Programa ajuda a detectar conteúdo erótico nas autopublicações.

Em novembro de 2011 publiquei um post aqui dando notícia de um programa chamado BookLamp, que detectava o “genoma” dos livros analisados. O programa permitia às editoras analisar o conteúdo de originais por tipos de temas, citações, etc., mencionados no texto. Na época o BookLamp já recebia material de algumas grandes editoras, inclusive os originais enviados sem solicitação, de modo a permitir que, posteriormente, fossem analisados mais detidamente aqueles que tivessem abordagens semelhantes a de outros livros de sucesso. Era uma ferramenta de análise e permitia também que, no site da empresa, fossem selecionados, para leitura, títulos com temas e outros elementos de conteúdo similares aos de algum romance recentemente lido. O post pode ser lido aqui.

Pois bem, o pessoal do BookLamp resolveu colocar esse instrumental para analisar a questão dos livros pornográficos autopublicados, assunto que tem provocado controvérsia no meio editorial dos EUA e da Inglaterra. A rede W. H. Smith resolveu retirar do ar o site alimentado pela Kobo até que todos os livros passagem por uma avaliação para eliminar aqueles que tratassem de pedofilia, bestialismo e incesto.

A coisa é quente.

Hoje o site Digital Book World publicou o post que reproduzo abaixo. Lembro sempre que a distinção entre hipocrisia e o veto de crianças a conteúdos questionáveis é sempre um assunto delicado, e que o moralismo anglo-saxão, unido aos evangélicos e outros ultramoralistas de plantão pode também prejudicar a liberdade de expressão na ficção. O texto está em inglês.

A publicação original foi feita aqui.

The Literary Darknet of Independent Publishing
Categories: Expert Publishing Blog
October 20, 2013 | Aaron Stanton | 7

The independent and self-publishing space recently found itself with a cascading bit of drama, eventually escalating to impact everyone from Amazon to Barnes & Noble, to WHSmith and Kobo. It began with an article on The Kernel about how Amazon sells incest, rape, and underage erotica in their online book stores. This is not mild content.

The story quickly spread through larger news channels to include virtually every major online retailer, though somehow, the Google Play Store escaped notice, despite having the exact same content. WHSmith, the respected online book seller, responded by shutting down their entire site to categorically remove all independent books until they could be verified “clean.” In case it’s back up by the time this article goes up, the image below is what a major site looks like when the universe implodes.

The relative ease that independent authors can publish content directly to a digital store has created a tremendous swell in content with no editorial oversight. The vast majority of these titles have almost no reliable meta-data about what’s in them. It is a large, invisible ocean of content that most people are not really aware of.
Learn more about the future of ebook retail at Digital Book World 2014
The Literary Darknet

On the internet, the Darknet is a collection of underground or largely unindexed websites that you have to know exist in order to find. A lot of questionable content has grown around these Darknet communities — if you’re familiar with the Silk Road that was recently taken down by the authorities, you’re at least partly familiar with the Darknet.

The invisible, generally unregulated ocean of written content coming onto the market from the self-publishing community is, in some ways, a literary equivalent of the Darknet. This tremendous volume of content is far greater than any current social-based review system can handle, not only from a sexual content standpoint, but from a review and discovery standpoint.The vast majority of these books have zero reviews, and zero star ratings on even the largest social review sites. You can see this in the hundreds of pages of “zero rating” books in almost any Goodread’s keyword search.

This creates a problem. Online retailers like Amazon, Google, and B&N end up putting books on their shelves without content oversight.

How to Map the Literary Darknet

Contrary to popular belief, there is a way to map these sorts of issues, and to do so with millions of books. The Book Genome Project, where I work, has spent years building and tuning computer-based tools that catalog the vast amount of invisible content, generally books that don’t have the marketing resources to be visible on social discovery sites. We also build tools to help retailers identify and reclassify books with potentially objectionable content, such as flagging a Juvenile title that has sex, bestiality, or incest in it. We do this on a scene-by-scene basis in a book, and we do it at scale — normally in the range of 40,000 to 100,000 titles a week. Through various partners, virtually every book published in the United States passes through one of our systems at one time or another. I’m happy to say that we do our job well; not a single one of the large clients we work with appeared on The Kernel’s shame list, and not for lack of visibility.

You can read more details about how our tools work here in an article we did about the impact of 50 Shades of Grey on sexual content in publishing, but for a quick glimpse of what our system sees when it looks at a book, here’s a single sexual content graph from that article: 50 Shades of Grey, from beginning to end of the book. Each block represents roughly 1,000 words. Green means no sexual content. Yellow means some. Red means… well…

grafico 1 internet darknet

From our perspective, we’re mostly interested in whether or not a book is in the right category. As Erotica, this graph wouldn’t have raised an eyebrow, but if it had been misclassified as Juvenile Fiction we would certainly have flagged it. To give you an example that’s more specific to this topic, here’s a graphic showing the sexual content of one of the objectionable books identified by The Kernal as being for sale at Amazon, called Daddy’s Invisible Condom. This book was flagged as both Erotica and Incest by our automated tools. I’ll spare you the book cover:

gráfico 2 internet darknet

As you can see, virtually every scene in this book contains sexual content, and as the name implies, incest or pseudo-incest features throughout. It’s also interesting to note that almost immediately after this book was highlighted in the article on The Kernel, the name of the book was changed fromDaddy’s Invisible Condom to simply Invisible Condom, removing the ability for title-based screening methods to identify it as containing incestuous themes. However, when we ran the title through our system, it was still flagged as containing those themes anyway, meaning that at the time of analysis the incestuous content was still there, just hiding.

What Percentage of Self-Published Books Are Erotica?

Now, let’s look at books that have sexual content to a degree that they’re likely to be considered erotica. These are self-published books that contain an amount and type of sexual content that puts them statistically in the erotica category established by traditional publishers. In our observations, roughly 28.5% of the self-published content falls into this category. This is based on a “slice of life” sub-sample of data; I would not consider it necessarily representative of all self-published content, though I believe it’s relatively typical as self-published content goes. I have no concrete way to estimate how representative our sample is of all self-published content, though it represents several tens of thousands of books — I can only speak to what we’ve observed. In that case, a little under 30% of the independent content we’ve observed fell into the sexual category. For comparison, about 1.11% of the roughly 110,000 traditionally published books in the Book Genome Project fall into the Erotica category.

gráfico 3 internet darknet

This supports my personal observations that the self-published marketplace is producing a great deal of sexual content compared to traditional sources. In fact, in our data, nearly 26 times more sexual content in terms of distribution make-up.

Type of Erotic Content in Self-Published Titles

Here’s the tough question: How much of this content is of concern to a company like Amazon, Kobo, Google, or news outlets like The Kernel? If we define Erotic Incest and Erotic Bestiality as objectionable, how many books are we actually talking about here?

That, too, we can shed some light on. Out of any given 1,000 self-published books that we’ve observed, roughly 19 (1.9%) will contain erotic incestuous themes, and 9 (0.91%) will contain erotic bestiality themes. Put another way, just under 3% of self-published titles are likely to contain objectionable content by the definition above.

gráfico 4 internet darknet

There are many ways to spin that, depending on your particular view. On one hand, this means that 97% of self-published titles do NOT contain this content. Yes, it contains substantially more than any similar content we’ve found in traditional publishing (we’ve observed virtually no erotic incest or bestiality in traditional titles), self-published books are overwhelmingly about something other than those themes.

On the other hand, if you’re inclined to look at it the other direction, it potentially indicates that the amount of questionable content in self-published books is significant. Another way of stating our observations is that nearly 1 out of every 10 erotic titles in our self-published sample contained either bestiality or incest. Personally, a more eye-opening way of putting this in perspective is to compare that potential 2.81% overall objectionable content rate in our sample with the prevalence of common genres in traditional publishing. For example, it would be three-times larger than the percentage of traditionally published Cookbooks in 2010. Those only made up 1.04% of total new books. Sports titles made up only 2.26%. If Erotic Incest/Bestiality were a single category of books, it would be a larger category than nearly half of the genres listed in Bowker’s data, and bigger than most sub-categories of Fiction:

gráfico 5 internet darknet

Final Thoughts

Do I really think that the combined categories of self-published Erotic Incest and Bestiality compete in scale with Computer or Literature books? I certainly think it’s possible, but there are some caveats that have to be included.

 There might be substantially more Incest & Bestiality books because: There are more self-published books published each year than traditionally published books. As a consequence, 3% of the self-published books is likely to be far more books than 3% of traditionally published ones. In terms of sheer numbers, there could be substantially more Incest books coming onto the market than this data implies.

 There might be substantially fewer Incest & Bestiality books because: Not all self-publishing companies attract the same authors, and we didn’t shape the data to represent source distribution. I do believe that the books we’ve observed are highly similar to what most people think of as “self-published” — the sort you’d expect to see in CreateSpace, Smashwords, Amazon Kindle Direct, and other similar publishers. But I’d never try to pass the above numbers off as somehow a complete picture of the universe of self-publishing; even if we had access to those books, most of that data would be proprietary and we wouldn’t be able to share it. So, this is more an indication of the potential scale in a single slice, not definitive.

 There are no sales numbers in this data. As with any long tail, it’s likely irrelevant how many books on a topic are available compared to how many people are reading them. After all, does it matter that there is really objectionable content in the long tail of the book market if no one ever sees or purchases it? As a percentage of sales volume, they could be virtually invisible. They could also be one of the few categories of the market that’s filling a niche not already addressed in traditional publishing. The answer to that would require additional data I don’t have.

How Do We Know — The Tools for Mapping Content in the Literary Darknet

In order for any of the above to have any validity, even as a curiosity, it requires some faith in the technology used to generate the data. The Book Genome Project focuses on using computers to understand the thematic, emotional, and stylistic make-up of the content of a book. It’s often been compared to the Pandora of the book industry, in terms of methodology. Every theme that we measure is done on a scene-by-scene basis, allowing for a very granular degree of content mapping throughout a book. In terms of accuracy, our tools for identifying erotic content has a better than 99% catch rate, and a less than 1% false positive rate. The same is true with bestiality. For a more detailed example of measuring sexual content in books, check out how Fifty Shades of Grey impacted the amount of sexual content in Romance.

If you’re interested in the more general application of the Book Genome Tools on search and discovery, or you happen to be a Stephen King fan, check out Visualizing the Data of Stephen King.

For more information on BookLamp or the Book Genome Project, feel free to visit here, or fire questions my direction.

Deixe um comentário

O seu endereço de e-mail não será publicado.

Esse site utiliza o Akismet para reduzir spam. Aprenda como seus dados de comentários são processados.