https://towardsdatascience.com/dirty-secrets-of-bookcorpus-a-key-dataset-in-machine-learning-6ee2927e8650

 
[                    ]

Fairness and Bias

Dirty Secrets of BookCorpus, a Key Dataset in Machine Learning

A closer look at BookCorpus, the text dataset that helps train large
language models for Google, OpenAI, Amazon, and others

 
Jack Bandy
 
Towards Data Science

Jack Bandy

*

Follow

Published in
 

Towards Data Science

*
5 min read
*
May 12, 2021

--

2

Listen

Share

Photo by Javier Quiroga on Unsplash

BookCorpus has helped train at least thirty influential language
models (including Google's BERT, OpenAI's GPT, and Amazon's Bort),
according to HuggingFace.

But what exactly is inside BookCorpus?

This is the research question that Nicholas Vincent and I ask in a
new working paper that attempts to address some of the "documentation
debt" in machine learning research -- a concept discussed by Dr. Emily
M. Bender and Dr. Timnit Gebru et al. in their Stochastic Parrots
paper.

While many researchers have used BookCorpus since it was first
introduced, documentation remains sparse. The original paper that
introduced the dataset described it as a "corpus of 11,038 books from
the web," and provided six summary statistics (74 Million sentences,
984 Million words, etc.).

We decided to take a closer look -- here is what we found.

General Notes

For context, it is first important to note that BookCorpus contains a
sample of books from Smashwords.com, a website that describes itself
as "the world's largest distributor of indie ebooks."

As of 2014, Smashwords hosted 336,400 books. For comparison, in the
same year, the Library of Congress housed a total of 23,592,066
catalogued books (about seventy times as many).

The researchers who collected BookCorpus downloaded every free book
longer than 20,000 words, which resulted in 11,038 books -- a 3%
sample of all books on Smashwords.com. But as discussed below, we
found that thousands of these books were duplicates and only 7,185
were unique, so really BookCorpus is only a 2% sample of all books on
Smashwords.

In the full datasheet, we provide information about funding (Google
and Samsung were among the funding sources), the original use case
for BookCorpus (sentence embedding), as well as other details
outlined in the datasheet standard. For this blog post, I will
highlight some of the more concerning findings.

 Major Concerns

 Copyright Violations

In 2016, Richard Lea explained in The Guardian that Google did not
seek consent from authors in BookCorpus, whose books help power
Google's technologies.

Going even further, we find evidence that BookCorpus directly
violated copyright restrictions for hundreds of books that should not
have been redistributed through a free dataset. For example, over 200
books in BookCorpus explicitly state that they "may not be
reproduced, copied and distributed for commercial or non-commercial
purposes."

We also find that at least 406 books included in the free BookCorpus
dataset now cost money on Smashwords, the dataset's source. To
purchase these 406 books would cost $1,182.21 as of April 2021.

 Duplicate Books

BookCorpus is often described as containing 11,038 books, which is
what the original authors report. However, we found that thousands of
books were duplicates, and in fact only 7,185 books in the dataset
are unique. The exact breakdown is as follows:

  * 4,255 books occurred once (i.e. were not duplicated)
  * 2,101 books occurred twice 
  * 741 books occurred thrice 
  * 82 books occurred four times 
  * 6 books occurred five times 

 Skewed Genre Representation

Compared to a new version called BookCorpusOpen, and another dataset
of all the books on Smashwords (Smashwords21), the original
BookCorpus has some significant genre skews. Here is a table with all
the details:

Notably, BookCorpus over-represents the Romance genre, which is not
necessarily surprising given broader patterns in self-publishing (
authors consistently find that romance novels are in high demand). It
also contains quite a few books in the Vampires genre, which may have
been phased out given that no vampire books appear in Smashwords21.

In and of itself, skewed representation can lead to issues when
training large language models. But as we looked at some "romance"
novels, it became clear that some books pose further concerns.

[?][?] Potential Concerns that Need Further Attention

 Problematic Content

While there is more work to be done in determining the extent of
problematic content in BookCorpus, our analysis shows that it
definitely exists. Consider, for example, one novel in BookCorpus
called The Cop And The Girl From The Coffee Shop.

 

The Cop And The Girl From The Coffee Shop, an Ebook by Terry Towers

For Jade, there is one customer who makes going to work at the coffee
shop worthwhile, Officer Alex Kane. With just a...

www.smashwords.com

The book's preamble clearly states that "the material in this book is
intended for ages 18+." On Smashwords, the book's tags include "alpha
male" and "submissive female."

While there may be little harm from informed adults reading a book
like this, feeding it as training material to language models would
contribute to well-documented gender discrimination in these
technologies.

 Potentially Skewed Religious Representation

When it comes to discrimination, the recently-introduced BOLD
framework also suggests looking at seven of the most common religions
in the world: Sikhism, Judaism, Islam, Hinduism, Christianity,
Buddhism, and Atheism.

While we do not yet have the appropriate metadata to fully analyze
religious representation in BookCorpus, we did find that
BookCorpusOpen and Smashwords21 exhibit skews, suggesting that this
could also be an issue in the original BookCorpus dataset. Here is
the breakdown:

More work is needed to clarify religious representation in the
original version of BookCorpus, however, BookCorpus does use the same
source as BookCorpusOpen and Smashwords21, so similar skews are
likely.

[?][?] Lopsided Author Contributions

Another potential issue is lopsided author contributions. Again, we
do not yet have all the metadata we would need for a complete
analysis of BookCorpus, but we can make estimates based on our
Smashwords21 dataset.

In Smashwords21, we found that author contributions were quite
lopsided, with the top 10% of authors contributing 59% of all words
in the dataset. Word contributions roughly follow the Pareto
principle (i.e. the 80/20 rule), with the top 20% of authors
contributing 75% of all words.

Similarly, in terms of book contributions, the top 10% of authors
contributed 43% of all books. We even found some "super-authors,"
like Kenneth Kee, who has published over 800 books.

If BookCorpus looks at all similar to Smashwords.com as a whole, then
a majority of books in the dataset were probably written by a
minority of authors. In many contexts, researchers may want to
account for these lopsided contributions when using the dataset.

What's Next?

With new NeurIPS standards for dataset documentation and even a whole
new track devoted to datasets, hopefully the need for retrospective
documentation efforts (like the one presented here) will decline.

In the meantime, efforts like this one can help us understand and
improve the datasets that power machine learning.

If you want to learn more, you can check out our code and data here,
and the full paper here, which includes the following "data card"
summarizing our findings:

If you have questions/comments, head over to the Github discussion!
You can also reach out directly to Nick or myself. Thanks for reading
to the end 

 
Machine Learning
 
Data Science
 
AI
 
Linguistics
 
Fairness And Bias

--

--

2

 
Jack Bandy
 
Towards Data Science
Follow
 

Written by Jack Bandy

245 Followers
*Writer for
 

Towards Data Science

PhD student studying AI, ethics, and media. Trying to share things I
learn in plain english.  @jackbandy

Follow
 

More from Jack Bandy and Towards Data Science

 
How to Collect Data from TikTok (Tutorial)
 
Jack Bandy
 

Jack Bandy

in

 

Towards Data Science

 

How to Collect Data from TikTok (Tutorial)

How to scrape videos posted or liked by a user, snowball a large user
list from seed accounts, and collect trending videos.

 
*5 min read*Jul 2, 2020

--

 

9

 
RAG vs Finetuning -- Which Is the Best Tool to Boost Your LLM
Application?
 
Heiko Hotz
 

Heiko Hotz

in

 

Towards Data Science

 

RAG vs Finetuning -- Which Is the Best Tool to Boost Your LLM
Application?

The definitive guide for choosing the right method for your use case

 
*19 min read*Aug 24

--

 

16

 
Man in a room holding a book and ethereal computation on the walls
 
Giuseppe Scalamogna
 

Giuseppe Scalamogna

in

 

Towards Data Science

 

New ChatGPT Prompt Engineering Technique: Program Simulation

A potentially novel technique for turning a ChatGPT prompt into a
mini-app.

 
9 min read*Sep 3

--

 

12

 
How Syndrome's Omnidroids were Biased in The Incredibles
 
Jack Bandy
 

Jack Bandy

 

How Syndrome's Omnidroids were Biased in The Incredibles

A brief example of measuring gender bias in machines

 
2 min read*Feb 29, 2020

--

 
 
See all from Jack Bandy
 
See all from Towards Data Science

Recommended from Medium

 
RAG vs Finetuning -- Which Is the Best Tool to Boost Your LLM
Application?
 
Heiko Hotz
 

Heiko Hotz

in

 

Towards Data Science

 

RAG vs Finetuning -- Which Is the Best Tool to Boost Your LLM
Application?

The definitive guide for choosing the right method for your use case

 
*19 min read*Aug 24

--

 

16

 
The ChatGPT Hype Is Over -- Now Watch How Google Will Kill ChatGPT.
 
AL Anany
 

AL Anany

 

The ChatGPT Hype Is Over -- Now Watch How Google Will Kill ChatGPT.

It never happens instantly. The business game is longer than you
know.

 
*6 min read*Sep 1

--

 

242

Lists

 
[0]
[1]
[0]

Predictive Modeling w/ Python

20 stories*397 saves
 
Principal Component Analysis for ML
Time Series Analysis
deep learning cheatsheet for beginner

Practical Guides to Machine Learning

10 stories*459 saves
 
Image by vectorjuice on FreePik
[1]
[1]

The New Chatbots: ChatGPT, Bard, and Beyond

13 stories*123 saves
 
[0]
[1]
[0]

Natural Language Processing

620 stories*235 saves
 
A meerkat librarian working with books and documents on ChatGPT.
 
Maximilian Vogel
 

Maximilian Vogel

in

 

MLearning.ai

 

The ChatGPT list of lists: A collection of 3000+ prompts, examples,
use-cases, tools, APIs...

Updated Sep-09, 2023. Added new prompt engineering courses,
masterclasses and tutorials.

 
10 min read*Feb 7

--

 

110

 
Google Bert
 
Rayyan Shaikh
 

Rayyan Shaikh

 

Mastering BERT: A Comprehensive Guide from Beginner to Advanced in
Natural Language Processing...

Introduction: A Guide to Unlocking BERT: From Beginner to Expert

 
19 min read*Aug 26

--

 

13

 
Extract Data from Documents with ChatGPT
 
Waveline
 

Waveline

 

Extract Data from Documents with ChatGPT

Guide on how to extract data from documents like PDFs using Large
Language Models (LLMs)

 
4 min read*Jul 19

--

 
 
A Very Gentle Introduction to Large Language Models without the Hype
 
Mark Riedl
 

Mark Riedl

 

A Very Gentle Introduction to Large Language Models without the Hype

1. Introduction

 
38 min read*Apr 14

--

 

115

 
See more recommendations
 

Help

 

Status

 

Writers

 

Blog

 

Careers

 

Privacy

 

Terms

 

About

 

Text to speech

 

Teams