[HN Gopher] A closer look at BookCorpus, a key dataset in machin...
___________________________________________________________________
A closer look at BookCorpus, a key dataset in machine learning
Author : Kaibeezy
Score : 100 points
Date : 2023-09-19 12:13 UTC (1 days ago)
(HTM) web link (towardsdatascience.com)
(TXT) w3m dump (towardsdatascience.com)
| dougb5 wrote:
| The Wikipedia article on BooksCorpus raises an important point
| about how researchers used the word "unpublished" to describe the
| books in this corpus -- this word appears in both the original
| Aligning Books paper, as well as OpenAI's papers, which don't
| even bother to acknowledge SmashWords. The books aren't
| "unpublished" -- SmashWords is a self-publishing platform!
| Whether deliberately or not, the word choice diminishes the human
| effort that was appropriated by the researchers to train the
| models.
| armcat wrote:
| This is really a fantastic analysis of a dataset, and it's
| something that should be a mandatory form of smokescreen before
| proceeding with actual model training, in every organization or
| research group. Whether you are using public datasets, buying 3rd
| party data, doing in-house data collection and annotation, or
| paying someone to do it for you, you must check for class
| imbalance and over/under representation within your data -
| inevitably human biases will creep in. Ultimately you have to
| evaluate whether this data distribution is compatible with your
| target data distribution your model will be applied on in
| production. Doing this post hoc is a real pain.
| evrydayhustling wrote:
| I agree that this is extremely valuable, but it's worth
| flagging that it's harder to reason about the impacts of class
| imbalance for generative models than e.g. classifiers. For
| example, should we think about genre imbalance per novel, per
| token, or on some more complex basis? Which genres are most
| relevant to a target distribution of chatbot queries?
|
| This isn't to suggest that organizations shouldn't invest in
| actively understanding their training data, but that post-hoc
| bias analysis is going to be a critical component of evaluation
| for the foreseeable future.
| barry-cotter wrote:
| I regret reading about half of that article and suggest you save
| the precious moments of your life that reading it would take and
| do something that is valuable out interesting instead.
| politelemon wrote:
| I see many courses and papers base their work off BookCorpus. So
| this should be somewhat significant. What are the 'places' that
| these concerns can be highlighted to the wider machine learning
| community? Not everyone will be here. Is there a forum or a
| 'reddit' or a conference where machine learning people regularly
| visit?
|
| The cynical me is thinking that because this is inconvenient news
| and would require rework, a lot of people would prefer to
| suppress or ignore the author's findings (assuming true).
| tastroder wrote:
| This is from 21, not really news, and the paper version on
| arxiv and published at NeurIPS have quite a few citations. No
| one's suppressing this, people that don't reflect on their
| datasets or how they use them just either don't care or fail to
| acknowledge they're actual issues.
| Kaibeezy wrote:
| IANA AI developer but have been looking into this in detail
| recently for other purposes. I was puzzled at the lack of
| info about "books" and when searching for detail (in what I
| believe was a reasonably diligent manner) found a very
| surprisingly small amount of it. I assumed there would be
| more knowledge and did ask for it here. So now I will go look
| up those papers to get a better sense of things. Thank you
| for the tip.
|
| I note neither this paper nor any discussion of "BookCorpus"
| or even "book corpus" has appeared on HN previously.
|
| _Addressing "Documentation Debt" in Machine Learning
| Research: A Retrospective Datasheet for BookCorpus_, 2021,
| Jack Bandy and Nicholas Vincent
|
| https://arxiv.org/pdf/2105.05241.pdf?
| jprete wrote:
| The correct title is "Dirty Secrets of BookCorpus, a Key Dataset
| in Machine Learning".
| Kaibeezy wrote:
| OP here. I had a lot of difficulty finding info on Books1 and
| Books2, even on HN. If there's a better source of info, please
| link or post.
|
| What's the value of these scant few thousand unpublished romance
| and fantasy novels in the context of the rest of the corpus --
| vast scrapings, all of Wikipedia, etc.? A sample of how people
| write? Why aren't more public domain works included?
| __loam wrote:
| [flagged]
| kyle-rb wrote:
| I think you're reading too much into AI specific stuff, I
| would happily violate copyright for no reason at all.
| slyall wrote:
| With the way Copyright now lasts roughly forever and covers
| roughly everything then it can't expect to be respected.
| Copyright owners have used their power to tilt the bargain
| entirely in their favour. They have only themselves to blame
| if people increasing don't want to obey their rules.
| __loam wrote:
| Yeah you're right, it's totally justified to steal the work
| of still living artists without their permission to build
| massive for profit systems to automate their jobs away
| because some corporations have abused copyright.
| JimDabell wrote:
| > A lot of people, especially on hacker news, believe
| something being on the internet is a license to do whatever
| the fuck they want with that data.
|
| A license to do whatever the fuck they want? No. To do things
| other than make copies? Sure. Copyright is the right to copy,
| no more than that. Learning from something is not copying it.
| If you want to complain about memorisation, that's fair. But
| learning from something is not something copyright was
| intended to restrict, so no, plenty of people absolutely will
| not care about copyright when it comes to this kind of thing,
| and rightly so.
| Kaibeezy wrote:
| ChatGPT told me it doesn't have the text of the US
| Declaration of Independence due to copyright. It does not
| have the English Magna Carta within its accessible text. That
| seems unexpected. It does have the US Constitution.
| JimDabell wrote:
| Large language models don't know what they know. Asking
| them what they know is often going to give you an incorrect
| answer. $ llm -m 4 "Quote the declaration
| of independence:"
|
| IN CONGRESS, JULY 4, 1776
|
| The unanimous Declaration of the thirteen united States of
| America
|
| When in the Course of human events it becomes necessary for
| one people to dissolve the political bands which have
| connected them with another and to assume among the powers
| of the earth, the separate and equal station to which the
| Laws of Nature and of Nature's God entitle them, a decent
| respect to the opinions of mankind requires that they
| should declare the causes which impel them to the
| separation.
|
| We hold these truths to be self-evident, that all men are
| created equal, that they are endowed by their Creator with
| certain unalienable Rights, that among these are Life,
| Liberty and the pursuit of Happiness. That to secure these
| rights, Governments are instituted among Men, deriving
| their just powers from the consent of the governed...
|
| (Note: The full text of the Declaration of Independence is
| quite lengthy, so I have included the most well-known
| portion here. The full text, including a list of grievances
| against King George III and the signatures of the signers,
| can be found elsewhere.)
| Terretta wrote:
| It made that answer up.
| __loam wrote:
| https://youtu.be/GM-e46xdcUo?si=D8Zmikk8TKsLoHwx
| a_bonobo wrote:
| Perhaps there's part of the prompt that tells GPT not to
| tell users details about written words to avoid
| embarrassing copyrighted text leaking out, but the prompt
| is slightly too strict and lets GPT also not talk about
| 'open' texts? Pure speculation.
| gs17 wrote:
| I don't think that's the case, although there was an
| interesting bug a while back where it would freeze after
| each word when asked to quote the opening to, e.g., Moby
| Dick.
| wongarsu wrote:
| > What's the value of these
|
| The Pile (the 800GB dataset by Eluther AI) contains
| BookCorpus2, along with two much larger datasets of books (and
| a whole lot of not-book stuff). From their paper [0] the
| reasoning for the book datasets is that they are "are
| invaluable for long-range context modeling research and
| coherent storytelling". The reasoning for including BookCorpus2
| next to Books3 and Project Gutenberg boils down to "no
| significant overlap with the other datasets, and others use
| it".
|
| In general books are a great source of extremely high quality
| long-form content. They are longer than most content found on
| the web, and are generally of high quality, having gone through
| many revision rounds between editor and author. Just that both
| of these aren't really true of BookCorpus. Even a dump of
| highly rated stories from fanfiction.org might be better.
|
| 0: https://arxiv.org/pdf/2101.00027.pdf
| Kaibeezy wrote:
| [flagged]
| Kaibeezy wrote:
| Why is that list ^ not interesting? Illustrative?
| gs17 wrote:
| It's not useful. You can ask ChatGPT again in a new
| session, and you'll get a different list. You can then
| ask it about them and find out it's making it up. For
| example, "Wuthering Heights" is on your list, when I ask
| it "Pride and Prejudice" is on it. In another session, I
| can ask it for opening lines, character lists, etc. from
| those works and then ask for a crossover story where the
| characters from each meet each other. The model isn't
| likely to regurgitate the original text in its entirety,
| but it does know them.
| jamilton wrote:
| Additionally, I believe a general norm has been
| developing to not post raw output of ChatGPT without
| commentary/editing, because it's low value. If anyone
| wanted similar output, they could just ask ChatGPT
| themselves. It also often gives false info, as this one
| demonstrates. You _need_ to fact check it and demonstrate
| that before posting it if you 're posting it as a source
| of information, IMO.
| [deleted]
| jazzyjackson wrote:
| the bots are not capable of reflection, they do not know
| and cannot check what is in their training data
| NoMoreNicksLeft wrote:
| Are these primarily fiction books, or a mix of fiction and
| non-fiction?
| banana_giraffe wrote:
| The books in the books3 collection aren't categorized. The
| source, however, currently is at a ratio of 2:1 nonfiction
| to fiction, and from what I've seen, whoever created the
| books3 archive simply attempted to gather all the EPubs
| they could, with their only criteria being availability.
| [deleted]
| lukev wrote:
| I wonder how much better an LLM could be given even better
| training data.
|
| For example, the total number of tokens contained in the physical
| and digital collections of a moderately-sized university library
| is (probably) equal to or on par with the size of the training
| data for GPT 3.5.
|
| What would happen if you could train just on _that_? I know we
| 're using huge training sets, but how much of it is just junk
| from the internet?
|
| (There should be _some_ representative junk in the dataset, but
| nowhere near the majority.)
| pseudonom- wrote:
| https://arxiv.org/abs/2306.11644 is along these lines.
| josephcooney wrote:
| Isn't this what "tiny stories" / Phi LLM are doing?
| https://arxiv.org/abs/2306.11644
___________________________________________________________________
(page generated 2023-09-20 23:02 UTC)