[HN Gopher] A closer look at BookCorpus, a key dataset in machin...
       ___________________________________________________________________
        
       A closer look at BookCorpus, a key dataset in machine learning
        
       Author : Kaibeezy
       Score  : 100 points
       Date   : 2023-09-19 12:13 UTC (1 days ago)
        
 (HTM) web link (towardsdatascience.com)
 (TXT) w3m dump (towardsdatascience.com)
        
       | dougb5 wrote:
       | The Wikipedia article on BooksCorpus raises an important point
       | about how researchers used the word "unpublished" to describe the
       | books in this corpus -- this word appears in both the original
       | Aligning Books paper, as well as OpenAI's papers, which don't
       | even bother to acknowledge SmashWords. The books aren't
       | "unpublished" -- SmashWords is a self-publishing platform!
       | Whether deliberately or not, the word choice diminishes the human
       | effort that was appropriated by the researchers to train the
       | models.
        
       | armcat wrote:
       | This is really a fantastic analysis of a dataset, and it's
       | something that should be a mandatory form of smokescreen before
       | proceeding with actual model training, in every organization or
       | research group. Whether you are using public datasets, buying 3rd
       | party data, doing in-house data collection and annotation, or
       | paying someone to do it for you, you must check for class
       | imbalance and over/under representation within your data -
       | inevitably human biases will creep in. Ultimately you have to
       | evaluate whether this data distribution is compatible with your
       | target data distribution your model will be applied on in
       | production. Doing this post hoc is a real pain.
        
         | evrydayhustling wrote:
         | I agree that this is extremely valuable, but it's worth
         | flagging that it's harder to reason about the impacts of class
         | imbalance for generative models than e.g. classifiers. For
         | example, should we think about genre imbalance per novel, per
         | token, or on some more complex basis? Which genres are most
         | relevant to a target distribution of chatbot queries?
         | 
         | This isn't to suggest that organizations shouldn't invest in
         | actively understanding their training data, but that post-hoc
         | bias analysis is going to be a critical component of evaluation
         | for the foreseeable future.
        
       | barry-cotter wrote:
       | I regret reading about half of that article and suggest you save
       | the precious moments of your life that reading it would take and
       | do something that is valuable out interesting instead.
        
       | politelemon wrote:
       | I see many courses and papers base their work off BookCorpus. So
       | this should be somewhat significant. What are the 'places' that
       | these concerns can be highlighted to the wider machine learning
       | community? Not everyone will be here. Is there a forum or a
       | 'reddit' or a conference where machine learning people regularly
       | visit?
       | 
       | The cynical me is thinking that because this is inconvenient news
       | and would require rework, a lot of people would prefer to
       | suppress or ignore the author's findings (assuming true).
        
         | tastroder wrote:
         | This is from 21, not really news, and the paper version on
         | arxiv and published at NeurIPS have quite a few citations. No
         | one's suppressing this, people that don't reflect on their
         | datasets or how they use them just either don't care or fail to
         | acknowledge they're actual issues.
        
           | Kaibeezy wrote:
           | IANA AI developer but have been looking into this in detail
           | recently for other purposes. I was puzzled at the lack of
           | info about "books" and when searching for detail (in what I
           | believe was a reasonably diligent manner) found a very
           | surprisingly small amount of it. I assumed there would be
           | more knowledge and did ask for it here. So now I will go look
           | up those papers to get a better sense of things. Thank you
           | for the tip.
           | 
           | I note neither this paper nor any discussion of "BookCorpus"
           | or even "book corpus" has appeared on HN previously.
           | 
           |  _Addressing "Documentation Debt" in Machine Learning
           | Research: A Retrospective Datasheet for BookCorpus_, 2021,
           | Jack Bandy and Nicholas Vincent
           | 
           | https://arxiv.org/pdf/2105.05241.pdf?
        
       | jprete wrote:
       | The correct title is "Dirty Secrets of BookCorpus, a Key Dataset
       | in Machine Learning".
        
       | Kaibeezy wrote:
       | OP here. I had a lot of difficulty finding info on Books1 and
       | Books2, even on HN. If there's a better source of info, please
       | link or post.
       | 
       | What's the value of these scant few thousand unpublished romance
       | and fantasy novels in the context of the rest of the corpus --
       | vast scrapings, all of Wikipedia, etc.? A sample of how people
       | write? Why aren't more public domain works included?
        
         | __loam wrote:
         | [flagged]
        
           | kyle-rb wrote:
           | I think you're reading too much into AI specific stuff, I
           | would happily violate copyright for no reason at all.
        
           | slyall wrote:
           | With the way Copyright now lasts roughly forever and covers
           | roughly everything then it can't expect to be respected.
           | Copyright owners have used their power to tilt the bargain
           | entirely in their favour. They have only themselves to blame
           | if people increasing don't want to obey their rules.
        
             | __loam wrote:
             | Yeah you're right, it's totally justified to steal the work
             | of still living artists without their permission to build
             | massive for profit systems to automate their jobs away
             | because some corporations have abused copyright.
        
           | JimDabell wrote:
           | > A lot of people, especially on hacker news, believe
           | something being on the internet is a license to do whatever
           | the fuck they want with that data.
           | 
           | A license to do whatever the fuck they want? No. To do things
           | other than make copies? Sure. Copyright is the right to copy,
           | no more than that. Learning from something is not copying it.
           | If you want to complain about memorisation, that's fair. But
           | learning from something is not something copyright was
           | intended to restrict, so no, plenty of people absolutely will
           | not care about copyright when it comes to this kind of thing,
           | and rightly so.
        
           | Kaibeezy wrote:
           | ChatGPT told me it doesn't have the text of the US
           | Declaration of Independence due to copyright. It does not
           | have the English Magna Carta within its accessible text. That
           | seems unexpected. It does have the US Constitution.
        
             | JimDabell wrote:
             | Large language models don't know what they know. Asking
             | them what they know is often going to give you an incorrect
             | answer.                 $ llm -m 4 "Quote the declaration
             | of independence:"
             | 
             | IN CONGRESS, JULY 4, 1776
             | 
             | The unanimous Declaration of the thirteen united States of
             | America
             | 
             | When in the Course of human events it becomes necessary for
             | one people to dissolve the political bands which have
             | connected them with another and to assume among the powers
             | of the earth, the separate and equal station to which the
             | Laws of Nature and of Nature's God entitle them, a decent
             | respect to the opinions of mankind requires that they
             | should declare the causes which impel them to the
             | separation.
             | 
             | We hold these truths to be self-evident, that all men are
             | created equal, that they are endowed by their Creator with
             | certain unalienable Rights, that among these are Life,
             | Liberty and the pursuit of Happiness. That to secure these
             | rights, Governments are instituted among Men, deriving
             | their just powers from the consent of the governed...
             | 
             | (Note: The full text of the Declaration of Independence is
             | quite lengthy, so I have included the most well-known
             | portion here. The full text, including a list of grievances
             | against King George III and the signatures of the signers,
             | can be found elsewhere.)
        
             | Terretta wrote:
             | It made that answer up.
        
               | __loam wrote:
               | https://youtu.be/GM-e46xdcUo?si=D8Zmikk8TKsLoHwx
        
             | a_bonobo wrote:
             | Perhaps there's part of the prompt that tells GPT not to
             | tell users details about written words to avoid
             | embarrassing copyrighted text leaking out, but the prompt
             | is slightly too strict and lets GPT also not talk about
             | 'open' texts? Pure speculation.
        
               | gs17 wrote:
               | I don't think that's the case, although there was an
               | interesting bug a while back where it would freeze after
               | each word when asked to quote the opening to, e.g., Moby
               | Dick.
        
         | wongarsu wrote:
         | > What's the value of these
         | 
         | The Pile (the 800GB dataset by Eluther AI) contains
         | BookCorpus2, along with two much larger datasets of books (and
         | a whole lot of not-book stuff). From their paper [0] the
         | reasoning for the book datasets is that they are "are
         | invaluable for long-range context modeling research and
         | coherent storytelling". The reasoning for including BookCorpus2
         | next to Books3 and Project Gutenberg boils down to "no
         | significant overlap with the other datasets, and others use
         | it".
         | 
         | In general books are a great source of extremely high quality
         | long-form content. They are longer than most content found on
         | the web, and are generally of high quality, having gone through
         | many revision rounds between editor and author. Just that both
         | of these aren't really true of BookCorpus. Even a dump of
         | highly rated stories from fanfiction.org might be better.
         | 
         | 0: https://arxiv.org/pdf/2101.00027.pdf
        
           | Kaibeezy wrote:
           | [flagged]
        
             | Kaibeezy wrote:
             | Why is that list ^ not interesting? Illustrative?
        
               | gs17 wrote:
               | It's not useful. You can ask ChatGPT again in a new
               | session, and you'll get a different list. You can then
               | ask it about them and find out it's making it up. For
               | example, "Wuthering Heights" is on your list, when I ask
               | it "Pride and Prejudice" is on it. In another session, I
               | can ask it for opening lines, character lists, etc. from
               | those works and then ask for a crossover story where the
               | characters from each meet each other. The model isn't
               | likely to regurgitate the original text in its entirety,
               | but it does know them.
        
               | jamilton wrote:
               | Additionally, I believe a general norm has been
               | developing to not post raw output of ChatGPT without
               | commentary/editing, because it's low value. If anyone
               | wanted similar output, they could just ask ChatGPT
               | themselves. It also often gives false info, as this one
               | demonstrates. You _need_ to fact check it and demonstrate
               | that before posting it if you 're posting it as a source
               | of information, IMO.
        
             | [deleted]
        
             | jazzyjackson wrote:
             | the bots are not capable of reflection, they do not know
             | and cannot check what is in their training data
        
           | NoMoreNicksLeft wrote:
           | Are these primarily fiction books, or a mix of fiction and
           | non-fiction?
        
             | banana_giraffe wrote:
             | The books in the books3 collection aren't categorized. The
             | source, however, currently is at a ratio of 2:1 nonfiction
             | to fiction, and from what I've seen, whoever created the
             | books3 archive simply attempted to gather all the EPubs
             | they could, with their only criteria being availability.
        
       | [deleted]
        
       | lukev wrote:
       | I wonder how much better an LLM could be given even better
       | training data.
       | 
       | For example, the total number of tokens contained in the physical
       | and digital collections of a moderately-sized university library
       | is (probably) equal to or on par with the size of the training
       | data for GPT 3.5.
       | 
       | What would happen if you could train just on _that_? I know we
       | 're using huge training sets, but how much of it is just junk
       | from the internet?
       | 
       | (There should be _some_ representative junk in the dataset, but
       | nowhere near the majority.)
        
         | pseudonom- wrote:
         | https://arxiv.org/abs/2306.11644 is along these lines.
        
         | josephcooney wrote:
         | Isn't this what "tiny stories" / Phi LLM are doing?
         | https://arxiv.org/abs/2306.11644
        
       ___________________________________________________________________
       (page generated 2023-09-20 23:02 UTC)