[HN Gopher] The Pile: An 800GB dataset of diverse text for langu...
       ___________________________________________________________________
        
       The Pile: An 800GB dataset of diverse text for language modeling
       (2020)
        
       Author : charlysl
       Score  : 70 points
       Date   : 2023-07-11 18:19 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | charlysl wrote:
       | OP here. I learned about this while reading Stanford's LLM
       | course's "Data" lecture [1]. Very interesting how it assesses the
       | datasets used for GPT 2 and 3, etc, and how The Pile addresses
       | their issues. A very interesting course!
       | 
       | [1] https://stanford-cs324.github.io/winter2022/lectures/data/
        
         | pjot wrote:
         | The Pile was also referenced in a post today of some guys
         | tweets about "leaked" gpt4 details
         | 
         | https://news.ycombinator.com/item?id=36675934
        
       | Der_Einzige wrote:
       | I came so close to getting my dataset DebateSum
       | (https://huggingface.co/datasets/Hellisotherpeople/DebateSum)
       | into the pile, but they decided at the last minute not to add it:
       | https://github.com/EleutherAI/the-pile/issues/56
       | 
       | I'm still a tiny bit salty about that, but the pile is a
       | wonderful dataset regardless.
        
         | orange_fritter wrote:
         | That dataset looks cool. Good work either way, I'm sure it'll
         | go somewhere
        
           | Der_Einzige wrote:
           | Stay tuned! I've got a paper I'm writing about a new followup
           | which is a 40x improvement in size (basically every open
           | source debate card... Ever) and a 40x improvement in metadata
           | and duplication detection. The work is all done since late
           | april and I've just been lazy/writer-blocked (ironic in a
           | world of high end LLMs) and haven't gotten the paper
           | finished.
           | 
           | Kinda of sad to have missed NeurIPS dataset track deadline
           | and ACL, but I know that anything close to this in scope is a
           | slam-dunk accept at the argument mining workshop
        
       | dang wrote:
       | Related:
       | 
       |  _The Pile: An 800GB Dataset of Diverse Text for Language
       | Modeling_ - https://news.ycombinator.com/item?id=36272365 - June
       | 2023 (5 comments)
       | 
       |  _The Pile: An 800GB Dataset of Diverse Text for Language
       | Modeling_ - https://news.ycombinator.com/item?id=25607809 - Jan
       | 2021 (60 comments)
        
       | sillysaurusx wrote:
       | Author here. And by author I mean I created books3 (the books
       | component of The Pile) while everyone else did the hard work of
       | actually writing the paper, ha. Stella and Leo Gao in particular
       | did so much wonderful work on the paper, though it couldn't have
       | happened without everyone's contributions.
       | 
       | As far as I know, this was the first academic contribution from a
       | discord collaboration to ML. Back then discord was barely used
       | for ML at all, though nowadays of course the largest discord in
       | the world is midjourney.
       | 
       | There were a bunch of interesting stories from those days. We
       | almost didn't release at all (or at least the books component)
       | because of fear of copyright backlash. Turns out no one cared,
       | and then suddenly today the world cares a great deal.
       | 
       | As a side note, I'll be participating in a legal action against
       | Meta for the purpose of making ML models uncopyrightable:
       | https://twitter.com/theshawwn/status/1641804013791215619?s=6....
       | They DMCA'ed one of my repos distributing LLaMA, so we fought
       | back and challenged the idea that weights can be copyrighted at
       | all. This seems like the best outcome for hackers and individual
       | researchers, for a few reasons. It's also one of the most ethical
       | outcomes; since ~no one trains on data that they own, they
       | shouldn't own the resulting model.
       | 
       | One last thing. The Pile would've been far less relevant without
       | the wonderful assistance of The Eye, a group of people who
       | archive all kinds of things. They've hosted the datasets for
       | years now. And although it seems strange to say that dataset
       | hosting could make or break The Pile, back then there was nobody
       | else willing to host us. https://the-eye.eu/
        
         | sfriedr wrote:
         | Could you share more about copyright? For example, aren't you
         | worried that now, with all kinds of lawsuits happening [1] and
         | copyright issues that were found in existing datasets [2], that
         | you might get threatening letters from a lawyer some day?
         | 
         | I'm the author of [3] where we introduced one of the first
         | natural-language datasets that test graduate mathematics for
         | LLMs, but some of the prompts we took from a copyrighted book
         | and therefore thought about excluding them. Having them in the
         | public dataset would be really nice though, hence I'm keen
         | about your experience.
         | 
         | I'd also be keen to hear how your challenge against the DMCA on
         | sharing LLaMA's weights goes?
         | 
         | [1] https://www.theguardian.com/books/2023/jul/05/authors-
         | file-a... [2] https://arxiv.org/abs/2105.05241 [3]
         | https://arxiv.org/abs/2301.13867
        
           | sillysaurusx wrote:
           | I think a lot of hackers shy away from doing impactful work
           | because of fear. Sometimes those fears are justified, but
           | it's remarkable how often things that seem like a big deal
           | turn out not to matter. My advice for ambitious devs would be
           | to do what seems interesting, and don't worry too much about
           | threatening letters. Usually the worst thing that happens is
           | that you agree to stop doing whatever generated the threat.
           | 
           | Personally, I'm not worried. It would be a damn shame if
           | academics come under fire merely for trying to operate on the
           | cutting edge of science. None of us were trying to make
           | money; we just wanted to make something interesting.
           | 
           | > I'd also be keen to hear how your challenge against the
           | DMCA on sharing LLaMA's weights goes?
           | 
           | Thanks! I think we might be putting up a website for it soon,
           | if only to explain ourselves. In the meantime - I hate this
           | phrase, since I don't want followers - the only way to keep
           | informed is to follow my Twitter, and perhaps keep an eye on
           | my HN comments.
           | 
           | You'll probably hear about it either way though, since it's a
           | groundbreaking case. No one has tested the copyrightability
           | of ML models before.
        
           | Der_Einzige wrote:
           | Getting sued is straight up a good thing for most peoples
           | careers in tech. Haven't you watched silicon valley?
        
         | jacquesm wrote:
         | > It's also one of the most ethical outcomes; since ~no one
         | trains on data that they own, they shouldn't own the resulting
         | model.
         | 
         | In my opinion the most ethical outcome would be that they are
         | on the hook for the cumulative cost of the copyright they
         | violated. That way authors would come out ahead instead of
         | having their rights trashed 'because it's too late anyway'.
        
           | cornel_io wrote:
           | Whether or not training on publicly available data counts as
           | a copyright violation is still completely up in the air
           | legally, and clearly a lot of lawyers at all of the top tech
           | companies think they're going to end up in the clear under
           | fair use.
           | 
           | At some point this stuff will have to get tested by making
           | its way up the appeals stack in the US, and IMO there is only
           | a minuscule chance that will result in Google, MS, and Meta
           | getting slapped with anything more than a token fine (my bet
           | is it won't even be that), let alone paying every person who
           | ever wrote anything that was used in these datasets for
           | copyright violations, which would basically be everyone.
        
           | rpdillon wrote:
           | > on the hook for the cumulative cost of the copyright they
           | violated.
           | 
           | I think there's a strong argument for a Fair Use defense,
           | given the size of the models versus the size of the training
           | sets, as well as the gulf in intended use: an AI model
           | doesn't compete with e.g. a book. Obviously we'll have to see
           | if play out in court to find out.
        
             | ben_w wrote:
             | Current AI models don't compete with a book, from what I've
             | seen; I wouldn't want to bet how long it takes before they
             | can compete with not just one but all books.
        
         | archivist0 wrote:
         | [dead]
        
         | hedgehog wrote:
         | I understand that LLMs to date have mostly been trained on a
         | wide variety of copyright-encumbered data but in other domains
         | (computer vision for example) the tradeoffs are different and
         | in practice many models are trained on private / unencumbered
         | data. If those weights are not protected by copyright then my
         | concern is it will be hard to sufficiently protect them via
         | license agreement and it will become yet another factor
         | favoring the SaaSification of everything in tech.
        
           | sillysaurusx wrote:
           | This is true, and it's why I hesitated to file legal action.
           | My goal was to benefit hackers. If the outcome causes
           | problems for people who are just trying to share their work,
           | I'd be upset.
           | 
           | Ultimately what convinced me to proceed is that there are
           | immense forces pressuring ML models to become SaaS companies.
           | It's very difficult to offer an ML model for extended periods
           | _without_ being a company. E.g. https://6b.eleuther.ai/ is
           | down. Eleuther failing illustrates just how hard it is -- we
           | were all working as hard as we could to design something that
           | would last a long time, and a long time turned out to be two
           | short years. Contrast that with other kinds of hacking (e.g.
           | webdev, gamedev, hardware...) where the end result lasts
           | basically forever.
           | 
           | So if ML models aren't copyrightable, I think it'll hurt
           | companies a lot more than individuals. In fact the goal is
           | the other way around: to protect individuals. All I did was
           | publish Facebook's own GPL download script to github, and it
           | got DMCA'd. If we don't push back on that kind of behavior
           | now, companies will get used to the idea that they control
           | "their" model -- even when their model is anything but
           | theirs.
        
           | idiotsecant wrote:
           | Is it useful to protect weights with copyright? What if I
           | download your weights and retrain them for 5 seconds,
           | changing each weight .0000001%? How much change is a new
           | product? What if I change a single weight?
        
             | hedgehog wrote:
             | Like the parallel scenarios of taking a book and changing a
             | few words, slapping a new logo on someone else's app, or
             | stylizing a photo with a filter, those are questions that
             | will be answered in court if people can't come to an
             | agreement on their own.
        
       | cschmidt wrote:
       | If you're looking at The Pile, you also might consider the Red
       | Pajama dataset. A new cleaned version was released recently
       | https://www.cerebras.net/blog/slimpajama-a-627b-token-cleane...
        
       ___________________________________________________________________
       (page generated 2023-07-11 23:00 UTC)