[HN Gopher] The Pile is a 825 GiB diverse, open-source language ...
       ___________________________________________________________________
        
       The Pile is a 825 GiB diverse, open-source language modelling data
       set
        
       Author : bilsbie
       Score  : 308 points
       Date   : 2024-03-07 17:14 UTC (5 hours ago)
        
 (HTM) web link (pile.eleuther.ai)
 (TXT) w3m dump (pile.eleuther.ai)
        
       | turnsout wrote:
       | The Pile is pretty old--is this an updated version?
        
         | bt1a wrote:
         | It is not.
         | 
         | In related news, v2 of the "stack" dataset was recently
         | released
         | 
         | > 3.28B unique files belonging to 104.2M github repositories
         | were collected by traversing the Software Heritage 2023-09-06
         | graph dataset. Additional repository-level metadata was
         | collected from GitHub Archive data up to 2023-09-14. The total
         | uncompressed size of all files is 67.53TB. Near-deduplication
         | was implemented in the pre-processing pipeline on top of exact
         | deduplication.
         | 
         | V1 vs V2 by Deduped Size Tokens
         | 
         | V1: 2.9TB and 200B
         | 
         | V2: 32.1TB and 900B
         | 
         | I imagine we'll see some fairly powerful open coding models
         | soon. The ones I'm looking at testing are:
         | 
         | dolphincoder-starcoder2-15b-iMat.GGUF
         | 
         | CodeFuse-DeepSeek-33B-iMat.GGUF
         | 
         | OpenCodeInterpreter-DS-33B-iMat.GGUF
         | 
         | starcoder2-15b-instruct-iMat.GGUF
         | 
         | more info
         | 
         | dataset https://huggingface.co/datasets/bigcode/the-stack-v2
         | 
         | gguf quants https://huggingface.co/dranger003
        
           | bick_nyers wrote:
           | Do you happen to know what the v2 dedup size is when
           | compressed? 32.1TB is quite a bit, but if that compresses
           | down to say 3-6TB, it would be much more manageable. Code has
           | a lot of whitespace, repetition, and
           | structure/predictability, so I imagine it would compress
           | better than average text.
        
             | spindump8930 wrote:
             | Those sizes refer to the data before processing and
             | filtering. The actual training size was about 3 TB:
             | The Stack v2 is ten times larger than its predecessor,
             | yielding a raw dataset of 67.5 TB. Through extensive
             | cleaning, filtering, and subsampling of the source code,
             | along with the incorporation of other high-quality code-
             | related datasets, we created a training set of
             | approximately 3TB (900B+ tokens).
             | 
             | Source: the paper, Section 10
             | (https://arxiv.org/pdf/2402.19173.pdf)
        
       | zellyn wrote:
       | Is the "books3" dataset mentioned in the Pile paper the one that
       | authors are suing over? The one that includes a whole bunch of
       | popular and copyrighted material?
        
         | Balladeer wrote:
         | I believe it is. See https://www.wired.com/story/battle-over-
         | books3/
        
         | mistrial9 wrote:
         | this list [0] seems like a starting place to look into various
         | legal actions.. not sure how often it is updated e.g. Silverman
         | et al
         | 
         | [0] https://originality.ai/blog/openai-chatgpt-lawsuit-list
        
           | bt1a wrote:
           | Pouring one out for the future litigators, jurors, and judges
           | who will have to pore over this inextricable web of legal and
           | technical details
        
             | PeterStuer wrote:
             | They'll just let their ai do it over lunch.
        
         | taylorfinley wrote:
         | Yes, from the linked paper:
         | 
         | "Books3 is a dataset of books derived from a copy of the
         | contents of the Bibliotik private tracker made available by
         | Shawn Presser (Presser, 2020). Bibliotik consists of a mix of
         | fiction and nonfiction books and is almost an order of
         | magnitude larger than our next largest book dataset
         | (BookCorpus2). We included Bibliotik because books are
         | invaluable for long-range context modeling research and
         | coherent storytelling"
        
           | pimlottc wrote:
           | This is the most ridiculous legal hand wave I've ever seen.
           | 
           | "They're not books, man, they're a dataset!"
        
         | DiggyJohnson wrote:
         | Do they claim that none of their data came from copyrighted
         | sources / is copyrighted?
        
           | numpad0 wrote:
           | Why do everyone assume "open source" imply legality?
           | 
           | (/s)
        
           | jsheard wrote:
           | "Open source" implies that, no? A definition of open source
           | which includes blatantly pirated material on the condition
           | that the people who collated and released the pirated
           | material did so for free is really stretching it past
           | breaking point. By that standard everything on The Pirate Bay
           | is open source.
        
           | seanhunter wrote:
           | The claim (which I don't personally agree with, but I'm
           | trying to represent here in good faith) is that although the
           | data is copyright, training models constitutes "fair use"
           | under US copyright law and therefore you're entitled to use
           | copyright material for this.
           | 
           | Fair to say that whether or not this is correct is pretty
           | important to all the outstanding court cases on this matter.
        
             | retrac wrote:
             | I think there is actually a good argument that an AI model
             | is transformative, and that training a model is therefore
             | not infringing of the copyright. (An analogy: if you rolled
             | dice to select words randomly from the Lord of the Rings
             | and rearranged them into a poem, it's not infringing the
             | Lord of the Rings even if in a sense, every word was taken
             | from that book.)
             | 
             | But you still have to get your hands on the copyrighted
             | data legally. It might be legal to scan every book an
             | institution owns, and train off it, so long as those scans
             | are not distributed. But it is probably not legal to scrape
             | copyrighted content off torrents - creating the copy to
             | train with is infringing, even if the model's final product
             | maybe isn't.
        
               | seanhunter wrote:
               | Yes agreed, and transformative use itself also has
               | limitations. You don't have carte blanche to use
               | something just because you think it's transformative, for
               | example the Lynn Goldsmith vs Andy Warhol Foundation case
               | over the "Orange Prince" work.
               | https://copyrightalliance.org/warhol-decision-reins-
               | transfor...
        
               | fsckboy wrote:
               | while there is a good argument that AI produces
               | transformative outputs, it's refuted when the models are
               | shown to regurgitate literal text, which they have. Then
               | it just starts to look like a neural memorization agent,
               | compressed storage algorithm, etc.
        
               | bee_rider wrote:
               | Definitely open to the idea, that couldn't be the whole
               | argument. I mean, my brain can output some quotes, but
               | I'm not a compressed storage algorithm. Or at least I
               | hope I'm not.
        
               | zettabomb wrote:
               | I've seen examples of this, but they're nearly always
               | isolated, rather difficult to obtain, and not in fact
               | exact copies. You need to specifically ask for an exact
               | copy, and then attempt to defeat the safeguards the model
               | has in place to prevent this, and hope that it was
               | "memorized" - which for the record is considered to be a
               | _flaw_ in the model as it 's a reduction in information
               | density and capability, compared to if that "memory" was
               | used for something else. Good models seek to reduce this
               | as much as possible. With the size of the datasets
               | involve (see OP) this feels more like an understandable
               | and reasonable issue to have.
        
               | 7moritz7 wrote:
               | This very rarely happens, usually when trying hard to get
               | it to regurgitate, and I don't think it has ever happened
               | for anything longer than 2 paragraphs, or at most a short
               | article. Certainly not something like a book or even the
               | whole issue of a newspaper.
        
             | jdiff wrote:
             | That seems to fall apart quickly. Even if training could be
             | considered fair use, surely just distributing the raw
             | masses of copyrighted works can't be under any reasonable
             | definition. Otherwise, why did TBP, KAT, and MegaUpload
             | shut down if you could defeat copyright with sheer numbers?
        
               | seanhunter wrote:
               | Indeed. Also in the US, whether or not something is fair
               | use involves a four factor test[1] and two of the factors
               | are the amount and substantiality of what's taken and the
               | effect on any market. In this case, the amount is
               | "everything" and the effect on the market is potentially
               | very large for authors/publishers.
               | 
               | [1] https://fairuse.stanford.edu/overview/fair-use/four-
               | factors/
        
               | fsckboy wrote:
               | > _two of the factors are the amount and substantiality
               | of what 's taken and the effect on any market_
               | 
               | books.google.com has been allowed to copy all the books
               | they can lay their hands on, so long as they don't
               | regurgitate them in full, so it's not really the taking,
               | but any subsequent reproductions. And the effect on the
               | market is insubstantial if the alternative wasn't going
               | to be the equivalent sales.
        
               | ascorbic wrote:
               | You can download the whole dataset, so they're certainly
               | able to regurgitate them in full.
        
               | justinclift wrote:
               | Since when has TBP shut down?
        
               | RecycledEle wrote:
               | Some of the founders were convicted of crimes but the
               | database and code are out there.
        
               | gosub100 wrote:
               | I think they are referring to the many times the domain
               | name has been seized, and shut down temporarily.
        
               | PeterisP wrote:
               | One thing that we did with distributing certain
               | copyright-protected textual material was to scramble them
               | at the paragraph level.
               | 
               | If you take every paragraph in the Harry Potter saga and
               | sort the paragraphs in alphabetical order, it's just as
               | good for training short-context-window models, but not a
               | "harm to the market" leading to a lost sale for anyone
               | who wants to read the books.
        
               | YeGoblynQueenne wrote:
               | Megaupload et all went against the entertainment industry
               | in a time when that industry had the money to pay the
               | lawyers to convince the judges what the law means.
               | 
               | In the present moment on the other hand, it is the
               | entities in the AI industry (e.g. MS) that have the money
               | and can hire the lawyers to convince the judges.
               | Realistically speaking, it's very likely that things will
               | swing the way of AI companies, which will benefit, albeit
               | indirectly, these guys, even though by themselves they're
               | too small to push their agenda, they're just bit players.
        
           | qwertox wrote:
           | There's odd stuff in there. I just randomly downloaded a
           | file,
           | 
           | https://the-eye.eu/public/Books/ThoseBooks/Puzzles.tar --
           | 20-Jan-2023 14:54 -- 6M
           | 
           | and it pretends to be a jigsaw puzzle, but is actually eISBN
           | 9781594868573 - The South Beach diet cookbook / Arthur
           | Agatston
        
       | racee wrote:
       | i love the Illuminati vibes of _The Eye_
        
       | brokensegue wrote:
       | "open source" as in gratis but not as in libre?
        
         | Legend2440 wrote:
         | more like, it's a scrape of the entire internet, use it at your
         | own risk
        
           | fddrdplktrew wrote:
           | wow the internet is really small in your mind... or its too
           | big in my mind but I doubt it.
        
         | o11c wrote:
         | It should be understood in contrast to most traditional
         | corpora, which are heavily paywalled/restricted ... or else
         | based solely on century-old books. It has long been a major
         | obstacle for linguistics tooling.
         | 
         | If the current push of AI companies to get their way (to allow
         | copyright laundering) succeeds, this would almost count as open
         | source by the real definition.
         | 
         | If not ... lots of people/companies are committing copyright
         | crimes, some are committing civil infractions, and some may be
         | able to claim fair use.
        
       | jwitthuhn wrote:
       | Is this still available somewhere? I attempted to download it
       | several months ago and saw the download link 404ing, seems it is
       | still like that.
        
         | TrueDuality wrote:
         | Most of the distribution for this is via torrents/magnet links
         | and in person hard drive exchanges. I'd go look at some public
         | trackers if you want a copy and don't know someone that already
         | has it.
         | 
         | Do be aware that it does include copyrighted content so
         | distribution is piracy.
        
           | Der_Einzige wrote:
           | Almost all LLM training datasets include copyrighted content
           | so almost all open source LLM distribution is piracy and
           | almost all API based LLMs, including ChatGPT, are also piracy
           | and copyright laundering.
           | 
           | Also, most image-text dataset pairs contain far worse than
           | that. You might want to check out LAION-5B and what stanford
           | researchers have found in there. Technically, anyone who even
           | touched that could in theory be in some serious, serious
           | trouble. I find it quite remarkable that nothing has happened
           | yet.
        
             | littlestymaar wrote:
             | It's only piracy if it's private individual doing it,
             | otherwise it's just "ask for forgiveness not for
             | permission"-type Capitalism.
        
               | gosub100 wrote:
               | It'll be some epic lawsuit like google-v-samsung that
               | will get drawn out for a decade, awarded, and reduced,
               | appealed, etc. where the only winners will be both
               | party's lawyers.
        
               | littlestymaar wrote:
               | It's gonna be way worse than this:
               | 
               | - OpenAI and others will just settle with MPAA, RIAA and
               | the likes for a revenue stream (a single digit billion a
               | year, likely) + some kind of control over what people can
               | and cannot do with the AI + the access to the technology
               | to produce their own content.
               | 
               | - artists will see peanuts from the deal, and the big
               | names are going to be able to stop doing any kind of
               | business with artists which are just expenses in their
               | eyes. They will have been replaced by machines that where
               | trained using their art with no compensation whatsoever.
               | 
               | IP is already predatory capitalism, AI will definitely be
               | weaponized against the workers by the owners of the means
               | of "production".
        
             | beeboobaa wrote:
             | Turns out you can ignore copyright law if your company has
             | enough money.
        
             | vineyardmike wrote:
             | The courts (in the US) have not found LLM model weights to
             | be piracy, nor the outputs, but it's really surprising that
             | LAION was used for so long consider the content you allude
             | to.
        
               | Filligree wrote:
               | LAION is essentially a list of every image on the public
               | internet. It was filtered, of course, but do you really
               | expect perfection?
               | 
               | It's impossible to create such a list while evading all
               | such material.
        
               | vineyardmike wrote:
               | There exists databases of "the hash of problematic
               | photos" (CSAM), so it seems trivial to search your
               | billions of photos against them before training an AI.
               | You can't catch everything, but this seems like an
               | obvious miss _considering the explicitly tried to scrape
               | pornography_.
               | 
               | These hashes is exactly how researchers later discovered
               | this content, so it's clearly not hard.
        
               | duskwuff wrote:
               | The Stanford researchers also found a substantial number
               | of CSAM images in the LAION-5B dataset which were not
               | recognized by PhotoDNA, probably because the images in
               | question were not in wide distribution prior to their
               | inclusion in LAION.
               | 
               | Full paper: https://stacks.stanford.edu/file/druid:kh752s
               | m9123/ml_traini...
        
               | SEGyges wrote:
               | You are uploading 5 billion examples of <something>. You
               | cannot filter it manually, of course, because there are
               | five billion of it. Given that it is the year 2024, how
               | hard is it to be positive that a well-resourced team at
               | Stanford in 2029 will not have better methods of
               | identifying and filtering your data, or a better
               | reference dataset to filter it against, than you do
               | presently?
               | 
               | It is a pretty hard problem.
        
               | vineyardmike wrote:
               | You don't have to do it manually. There is a database of
               | file hashes.
               | 
               | And this isn't just "one engineer". Companies like
               | StabilityAI, Google, etc have used LAION datasets. If you
               | _built_ a dataset you should expend some resources on
               | automated filtering. Don't include explicit imagery as an
               | intentional choice if you can't do basic filtering.
        
             | visarga wrote:
             | > almost all open source LLM distribution is piracy and
             | almost all API based LLMs, including ChatGPT, are also
             | piracy and copyright laundering
             | 
             | That's an amplification of copyright, original expression
             | is protected, but not the ideas themselves, those are free.
             | And don't forget when we actually get to use these models
             | we feed them questions, data, we give corrections - so they
             | are not simply replicating the training set, they learn and
             | do new things with new inputs.
             | 
             | In fact if you think deeply about it, it is silly to accuse
             | AI of copyright violation. Copying the actual book or
             | article is much much faster and cheaper, and exact. Why
             | would I pay a LLM provider to generate it for me from the
             | title and starting phrase? If I already have part of the
             | article, do I still need to generate it with AI? it's
             | silly. LLM regurgitation are basically attacks with special
             | key, entrapments. They don't happen in normal use.
        
             | doctorpangloss wrote:
             | > I find it quite remarkable that nothing has happened yet.
             | 
             | While I don't think it's because you're wrong, per se, it's
             | just that none of this drama really matters.
        
             | Workaccount2 wrote:
             | Models are not information archives. The size of the final
             | model is orders of magnitude smaller than the size of the
             | training data.
             | 
             | Somehow people are just not able to get this through their
             | heads. Stable diffusion is like 12GB or something and you
             | have people convinced it's a tool that is cutting and
             | pasting copyrighted works from an enormous image archive.
        
               | 7moritz7 wrote:
               | Stable Diffusion 1.5 is 1.5 to 6 GB depending on the
               | finetune and trained on like 5 billion images
        
               | feoren wrote:
               | > The size of the final model is orders of magnitude
               | smaller than the size of the training data.
               | 
               | Good to know I can avoid copyright on a book just by
               | zipping it up!
        
         | archon1410 wrote:
         | > The Pile is old news, check out more recent datasets like;
         | https://huggingface.co/datasets/bigcode/the-stack-v2
         | 
         | -- https://the-eye.eu/public/AI/pile/readme.txt
        
           | natch wrote:
           | Super odd message since the stack v2 seems to be exclusively
           | code and The Pile is (mostly?) text.
        
         | HanClinto wrote:
         | Is it kosher to post magnet links here? I'm not sure.
         | 
         | magnet:?xt=urn:btih:0d366035664fdf51cfbe9f733953ba325776e667&dn
         | =EleutherAI_ThePile_v1
        
           | SEGyges wrote:
           | This is the correct one.
        
         | spindump8930 wrote:
         | Also good to note that that the Pile contains lots of curated
         | sources and recent trends have been to take curated data
         | sources and combine them with filtered webcrawls (i.e.
         | commoncrawl with heavy processing). See dolma or the stack v2
         | (for code models) as others have mentioned.
        
       | DiggyJohnson wrote:
       | Awesome name. Reminds me of the "original" "Pile" from the
       | Manhattan Project.
       | 
       | I read about it in "The Making of the Atomic Bomb" (1986), but
       | presumably it's featured in the recent movie.
        
         | groby_b wrote:
         | Not really. There's an ultra-brief scene where it's mentioned,
         | but that's it, IIRC.
         | 
         | The movie... is a bunch of anecdotes strung together to make a
         | ham-handed point at the end. It was a decent movie if you treat
         | it as a fictional story instead of an actual retelling.
         | 
         | I'd stick with the book. (And if you specifically care about
         | Fermi, I recommend "The Last Man Who Knew Everything" by David
         | Schwartz)
        
       | Der_Einzige wrote:
       | I came so close to getting my Debate document dataset
       | "DebateSum"[1] included into this[2] and I am very sad that it
       | wasn't included to this day:
       | 
       | [1] https://github.com/Hellisotherpeople/DebateSum [2]
       | https://github.com/EleutherAI/the-pile/issues/56
        
         | joshuakogut wrote:
         | > If you'd like to contribute it, feel free to submit a PR
         | 
         | Stella was waiting for you to submit your dataset. Did you? She
         | closed the ticket many months later.
        
           | Der_Einzige wrote:
           | They did a significant amount of work themselves of taking
           | other peoples datasets and including them without the work of
           | the original author needing to submit the full PR to do it. I
           | was then and to this day remain extremely busy
           | 
           | Also this was before most datasets were hosted conveniently
           | on huggingface.
           | 
           | It's all tears in the rain now.
        
       | joering2 wrote:
       | The pile can be downloaded here.
       | 
       | 404 Not Found nginx
       | 
       | 825 GB is a great candidate for torrent use, whatever was under
       | that broken link better be a torrent magnet.
        
       | swatcoder wrote:
       | Where do I find the license reproductions and
       | credits/attributions for the content being distributed in this
       | data set? Is it all in there? Are all inclusions compliant? Can I
       | know?
       | 
       | I'm open to the argument that generators built with models that
       | _consumed_ copyrighted data may evade copyright obligations on
       | their output, but surely the data sets themselves are bound by
       | any copyright on their content?
        
         | __loam wrote:
         | They stole it because they think building their toys is more
         | important than everyone else's rights to the product of their
         | own labor.
        
           | johndough wrote:
           | I doubt that anyone is going to download and search through
           | over 800 TB just to find a badly formatted copy of some book
           | that could be found much quicker on different websites with
           | better formatting. Authors are losing fractional cents here
           | at most.
        
             | gosub100 wrote:
             | so just like Office Space? (paraphrasing) "We steal a
             | fraction of a cent from each transaction, who do we hurt?
             | Nobody. We just put the remainder into our account!"
             | 
             | Sorry that's not how damages are calculated in the US tort
             | system.
        
               | johndough wrote:
               | I do not know how damages are calculated in the US tort
               | system. What do they say about the books3 dataset?
               | 
               | I also think that the case is different here, since in
               | your example, there is a specific amount of money being
               | stolen, while in the books3 case, there is an unspecified
               | amount of money not being made by the authors.
        
               | SEGyges wrote:
               | I am pretty sure if the authors were trying to license
               | their works for this purpose we would just not use them
               | at all; it is difficult to see under what circumstances
               | they would stand to profit from this other than by suing
               | people after the fact over it.
        
               | doug_durham wrote:
               | I think you could argue that authors could profit from
               | their works being cited in an LLM response. It could
               | drive sales of their works much like citations do on the
               | web. The counter argument is that and LLM could give you
               | the Clif Notes version of the work and thus taking away a
               | portion of sales.
        
               | SEGyges wrote:
               | In a world where the options were to
               | 
               | 1) pay the author,
               | 
               | 2) implement guaranteed citation of the author any time
               | the model gave an answer that was directly derivative,
               | with an option to not do so if the summary was
               | sufficiently vague, or
               | 
               | 3) ignore the author's book completely as training data
               | 
               | we would all choose 3).
        
               | __loam wrote:
               | And the authors would probably be very happy that you
               | did.
        
             | __loam wrote:
             | The penalty is up to $150k per violation.
        
           | pk-protect-ai wrote:
           | they have stole nothing, they make no profit from it as well.
        
             | idle_zealot wrote:
             | Oh, so there _aren 't_ AI companies charging for access to
             | private models?
        
               | pk-protect-ai wrote:
               | Who are they? Why do you mix up the guys who prepared the
               | data with the other guys who used this data and making
               | money from a vague memory of that data?
        
           | idle_zealot wrote:
           | I congratulate all of the authors whose work is included in
           | this dataset on contributing their knowledge, skills, and
           | perspective to humanity's various endeavors, both creative
           | and technical. I hope that the fruits of their labors are
           | returned to them, rather than being selfishly hoarded by the
           | few with the resources necessary to produce those fruits, be
           | they publishers, middlemen, or big tech.
           | 
           | Which is all to say that information shouldn't be hoarded and
           | guarded. If it can produce something more than the sum of its
           | parts we should use it to do so. The result of that should,
           | on the same grounds, not be hoarded and guarded, doubly so
           | being based on the work of others.
        
             | gosub100 wrote:
             | It will produce "something more" for the already-wealthy
             | who control the technology. For instance, LLMs will
             | eliminate the need for some customer service jobs,
             | increasing the profit margin for the existing executives
             | and shareholders, while eliminating entry-level jobs from
             | the job market.
        
               | idle_zealot wrote:
               | Call me an idealist but I don't think humans should be
               | spending their time on jobs a computer can do.
               | 
               | The solution to wealth disparity cannot include "invent
               | menial untalented high-paying labor for people to do".
        
               | __loam wrote:
               | Yeah why should humans do bothersome labor
               | like...creating literature?
        
             | nonrandomstring wrote:
             | > I congratulate all of the authors whose work is included
             | in this dataset on contributing their knowledge, skills,
             | and perspective to humanity's various endeavours
             | 
             | Thank you. You know in some ways it's an honour and a
             | privilege to live in such times of progress. The very act
             | of publishing is to "let go", and hope that your words and
             | ideas contribute to something bigger and beyond your life.
             | I never believed much in "intellectual property" as it's
             | all stuff that flows through us.
             | 
             | > I hope that the fruits of their labours are returned to
             | them
             | 
             | They rarely are, because knowledge and creativity are not
             | greatly valued in our time. But authors, artists and
             | scientists go into that with eyes wide open these days. The
             | rewards come in other ways, as the more you give and put
             | into life the more you get out.
             | 
             | > rather than being selfishly hoarded by the few with the
             | resources necessary to produce those fruits
             | 
             | This is not what we fear. Hoard away. We will simply take
             | back what is ours, whenever we desire it. The hoarders will
             | never win against what they call "piracy", because they
             | have no moral right. In the long run, they are on the wrong
             | side of history.
             | 
             | Far worse, and more likely is that the creative and
             | technical works of generations of artists and scientists
             | are going to be turned to exactly the opposite of what they
             | would want. They will be used to harm and disempower
             | humans, divide society instead of heal it, and even make
             | the pursuits of art, science and knowledge irrelevant.
             | 
             | We cannot take back our words, or our formulas, or our
             | paintings or our songs. But we can _take back tech_.
        
         | jsheard wrote:
         | This dataset includes "books3", which is a comprehensive dump
         | of Bibliotik, a torrent tracker dedicated to pirated ebooks.
         | 
         | Throw a dart at a wall filled with every notable
         | author/publisher ever and whoever you hit probably owns some of
         | this data.
         | 
         | Apparently you can just do whatever as long as you say it's for
         | AI research, go post Blu-ray rips online, it's fine provided
         | you have a .ai domain :^)
        
           | fsckboy wrote:
           | > _Throw a dart at a wall filled with every notable author
           | /publisher ever_
           | 
           | copyrights do expire, and any books older than Mickey Mouse
           | are public domain, so it's not _every notable author ever_
        
             | jsheard wrote:
             | Technically true, narrow that down to merely "every notable
             | _living_ author and a subset of dead ones " then.
             | 
             | Bram Stokers bones will be relieved to hear that their work
             | isn't being misappropriated.
        
           | oldgradstudent wrote:
           | It also contains an archive of opensubtitles, which is also
           | not very open source.
        
             | refulgentis wrote:
             | The subtitles aren't open?
             | 
             | If you meant transcribing dialogue from a TV show is
             | violating copyright, I'm not so sure, it's relatively
             | common to quote dialogue for varied purposes, ex. TV
             | critics
             | 
             | Definitely understand if you're saying the whole dialogue
             | for a TV show is copyrighted, but I'm curious about the
             | opensubtitles part, used to work in that area.
        
               | PavleMiha wrote:
               | Quoting is very different from posting the full contents
               | of something. I can quote a book but I can't reproduce it
               | in its entirety.
        
               | refulgentis wrote:
               | Right, you can't reproduce a book. W/r/t subs and dubs,
               | fair use has applied historically.
        
               | layer8 wrote:
               | Quoting excerpts is different from transcribing an entire
               | work, which is unambiguously copyright infringement.
               | (Otherwise you would find the "book" version of any and
               | all TV shows on Amazon.) The subtitles in question are
               | generally translations, which likewise fall under
               | copyright, being a derived work.
        
               | refulgentis wrote:
               | Yeah, I was just curious about the opensubtitles site
               | because I used to work in that field (subtitles) and
               | wasn't sure if there were some new pirate sites that were
               | monetizing subs.
               | 
               | n.b. not being argumentative, please don't read it that
               | way, I apologize if it comes off that way:
               | 
               | Not every derived work is a copyright violation, that's
               | why subs and dubs don't get kicked around, you can quote
               | dialogue in an article, etc.[^1]
               | 
               | Answering if it applies to AI is playing out in court
               | currently with ex. NYT v. OpenAI[^2] and Sarah Silverman
               | et al v. OpenAI[^3] and v. Meta.[^4]
               | 
               | [^1] "Copyright doesn't protect against all use of the
               | work or use of derivative works. There are a few
               | exceptions that fall under what's commonly known as the
               | fair use doctrine:"
               | (https://www.legalzoom.com/articles/what-are-derivative-
               | works...)
               | 
               | [^2]
               | https://www.nytimes.com/2023/12/27/business/media/new-
               | york-t...
               | 
               | [^3] https://www.theverge.com/2024/2/13/24072131/sarah-
               | silverman-...
               | 
               | [^4] https://www.hollywoodreporter.com/business/business-
               | news/sar...
        
           | pk-protect-ai wrote:
           | I wish it had included the books3, but it doesn't anymore. I
           | wish it was possible to download that 36GB books3.tar in the
           | wild these days. Herewith, I promise to use this dataset
           | according to the "fair use" only...
        
             | SekstiNi wrote:
             | > I wish it was possible to download that 36GB books3.tar
             | in the wild these days.
             | 
             | There... is a torrent.
        
               | pk-protect-ai wrote:
               | I know. But here where I am, using torrent means
               | participate in distribution of the content and that is
               | where I'll get huge bill for illegally sharing this file.
        
           | gosub100 wrote:
           | not the domain per se, but the high-powered law firms at your
           | fingertips. Copyright law is much easier to enforce against
           | working-class parents of 12-year-olds than SV elites.
        
       | arthurcolle wrote:
       | I can't believe people would do this, just share and republish
       | copyrighted works over the internet. I'm in shock and in
       | disbelief.
       | 
       | Anyways...
       | 
       | Is RedPajama 30T and The Pile "all you need" ? ;)
        
         | artninja1988 wrote:
         | There is currently a project going on to create the pile v2
         | which has only permissively licensed data, because of all the
         | bickering about copyright.
        
           | jeffrallen wrote:
           | > because authors prefer to be paid for their labor
           | 
           | FTFY.
        
             | evilduck wrote:
             | I asked an AI tool to create a cheery poem about ringworms
             | infecting kids from the 1600s and it created something
             | that's never existed before. Which author gets paid for
             | this labor they performed?
        
             | zettabomb wrote:
             | This is pretty reductive - "FTFY" is rarely the witty
             | response you think it is.
        
             | ben_w wrote:
             | Naturally, but I wonder what writers are going to do when
             | the AI trained purely on suitably licensed content is still
             | good enough to make most redundant.
             | 
             | (The authors in best seller's lists may well be immune for
             | a bit longer than others writers, as they're necessarily
             | the top 0.1% of writers, but not forever: nay-sayers
             | claimed that AI could never beat humans at chess or go
             | because the games required special human insight).
        
               | mejutoco wrote:
               | > The authors in best seller's lists may well be immune
               | for a bit longer than others writers, as they're
               | necessarily the top 0.1% of writers
               | 
               | The top best selling. Only one of many possible reasons
               | for that might be the quality.
        
               | ben_w wrote:
               | Quality is subjective, therefore I think it is reasonable
               | to say the best are those most able to profit rather
               | than, e.g. winners of the Nobel Prize in Literature, or
               | the list of books people most pretend to have read.
        
               | wizzwizz4 wrote:
               | Once upon a time, nay-sayers said that nobody could
               | travel to the moon, regardless of what vehicle they used.
               | They were wrong. Once upon a time, nay-sayers said that
               | nobody could transmute lead into gold using alchemical
               | equipment. They were right.
               | 
               | Nay-sayers who said that _no possible algorithm_ could
               | beat humans at chess and go? They were wrong. Nay-sayers
               | who say that _these algorithms_ cannot write better books
               | than humans? Well...
        
               | SEGyges wrote:
               | By "these algorithms", do you mean the ones that
               | currently exist, or the ones that will exist next month,
               | next year, or in 2034?
        
               | wizzwizz4 wrote:
               | We're not developing new algorithms all that quickly. My
               | point is that one shouldn't dismiss criticism out-of-
               | hand, just because some critics of some other thing
               | turned out to be wrong: for this point to be valid, I
               | don't need to be making criticism. On an unrelated
               | note...
               | 
               |  _Personally_ , I'd be referring to the family of
               | algorithms that purely take as input a context window and
               | provide as output a prediction of the next token
               | likelihood. (Plus or minus iteration, to generate strings
               | of text.) Pejoratively, one might call these "fancy
               | Markov chains", though as with most pejoratives, that's
               | overly reductive.
               | 
               | All the approaches we're seeing marketed heavily are just
               | fancy Markov chains. I expect every "new algorithm" for
               | the next 5 years at least to be a fancy Markov chain,
               | because that's what I expect to get funding. (I do expect
               | that _some_ people will be working on other approaches,
               | but only for amateurish reasons.)
        
               | SEGyges wrote:
               | These are fancy Markov chains in the sense that humans
               | are just chemicals and computers just do math.
               | Technically true, but not even "overly reductive"; it is
               | just wrong if it is used to imply that, e.g., humans just
               | swirl around in beakers or the most complex thing you can
               | do with computers is trigonometry.
               | 
               | You can make anything sound unimpressive if you describe
               | it sufficiently poorly.
               | 
               | And: So many different variations are published every
               | month. There are a good number of people in serious
               | research trying approaches that don't use cross entropy
               | loss (ie, strictly next-token prediction).
               | 
               | I don't know what the trajectory of the technology is
               | over the next ten years, but I am positive no one else
               | does either and anyone who thinks they do is wrong.
        
               | ben_w wrote:
               | > Once upon a time, nay-sayers said that nobody could
               | transmute lead into gold using alchemical equipment. They
               | were right.
               | 
               | Now I'm wondering if, with modern knowledge, you could
               | build a 0.5 MeV heavy ion accelerator with only the
               | things available to a medieval alchemist.
               | 
               | I'm thinking probably yes? Triboelectics can get the
               | right voltage. But how good does the vacuum need to be?
               | 
               | > Nay-sayers who say that _these algorithms_ cannot write
               | better books than humans?
               | 
               | They may be right or wrong in the specific, but I think
               | they're asking the wrong question, too specific.
        
               | jfvinueza wrote:
               | Dunno. Writing fiction myself; asked AI to read it aloud.
               | Narrative paragraphs worked fine: a clear, if a bit
               | deadpan, slightly tone-deaf delivery. But dialogue was
               | horrendous: it didn't understand emotional reactions and
               | connotations at all. More so than cringey and robotic, it
               | felt soulless. And the distance from "something that
               | makes sense" to "something that feels human" felt
               | unsormountable. Yes. Many novels will be written with
               | LLMs in the coming years. They might even touch us. But
               | this little Text-to-Speech experiment felt like an
               | evidence that this technology has a void at its core: it
               | doesn't have access, like a human does, to a gargantuan
               | emotional spectrum, which allows us to understand all
               | sorts of subtleties between what is being said, and why,
               | and what does it actually mean, and why does it affect us
               | (or, hell, how should the next line be read in this
               | context, because it has no context, it doesn't feel).
        
               | ben_w wrote:
               | I'm also writing a novel, and using text to speech to
               | hear how it sounds. One of the ones built into Mac OS.
               | And I'd agree with your assessment, I value the
               | synthesiser for bringing my attention to things my eyes
               | gloss over, such as unnecessary repetition and typos
               | which are still correctly spelled words (a common one for
               | me is lose/loose).
               | 
               | But: AI was seen as "decades" away from beating humans at
               | go, even 6 months before it did.
               | 
               | I don't know how far we are from them writing award
               | winning novels (awards we care about, it doesn't count if
               | it's an award for best AI), though my _gut feeling_ is we
               | need another breakthrough as significant as the
               | transformer model... but even then, that 's only a 1s
               | feeling.
        
             | onion2k wrote:
             | If the data is available online for the pile, surely it's
             | also publicly available to ordinary people in a way that
             | means authors aren't getting any money.
        
               | sangnoir wrote:
               | What sort of defense is this? "Your honor, after someone
               | broke in, they left the door open. Since the door was
               | unlocked _anyone_ could have committed the crime I 'm
               | accused of."
        
           | idle_zealot wrote:
           | So a bunch of extra work to create a downgrade? I'm sure
           | that's going to be very popular.
        
             | arthurcolle wrote:
             | The training data distribution is the only thing that
             | matters, not the actual content
        
               | observationist wrote:
               | Unless you want something like style from a range of
               | authors, knowledge of a fictional universe or storyline,
               | or other domain specific data or style characteristics.
               | 
               | A blanket removal of copyrighted data would make a bot
               | sterile, boring, unrelatable, and ignorant of culture and
               | common memes. We have amazing AI technology. Let's lean
               | into it and see where it goes.
        
               | __loam wrote:
               | By the violating the copyright hundreds of authors.
        
           | chasd00 wrote:
           | if the pile contains the code to go from step 1 to step 2 and
           | then to 3 then couldn't you just remove the parts you don't
           | want from the raw dataset and re-run the code?
        
         | doctorpangloss wrote:
         | It's enough for pre training to later tackle specific NLP
         | tasks.
         | 
         | To get something interesting you would have to generate an
         | instruct dataset from it. It would have to cover a diverse
         | range of tasks. The completions themselves do not make LLMs
         | manifest knowledge and reasoning, a large and diverse instruct
         | dataset does.
        
       | Ninjinka wrote:
       | I raised a concern about the inclusion of books3 in the Pile back
       | in 2020, and this is what the head of Eleuther (Stella Biderman)
       | told me:
       | 
       | "So here's the big picture. There are three sets of datasets: 1.
       | Data exists out there in the world. It has been collected into
       | datasets and posted online. I'll call this raw data. 2. We take
       | that data, clean it, and process it for language modeling. I'll
       | call this per-set data. 3. We combine those per-set data into one
       | massive dataset, the Pile. This is heavily processed, including
       | weighing the components.
       | 
       | We created 2 and 3 and put them online. We put 2 online so that
       | people can reweigh and remix the data if they wish, but we expect
       | most people to just download 3 and use it out of the box. Access
       | to 3 will be provided in several forms, including HuggingFace and
       | from our website.
       | 
       | 2 and 3 are not copyright violations, even if the data is
       | copyrighted, because they fall under fair use (at least in the
       | US).
       | 
       | The Pile contains code that turns 1 into 2 and code that turns 2
       | into 3.
       | 
       | When you download Maroon 5 from a website, you are creating a
       | dataset corresponding to 2. That _can be_ copyright violation
       | depending on what you do with it, but our use is not a copyright
       | violation. "
        
         | artninja1988 wrote:
         | Hopefully that is correct. The pile has been very valuable for
         | open model work. It's a really high quality dataset
        
         | layer8 wrote:
         | I don't understand how this can be true if set 2 contains a
         | complete copyrighted work (say, a book) that the copyright
         | owner hasn't approved for such distribution. Unless I
         | misunderstand and the "process[ing] for language modeling" is
         | an entirely irreversible process.
        
           | CharlesW wrote:
           | Even if the model encoding is not lossless/reversible, it's
           | probably _not_ true. A good place to start when thinking
           | about fair use is the  "four factors" that the U.S. legal
           | system will consider.
           | https://fairuse.stanford.edu/overview/fair-use/four-factors/
        
             | layer8 wrote:
             | Summary books for example are legal, so there is _some_
             | threshold of compression where things are fine.
        
               | knodi123 wrote:
               | are you referring to things like CliffNotes ?
        
               | layer8 wrote:
               | I'm referring to the "Summary of <some other book>"
               | booklets you can find on Amazon. Also services like
               | Blinkist.
        
             | kmeisthax wrote:
             | Google keeps coming up with new ways to statistically infer
             | training set data from models. So it's not entirely
             | lossless. At the very least, models that have been trained
             | on a particular work are unusually good at compressing[0]
             | those works, relative to other valid text.
             | 
             | In terms of fair use, one of the larger factors is the
             | 'market substitution' factor, which basically means "does
             | this use compete with otherwise licensed uses that people
             | would ordinarily pay for?" AI absolutely does compete with
             | human artists for the same market. In fact, it's winning
             | handily[1], _because you don 't have to pay human artists_.
             | AI art models absolutely shouldn't be trained on anything
             | with copyright on it.
             | 
             | The other factors don't fare much better. Nature of the
             | original work will differ based on the plaintiff, but the
             | purpose and character of the AI's use of that work is very
             | much commercial. And the amount and substantiality of the
             | use is complete and total. I don't see AI being fair use -
             | at least, not in every one of the many, many training
             | lawsuits currently ongoing against OpenAI and Stability.
             | 
             | [0] Starting with any body of text, an LLM, and an empty
             | context window, compute the next-token probabilities and
             | take the highest one. If it matches the source text, output
             | a 1 bit. If it doesn't, output 0 followed by the ID of the
             | correct next token. Add the correct token to the context
             | window and repeat until the text has been fully compressed.
             | This produces a list of perplexities (wrong words) for the
             | given text which can be used to guide the LLM to output the
             | original work.
             | 
             | [1] Hey, remember when both WotC (biggest art commissioner
             | on the planet) and Wacom (hardware vendor that sells art
             | tools and payment terminals[2]) both got caught using AI
             | art after making very loud and public pledges to not do
             | that? They both wound up buying stock photography on
             | marketplaces that are absolutely flooded with AI trash.
             | 
             | [2] All the credit card readers in Japan are built by
             | Wacom, which is _really funny_ as an artist
        
           | IshKebab wrote:
           | Yeah I agree. If 2 contains complete copyright works (e.g.
           | all of Harry Potter) then "we're just using it for AI
           | training!" stands approximately zero chance of passing the
           | fair use test. Their assertion that it does is just wishful
           | thinking.
        
             | mistrial9 wrote:
             | said with confidence, however Silverman et al Judge
             | explicitly rejected what you just asserted AFAIK
        
               | whimsicalism wrote:
               | no, she didn't reject the claim that training on
               | copyrighted work is infringement, merely that the outputs
               | are not infringing simply by bearing similarity to the
               | texts
        
               | papercrane wrote:
               | You've got it backwards. The judge in Silverman et al
               | dismissed the claims asserting that OpenAIs output is
               | copyright infringement. The claims for copyright
               | infringement in the training data are still going
               | forward, that will directly test whether it is "fair use"
               | or not.
               | 
               | From the ruling:
               | 
               | > Assuming the truth of Plaintiffs' allegations - that
               | Defendants used Plaintiffs' copyrighted works to train
               | their language models for commercial profit - the Court
               | concludes that Defendants' conduct may constitute an
               | unfair practice.6 Therefore, this portion of the UCL
               | claim may proceed.
               | 
               | https://caselaw.findlaw.com/court/us-dis-crt-n-d-
               | cal/1158180...
        
               | mistrial9 wrote:
               | aha - much appreciated
        
             | HenryBemis wrote:
             | I'm playing stupid now: I believe that if I ask the LLM to
             | "display Harry Potter Book 1" and it does, word-by-word,
             | then you're 100% right, it's copyright infringement. But,
             | if I ask the LLM to "give me an analysis of the "Professor
             | Severus Snape's" character and it gives me one, then I
             | don't see the problem.
             | 
             | So in that sense I understand the response that "they don't
             | violate copyright" by studying the material. Again, I don't
             | pretend to be a lawyer, and not every law has to follow my
             | logic.
        
               | swatcoder wrote:
               | That's a different discussion.
               | 
               | This isn't about the output for content generators or
               | about the abstract numeric weights that they operate
               | over. That's more complex and a largely open question.
               | 
               | But this is literally about indiscriminately distributing
               | copyrighted works in a large, convenient archive while
               | arguing that it's okay because you normalized the
               | formatting a bit and because you suspect that _some_
               | people _might_ find  "fair use" value in it.
        
           | michaelt wrote:
           | _> Unless I misunderstand and the "process[ing] for language
           | modeling" is an entirely irreversible process._
           | 
           | In the case of The Pile, "processing for language modelling"
           | means "converting epub and pdf into plain text, maybe
           | deduplicating, maybe removing some sorts of detectably
           | malformed files"
           | 
           | So not a particularly lossy conversion.
        
             | layer8 wrote:
             | I see, thanks. Yes, in that case, I don't see how this can
             | possibly not constitute copyright infringement.
        
               | bayindirh wrote:
               | It's generally tucked under Fair Use doctrine because
               | "It's for the science", until it doesn't (looking at
               | commercial AI non-profits).
               | 
               | Then "they're doing something amazing, they don't need
               | permission, and the cat is already out of the bag, and
               | similar musings".
               | 
               | Seriously, it's both copyright infringement, and
               | unethical. This is why I don't use any of the popular AI
               | tools, or even AI add-ons in Evernote, Notion, etc. They
               | all link back to the usual suspects.
        
               | layer8 wrote:
               | I'm talking about distributing the corpus, which by
               | itself is not bound to any particular usage.
        
               | bayindirh wrote:
               | It's again copyright infringement. If I share a
               | copyrighted ebook by accident, any and every cloud
               | provider will ban my account with no warning and
               | recourse.
               | 
               | Open science repositories would take down the "dataset"
               | immediately (or at least limit its access) if a copyright
               | holder brings the matter to the eyes of the admins.
        
               | CobrastanJorji wrote:
               | > "They're doing something amazing, they don't need
               | permission, and the cat is already out of the bag"
               | 
               | Ah, the Uber theory of law. Works surprisingly well for
               | some reason.
        
               | bayindirh wrote:
               | Probably due to Murphy's Golden Law of Golden Laws: Who
               | has the gold makes the laws.
        
               | Grimblewald wrote:
               | The question then becomes do these concerns remain even
               | for AI that cannot reproduce origional works? Does what
               | does that mean for us? When we read things, or interact
               | with Amy information for that matter, it changes us and
               | how we do things. If you consume art it will forever
               | influence art you produce yourself. Are these copyright
               | infringements also?
               | 
               | I can see the problem where direct and faithful
               | replication is possible but where it isn't is there still
               | a problem? Or is the automatable aspect, the scale at
               | which it can occur, that is the problem?
        
               | bayindirh wrote:
               | The difference is what you mix, and the amounts of things
               | you mix. As a human you mix many more inputs, plus your
               | emotions, plus the things you consume to create it.
               | Moreover, what you can consume and how perfectly you can
               | consume is limited by our innate limits.
               | 
               | An AI system consumes something perfectly, then ingrains
               | it into its weights perfectly, and becomes capable of
               | imitating the same thing perfectly. Plus, ther are no
               | other internal or external factors which affect these
               | "generation" over time. Hence, it mixes and reproduces
               | based on what it consumed, solely.
               | 
               | I might get inspired by people, and add my own values to
               | it, iterate over it and diverge from what I'm inspired
               | from ultimately to create my own style. AI doesn't work
               | like that. Also, if I do the same amount of inspiration
               | with the same precision and accuracy, I'll be neck deep
               | in accusations and lawsuits (for the right reasons).
               | 
               | As a result, just because we fail to ask the right
               | questions to reproduce the training data verbatim or
               | almost verbatim doesn't mean that the information is not
               | there. At the end, a neural network is a compression
               | algorithm which encodes data in terms of weights. Given
               | the correct input, you can regenerate the training data
               | as is.
               | 
               | Unless you have special abilities, you can't read 500
               | books an hour, remember them perfectly, and generate
               | derivative works by mashing all of them together. If I do
               | and try to sell a novel, I'll be ridiculed no end. If I
               | write a Ph.D. the same way and try to defend it, I'll be
               | banned from academia for three lifetimes at least.
               | 
               | For more elaboration on the subject, see [0].
               | 
               | [0]: https://news.ycombinator.com/item?id=39188463
        
               | miki123211 wrote:
               | The myth that AI models store all its training data in
               | its weights verbatim is as widespread as it is false. In
               | fact, if this were the case, deep neural networks would
               | be considered far better compression algorithms than
               | anything we have on the market right now, by literal
               | orders of magnitude.
               | 
               | If you divide Stable Diffusion's file size by the number
               | of images used to train it, you get something like 1.2
               | bits per image, and it is physically impossible to get
               | this kind of a compression ratio.
               | 
               | The actual problem with AI is that it _sometimes_
               | plagiarizes random _fragments_ of the work it is trained
               | on, even if it is not the user 's intend, and we
               | currently don't really know how to fully prevent this.
        
               | bayindirh wrote:
               | It still doesn't change the fact that the inclusion of
               | commercial works is copyright infringement though.
               | 
               | Same for code generating models trained on Open Source
               | and Free Software. Tons of licenses violated, from strong
               | copyleft to source available models and reproduced
               | (almost) verbatim with comments intact.
               | 
               | Some researcher's codebase is almost completely
               | reproducible without any licensing information just by
               | hinting the function names.
               | 
               | Maybe for the image compression it's borderline
               | impossible for now due to network size, but for text and
               | code, generation of training data almost verbatim is very
               | possible and straightforward.
               | 
               | Also in image generation models, style transfer is the
               | bigger problem, because it completely eliminates the
               | artist who uses/created the style in the first place.
               | "You pioneered this, and we fine tuned this model with
               | your images, and we can do your work for free, without
               | you, have a nice day". However, the artist's life
               | expenses doesn't disappear when they're transferred to an
               | image generation model.
               | 
               | This is also unethical.
        
               | __loam wrote:
               | The "humans do it too" argument is totally irrelevant to
               | this because humans have special and specific privileges
               | under the law that computers don't. The problem is that a
               | lot of data was copied into training sets and used for
               | commercial purposes without permission.
        
               | ribosometronome wrote:
               | Whether or not I'm influenced by media isn't super
               | relevant to whether or not I pirated that media. Before
               | even arriving at the question of whether or not the
               | resulting models are infringing, it's clear the training
               | data is.
        
         | chinathrow wrote:
         | Nicely stated copyright violations. Has noone filed suit yet?
        
           | SEGyges wrote:
           | Huckabee v Bloomberg, Meta, et al
        
         | whimsicalism wrote:
         | scraping libgen and downloading copyrighted content and
         | redistributing it isn't illegal?
         | 
         | call me skeptical, seeding a torrent of movies that you
         | downloaded from elsewhere on the internet isn't "fair use" and
         | the pile isn't just code for transforming data, it is the
         | redistributed data itself
         | 
         | by this logic i could legally run a libgen mirror
        
         | nickpsecurity wrote:
         | They're distributing copyrighted works without the authors
         | permission, using them in ways that compete with the author,
         | many make money off AI's, and the AI's reproduce some verbatim.
         | These datasets seem to fail most tests ("four factors") in
         | copyright law. Even laypeople I've explained LLM's to think the
         | AI companies are ripping others' work off.
         | 
         | For those concerned, I have an article that covers legalities,
         | each dataset (including The Pile), legal issues with them,
         | alternatives that are legal, and a copyright amendment that
         | balances all sides.
         | 
         | http://gethisword.com/tech/exploringai/
         | 
         | Looking back at my proposal, I think we need at least three
         | rules passed immediately in at least one country:
         | 
         | 1. All copyrighted works can, if a person has legal access, be
         | used for training AI systems. Any terms restricting copyrighted
         | works from use in training, charging more for that, restricting
         | downloads for it, etc are illegal. Every act of publishing can
         | benefit both a human mind and AI training equally.
         | 
         | 2. People can copy and transform for their own use any work
         | they have access to _only for AI training._ This might include
         | reverse engineering for extraction, multiple copies in
         | different formats, and so on. They can do whatever is needed to
         | get it into the AI system. Other uses or abuse of this data is
         | subject to existing law.
         | 
         | 3. Any work published online for free and with public access
         | can be copied, _shared_ , processed, and bundled for AI
         | training. That's regardless of its terms.
         | 
         | Note: In No. 2 and No. 3, the resulting AI's copyright will be
         | determined by existing law about AI's and mixing copyrighted
         | works. Or no copyright if that's the law.
         | 
         | 4. If AI outputs are copywritten, their status will be the same
         | as if the user published it themselves while relying on prior
         | works. AI training sets will also be public to determine this.
         | 
         | With those rules, we can share works like those in The Pile,
         | still pay creators that want to be paid, be less likely to just
         | steal existing work, and infringement in outputs is still
         | illegal. What do you all think of that?
        
         | dougb5 wrote:
         | I don't know what the right answer is to the copyright
         | questions, but I hope that in 2024 we'll have a better attitude
         | about the human labor that went into these models than "Data
         | exists out there in the world" and the passive-voice "It has
         | been collected into datasets"
        
           | clooper wrote:
           | Have you heard of data unions?
        
         | otterley wrote:
         | > 2 and 3 are not copyright violations, even if the data is
         | copyrighted, because they fall under fair use (at least in the
         | US).
         | 
         | This cannot be known until it is litigated. Fair Use is not
         | something you can unilaterally declare and have it be so, just
         | like you can't be like Michael Scott in the Office shouting "I
         | declare bankruptcy!" OpenAI is currently defending itself
         | against the New York Times for this very reason.
         | 
         | There's a multi-factor test that courts weigh the facts against
         | in making a determination as to whether a _prima facie_
         | copyright violation would be protected under a Fair Use
         | defense:
         | 
         | Factor 1: The Purpose and Character of the Use
         | 
         | Factor 2: The Nature of the Copyrighted Work
         | 
         | Factor 3: The Amount or Substantiality of the Portion Used
         | 
         | Factor 4: The Effect of the Use on the Potential Market for or
         | Value of the Work
         | 
         | See https://copyright.columbia.edu/basics/fair-use.html for a
         | pretty good overview of what the analysis entails.
        
           | ryukoposting wrote:
           | Thanks. This is really informative, and really important
           | information given the growing relevance of IP law in
           | everyone's daily life. Part of me wonders if these four
           | factors will ever become part of core curriculum for civics
           | classes.
           | 
           | By no means am I an expert in copyright law, but factor 3
           | seems like very bad news if you're OpenAI.
        
           | fluoridation wrote:
           | Whether something is fair use or not is not determined by a
           | court, but by the definition of what fair use is. A court
           | interprets that definition and the situation, and if their
           | interpretation matches yours you may have a ruling in your
           | favor. But saying "this is fair use" is no more incorrect
           | than saying "this is red". You're interpreting your
           | perception and putting that interpretation into words.
        
             | otterley wrote:
             | > But saying "this is fair use" is no more incorrect than
             | saying "this is red".
             | 
             | When a court determines that it isn't, you can continue to
             | argue it as much as you like (to deaf ears), and yet you're
             | still liable to the copyright holder. Whether it's
             | "incorrect" or not is then irrelevant. Let's not argue
             | semantics here.
        
           | wiremine wrote:
           | > This cannot be known until it is litigated. Fair Use is not
           | something you can unilaterally declare and have it be so.
           | 
           | Correct, but what isn't clear here is their rationale for why
           | they think they're covered by fair use. Does anybody have
           | that information?
           | 
           | I'm not saying their interpretation is correct, but seems to
           | be germane to this discussion. The parent comment seems to
           | assume none of this has been litigated yet, which might also
           | be true. Or not.
        
         | 3abiton wrote:
         | Interesting take on the copyright law.
        
         | tycho-newman wrote:
         | Fair use is a defense to infringement. Do not start your
         | copyright argument by admitting you infringed.
        
       | mjtechguy wrote:
       | Would be interested to see what is in there. Luckily noone has
       | posted the magnet link on Twitter.
        
         | SEGyges wrote:
         | The counterparties on related legal action are sufficiently
         | litigious that it is probably smarter to DM the magnet link.
        
       | fddrdplktrew wrote:
       | 825gb seems really small
        
       | _obviously wrote:
       | Seems kind of small tbqh.
        
         | beiller wrote:
         | It seems small, until you try to download it.
        
       | jMyles wrote:
       | So much of this thread is concerned not with the achievement of
       | this data set, but with the (by comparison) silly and outdated
       | spat over how to frame it as "property" for the purposes of
       | government intervention (pursuant to which jurisdiction?).
       | 
       | The era of intellectual "property" is over. Let's be at peace
       | with that and just move on into the next age.
        
       | __lbracket__ wrote:
       | LLM are of use to megacorps.
       | 
       | Megacorps assume authors, painters, etc are poor and powerless
       | (which lets face it, they are)
       | 
       | we can b** and moan on HN, but megacorps will find ways to use
       | copyrighted works for free.
        
       | quatrefoil wrote:
       | While a lot of attention has been given to books3, another large
       | component of this dataset is the deceptively-named
       | "OpenWebText2". What's that? It's a scrape of 15 years' worth of
       | third-party websites that were linked to from upvoted Reddit
       | submissions. I know this includes some of my writing.
        
         | 7moritz7 wrote:
         | Care to give me your domain name so I can check all major llms
         | for plagiarism? I have a feeling none of them can produce a
         | sentence from your writings
        
           | quatrefoil wrote:
           | It takes deliberate effort, but I was actually able to get
           | pieces of my writing out of one of the leading LLMs (not
           | ChatGPT). This is not particularly unique, a number of folks
           | demonstrated the same.
        
         | observationist wrote:
         | Relevance and impact aside, if you publish something to the
         | internet on a site with no access restriction in place, I don't
         | know how you can keep a straight face while claiming some sort
         | of moral right to the content. It's the equivalent of
         | broadcasting it over radio, or printing and delivering it
         | straight to the doorsteps of millions of random individuals.
         | Methinks you doth protest too much, or something.
         | 
         | There are ways of copyrighting data, and establishing ownership
         | of intellectual property. Your tumblr fanfic, youtube comments,
         | or HN discussions are not legitimate copyright avenues. Stuff
         | you post to legally scrapeable websites are fair game for fair
         | use.
         | 
         | I can do anything I want in private to any data I collect. I
         | could create an awesome HN LLM on the scraped datasets, and use
         | it privately to my hearts content. I can even set up an API to
         | that LLM that generates content, and, given recent rulings,
         | even if i had all the written copyrighted data in the world, as
         | long as I was making good faith efforts to ensure copyright was
         | being respected and works weren't being recreated verbatim,
         | then I could even use that model commercially. I just couldn't
         | sell it to other people, or distribute it, without entering a
         | different legal regime.
         | 
         | I can collect any data I want from public facing websites.
         | 
         | That's how the internet works; it's how it was designed. There
         | are authentication mechanisms, network configurations, and a
         | myriad other access control schemes you can implement to
         | prevent public access. If you post to sites without those
         | mechanisms, you're tacitly agreeing to give up any plausible
         | claims of protection against a wide array of fair uses well
         | established by precedent cases at this point. If you don't
         | prevent public access, and you've got a domain name on a
         | server, you're tacitly inviting the world to come download
         | whatever it is you have on your server. This is a social good.
         | This is what we want when we participate in the internet.
         | 
         | Insisting on some sort of vague entitlement as to how "your"
         | data gets used completely bypasses the fact that anything you
         | consider to be misused in OpenWebText2 fundamentally stems from
         | the fact that you posted the content to a publicly visible
         | website and gave up any say in what happens thereafter. It was
         | scraped fair and square.
         | 
         | Don't complain that you didn't know the rules, or that life
         | isn't fair.
         | 
         | It's not even clear that terms of service or those little
         | popups on public websites have any legal relevance. If your
         | website is open to the public, then it's fair game. If you post
         | content to a public website, then that content's fair game.
        
       | intalentive wrote:
       | This is 4 years old. Why the top of HN now?
        
       | willvarfar wrote:
       | Are there any simple text editors or wysiwyg that have local LLMs
       | and can tidy up and auto-suggest whole paras of slick verbage as
       | you type?
        
       | clooper wrote:
       | The big Hollywood studio pay a lot of money to various cyber
       | security companies to look for pirated content and send cease and
       | desit letters to hosting companies for letting their users
       | distribute copyrighted content.
       | 
       | If authors and artists were to join a data union they could do
       | the same thing as studios. If copyrighted law has any real teeth
       | then the data union can send legal requests to whoever is hosting
       | the content and requesting it to be taken down.
       | 
       | I'm not a lawyer but I know the studios definitely do this.
        
       | kristianp wrote:
       | Please add (2020) to the title.
        
       ___________________________________________________________________
       (page generated 2024-03-07 23:02 UTC)