[HN Gopher] The Pile is a 825 GiB diverse, open-source language ...
___________________________________________________________________
The Pile is a 825 GiB diverse, open-source language modelling data
set
Author : bilsbie
Score : 308 points
Date : 2024-03-07 17:14 UTC (5 hours ago)
(HTM) web link (pile.eleuther.ai)
(TXT) w3m dump (pile.eleuther.ai)
| turnsout wrote:
| The Pile is pretty old--is this an updated version?
| bt1a wrote:
| It is not.
|
| In related news, v2 of the "stack" dataset was recently
| released
|
| > 3.28B unique files belonging to 104.2M github repositories
| were collected by traversing the Software Heritage 2023-09-06
| graph dataset. Additional repository-level metadata was
| collected from GitHub Archive data up to 2023-09-14. The total
| uncompressed size of all files is 67.53TB. Near-deduplication
| was implemented in the pre-processing pipeline on top of exact
| deduplication.
|
| V1 vs V2 by Deduped Size Tokens
|
| V1: 2.9TB and 200B
|
| V2: 32.1TB and 900B
|
| I imagine we'll see some fairly powerful open coding models
| soon. The ones I'm looking at testing are:
|
| dolphincoder-starcoder2-15b-iMat.GGUF
|
| CodeFuse-DeepSeek-33B-iMat.GGUF
|
| OpenCodeInterpreter-DS-33B-iMat.GGUF
|
| starcoder2-15b-instruct-iMat.GGUF
|
| more info
|
| dataset https://huggingface.co/datasets/bigcode/the-stack-v2
|
| gguf quants https://huggingface.co/dranger003
| bick_nyers wrote:
| Do you happen to know what the v2 dedup size is when
| compressed? 32.1TB is quite a bit, but if that compresses
| down to say 3-6TB, it would be much more manageable. Code has
| a lot of whitespace, repetition, and
| structure/predictability, so I imagine it would compress
| better than average text.
| spindump8930 wrote:
| Those sizes refer to the data before processing and
| filtering. The actual training size was about 3 TB:
| The Stack v2 is ten times larger than its predecessor,
| yielding a raw dataset of 67.5 TB. Through extensive
| cleaning, filtering, and subsampling of the source code,
| along with the incorporation of other high-quality code-
| related datasets, we created a training set of
| approximately 3TB (900B+ tokens).
|
| Source: the paper, Section 10
| (https://arxiv.org/pdf/2402.19173.pdf)
| zellyn wrote:
| Is the "books3" dataset mentioned in the Pile paper the one that
| authors are suing over? The one that includes a whole bunch of
| popular and copyrighted material?
| Balladeer wrote:
| I believe it is. See https://www.wired.com/story/battle-over-
| books3/
| mistrial9 wrote:
| this list [0] seems like a starting place to look into various
| legal actions.. not sure how often it is updated e.g. Silverman
| et al
|
| [0] https://originality.ai/blog/openai-chatgpt-lawsuit-list
| bt1a wrote:
| Pouring one out for the future litigators, jurors, and judges
| who will have to pore over this inextricable web of legal and
| technical details
| PeterStuer wrote:
| They'll just let their ai do it over lunch.
| taylorfinley wrote:
| Yes, from the linked paper:
|
| "Books3 is a dataset of books derived from a copy of the
| contents of the Bibliotik private tracker made available by
| Shawn Presser (Presser, 2020). Bibliotik consists of a mix of
| fiction and nonfiction books and is almost an order of
| magnitude larger than our next largest book dataset
| (BookCorpus2). We included Bibliotik because books are
| invaluable for long-range context modeling research and
| coherent storytelling"
| pimlottc wrote:
| This is the most ridiculous legal hand wave I've ever seen.
|
| "They're not books, man, they're a dataset!"
| DiggyJohnson wrote:
| Do they claim that none of their data came from copyrighted
| sources / is copyrighted?
| numpad0 wrote:
| Why do everyone assume "open source" imply legality?
|
| (/s)
| jsheard wrote:
| "Open source" implies that, no? A definition of open source
| which includes blatantly pirated material on the condition
| that the people who collated and released the pirated
| material did so for free is really stretching it past
| breaking point. By that standard everything on The Pirate Bay
| is open source.
| seanhunter wrote:
| The claim (which I don't personally agree with, but I'm
| trying to represent here in good faith) is that although the
| data is copyright, training models constitutes "fair use"
| under US copyright law and therefore you're entitled to use
| copyright material for this.
|
| Fair to say that whether or not this is correct is pretty
| important to all the outstanding court cases on this matter.
| retrac wrote:
| I think there is actually a good argument that an AI model
| is transformative, and that training a model is therefore
| not infringing of the copyright. (An analogy: if you rolled
| dice to select words randomly from the Lord of the Rings
| and rearranged them into a poem, it's not infringing the
| Lord of the Rings even if in a sense, every word was taken
| from that book.)
|
| But you still have to get your hands on the copyrighted
| data legally. It might be legal to scan every book an
| institution owns, and train off it, so long as those scans
| are not distributed. But it is probably not legal to scrape
| copyrighted content off torrents - creating the copy to
| train with is infringing, even if the model's final product
| maybe isn't.
| seanhunter wrote:
| Yes agreed, and transformative use itself also has
| limitations. You don't have carte blanche to use
| something just because you think it's transformative, for
| example the Lynn Goldsmith vs Andy Warhol Foundation case
| over the "Orange Prince" work.
| https://copyrightalliance.org/warhol-decision-reins-
| transfor...
| fsckboy wrote:
| while there is a good argument that AI produces
| transformative outputs, it's refuted when the models are
| shown to regurgitate literal text, which they have. Then
| it just starts to look like a neural memorization agent,
| compressed storage algorithm, etc.
| bee_rider wrote:
| Definitely open to the idea, that couldn't be the whole
| argument. I mean, my brain can output some quotes, but
| I'm not a compressed storage algorithm. Or at least I
| hope I'm not.
| zettabomb wrote:
| I've seen examples of this, but they're nearly always
| isolated, rather difficult to obtain, and not in fact
| exact copies. You need to specifically ask for an exact
| copy, and then attempt to defeat the safeguards the model
| has in place to prevent this, and hope that it was
| "memorized" - which for the record is considered to be a
| _flaw_ in the model as it 's a reduction in information
| density and capability, compared to if that "memory" was
| used for something else. Good models seek to reduce this
| as much as possible. With the size of the datasets
| involve (see OP) this feels more like an understandable
| and reasonable issue to have.
| 7moritz7 wrote:
| This very rarely happens, usually when trying hard to get
| it to regurgitate, and I don't think it has ever happened
| for anything longer than 2 paragraphs, or at most a short
| article. Certainly not something like a book or even the
| whole issue of a newspaper.
| jdiff wrote:
| That seems to fall apart quickly. Even if training could be
| considered fair use, surely just distributing the raw
| masses of copyrighted works can't be under any reasonable
| definition. Otherwise, why did TBP, KAT, and MegaUpload
| shut down if you could defeat copyright with sheer numbers?
| seanhunter wrote:
| Indeed. Also in the US, whether or not something is fair
| use involves a four factor test[1] and two of the factors
| are the amount and substantiality of what's taken and the
| effect on any market. In this case, the amount is
| "everything" and the effect on the market is potentially
| very large for authors/publishers.
|
| [1] https://fairuse.stanford.edu/overview/fair-use/four-
| factors/
| fsckboy wrote:
| > _two of the factors are the amount and substantiality
| of what 's taken and the effect on any market_
|
| books.google.com has been allowed to copy all the books
| they can lay their hands on, so long as they don't
| regurgitate them in full, so it's not really the taking,
| but any subsequent reproductions. And the effect on the
| market is insubstantial if the alternative wasn't going
| to be the equivalent sales.
| ascorbic wrote:
| You can download the whole dataset, so they're certainly
| able to regurgitate them in full.
| justinclift wrote:
| Since when has TBP shut down?
| RecycledEle wrote:
| Some of the founders were convicted of crimes but the
| database and code are out there.
| gosub100 wrote:
| I think they are referring to the many times the domain
| name has been seized, and shut down temporarily.
| PeterisP wrote:
| One thing that we did with distributing certain
| copyright-protected textual material was to scramble them
| at the paragraph level.
|
| If you take every paragraph in the Harry Potter saga and
| sort the paragraphs in alphabetical order, it's just as
| good for training short-context-window models, but not a
| "harm to the market" leading to a lost sale for anyone
| who wants to read the books.
| YeGoblynQueenne wrote:
| Megaupload et all went against the entertainment industry
| in a time when that industry had the money to pay the
| lawyers to convince the judges what the law means.
|
| In the present moment on the other hand, it is the
| entities in the AI industry (e.g. MS) that have the money
| and can hire the lawyers to convince the judges.
| Realistically speaking, it's very likely that things will
| swing the way of AI companies, which will benefit, albeit
| indirectly, these guys, even though by themselves they're
| too small to push their agenda, they're just bit players.
| qwertox wrote:
| There's odd stuff in there. I just randomly downloaded a
| file,
|
| https://the-eye.eu/public/Books/ThoseBooks/Puzzles.tar --
| 20-Jan-2023 14:54 -- 6M
|
| and it pretends to be a jigsaw puzzle, but is actually eISBN
| 9781594868573 - The South Beach diet cookbook / Arthur
| Agatston
| racee wrote:
| i love the Illuminati vibes of _The Eye_
| brokensegue wrote:
| "open source" as in gratis but not as in libre?
| Legend2440 wrote:
| more like, it's a scrape of the entire internet, use it at your
| own risk
| fddrdplktrew wrote:
| wow the internet is really small in your mind... or its too
| big in my mind but I doubt it.
| o11c wrote:
| It should be understood in contrast to most traditional
| corpora, which are heavily paywalled/restricted ... or else
| based solely on century-old books. It has long been a major
| obstacle for linguistics tooling.
|
| If the current push of AI companies to get their way (to allow
| copyright laundering) succeeds, this would almost count as open
| source by the real definition.
|
| If not ... lots of people/companies are committing copyright
| crimes, some are committing civil infractions, and some may be
| able to claim fair use.
| jwitthuhn wrote:
| Is this still available somewhere? I attempted to download it
| several months ago and saw the download link 404ing, seems it is
| still like that.
| TrueDuality wrote:
| Most of the distribution for this is via torrents/magnet links
| and in person hard drive exchanges. I'd go look at some public
| trackers if you want a copy and don't know someone that already
| has it.
|
| Do be aware that it does include copyrighted content so
| distribution is piracy.
| Der_Einzige wrote:
| Almost all LLM training datasets include copyrighted content
| so almost all open source LLM distribution is piracy and
| almost all API based LLMs, including ChatGPT, are also piracy
| and copyright laundering.
|
| Also, most image-text dataset pairs contain far worse than
| that. You might want to check out LAION-5B and what stanford
| researchers have found in there. Technically, anyone who even
| touched that could in theory be in some serious, serious
| trouble. I find it quite remarkable that nothing has happened
| yet.
| littlestymaar wrote:
| It's only piracy if it's private individual doing it,
| otherwise it's just "ask for forgiveness not for
| permission"-type Capitalism.
| gosub100 wrote:
| It'll be some epic lawsuit like google-v-samsung that
| will get drawn out for a decade, awarded, and reduced,
| appealed, etc. where the only winners will be both
| party's lawyers.
| littlestymaar wrote:
| It's gonna be way worse than this:
|
| - OpenAI and others will just settle with MPAA, RIAA and
| the likes for a revenue stream (a single digit billion a
| year, likely) + some kind of control over what people can
| and cannot do with the AI + the access to the technology
| to produce their own content.
|
| - artists will see peanuts from the deal, and the big
| names are going to be able to stop doing any kind of
| business with artists which are just expenses in their
| eyes. They will have been replaced by machines that where
| trained using their art with no compensation whatsoever.
|
| IP is already predatory capitalism, AI will definitely be
| weaponized against the workers by the owners of the means
| of "production".
| beeboobaa wrote:
| Turns out you can ignore copyright law if your company has
| enough money.
| vineyardmike wrote:
| The courts (in the US) have not found LLM model weights to
| be piracy, nor the outputs, but it's really surprising that
| LAION was used for so long consider the content you allude
| to.
| Filligree wrote:
| LAION is essentially a list of every image on the public
| internet. It was filtered, of course, but do you really
| expect perfection?
|
| It's impossible to create such a list while evading all
| such material.
| vineyardmike wrote:
| There exists databases of "the hash of problematic
| photos" (CSAM), so it seems trivial to search your
| billions of photos against them before training an AI.
| You can't catch everything, but this seems like an
| obvious miss _considering the explicitly tried to scrape
| pornography_.
|
| These hashes is exactly how researchers later discovered
| this content, so it's clearly not hard.
| duskwuff wrote:
| The Stanford researchers also found a substantial number
| of CSAM images in the LAION-5B dataset which were not
| recognized by PhotoDNA, probably because the images in
| question were not in wide distribution prior to their
| inclusion in LAION.
|
| Full paper: https://stacks.stanford.edu/file/druid:kh752s
| m9123/ml_traini...
| SEGyges wrote:
| You are uploading 5 billion examples of <something>. You
| cannot filter it manually, of course, because there are
| five billion of it. Given that it is the year 2024, how
| hard is it to be positive that a well-resourced team at
| Stanford in 2029 will not have better methods of
| identifying and filtering your data, or a better
| reference dataset to filter it against, than you do
| presently?
|
| It is a pretty hard problem.
| vineyardmike wrote:
| You don't have to do it manually. There is a database of
| file hashes.
|
| And this isn't just "one engineer". Companies like
| StabilityAI, Google, etc have used LAION datasets. If you
| _built_ a dataset you should expend some resources on
| automated filtering. Don't include explicit imagery as an
| intentional choice if you can't do basic filtering.
| visarga wrote:
| > almost all open source LLM distribution is piracy and
| almost all API based LLMs, including ChatGPT, are also
| piracy and copyright laundering
|
| That's an amplification of copyright, original expression
| is protected, but not the ideas themselves, those are free.
| And don't forget when we actually get to use these models
| we feed them questions, data, we give corrections - so they
| are not simply replicating the training set, they learn and
| do new things with new inputs.
|
| In fact if you think deeply about it, it is silly to accuse
| AI of copyright violation. Copying the actual book or
| article is much much faster and cheaper, and exact. Why
| would I pay a LLM provider to generate it for me from the
| title and starting phrase? If I already have part of the
| article, do I still need to generate it with AI? it's
| silly. LLM regurgitation are basically attacks with special
| key, entrapments. They don't happen in normal use.
| doctorpangloss wrote:
| > I find it quite remarkable that nothing has happened yet.
|
| While I don't think it's because you're wrong, per se, it's
| just that none of this drama really matters.
| Workaccount2 wrote:
| Models are not information archives. The size of the final
| model is orders of magnitude smaller than the size of the
| training data.
|
| Somehow people are just not able to get this through their
| heads. Stable diffusion is like 12GB or something and you
| have people convinced it's a tool that is cutting and
| pasting copyrighted works from an enormous image archive.
| 7moritz7 wrote:
| Stable Diffusion 1.5 is 1.5 to 6 GB depending on the
| finetune and trained on like 5 billion images
| feoren wrote:
| > The size of the final model is orders of magnitude
| smaller than the size of the training data.
|
| Good to know I can avoid copyright on a book just by
| zipping it up!
| archon1410 wrote:
| > The Pile is old news, check out more recent datasets like;
| https://huggingface.co/datasets/bigcode/the-stack-v2
|
| -- https://the-eye.eu/public/AI/pile/readme.txt
| natch wrote:
| Super odd message since the stack v2 seems to be exclusively
| code and The Pile is (mostly?) text.
| HanClinto wrote:
| Is it kosher to post magnet links here? I'm not sure.
|
| magnet:?xt=urn:btih:0d366035664fdf51cfbe9f733953ba325776e667&dn
| =EleutherAI_ThePile_v1
| SEGyges wrote:
| This is the correct one.
| spindump8930 wrote:
| Also good to note that that the Pile contains lots of curated
| sources and recent trends have been to take curated data
| sources and combine them with filtered webcrawls (i.e.
| commoncrawl with heavy processing). See dolma or the stack v2
| (for code models) as others have mentioned.
| DiggyJohnson wrote:
| Awesome name. Reminds me of the "original" "Pile" from the
| Manhattan Project.
|
| I read about it in "The Making of the Atomic Bomb" (1986), but
| presumably it's featured in the recent movie.
| groby_b wrote:
| Not really. There's an ultra-brief scene where it's mentioned,
| but that's it, IIRC.
|
| The movie... is a bunch of anecdotes strung together to make a
| ham-handed point at the end. It was a decent movie if you treat
| it as a fictional story instead of an actual retelling.
|
| I'd stick with the book. (And if you specifically care about
| Fermi, I recommend "The Last Man Who Knew Everything" by David
| Schwartz)
| Der_Einzige wrote:
| I came so close to getting my Debate document dataset
| "DebateSum"[1] included into this[2] and I am very sad that it
| wasn't included to this day:
|
| [1] https://github.com/Hellisotherpeople/DebateSum [2]
| https://github.com/EleutherAI/the-pile/issues/56
| joshuakogut wrote:
| > If you'd like to contribute it, feel free to submit a PR
|
| Stella was waiting for you to submit your dataset. Did you? She
| closed the ticket many months later.
| Der_Einzige wrote:
| They did a significant amount of work themselves of taking
| other peoples datasets and including them without the work of
| the original author needing to submit the full PR to do it. I
| was then and to this day remain extremely busy
|
| Also this was before most datasets were hosted conveniently
| on huggingface.
|
| It's all tears in the rain now.
| joering2 wrote:
| The pile can be downloaded here.
|
| 404 Not Found nginx
|
| 825 GB is a great candidate for torrent use, whatever was under
| that broken link better be a torrent magnet.
| swatcoder wrote:
| Where do I find the license reproductions and
| credits/attributions for the content being distributed in this
| data set? Is it all in there? Are all inclusions compliant? Can I
| know?
|
| I'm open to the argument that generators built with models that
| _consumed_ copyrighted data may evade copyright obligations on
| their output, but surely the data sets themselves are bound by
| any copyright on their content?
| __loam wrote:
| They stole it because they think building their toys is more
| important than everyone else's rights to the product of their
| own labor.
| johndough wrote:
| I doubt that anyone is going to download and search through
| over 800 TB just to find a badly formatted copy of some book
| that could be found much quicker on different websites with
| better formatting. Authors are losing fractional cents here
| at most.
| gosub100 wrote:
| so just like Office Space? (paraphrasing) "We steal a
| fraction of a cent from each transaction, who do we hurt?
| Nobody. We just put the remainder into our account!"
|
| Sorry that's not how damages are calculated in the US tort
| system.
| johndough wrote:
| I do not know how damages are calculated in the US tort
| system. What do they say about the books3 dataset?
|
| I also think that the case is different here, since in
| your example, there is a specific amount of money being
| stolen, while in the books3 case, there is an unspecified
| amount of money not being made by the authors.
| SEGyges wrote:
| I am pretty sure if the authors were trying to license
| their works for this purpose we would just not use them
| at all; it is difficult to see under what circumstances
| they would stand to profit from this other than by suing
| people after the fact over it.
| doug_durham wrote:
| I think you could argue that authors could profit from
| their works being cited in an LLM response. It could
| drive sales of their works much like citations do on the
| web. The counter argument is that and LLM could give you
| the Clif Notes version of the work and thus taking away a
| portion of sales.
| SEGyges wrote:
| In a world where the options were to
|
| 1) pay the author,
|
| 2) implement guaranteed citation of the author any time
| the model gave an answer that was directly derivative,
| with an option to not do so if the summary was
| sufficiently vague, or
|
| 3) ignore the author's book completely as training data
|
| we would all choose 3).
| __loam wrote:
| And the authors would probably be very happy that you
| did.
| __loam wrote:
| The penalty is up to $150k per violation.
| pk-protect-ai wrote:
| they have stole nothing, they make no profit from it as well.
| idle_zealot wrote:
| Oh, so there _aren 't_ AI companies charging for access to
| private models?
| pk-protect-ai wrote:
| Who are they? Why do you mix up the guys who prepared the
| data with the other guys who used this data and making
| money from a vague memory of that data?
| idle_zealot wrote:
| I congratulate all of the authors whose work is included in
| this dataset on contributing their knowledge, skills, and
| perspective to humanity's various endeavors, both creative
| and technical. I hope that the fruits of their labors are
| returned to them, rather than being selfishly hoarded by the
| few with the resources necessary to produce those fruits, be
| they publishers, middlemen, or big tech.
|
| Which is all to say that information shouldn't be hoarded and
| guarded. If it can produce something more than the sum of its
| parts we should use it to do so. The result of that should,
| on the same grounds, not be hoarded and guarded, doubly so
| being based on the work of others.
| gosub100 wrote:
| It will produce "something more" for the already-wealthy
| who control the technology. For instance, LLMs will
| eliminate the need for some customer service jobs,
| increasing the profit margin for the existing executives
| and shareholders, while eliminating entry-level jobs from
| the job market.
| idle_zealot wrote:
| Call me an idealist but I don't think humans should be
| spending their time on jobs a computer can do.
|
| The solution to wealth disparity cannot include "invent
| menial untalented high-paying labor for people to do".
| __loam wrote:
| Yeah why should humans do bothersome labor
| like...creating literature?
| nonrandomstring wrote:
| > I congratulate all of the authors whose work is included
| in this dataset on contributing their knowledge, skills,
| and perspective to humanity's various endeavours
|
| Thank you. You know in some ways it's an honour and a
| privilege to live in such times of progress. The very act
| of publishing is to "let go", and hope that your words and
| ideas contribute to something bigger and beyond your life.
| I never believed much in "intellectual property" as it's
| all stuff that flows through us.
|
| > I hope that the fruits of their labours are returned to
| them
|
| They rarely are, because knowledge and creativity are not
| greatly valued in our time. But authors, artists and
| scientists go into that with eyes wide open these days. The
| rewards come in other ways, as the more you give and put
| into life the more you get out.
|
| > rather than being selfishly hoarded by the few with the
| resources necessary to produce those fruits
|
| This is not what we fear. Hoard away. We will simply take
| back what is ours, whenever we desire it. The hoarders will
| never win against what they call "piracy", because they
| have no moral right. In the long run, they are on the wrong
| side of history.
|
| Far worse, and more likely is that the creative and
| technical works of generations of artists and scientists
| are going to be turned to exactly the opposite of what they
| would want. They will be used to harm and disempower
| humans, divide society instead of heal it, and even make
| the pursuits of art, science and knowledge irrelevant.
|
| We cannot take back our words, or our formulas, or our
| paintings or our songs. But we can _take back tech_.
| jsheard wrote:
| This dataset includes "books3", which is a comprehensive dump
| of Bibliotik, a torrent tracker dedicated to pirated ebooks.
|
| Throw a dart at a wall filled with every notable
| author/publisher ever and whoever you hit probably owns some of
| this data.
|
| Apparently you can just do whatever as long as you say it's for
| AI research, go post Blu-ray rips online, it's fine provided
| you have a .ai domain :^)
| fsckboy wrote:
| > _Throw a dart at a wall filled with every notable author
| /publisher ever_
|
| copyrights do expire, and any books older than Mickey Mouse
| are public domain, so it's not _every notable author ever_
| jsheard wrote:
| Technically true, narrow that down to merely "every notable
| _living_ author and a subset of dead ones " then.
|
| Bram Stokers bones will be relieved to hear that their work
| isn't being misappropriated.
| oldgradstudent wrote:
| It also contains an archive of opensubtitles, which is also
| not very open source.
| refulgentis wrote:
| The subtitles aren't open?
|
| If you meant transcribing dialogue from a TV show is
| violating copyright, I'm not so sure, it's relatively
| common to quote dialogue for varied purposes, ex. TV
| critics
|
| Definitely understand if you're saying the whole dialogue
| for a TV show is copyrighted, but I'm curious about the
| opensubtitles part, used to work in that area.
| PavleMiha wrote:
| Quoting is very different from posting the full contents
| of something. I can quote a book but I can't reproduce it
| in its entirety.
| refulgentis wrote:
| Right, you can't reproduce a book. W/r/t subs and dubs,
| fair use has applied historically.
| layer8 wrote:
| Quoting excerpts is different from transcribing an entire
| work, which is unambiguously copyright infringement.
| (Otherwise you would find the "book" version of any and
| all TV shows on Amazon.) The subtitles in question are
| generally translations, which likewise fall under
| copyright, being a derived work.
| refulgentis wrote:
| Yeah, I was just curious about the opensubtitles site
| because I used to work in that field (subtitles) and
| wasn't sure if there were some new pirate sites that were
| monetizing subs.
|
| n.b. not being argumentative, please don't read it that
| way, I apologize if it comes off that way:
|
| Not every derived work is a copyright violation, that's
| why subs and dubs don't get kicked around, you can quote
| dialogue in an article, etc.[^1]
|
| Answering if it applies to AI is playing out in court
| currently with ex. NYT v. OpenAI[^2] and Sarah Silverman
| et al v. OpenAI[^3] and v. Meta.[^4]
|
| [^1] "Copyright doesn't protect against all use of the
| work or use of derivative works. There are a few
| exceptions that fall under what's commonly known as the
| fair use doctrine:"
| (https://www.legalzoom.com/articles/what-are-derivative-
| works...)
|
| [^2]
| https://www.nytimes.com/2023/12/27/business/media/new-
| york-t...
|
| [^3] https://www.theverge.com/2024/2/13/24072131/sarah-
| silverman-...
|
| [^4] https://www.hollywoodreporter.com/business/business-
| news/sar...
| pk-protect-ai wrote:
| I wish it had included the books3, but it doesn't anymore. I
| wish it was possible to download that 36GB books3.tar in the
| wild these days. Herewith, I promise to use this dataset
| according to the "fair use" only...
| SekstiNi wrote:
| > I wish it was possible to download that 36GB books3.tar
| in the wild these days.
|
| There... is a torrent.
| pk-protect-ai wrote:
| I know. But here where I am, using torrent means
| participate in distribution of the content and that is
| where I'll get huge bill for illegally sharing this file.
| gosub100 wrote:
| not the domain per se, but the high-powered law firms at your
| fingertips. Copyright law is much easier to enforce against
| working-class parents of 12-year-olds than SV elites.
| arthurcolle wrote:
| I can't believe people would do this, just share and republish
| copyrighted works over the internet. I'm in shock and in
| disbelief.
|
| Anyways...
|
| Is RedPajama 30T and The Pile "all you need" ? ;)
| artninja1988 wrote:
| There is currently a project going on to create the pile v2
| which has only permissively licensed data, because of all the
| bickering about copyright.
| jeffrallen wrote:
| > because authors prefer to be paid for their labor
|
| FTFY.
| evilduck wrote:
| I asked an AI tool to create a cheery poem about ringworms
| infecting kids from the 1600s and it created something
| that's never existed before. Which author gets paid for
| this labor they performed?
| zettabomb wrote:
| This is pretty reductive - "FTFY" is rarely the witty
| response you think it is.
| ben_w wrote:
| Naturally, but I wonder what writers are going to do when
| the AI trained purely on suitably licensed content is still
| good enough to make most redundant.
|
| (The authors in best seller's lists may well be immune for
| a bit longer than others writers, as they're necessarily
| the top 0.1% of writers, but not forever: nay-sayers
| claimed that AI could never beat humans at chess or go
| because the games required special human insight).
| mejutoco wrote:
| > The authors in best seller's lists may well be immune
| for a bit longer than others writers, as they're
| necessarily the top 0.1% of writers
|
| The top best selling. Only one of many possible reasons
| for that might be the quality.
| ben_w wrote:
| Quality is subjective, therefore I think it is reasonable
| to say the best are those most able to profit rather
| than, e.g. winners of the Nobel Prize in Literature, or
| the list of books people most pretend to have read.
| wizzwizz4 wrote:
| Once upon a time, nay-sayers said that nobody could
| travel to the moon, regardless of what vehicle they used.
| They were wrong. Once upon a time, nay-sayers said that
| nobody could transmute lead into gold using alchemical
| equipment. They were right.
|
| Nay-sayers who said that _no possible algorithm_ could
| beat humans at chess and go? They were wrong. Nay-sayers
| who say that _these algorithms_ cannot write better books
| than humans? Well...
| SEGyges wrote:
| By "these algorithms", do you mean the ones that
| currently exist, or the ones that will exist next month,
| next year, or in 2034?
| wizzwizz4 wrote:
| We're not developing new algorithms all that quickly. My
| point is that one shouldn't dismiss criticism out-of-
| hand, just because some critics of some other thing
| turned out to be wrong: for this point to be valid, I
| don't need to be making criticism. On an unrelated
| note...
|
| _Personally_ , I'd be referring to the family of
| algorithms that purely take as input a context window and
| provide as output a prediction of the next token
| likelihood. (Plus or minus iteration, to generate strings
| of text.) Pejoratively, one might call these "fancy
| Markov chains", though as with most pejoratives, that's
| overly reductive.
|
| All the approaches we're seeing marketed heavily are just
| fancy Markov chains. I expect every "new algorithm" for
| the next 5 years at least to be a fancy Markov chain,
| because that's what I expect to get funding. (I do expect
| that _some_ people will be working on other approaches,
| but only for amateurish reasons.)
| SEGyges wrote:
| These are fancy Markov chains in the sense that humans
| are just chemicals and computers just do math.
| Technically true, but not even "overly reductive"; it is
| just wrong if it is used to imply that, e.g., humans just
| swirl around in beakers or the most complex thing you can
| do with computers is trigonometry.
|
| You can make anything sound unimpressive if you describe
| it sufficiently poorly.
|
| And: So many different variations are published every
| month. There are a good number of people in serious
| research trying approaches that don't use cross entropy
| loss (ie, strictly next-token prediction).
|
| I don't know what the trajectory of the technology is
| over the next ten years, but I am positive no one else
| does either and anyone who thinks they do is wrong.
| ben_w wrote:
| > Once upon a time, nay-sayers said that nobody could
| transmute lead into gold using alchemical equipment. They
| were right.
|
| Now I'm wondering if, with modern knowledge, you could
| build a 0.5 MeV heavy ion accelerator with only the
| things available to a medieval alchemist.
|
| I'm thinking probably yes? Triboelectics can get the
| right voltage. But how good does the vacuum need to be?
|
| > Nay-sayers who say that _these algorithms_ cannot write
| better books than humans?
|
| They may be right or wrong in the specific, but I think
| they're asking the wrong question, too specific.
| jfvinueza wrote:
| Dunno. Writing fiction myself; asked AI to read it aloud.
| Narrative paragraphs worked fine: a clear, if a bit
| deadpan, slightly tone-deaf delivery. But dialogue was
| horrendous: it didn't understand emotional reactions and
| connotations at all. More so than cringey and robotic, it
| felt soulless. And the distance from "something that
| makes sense" to "something that feels human" felt
| unsormountable. Yes. Many novels will be written with
| LLMs in the coming years. They might even touch us. But
| this little Text-to-Speech experiment felt like an
| evidence that this technology has a void at its core: it
| doesn't have access, like a human does, to a gargantuan
| emotional spectrum, which allows us to understand all
| sorts of subtleties between what is being said, and why,
| and what does it actually mean, and why does it affect us
| (or, hell, how should the next line be read in this
| context, because it has no context, it doesn't feel).
| ben_w wrote:
| I'm also writing a novel, and using text to speech to
| hear how it sounds. One of the ones built into Mac OS.
| And I'd agree with your assessment, I value the
| synthesiser for bringing my attention to things my eyes
| gloss over, such as unnecessary repetition and typos
| which are still correctly spelled words (a common one for
| me is lose/loose).
|
| But: AI was seen as "decades" away from beating humans at
| go, even 6 months before it did.
|
| I don't know how far we are from them writing award
| winning novels (awards we care about, it doesn't count if
| it's an award for best AI), though my _gut feeling_ is we
| need another breakthrough as significant as the
| transformer model... but even then, that 's only a 1s
| feeling.
| onion2k wrote:
| If the data is available online for the pile, surely it's
| also publicly available to ordinary people in a way that
| means authors aren't getting any money.
| sangnoir wrote:
| What sort of defense is this? "Your honor, after someone
| broke in, they left the door open. Since the door was
| unlocked _anyone_ could have committed the crime I 'm
| accused of."
| idle_zealot wrote:
| So a bunch of extra work to create a downgrade? I'm sure
| that's going to be very popular.
| arthurcolle wrote:
| The training data distribution is the only thing that
| matters, not the actual content
| observationist wrote:
| Unless you want something like style from a range of
| authors, knowledge of a fictional universe or storyline,
| or other domain specific data or style characteristics.
|
| A blanket removal of copyrighted data would make a bot
| sterile, boring, unrelatable, and ignorant of culture and
| common memes. We have amazing AI technology. Let's lean
| into it and see where it goes.
| __loam wrote:
| By the violating the copyright hundreds of authors.
| chasd00 wrote:
| if the pile contains the code to go from step 1 to step 2 and
| then to 3 then couldn't you just remove the parts you don't
| want from the raw dataset and re-run the code?
| doctorpangloss wrote:
| It's enough for pre training to later tackle specific NLP
| tasks.
|
| To get something interesting you would have to generate an
| instruct dataset from it. It would have to cover a diverse
| range of tasks. The completions themselves do not make LLMs
| manifest knowledge and reasoning, a large and diverse instruct
| dataset does.
| Ninjinka wrote:
| I raised a concern about the inclusion of books3 in the Pile back
| in 2020, and this is what the head of Eleuther (Stella Biderman)
| told me:
|
| "So here's the big picture. There are three sets of datasets: 1.
| Data exists out there in the world. It has been collected into
| datasets and posted online. I'll call this raw data. 2. We take
| that data, clean it, and process it for language modeling. I'll
| call this per-set data. 3. We combine those per-set data into one
| massive dataset, the Pile. This is heavily processed, including
| weighing the components.
|
| We created 2 and 3 and put them online. We put 2 online so that
| people can reweigh and remix the data if they wish, but we expect
| most people to just download 3 and use it out of the box. Access
| to 3 will be provided in several forms, including HuggingFace and
| from our website.
|
| 2 and 3 are not copyright violations, even if the data is
| copyrighted, because they fall under fair use (at least in the
| US).
|
| The Pile contains code that turns 1 into 2 and code that turns 2
| into 3.
|
| When you download Maroon 5 from a website, you are creating a
| dataset corresponding to 2. That _can be_ copyright violation
| depending on what you do with it, but our use is not a copyright
| violation. "
| artninja1988 wrote:
| Hopefully that is correct. The pile has been very valuable for
| open model work. It's a really high quality dataset
| layer8 wrote:
| I don't understand how this can be true if set 2 contains a
| complete copyrighted work (say, a book) that the copyright
| owner hasn't approved for such distribution. Unless I
| misunderstand and the "process[ing] for language modeling" is
| an entirely irreversible process.
| CharlesW wrote:
| Even if the model encoding is not lossless/reversible, it's
| probably _not_ true. A good place to start when thinking
| about fair use is the "four factors" that the U.S. legal
| system will consider.
| https://fairuse.stanford.edu/overview/fair-use/four-factors/
| layer8 wrote:
| Summary books for example are legal, so there is _some_
| threshold of compression where things are fine.
| knodi123 wrote:
| are you referring to things like CliffNotes ?
| layer8 wrote:
| I'm referring to the "Summary of <some other book>"
| booklets you can find on Amazon. Also services like
| Blinkist.
| kmeisthax wrote:
| Google keeps coming up with new ways to statistically infer
| training set data from models. So it's not entirely
| lossless. At the very least, models that have been trained
| on a particular work are unusually good at compressing[0]
| those works, relative to other valid text.
|
| In terms of fair use, one of the larger factors is the
| 'market substitution' factor, which basically means "does
| this use compete with otherwise licensed uses that people
| would ordinarily pay for?" AI absolutely does compete with
| human artists for the same market. In fact, it's winning
| handily[1], _because you don 't have to pay human artists_.
| AI art models absolutely shouldn't be trained on anything
| with copyright on it.
|
| The other factors don't fare much better. Nature of the
| original work will differ based on the plaintiff, but the
| purpose and character of the AI's use of that work is very
| much commercial. And the amount and substantiality of the
| use is complete and total. I don't see AI being fair use -
| at least, not in every one of the many, many training
| lawsuits currently ongoing against OpenAI and Stability.
|
| [0] Starting with any body of text, an LLM, and an empty
| context window, compute the next-token probabilities and
| take the highest one. If it matches the source text, output
| a 1 bit. If it doesn't, output 0 followed by the ID of the
| correct next token. Add the correct token to the context
| window and repeat until the text has been fully compressed.
| This produces a list of perplexities (wrong words) for the
| given text which can be used to guide the LLM to output the
| original work.
|
| [1] Hey, remember when both WotC (biggest art commissioner
| on the planet) and Wacom (hardware vendor that sells art
| tools and payment terminals[2]) both got caught using AI
| art after making very loud and public pledges to not do
| that? They both wound up buying stock photography on
| marketplaces that are absolutely flooded with AI trash.
|
| [2] All the credit card readers in Japan are built by
| Wacom, which is _really funny_ as an artist
| IshKebab wrote:
| Yeah I agree. If 2 contains complete copyright works (e.g.
| all of Harry Potter) then "we're just using it for AI
| training!" stands approximately zero chance of passing the
| fair use test. Their assertion that it does is just wishful
| thinking.
| mistrial9 wrote:
| said with confidence, however Silverman et al Judge
| explicitly rejected what you just asserted AFAIK
| whimsicalism wrote:
| no, she didn't reject the claim that training on
| copyrighted work is infringement, merely that the outputs
| are not infringing simply by bearing similarity to the
| texts
| papercrane wrote:
| You've got it backwards. The judge in Silverman et al
| dismissed the claims asserting that OpenAIs output is
| copyright infringement. The claims for copyright
| infringement in the training data are still going
| forward, that will directly test whether it is "fair use"
| or not.
|
| From the ruling:
|
| > Assuming the truth of Plaintiffs' allegations - that
| Defendants used Plaintiffs' copyrighted works to train
| their language models for commercial profit - the Court
| concludes that Defendants' conduct may constitute an
| unfair practice.6 Therefore, this portion of the UCL
| claim may proceed.
|
| https://caselaw.findlaw.com/court/us-dis-crt-n-d-
| cal/1158180...
| mistrial9 wrote:
| aha - much appreciated
| HenryBemis wrote:
| I'm playing stupid now: I believe that if I ask the LLM to
| "display Harry Potter Book 1" and it does, word-by-word,
| then you're 100% right, it's copyright infringement. But,
| if I ask the LLM to "give me an analysis of the "Professor
| Severus Snape's" character and it gives me one, then I
| don't see the problem.
|
| So in that sense I understand the response that "they don't
| violate copyright" by studying the material. Again, I don't
| pretend to be a lawyer, and not every law has to follow my
| logic.
| swatcoder wrote:
| That's a different discussion.
|
| This isn't about the output for content generators or
| about the abstract numeric weights that they operate
| over. That's more complex and a largely open question.
|
| But this is literally about indiscriminately distributing
| copyrighted works in a large, convenient archive while
| arguing that it's okay because you normalized the
| formatting a bit and because you suspect that _some_
| people _might_ find "fair use" value in it.
| michaelt wrote:
| _> Unless I misunderstand and the "process[ing] for language
| modeling" is an entirely irreversible process._
|
| In the case of The Pile, "processing for language modelling"
| means "converting epub and pdf into plain text, maybe
| deduplicating, maybe removing some sorts of detectably
| malformed files"
|
| So not a particularly lossy conversion.
| layer8 wrote:
| I see, thanks. Yes, in that case, I don't see how this can
| possibly not constitute copyright infringement.
| bayindirh wrote:
| It's generally tucked under Fair Use doctrine because
| "It's for the science", until it doesn't (looking at
| commercial AI non-profits).
|
| Then "they're doing something amazing, they don't need
| permission, and the cat is already out of the bag, and
| similar musings".
|
| Seriously, it's both copyright infringement, and
| unethical. This is why I don't use any of the popular AI
| tools, or even AI add-ons in Evernote, Notion, etc. They
| all link back to the usual suspects.
| layer8 wrote:
| I'm talking about distributing the corpus, which by
| itself is not bound to any particular usage.
| bayindirh wrote:
| It's again copyright infringement. If I share a
| copyrighted ebook by accident, any and every cloud
| provider will ban my account with no warning and
| recourse.
|
| Open science repositories would take down the "dataset"
| immediately (or at least limit its access) if a copyright
| holder brings the matter to the eyes of the admins.
| CobrastanJorji wrote:
| > "They're doing something amazing, they don't need
| permission, and the cat is already out of the bag"
|
| Ah, the Uber theory of law. Works surprisingly well for
| some reason.
| bayindirh wrote:
| Probably due to Murphy's Golden Law of Golden Laws: Who
| has the gold makes the laws.
| Grimblewald wrote:
| The question then becomes do these concerns remain even
| for AI that cannot reproduce origional works? Does what
| does that mean for us? When we read things, or interact
| with Amy information for that matter, it changes us and
| how we do things. If you consume art it will forever
| influence art you produce yourself. Are these copyright
| infringements also?
|
| I can see the problem where direct and faithful
| replication is possible but where it isn't is there still
| a problem? Or is the automatable aspect, the scale at
| which it can occur, that is the problem?
| bayindirh wrote:
| The difference is what you mix, and the amounts of things
| you mix. As a human you mix many more inputs, plus your
| emotions, plus the things you consume to create it.
| Moreover, what you can consume and how perfectly you can
| consume is limited by our innate limits.
|
| An AI system consumes something perfectly, then ingrains
| it into its weights perfectly, and becomes capable of
| imitating the same thing perfectly. Plus, ther are no
| other internal or external factors which affect these
| "generation" over time. Hence, it mixes and reproduces
| based on what it consumed, solely.
|
| I might get inspired by people, and add my own values to
| it, iterate over it and diverge from what I'm inspired
| from ultimately to create my own style. AI doesn't work
| like that. Also, if I do the same amount of inspiration
| with the same precision and accuracy, I'll be neck deep
| in accusations and lawsuits (for the right reasons).
|
| As a result, just because we fail to ask the right
| questions to reproduce the training data verbatim or
| almost verbatim doesn't mean that the information is not
| there. At the end, a neural network is a compression
| algorithm which encodes data in terms of weights. Given
| the correct input, you can regenerate the training data
| as is.
|
| Unless you have special abilities, you can't read 500
| books an hour, remember them perfectly, and generate
| derivative works by mashing all of them together. If I do
| and try to sell a novel, I'll be ridiculed no end. If I
| write a Ph.D. the same way and try to defend it, I'll be
| banned from academia for three lifetimes at least.
|
| For more elaboration on the subject, see [0].
|
| [0]: https://news.ycombinator.com/item?id=39188463
| miki123211 wrote:
| The myth that AI models store all its training data in
| its weights verbatim is as widespread as it is false. In
| fact, if this were the case, deep neural networks would
| be considered far better compression algorithms than
| anything we have on the market right now, by literal
| orders of magnitude.
|
| If you divide Stable Diffusion's file size by the number
| of images used to train it, you get something like 1.2
| bits per image, and it is physically impossible to get
| this kind of a compression ratio.
|
| The actual problem with AI is that it _sometimes_
| plagiarizes random _fragments_ of the work it is trained
| on, even if it is not the user 's intend, and we
| currently don't really know how to fully prevent this.
| bayindirh wrote:
| It still doesn't change the fact that the inclusion of
| commercial works is copyright infringement though.
|
| Same for code generating models trained on Open Source
| and Free Software. Tons of licenses violated, from strong
| copyleft to source available models and reproduced
| (almost) verbatim with comments intact.
|
| Some researcher's codebase is almost completely
| reproducible without any licensing information just by
| hinting the function names.
|
| Maybe for the image compression it's borderline
| impossible for now due to network size, but for text and
| code, generation of training data almost verbatim is very
| possible and straightforward.
|
| Also in image generation models, style transfer is the
| bigger problem, because it completely eliminates the
| artist who uses/created the style in the first place.
| "You pioneered this, and we fine tuned this model with
| your images, and we can do your work for free, without
| you, have a nice day". However, the artist's life
| expenses doesn't disappear when they're transferred to an
| image generation model.
|
| This is also unethical.
| __loam wrote:
| The "humans do it too" argument is totally irrelevant to
| this because humans have special and specific privileges
| under the law that computers don't. The problem is that a
| lot of data was copied into training sets and used for
| commercial purposes without permission.
| ribosometronome wrote:
| Whether or not I'm influenced by media isn't super
| relevant to whether or not I pirated that media. Before
| even arriving at the question of whether or not the
| resulting models are infringing, it's clear the training
| data is.
| chinathrow wrote:
| Nicely stated copyright violations. Has noone filed suit yet?
| SEGyges wrote:
| Huckabee v Bloomberg, Meta, et al
| whimsicalism wrote:
| scraping libgen and downloading copyrighted content and
| redistributing it isn't illegal?
|
| call me skeptical, seeding a torrent of movies that you
| downloaded from elsewhere on the internet isn't "fair use" and
| the pile isn't just code for transforming data, it is the
| redistributed data itself
|
| by this logic i could legally run a libgen mirror
| nickpsecurity wrote:
| They're distributing copyrighted works without the authors
| permission, using them in ways that compete with the author,
| many make money off AI's, and the AI's reproduce some verbatim.
| These datasets seem to fail most tests ("four factors") in
| copyright law. Even laypeople I've explained LLM's to think the
| AI companies are ripping others' work off.
|
| For those concerned, I have an article that covers legalities,
| each dataset (including The Pile), legal issues with them,
| alternatives that are legal, and a copyright amendment that
| balances all sides.
|
| http://gethisword.com/tech/exploringai/
|
| Looking back at my proposal, I think we need at least three
| rules passed immediately in at least one country:
|
| 1. All copyrighted works can, if a person has legal access, be
| used for training AI systems. Any terms restricting copyrighted
| works from use in training, charging more for that, restricting
| downloads for it, etc are illegal. Every act of publishing can
| benefit both a human mind and AI training equally.
|
| 2. People can copy and transform for their own use any work
| they have access to _only for AI training._ This might include
| reverse engineering for extraction, multiple copies in
| different formats, and so on. They can do whatever is needed to
| get it into the AI system. Other uses or abuse of this data is
| subject to existing law.
|
| 3. Any work published online for free and with public access
| can be copied, _shared_ , processed, and bundled for AI
| training. That's regardless of its terms.
|
| Note: In No. 2 and No. 3, the resulting AI's copyright will be
| determined by existing law about AI's and mixing copyrighted
| works. Or no copyright if that's the law.
|
| 4. If AI outputs are copywritten, their status will be the same
| as if the user published it themselves while relying on prior
| works. AI training sets will also be public to determine this.
|
| With those rules, we can share works like those in The Pile,
| still pay creators that want to be paid, be less likely to just
| steal existing work, and infringement in outputs is still
| illegal. What do you all think of that?
| dougb5 wrote:
| I don't know what the right answer is to the copyright
| questions, but I hope that in 2024 we'll have a better attitude
| about the human labor that went into these models than "Data
| exists out there in the world" and the passive-voice "It has
| been collected into datasets"
| clooper wrote:
| Have you heard of data unions?
| otterley wrote:
| > 2 and 3 are not copyright violations, even if the data is
| copyrighted, because they fall under fair use (at least in the
| US).
|
| This cannot be known until it is litigated. Fair Use is not
| something you can unilaterally declare and have it be so, just
| like you can't be like Michael Scott in the Office shouting "I
| declare bankruptcy!" OpenAI is currently defending itself
| against the New York Times for this very reason.
|
| There's a multi-factor test that courts weigh the facts against
| in making a determination as to whether a _prima facie_
| copyright violation would be protected under a Fair Use
| defense:
|
| Factor 1: The Purpose and Character of the Use
|
| Factor 2: The Nature of the Copyrighted Work
|
| Factor 3: The Amount or Substantiality of the Portion Used
|
| Factor 4: The Effect of the Use on the Potential Market for or
| Value of the Work
|
| See https://copyright.columbia.edu/basics/fair-use.html for a
| pretty good overview of what the analysis entails.
| ryukoposting wrote:
| Thanks. This is really informative, and really important
| information given the growing relevance of IP law in
| everyone's daily life. Part of me wonders if these four
| factors will ever become part of core curriculum for civics
| classes.
|
| By no means am I an expert in copyright law, but factor 3
| seems like very bad news if you're OpenAI.
| fluoridation wrote:
| Whether something is fair use or not is not determined by a
| court, but by the definition of what fair use is. A court
| interprets that definition and the situation, and if their
| interpretation matches yours you may have a ruling in your
| favor. But saying "this is fair use" is no more incorrect
| than saying "this is red". You're interpreting your
| perception and putting that interpretation into words.
| otterley wrote:
| > But saying "this is fair use" is no more incorrect than
| saying "this is red".
|
| When a court determines that it isn't, you can continue to
| argue it as much as you like (to deaf ears), and yet you're
| still liable to the copyright holder. Whether it's
| "incorrect" or not is then irrelevant. Let's not argue
| semantics here.
| wiremine wrote:
| > This cannot be known until it is litigated. Fair Use is not
| something you can unilaterally declare and have it be so.
|
| Correct, but what isn't clear here is their rationale for why
| they think they're covered by fair use. Does anybody have
| that information?
|
| I'm not saying their interpretation is correct, but seems to
| be germane to this discussion. The parent comment seems to
| assume none of this has been litigated yet, which might also
| be true. Or not.
| 3abiton wrote:
| Interesting take on the copyright law.
| tycho-newman wrote:
| Fair use is a defense to infringement. Do not start your
| copyright argument by admitting you infringed.
| mjtechguy wrote:
| Would be interested to see what is in there. Luckily noone has
| posted the magnet link on Twitter.
| SEGyges wrote:
| The counterparties on related legal action are sufficiently
| litigious that it is probably smarter to DM the magnet link.
| fddrdplktrew wrote:
| 825gb seems really small
| _obviously wrote:
| Seems kind of small tbqh.
| beiller wrote:
| It seems small, until you try to download it.
| jMyles wrote:
| So much of this thread is concerned not with the achievement of
| this data set, but with the (by comparison) silly and outdated
| spat over how to frame it as "property" for the purposes of
| government intervention (pursuant to which jurisdiction?).
|
| The era of intellectual "property" is over. Let's be at peace
| with that and just move on into the next age.
| __lbracket__ wrote:
| LLM are of use to megacorps.
|
| Megacorps assume authors, painters, etc are poor and powerless
| (which lets face it, they are)
|
| we can b** and moan on HN, but megacorps will find ways to use
| copyrighted works for free.
| quatrefoil wrote:
| While a lot of attention has been given to books3, another large
| component of this dataset is the deceptively-named
| "OpenWebText2". What's that? It's a scrape of 15 years' worth of
| third-party websites that were linked to from upvoted Reddit
| submissions. I know this includes some of my writing.
| 7moritz7 wrote:
| Care to give me your domain name so I can check all major llms
| for plagiarism? I have a feeling none of them can produce a
| sentence from your writings
| quatrefoil wrote:
| It takes deliberate effort, but I was actually able to get
| pieces of my writing out of one of the leading LLMs (not
| ChatGPT). This is not particularly unique, a number of folks
| demonstrated the same.
| observationist wrote:
| Relevance and impact aside, if you publish something to the
| internet on a site with no access restriction in place, I don't
| know how you can keep a straight face while claiming some sort
| of moral right to the content. It's the equivalent of
| broadcasting it over radio, or printing and delivering it
| straight to the doorsteps of millions of random individuals.
| Methinks you doth protest too much, or something.
|
| There are ways of copyrighting data, and establishing ownership
| of intellectual property. Your tumblr fanfic, youtube comments,
| or HN discussions are not legitimate copyright avenues. Stuff
| you post to legally scrapeable websites are fair game for fair
| use.
|
| I can do anything I want in private to any data I collect. I
| could create an awesome HN LLM on the scraped datasets, and use
| it privately to my hearts content. I can even set up an API to
| that LLM that generates content, and, given recent rulings,
| even if i had all the written copyrighted data in the world, as
| long as I was making good faith efforts to ensure copyright was
| being respected and works weren't being recreated verbatim,
| then I could even use that model commercially. I just couldn't
| sell it to other people, or distribute it, without entering a
| different legal regime.
|
| I can collect any data I want from public facing websites.
|
| That's how the internet works; it's how it was designed. There
| are authentication mechanisms, network configurations, and a
| myriad other access control schemes you can implement to
| prevent public access. If you post to sites without those
| mechanisms, you're tacitly agreeing to give up any plausible
| claims of protection against a wide array of fair uses well
| established by precedent cases at this point. If you don't
| prevent public access, and you've got a domain name on a
| server, you're tacitly inviting the world to come download
| whatever it is you have on your server. This is a social good.
| This is what we want when we participate in the internet.
|
| Insisting on some sort of vague entitlement as to how "your"
| data gets used completely bypasses the fact that anything you
| consider to be misused in OpenWebText2 fundamentally stems from
| the fact that you posted the content to a publicly visible
| website and gave up any say in what happens thereafter. It was
| scraped fair and square.
|
| Don't complain that you didn't know the rules, or that life
| isn't fair.
|
| It's not even clear that terms of service or those little
| popups on public websites have any legal relevance. If your
| website is open to the public, then it's fair game. If you post
| content to a public website, then that content's fair game.
| intalentive wrote:
| This is 4 years old. Why the top of HN now?
| willvarfar wrote:
| Are there any simple text editors or wysiwyg that have local LLMs
| and can tidy up and auto-suggest whole paras of slick verbage as
| you type?
| clooper wrote:
| The big Hollywood studio pay a lot of money to various cyber
| security companies to look for pirated content and send cease and
| desit letters to hosting companies for letting their users
| distribute copyrighted content.
|
| If authors and artists were to join a data union they could do
| the same thing as studios. If copyrighted law has any real teeth
| then the data union can send legal requests to whoever is hosting
| the content and requesting it to be taken down.
|
| I'm not a lawyer but I know the studios definitely do this.
| kristianp wrote:
| Please add (2020) to the title.
___________________________________________________________________
(page generated 2024-03-07 23:02 UTC)