[HN Gopher] In the LLM space, "open source" is being used to mea...
___________________________________________________________________
In the LLM space, "open source" is being used to mean "downloadable
weights"
Author : FanaHOVA
Score : 289 points
Date : 2023-07-21 15:49 UTC (7 hours ago)
(HTM) web link (www.alessiofanelli.com)
(TXT) w3m dump (www.alessiofanelli.com)
| spullara wrote:
| It remains to be seen in court whether weights are even
| copyrightable potentially making all the various licenses and
| their restrictions moot.
| humanistbot wrote:
| And it also remains to be seen if various legislatures will
| pass laws that explicitly declare the copyright status of model
| weights. It is important to remember that what is or is not
| copyrightable can change.
| rvcdbn wrote:
| At least in the US copyright is established by the
| constitution so not sure how much it's possible to change via
| the normal legislative process.
| gpm wrote:
| The US constitution grants congress the ability to create
| copyright ("To promote the progress of science and useful
| arts, by securing for limited times to authors and
| inventors the exclusive right to their respective writings
| and discoveries"), but it doesn't create copyright law
| itself. That's a broad clause that gives Congress pretty
| free reign to change how copyright is defined.
| rvcdbn wrote:
| Constitutionality is also about how previous cases have
| been evaluated for example see the bit about how
| photography copyright was established here: https://const
| itution.congress.gov/browse/essay/artI-S8-C8-3-...
| rvcdbn wrote:
| specifically:
|
| > A century later, in Feist Publications v. Rural
| Telephone Service Co., the Supreme Court confirmed that
| originality is a constitutional requirement
| ljdcfsafsa wrote:
| 1. Why wouldn't they be and 2. Does that even matter? If you
| enter into a contract saying don't do X, and you do X, you're
| violating the contract.
| sebzim4500 wrote:
| I assume GP was talking about a scenario in which you had not
| entered into a contract with Meta. E.g. if I just downloaded
| the weights from someone else.
| rvcdbn wrote:
| 1 - because they lack originality, see: https://constitution.
| congress.gov/browse/essay/artI-S8-C8-3-...
| dvdkon wrote:
| In a similar vein, the common "you may not use this model's
| output to improve another model" clause is AFAIK unenforceable
| under copyright, so it's _at best_ a contractual clause binding
| a particular user. Anyone using that improved model afterward
| is in the clear.
| ljdcfsafsa wrote:
| > it's at best a contractual clause binding a particular
| user. Anyone using that improved model afterward is in the
| clear.
|
| That's... not really accurate. See the concept of tortious
| interference with a contract.
| dvdkon wrote:
| Hm, I don't know much about common law, but I don't think
| this would apply if, say, an ML enthusiast trained a model
| from LLaMA2 outputs, made it freely available, then someone
| else commercialised it. The later user never caused the
| original developer to breach any contract, they simply
| profited from an existing breach.
|
| That said, doing this inside one company or with
| subsiduaries probably wouldn't fly.
| taneq wrote:
| And of course anyone using a model improved by this is
| entirely unworried by these clauses if their improved model
| takes off hard.
| banana_feather wrote:
| The idea is that if you violate the terms of the license to
| develop your own model, you lose your rights under the
| license and are creating an infringing derivative work. If I
| clone a GPL'd work and ship a derivative work under a
| commercial license, downstream users can't just integrate the
| derivative work into a product without abiding by the GPL
| terms and say "well we're downstream relative to the party
| who actually copied the GPL'd work, so the GPL terms don't
| apply to us".
| dvdkon wrote:
| Thing is, the outputs of a computer program aren't
| copyrightable, so it doesn't matter if your improved model
| is a derivative work. What you say would apply if you
| derived something from the weights themselves (assuming
| they are copyrightable, of course).
| blendergeek wrote:
| If such a "derivative" model is a derivative work, then
| aren't all these LLMs just mass copyright infringement?
| banana_feather wrote:
| At the end of the day it's not black and white, but
| there's a large and obvious difference in degree that
| would plausibly permit someone to find that one is and
| the other isn't. It's fairly easy to argue that using the
| outputs of LLM X to create a slightly more refined LLM Y
| creates a derivative work. The argument that a model is a
| derivative work relative to the training data is not so
| clear cut.
| dragonwriter wrote:
| If model weights aren't copyrightable, derivative model
| weights are not a "work", derivative or otherwise, for
| copyright purposes.
|
| If they are, and the license allows creating finetuned
| models but not using the output to improve the model,
| then the derived model is not a violation, but it might
| be a derivative work.
| dTal wrote:
| Exactly this. What's good for the goose is good for the
| gander!
| rodoxcasta wrote:
| If the weights are not copyrighteable, you don't need a
| licence do use them, they are just data. There's not a
| right to infringe if these numbers have no author. Of
| course, to use openAI API you must abide to their terms.
| But if you publish your generations and I download them, I
| have nothing to do to the contract you have with openAI
| since I'm no part of it. They can't impede me to use it to
| improve my models.
| diffeomorphism wrote:
| Really?
|
| Your customers bought that product under license A.
| Afterwards it turned out that you pirated some artwork from
| disney. Then your customer can sue you (not disney) to make
| things right. The specific license of the original work
| seems quite irrelevant here.
| pessimizer wrote:
| Not at all. The reason your customer can sue you is
| because Disney can sue your customer. Disney would be
| suing your customer under the specific license of the
| original work.
|
| edit: you seem to see the customer as the primary victim
| here instead of Disney, but if Disney weren't a victim
| the customer wouldn't have a case.
| stale2002 wrote:
| No, because the premise of the hypothetical is that the
| weights aren't protected by copyright.
|
| So, no matter what they TOS says, it's not an infringing
| work.
|
| > Downstream users can't just integrate the derivative work
| into a product without abiding by the GPL terms
|
| You absolutely could do this if the original work is not
| protected by copyright, or if you use it in a way that is
| transformative and fair use.
| mattl wrote:
| Something under the GPL is also copyrighted. The GPL is a
| copyright license.
| stale2002 wrote:
| If the underlying work is not protected by copyright, it
| doesn't matter what license someone tries to put on it.
|
| Similarly, if someone creates a fair use/transformative
| work then the license can also be ignored.
| FanaHOVA wrote:
| Yep, same with SSPL. GPL has been tested in FSF vs Cisco
| (2008), but none of the more restrictive licenses have.
| jrockway wrote:
| It seems like a dangerous clause to me.
|
| 1) "Dear artists, the model cannot infringe upon your copyright
| because it's merely learning like a human does. If it
| accidentally outputs parts of your book, you know, it just
| accidentally plagiarized. We all do it haha! Our attorneys
| remind you that plagiarism is not illegal in the US."
|
| 2) "Dear engineers, the output of our model is copyrighted and
| thus if you use it to train your own model, we own it."
|
| I am not sure how both of those can be true at the same time.
| jimmaswell wrote:
| We all truly do "accidentally plagiarize", especially
| artists. Many guitarists realize they accidentally copied a
| riff they thought they'd come up with on their own for
| example.
| jrockway wrote:
| I, for one, welcome our new plagiarism overlords.
|
| Oops.
|
| I added the "haha" in there because the probability of a
| human doing this kind of goes way down as the length of the
| text increases. Can you type, verbatim, an entire chapter
| of a book? I can't. But, I bet the AI can be convinced in
| rare cases to do that.
|
| The whole thing is very interesting to me. There was an
| article on here a couple days ago about using gzip as a
| language model. Of course, gzipping a book doesn't remove
| the copyright. So how low does the probability of
| outputting the input verbatim have to be before copyright
| is lost?
|
| Reading the book and benefitting from what you learned?
| Obviously not copyright infringement. Putting the book into
| gzip and sending your friend the result? Obviously
| copyright infringement. Now we're in the grey area and ...
| nobody knows what the law is, or honestly, even how to
| reason about what the law wants here. Fun times.
|
| (Personally, I lean towards "not copyright infringement",
| but I'm not a big believer in copyright myself. In the case
| of AI training, it just makes it impossible for small
| actors to compete. Google can just buy a license from every
| book distributor. SmolStartup can't. So if we want to make
| AI that is only for the rich and powerful, copyright is the
| perfect tool to enable that. I don't think we want that,
| though.
|
| My take is that the rest of society kind of hates Tech
| right now ("I don't really like my Facebook friends, so
| someone should take away Mark Zuckerberg's money."), so
| it's likely that protectionist laws will soon be created
| that ruin it for everyone. The net effect of that is that
| Europe and the US will simply flat-out lose to China, which
| doesn't care about IP.)
| spullara wrote:
| There are people that can type, verbatim, the entire
| chapters of books.
| Der_Einzige wrote:
| The overwhelming majority of all human advancement is in
| the form of interpolation. Real extrapolation is extremely
| rare and most don't even know when it's happening. This is
| why it's extremely hypocritical for artists of any sort to
| be upset about Generative AI. Their own minds are doing the
| same exact thing they get upset about the model doing.
|
| This is why fundamental "interpolative" techniques like
| ChatGPT (whose weights are in theory frozen) is still
| basically super-intelligent.
| polotics wrote:
| Wow you appear to know a great deal about how human minds
| work: "doing the same exact thing they get upset about
| the model doing"... May I query you put up a list of
| publications on the subject of how minds work?
| Der_Einzige wrote:
| My insights are widely accepted theories from various
| fields, all available in the public domain.
|
| It's a well-understood concept that our minds function by
| making sense of the world through patterns. This is the
| essence of interpolation - taking two known points and
| making an educated guess about what lies in between. Ever
| caught yourself finishing someone's sentence in your mind
| before they do? That's your brain extrapolating based on
| previous patterns of speech and context. These processes
| are at the heart of human creativity.
|
| The field of Cognitive Science has extensively documented
| our tendency for interpolation and pattern recognition.
| Works like The Handbook of Imagination and Mental
| Simulation by Markman and Klein, or even "How Creativity
| Works in the Brain" by the National Endowment for the
| Arts all attest to this.
|
| When artists create, they draw from their experiences,
| their knowledge, their understanding of the world - a
| process overwhelmingly of interpolation.
|
| Now, I can see how you might be confused about my
| reference to ChatGPT being "super-intelligent". Perhaps
| "hyper-competent" would be more appropriate? It has the
| ability to generate text that appears intelligent because
| it's interpolating from a massive amount of data - far
| more than any human could consciously process. It's the
| ultimate pattern finder.
|
| And that, my friend, is my version of "publications on
| the subject of how minds work." I may not be an
| illustrious scholar, but hey, even a clock is right twice
| a day! And who knows, maybe I'm on to something after
| all.
| saghm wrote:
| There was a famous case where John Fogerty (formerly of
| Creedence Clearwater Revivial) ended up getting sued by
| CCR's record label, claiming a later solo song he did with
| a different label was too similar to a CCR song that he
| wrote, and they won. So legally speaking, you can even get
| in trouble for coming up with the same thing twice if don't
| own the copyright of the first one.
| rcxdude wrote:
| The copyright situation with music is kinda broken,
| different parts of the performance get quite different
| priority when it comes to copyright (many core elements
| of a performance get basically no protection, whereas the
| threshold for what counds as a protectable melody is
| absurdly low). Especially this means its less than
| worthless for some genres/traditions: for jazz and blues,
| especially, a huge part of the genre and culture is
| adapting and playing with a shared language of common
| riffs.
| luma wrote:
| 2) doesn't line up with the US court's current stance that
| only a human can hold copyright, and thus anything created by
| a not-human cannot have copyright applied. This applies to
| animals, inanimate objects, and presumably, AI.
|
| I have no idea how this impacts the encodability of the
| license from FB which may rely on things other than
| copyright, but as of right now, the output absolutely cannot
| be copyrighted.
| jrockway wrote:
| That's an extremely good point. The output of software is
| never copyrightable. What makes language models not
| software?
| danielbln wrote:
| Isn't Photoshop software?
| pessimizer wrote:
| Photoshop's output has been completely guided (until
| recent additions) by a human who can hold a copyright.
|
| That being said, isn't a prompt guidance?
| sangnoir wrote:
| If they are nor copyrightable, that'll be the end of publicly-
| released weights by for-profit companies. All subsequent models
| will be served behind an API.
| dragonwriter wrote:
| > If they are nor copyrightable, that'll be the end of
| publicly-released weights by for-profit companies
|
| I don't see why, for-profit companies release permissively-
| licensed ooen-source code all the time, and noncopyrightable
| models aren't practically much different than that.
| bilbo0s wrote:
| Because the courts will have determined their business
| models for them.
|
| As mercenary as it may sound, what these companies are
| trying to do is find a business model that is as friendly
| to themselves as it is hostile to their competitors.
|
| This is all part of the jockeying.
| dragonwriter wrote:
| And, sure, lack of copyrightability changes the
| parameters and will change behavior. What I think you
| have failed to support is that the _particular_ change
| that it will induce will eliminate all such releases.
| sangnoir wrote:
| I debated whether to be more specific and verbose in my
| earlier comment and brevity won at the expense of clarity.
| I meant large models that cost 6 or 7 digits to train
| likely won't be released if the donor company can't control
| how the models are used.
|
| > I don't see why, for-profit companies release
| permissively-licensed ooen-source code all the time
|
| I agree with this - however, they tend to open-source non-
| core components - Google won't release search engine code,
| Amazon wont release scalable-virtualization-in-a-box, etc.
|
| I'm confident that Facebook won't release a hypothetical
| Llama 5 in a manner that enables it to be used to improve
| ChatGPT 8 - the aim will be unchanged from today, byt the
| mechanism will shift from licensing to rate-limiting,
| authentication & IP-bans.
| weinzierl wrote:
| I find the idea that weights are not copyrightable very
| fascinating - appealing even. I have a hard time imagining a
| world where this is the case, though.
|
| Can you summarize why weights would not be copyrightable or
| give me pointers to sources that support that view.
| cbm-vic-20 wrote:
| An analog to this might be the settings of knobs and switches
| for an audio synthesizer, or guitar effects settings. If you
| wanted to get the "Led Zeppelin sound" from a guitar, you
| could take a picture of the knobs on the various pedals and
| their configuration, and replicate that yourself. You then
| create a new song that uses those settings. Is that something
| that is allowed under copyright?
|
| What if there were billions of knobs, tuned after years of
| feedback and observations of the sound output?
| paxys wrote:
| I don't think that's a good analogy. A piano has N keys.
| You can press certain ones in certain combinations and
| write it down. That result is still copyrightable, because
| you can prove that it was an original and creative work.
| Setting knobs for a machine is no different, but the key
| differentiator is if you did it yourself or if an algorithm
| did it for you.
| cbm-vic-20 wrote:
| In my analogy, it's not the sequence of the notes or the
| composition, which I agree is copyrightable. But are the
| settings of the knobs and switches on synthesizers and
| effects devices used in a recording equivalent to the
| weights of a neural network or LLM? And if so, are those
| settings or weights copyrighitable?
| rvcdbn wrote:
| That's a bad analogy because a human chose the values of
| those settings using their creative mind. That's not at all
| the case with weights. This originality is the heart of
| copyright law.
| slimsag wrote:
| Speculating (I am not a lawyer) I see two options:
|
| 1. Model weights are the output of mathematical principles,
| in the US facts are not copyrightable, so in general math is
| not copyrightable.
|
| 2. Model weights are the derivative work of all copyrighted
| works it was trained on - in which case, it would be similar
| to creating a new picture which contains every other picture
| in the world inside of it. Who is the copyright owner? Well,
| everyone, since it includes so many other copyright holders'
| works in it.
| humanistbot wrote:
| Your second argument, if true, disproves your first
| argument.
| slimsag wrote:
| Doesn't matter. A court decides in the end, and the two
| choices I presented could lead to OPs scenario. If a
| court decides that, they decide that, period. I'm not
| 'making an argument' with those points - I'm presenting
| options a court might choose from when setting precedent.
| FishInTheWater wrote:
| Remember that database rights are a thing.
|
| One cannot hold copyright facts, but one can "copyright" a
| collection of facts like a search index or a map.
| earleybird wrote:
| Your second question asks: "Who owns the Infinite
| Library[0]?"
|
| related, there was a presentation (i've lost the reference)
| on automatic song (tune?) generation where the presenter
| claimed (rather humourusly) that he'd generated all the
| songs that had ever been and will ever be so that while he
| was infringing on a large but finite number of songs, he
| was non infringing on an infinite number of future songs.
| So, on balance he was in a favourable position.
|
| [0] https://en.wikipedia.org/wiki/The_Library_of_Babel
| sebzim4500 wrote:
| Generally the output of a machine is not copyrightable.
| Similarly, the contents of a phone book is not copyrightable
| in the US even if the formatting/layout is. So I could take a
| phonebook and publish another one with identical phone
| numbers as long as I laid it out slightly differently.
| xxpor wrote:
| Work also has to be "creative" in order for it to be
| eligible for copyright. This is why photomasks have
| special, explicit protection in US law; they're not really
| "creative" in that way.
|
| https://en.wikipedia.org/wiki/Integrated_circuit_layout_des
| i...
| cal85 wrote:
| What about compiled binaries? If I write my own original
| source code (and thus automatically own the copyright to
| it), and compile it to binary, is the binary not protected
| to?
| sebzim4500 wrote:
| No, because you the input to that process was a bunch of
| work that you did.
|
| In the case of an LLM, I don't think that the work of
| compiling the training data probably would qualify by
| analogy to the phonebook example.
| humanistbot wrote:
| By that logic, if you convert a copyrighted song or movie
| from one codec to another, then that would not be
| copyrightable because it is the output of a machine.
| xxpor wrote:
| The song itself isn't output by the machine.
| humanistbot wrote:
| Neither was the original training data, which was
| copyrighted books, art, etc.
| dragonwriter wrote:
| > Neither was the original training data, which was
| copyrighted books, art, etc.
|
| If the original training data is a copyrightable
| (derivative or not) work, perhaps eligible for a
| compilation copyright, the model weights might be a form
| of lossy mechanical copy of _that_ work, and be both
| subject to its copyright and an infringing unauthorized
| derivative if it is.
|
| If its not, then I think even before fair use is
| considered the only violation would be the weights
| potentially infringing copyrights on original works, but
| I don't think _incomplete_ copy automatically works for
| them the way it would for an aggregate; I'd think you 'd
| have to demonstrate reproduction of the creative elements
| protected by copyright from _individual_ source works to
| make the claim that it infringed them.
| xxpor wrote:
| The _output_ of the training though is unrecognizable.
| SideburnsOfDoom wrote:
| Sometimes, the output is a recognisable plagiarism of a
| specific input.
|
| If it isn't recognisable, then it's merely _distributed_
| plagiarism. A million output, each of which are 0.0001%
| plagiarising each of million inputs.
| dragonwriter wrote:
| It _isn't_ independently copyrightable.
|
| Its a mechanical copy subject to the copyright on the
| original, though.
| danShumway wrote:
| Correct that it would not be copyrightable, but you're
| missing the point.
|
| A codec conversion is not copyrightable. The original
| _song_ which is still present enough in the conversion to
| impact its ability to be distributed, is still
| copyrightable. But you don 't get some kind of new
| copyright just because did a conversion.
|
| For comparison, if you take a public domain book off of
| Gutenberg and convert it from an EPUB to a KEPUB, you
| don't suddenly own a copyright on the result. You can't
| prevent someone else from later converting that EPUB to a
| KEPUB again. Copyright protects creative decisions, not
| mathematical operations.
|
| So if there is a copyright to be held on model weights,
| that copyright would be downstream of a creative decision
| -- ie, which data was it trained on and who owned the
| copyright of the data. However, this creates a weird
| problem -- if we're saying that the artifact of
| performing a mathematical operation on a series of inputs
| is still covered by the copyright of the components of
| that database, then it's somewhat tricky to argue that
| the creative decision of what to include in that database
| should be covered by copyright but that copyrights of the
| actual content in that database don't matter.
|
| Or to put it more simply, if the database copyright
| status impacts models, then that's kind of a problem
| because most of the content of that training database is
| unlicensed 3rd party data that is itself copyrighted. It
| would absolutely be copyright infringement for
| OpenAI/Meta to distribute its training dataset
| unmodified.
|
| AI companies are kind of trying to have their cake and
| eat it too. They want to say that model weights are
| transformed to such a degree that the original copyright
| of the database doesn't matter -- ie, it doesn't matter
| that the model was trained on copyrighted work. But they
| also want to claim that the database copyright does
| matter, that because the model was trained on a
| collection where the decision of what to include in that
| collection was covered by copyright, therefore the model
| weights are copyrightable.
|
| Well, which is it? If model weights are just a
| transformation of a database and the original copyrights
| still apply, then we need to have a conversation about
| the amount of copyrighted material that's in that
| database. If the copyright status of the database doesn't
| matter and the resulting output is something new, then
| no, running code on a GPU is not enough to grant you
| copyright and never really has been. Copyright does not
| protect algorithmic output, it protects human creative
| decisions.
|
| Notably, even if the copyright of the database was enough
| to add copyright to the final weights and even if we
| ignore that this would imply that the models themselves
| are committing copyright infringement in regards to the
| original data/artwork -- even in the best case scenario
| for AI companies, that doesn't mean the weights are fully
| protected because the only copyright a company can claim
| is based on the decision of what data they chose to
| include in the training set.
|
| A phone book is covered by copyright if there are
| creative decisions about how that phone book was
| compiled. The numbers within the phone book are not.
| Factual information can not be copyrighted. Factual
| observations can not be copyrighted. So we have to ask
| the same question about model weights -- are individual
| model weights an artistic expression or are they a fact
| derived from a database that are used to produce an
| output? If they're not individually an artistic
| expression, well... it's not really copyright
| infringement to use a phone book as a data reference to
| build another phone book.
| paxys wrote:
| It's a complicated question and I don't think anyone can give
| a clear yes or no answer before some court has ruled on it.
| One school of thought is that copyright is designed to
| protect original works of creativity, but weights are
| generated by an algorithm and not direct human expression. AI
| generated art, for example, has already been ruled ineligible
| for copyright.
| rvcdbn wrote:
| I have a hard time imagining a world where it is not the case
| at least in the US i.e. where copyright is extended to a work
| with no originality in direct contradiction to copyright
| clause in the constitution.
| bilbo0s wrote:
| It's all kind of irrelevant. If they are not copyrightable,
| then most companies will simply hide them behind an API.
| There is no law saying these companies _must_ release their
| weights. The companies are releasing their weights because
| they felt they could charge for and control other things.
| Like the output from their models.
|
| If they can't charge for and control those other things,
| then we'll likely see far fewer companies releasing
| weights. Most of this stuff will move behind APIs in that
| scenario.
| rvcdbn wrote:
| Maybe, maybe not. Companies are not monoliths. For all we
| know, internally it's already well known that model
| weights likely aren't copyrightable and the only reason
| for the restrictions is to give the appearance of being
| responsible to appease the AI doomers.
| appplication wrote:
| Let's take a simple linear regression model with a handful of
| parameters. The weights could be an array of maybe 5 numbers.
| Should that be copyrightable? What if someone else uses the
| same data sources (e.g. OSS data sets) and architecture and
| arrives at the same weights? Is this a Copyright violation?
|
| Let's talk about more complex models. What if my model shares
| 5% of the same weights with your model? What about 50%? What
| about 99%? How much do these have to change before you're in
| the clear? What if I take your exact model and run it through
| some extra layers that don't do anything, but dilute the
| significance of your weights?
|
| It's a murky area, and I'm inclined to think copyright is not
| at all the right tool to handle the legality of these models
| (especially given the glaring irony they are almost all
| trained using copyrighted material). Patents, perhaps better
| suited, but I'm also not sold.
| paulmd wrote:
| > While it's mostly open, there are caveats such as you can't use
| the model commercially if you had more than 700M MAUs as of the
| release date, and you also cannot use the model output to train
| another large language model. These types of restrictions don't
| play well with the open source ethos
|
| No, CC-NC-ND is a thing, and even GPL applies restrictions on
| derivation as well.
|
| "Open source" doesn't mean BSD/MIT. There is even open-source
| that you cannot freely redistribute at all - not all open-source
| is FOSS!
|
| I always think it's a testament to how much copyleft has
| succeeded that in many cases people think of GPL and BSD/MIT as
| being the baseline.
| Taek wrote:
| I didn't realize that the llama license forbids you from using
| its outputs to train other models. That's essentially a
| dealbreaker, synthetic data is going to be the most important
| type of training data from here on out. Any model that prohibits
| use of synthetic data to train new models is crippled.
| Der_Einzige wrote:
| It's exactly the opposite. We have better ways to combine the
| knowledge of several models together than sampling them. (i.e.
| mixture of experts, model merges, etc) Relying on synthetic
| data from one LLM to train another LLM is in general a terrible
| idea and will lead to a race to the bottom.
| zarzavat wrote:
| A contract ordinarily has to have consideration. Since LLaMa
| weights are not copyrightable by Meta and are freely available,
| what exactly is the consideration? The bandwidth they provide?
| SanderNL wrote:
| Good luck enforcing that, though. How would they ever know?
| denlekke wrote:
| i wonder if they could include some marker prompt and
| response that wouldn't occur "naturally" from any other model
| or training data
| ortusdux wrote:
| https://en.wikipedia.org/wiki/Trap_street
| nsplayer wrote:
| They could have picked up the LLM equivalent from LLM
| generated posts online however. How do you prove they
| didn't?
| denlekke wrote:
| as a layman, i imagine for someone at the scale required
| it may not be worth the risk or the added effort vs
| paying or using a different model but it'd be funny if we
| see companies creating a subsidiary that just acts as a
| web-passthrough to "legalize" llama2 output as training
| data
| mcny wrote:
| Level1Techs "link show" (because we can't call it news
| anymore) kind of touched this topic. I would like to read
| what you guys make of this:
|
| > Supreme Court rejects Genius lawsuit claiming Google
| stole song lyrics SCOTUS won't overturn ruling that US
| copyright law preempts Genius' claim.
|
| > The song lyrics website Genius' allegations that Google
| "stole" its work in violation of a contract will not be
| heard by the US Supreme Court. The top US court denied
| Genius' petition for certiorari in an order list issued
| today, leaving in place lower-court rulings that went in
| Google's favor.
|
| > Genius previously lost rulings in US District Court for
| the Eastern District of New York and the US Court of
| Appeals for the 2nd Circuit. In August 2020, US District
| Judge Margo Brodie ruled that Genius' claim is preempted
| by the US Copyright Act. The appeals court upheld the
| ruling in March 2022.
|
| > "Plaintiff's argument is, in essence, that it has
| created a derivative work of the original lyrics in
| applying its own labor and resources to transcribe the
| lyrics, and thus, retains some ownership over and has
| rights in the transcriptions distinct from the exclusive
| rights of the copyright owners... Plaintiff likely makes
| this argument without explicitly referring to the lyrics
| transcriptions as derivative works because the case law
| is clear that only the original copyright owner has
| exclusive rights to authorize derivative works," Brodie
| wrote in the August 2020 ruling.
|
| > Google search results routinely display song lyrics via
| the service LyricFind. Genius alleged that LyricFind
| copied Genius transcriptions and licensed them to Google.
|
| > Brodie found that Genius' claim must fail even if one
| accepts the argument that it "added a separate and
| distinct value to the lyrics by transcribing them such
| that the lyrics are essentially derivative works." Since
| Genius "does not allege that it received an assignment of
| the copyright owners' rights in the lyrics displayed on
| its website, Plaintiff's claim is preempted by the
| Copyright Act because, at its core, it is a claim that
| Defendants created an unauthorized reproduction of
| Plaintiff's derivative work, which is itself conduct that
| violates an exclusive right of the copyright owner under
| federal copyright law," Brodie wrote.
|
| https://arstechnica.com/tech-policy/2023/06/supreme-
| court-re...
| rcxdude wrote:
| The basic idea is whether an unauthorised derivative work
| is itself entitled to copyright protection: could the
| creator of the derivative work prevent copying by the
| original creator (or anyone else) of the work on which it
| is based, even though they themselves have no permission
| to distribute it? (if the work is authorised, this is
| generally considered to be the case). It looks like from
| this the conclusion is 'no', at the very least in this
| case. I'm not sure this matches most people's moral
| intuitions: every now and again a big company includes
| some fan art in their own official release without
| permission (usually not as a result of a general policy,
| but because of someone getting lazy and the rest of the
| system failing to catch it), and generally speaking the
| reaction is negative.
| joshuaissac wrote:
| > whether an unauthorised derivative work is itself
| entitled to copyright protection
|
| That is not what this court case was about. Genius had
| already settled the case of unauthorised transcriptions
| and had bought licences for its lyrics after a lawsuit
| 2014, so its own work was no longer unauthorised. In the
| case cited above, Genius was trying to enforce its claims
| against Google via contract law rather than copyright
| law. The court ruled that the alleged violations were
| covered by copyright law, so they could only pursued via
| copyright law, and that only the copyright holder (or
| assignee) of the lyrics that were copied could sue Google
| under it.
| criddell wrote:
| Disgruntled current or former employee turning in their
| employer for the reward? That's how Microsoft and the BSA
| used to bust people before the days of always online
| software.
| moffkalast wrote:
| I'm not sure why anyone would even do that in the first place,
| LLama doesn't generate synthetic data that would be even
| remotely good enough. Even GPT 3.5 and 4 are already very
| borderline for it, with lots of wrong and censored answers. And
| at best you make a model that's as good as LLama is, i.e. not
| very.
| jstarfish wrote:
| Instruction-tuning is the obvious use case. That much has
| nothing to do with subjectivity, alignment or censorship,
| it's will-you-actually-show-this-as-JSON-if-asked.
| moffkalast wrote:
| That's tuning llama which is allowed from what I
| understand. Otherwise why release it at all, it's not very
| functional in its initial state anyway. What that applies
| to is using llama outputs to train a completely new base
| model which makes no practical sense.
|
| As for generating jsons, that's more of a inference runtime
| thing, since you need to pick the top tokens that result a
| valid json instead of just hoping it returns something that
| can be parsed. On top of extensive tuning of course.
| lolinder wrote:
| Not that it's okay for this to be in the license, but I'm
| curious: what is the use case for synthetic data? Most of the
| discussion I've seen has been about how to avoid accidentally
| using LLM-generated data.
| lmeyerov wrote:
| Tuning a tiny classifier
| dheera wrote:
| > forbids you from using its outputs to train other models.
|
| I don't know how one can even forbid this. As a human, I'm a
| walking neural net, and I train myself on everything that I
| see, without a choice. The only difference is I'm a carbon-
| based neural net.
| 6gvONxR4sf7o wrote:
| It's hilarious that big players in this space seem to think
| these are consistent views:
|
| - It's okay to train a model on arbitrary internet data without
| permission/license just because you can access it
|
| - It's not okay train a model on our model
| realusername wrote:
| Yes, they have to pick one or the other. Until then I'm going
| to assume that the model licence doesn't apply since the
| first point would be invalid and the model could not be built
| in the first place.
| lhnz wrote:
| It tells you that they think their moat is data
| quality/quantity.
| torstenvl wrote:
| Those are perfectly consistent, despite what ideologically-
| driven people may want to believe.
|
| Copyright is literally the right to copy. Arbitrary Internet
| data that is not _copied_ does not have any copyright
| implications.
|
| The difference is that LLaMa imposes additional contractual
| obligations that, for ideological reasons (Freedom #0), open
| source software does not.
|
| This issue reminds me of the FSF/AGPL situation. At some
| point you just have to accept that copyright law, in and of
| itself, is not sufficient to control what people _do_ with
| your software. If you want to do that, you have to limit end-
| user freedom with an EULA.
|
| If someone uses LLaMa output to train models, it is unlikely
| they will be sued for copyright infringement. It is far more
| likely they will be sued for breach of contract.
| danShumway wrote:
| > Arbitrary Internet data that is not copied does not have
| any copyright implications.
|
| Training a model on model output isn't copying.
|
| There's no way to phrase this where training a model on
| copyrighted _human_ -generated images/text isn't copying,
| but training a model on _computer_ -generated images/text
| is copying.
|
| > If you want to do that, you have to limit end-user
| freedom with an EULA.
|
| If you want to limit end-user freedom with a EULA, you have
| to figure out how to get users to sign it. Copyright is one
| way to force them to do so, but doesn't really seem
| relevant to this situation if training a model on
| copyrighted material is fair use.
|
| And again, if somebody generates a giant dataset with
| LLaMA, if you want to argue that pushing that into another
| LLM to train with is making a copy of that data, then
| there's no way to get around the implication there that
| training on a human-generated image is also making a copy
| of that image.
| [deleted]
| [deleted]
| torstenvl wrote:
| > _Training a model on model output isn 't copying._
|
| That's literally what I said.
|
| > _There 's no way to phrase this where training a model
| on copyrighted human-generated images/text isn't copying,
| but training a model on computer-generated images/text is
| copying._
|
| Literally nobody is saying that.
|
| > _If you want to limit end-user freedom with a EULA, you
| have to figure out how to get users to sign it._
|
| That is not true. ProCD v. Zeidenberg, 86 F.3d 1447 (7th
| Cir. 1996).
|
| You and others seem to have an over-the-top hostile
| reaction to the idea that contract law can do things
| copyright law cannot do. But it is objective and
| unarguable fact.
| danShumway wrote:
| > Literally nobody is saying that.
|
| Okay? Apologies for making that assumption. But if you're
| not saying that, then your position here is even less
| defensible. Arguing that model output isn't copyrightable
| but that it's still covered by EULA if anyone anywhere
| tries to use it is even more absurd than arguing that
| it's covered by copyright. The interpretation that this
| is covered by copyright is arguably the charitable
| interpretation of what you wrote.
|
| > That is not true. ProCD v. Zeidenberg, 86 F.3d 1447
| (7th Cir. 1996).
|
| ProCD is about shrinkwrap licenses, the court determined
| that buying the software and installing it was the
| equivalent of agreeing to the license.
|
| In no way does that imply that licenses are enforceable
| on people who never agreed to the licenses. The court
| expanded what counts as agreement, it does not mean you
| don't have to get people to agree to the EULA. I mean,
| take pedantic issue with the word "sign" if you want
| (sure, other types of agreement exist, you're correct),
| but the basic point is still true -- if you want to
| restrict people with a EULA, they need to actually agree
| to the EULA.
|
| And if you don't have IP law as a way to block access to
| your stuff, then you don't really have a way to force
| people to agree to the EULA. Someone using LLaMA output
| to train a model may have never been in a position to
| agree to that EULA, and Facebook doesn't have the legal
| ability to say "hey, nobody can use output without
| agreeing to this" because they don't have copyright over
| that output. Can they get people to sign a EULA before
| downloading the weights from them? Sure. Is that enough
| to restrict everyone else who didn't download those
| weights? No.
|
| To go a step further, if you don't believe that weights
| themselves are copyrightable, then putting a EULA in
| front of them is even less effective because people can
| just download the weights from someone else other than
| Facebook.
|
| You can host a project Gutenberg book and get people to
| sign a EULA before they download it from you, even though
| you don't own the copyright. And that EULA would be
| binding, yes. But you cannot host a project Gutenberg
| book, put a EULA in front of it, and then claim that
| people who _don 't_ download it from you and instead just
| grab it off of a mirror are still bound by that EULA.
|
| Your ability to control access is what gives you the
| ability to force people to sign the EULA. And that's kind
| of dependent on IP law. If someone sticks the LLaMA 2.0
| weights on a P2P site, and those weights aren't covered
| by copyright, then no, under no interpretation of US law
| would downloading those weights from a 3rd-party source
| constitute an agreement with Facebook.
|
| But even if you don't take that position, even if you
| assume that model weights are copyrightable, if I
| download a dataset generated by LLaMA, there is still no
| shrinkwrap license on that data.
|
| To your original point:
|
| > If someone uses LLaMa output to train models, it is
| unlikely they will be sued for copyright infringement. It
| is far more likely they will be sued for breach of
| contract.
|
| It is incredibly unlikely that someone using a 3rd-party
| database of LLaMA output would be found to be in
| violation of contract law unless at the very least they
| had actually agreed to the contract by downloading LLaMA
| themselves. A restriction on the usage of LLaMA does not
| mean anything for someone who is using LLaMA output but
| has not taken any action that would imply agreement to
| that EULA.
|
| > You and others seem to have an over-the-top hostile
| reaction to the idea that contract law can do things
| copyright law cannot do. But it is objective and
| unarguable fact.
|
| No, what we have a hostile reaction to is the objectively
| false idea that a EULA covers unrelated 3rd parties.
| That's not a thing, it's never been a thing.
|
| I don't know what to say if you disagree with that other
| than that I'm putting a EULA in front of all of
| Shakespeare's works that says you now have to pay me $20
| before you use them no matter where you get them from,
| and apparently that's a thing you believe I can do?
| wwweston wrote:
| > Arbitrary Internet data that is not copied
|
| It's all but certainly copied, and not just in the "held in
| memory" sense but actually stored along with the rest of
| the training collection. What may not happen is
| distribution. There's a difference in scale/nature of
| copyright violation between the two but both could well be
| construed that way.
|
| Additionally, I think there's a reasonable argument that
| use as training data is a novel one that should be treated
| differently under the law. And if there's not:
|
| > If you want to do that, you have to limit end-user
| freedom with an EULA.
|
| What will eventually happen -- at least without some kind
| of worldwide convention -- is that someone who can
| successfully dodge licensing obligations will be able to
| take and redistribute weight-data and/or clean-room code.
|
| At least, if we're adopting a "because we can" approach to
| everything related.
| owenfi wrote:
| But you can publish the output, right? And then a "third
| party" could train a different model on just that published
| material without copying it or ever agreeing to a EULA.
| torstenvl wrote:
| If you believe that courts will find your shell game
| convincing, you are free to try it and incur the legal
| risk. I recommend you consult with an attorney before
| doing so.
| themoonisachees wrote:
| You could simply train on the output straight up and
| nobody would ever be able to tell anyway.
| 6gvONxR4sf7o wrote:
| One of the common elements of training sets for these
| models (including LLama) is the Books3 dataset, which is a
| huge number of pirated books from torrents. That's exactly
| what you described.
|
| Regardless, the lack of a license cannot give you _more_
| permission than a restrictive license. You 're arguing that
| if take a book out of a bookstore without paying (or
| signing a contract), then I have more rights than if I sign
| a contract and then leave with the book.
| [deleted]
| rahkiin wrote:
| Like google is allowed to scrape the whole internet but
| you're not allowed to scrape google. Rules for thee but not
| for me
| kgwgk wrote:
| What rules? Google won't scrape your part of the internet
| if you don't allow it, right?
| makeitdouble wrote:
| Google respects the "robot.txt" and asks you to use it to
| opt out of their crawling.
|
| Parent's point is if your own scaping army respects the
| "scaping.txt" and goes down on Google as they don't opt-
| out in their scraping.txt, it probably wouldn't fly.
| kgwgk wrote:
| I don't understand. What does "Rules for thee but not for
| me" mean if "google is allowed to scrape" whatever people
| allows Google to scrape but "you're not allowed to scrape
| google" because using the same rules
| google.com/robots.txt says User-agent: *
| Disallow: /search ....
| makeitdouble wrote:
| There's an imbalance because the robot.txt rule is
| something Google pushed forward (didn't invent it, but
| made it standard) and is opt-out. So yes, Google made up
| their rules and won't let other people to make up their
| own self-beneficial rules in a similar way.
| kgwgk wrote:
| > Google [...] won't let other people to make up their
| own self-beneficial rules in a similar way.
|
| What "other people"?
|
| If it's the "you" who is not allowed to scrape google in
| https://news.ycombinator.com/item?id=36817237 then you
| can make your own "google is not allowed to scrape my
| thing" rules if you think that's beneficial for you.
|
| If it's somehow related to LLM providers or users I doubt
| that's what the original comment was referring to.
|
| To be clear, I understand the original comment as
| LLM companies say "I can use your content and you cannot
| not prevent me from doing so, but I won't allow you to
| use the output of the LLM" just like Google says "I can
| scrape your content and you cannot not prevent me from
| doing so, but I won't allow you to scrape the output of
| the search engine"
|
| and that doesn't seem a valid analogy.
| rvnx wrote:
| Also the main business model of Google (and of search
| engines in general) is to republish rearranged snippets of
| copyrighted content and even serve whole copies of the
| content (googleusercontent cache), without prior
| authorization of the copyright holders, and for-profit.
|
| It's completely illegal if you think about it.
|
| So why LLMs who crawl the internet to present snippets and
| information should be treated differently from Google ?
| (who also reproduce verbatim the same content without
| paying any compensation to the copyright owners (all types:
| text, image, code)
| bayindirh wrote:
| Because search engines do not create mishmash of this
| data to parrot some stuff about it. Also they don't strip
| the source, the license, and stop scraping my site when I
| tell them.
|
| LLMs scrape my site and code, strip all identifying
| information and license, and provide/sell that to others
| for profit, without my consent.
|
| There are so many wrongs here, at every level.
| az226 wrote:
| It wouldn't. Facebook is delusional if they think the
| license can pass muster.
|
| Presumably you can't build an LLM that is a competitor of
| LlaMA using its outputs.
|
| But AI weights are in legal gray zone for now. So it's
| muddy waters and fair game for anyone who wants to take
| on the legal risks.
| panzi wrote:
| Not wanting to defend the likes of Google, but search
| engines link the original source (in contrast to LLMs).
| Their basic idea is to direct people to your content.
| There are countries where content companies didn't like
| what Google does: Google took them out of the index ->
| suddenly they where ok with it again so that Google put
| them in again. (extremely simplified story)
| pyrale wrote:
| > Their basic idea is to direct people to your content.
|
| This is less and less true, as evidenced by the
| progression of 0-click searchs.
|
| > There are countries where content companies didn't like
| what Google does: Google took them out of the index ->
| suddenly they where ok with it again so that Google put
| them in again.
|
| This story screams antitrust.
| mschuster91 wrote:
| > This story screams antitrust.
|
| It does but the complainers are usually tabloid crap
| pushers whom no one in power really supports.
| Andrex wrote:
| > It's completely illegal if you think about it.
|
| Google would argue (and they won in federal court versus
| the Author's Guild using this argument) that displaying
| snippets of publicly-crawlable websites constitutes "fair
| use." Profitability weighs against fair use but it
| doesn't discount it outright.
|
| They would also probably cite robots.txt as an easy and
| widely-accepted "opt-out" method.
|
| Overall, I'm not sure any court would rule against
| Google's use of snippets for search. And since Google's
| been around for over 20 years and they haven't lost a
| lawsuit over it, I don't think it's accurate to say "it's
| completely illegal if you think about it."
|
| US copyright law is one of those things that might seem
| simple, but really isn't. Hence many of the copyright
| lawsuits clogging our judicial system.
| gtirloni wrote:
| It just likes a little imoral vs illegal confusion.
| remram wrote:
| You think search engines are immoral? You think we should
| pay to view the snippets under the results we don't
| click?
| whatshisface wrote:
| The belief that makes them consistent is that the authors of
| a million Reddit posts have no way to assert their rights
| while the big company that trained a Redditor model does.
| LastTrain wrote:
| Sure they do, albeit a shitty one: it's called a class-
| action.
| tasubotadas wrote:
| Generate data using ai, save it, it cannot be copyrighted or
| anything, data isn't a model, use it as much as you want for
| training.
|
| Ezpz
| redox99 wrote:
| It's so hypocritical, it's insane.
|
| "Yes, we train our models on a good chunk of the internet
| without asking permission, but don't you dare train on our
| models' output without our permission!"
|
| And OpenAI also has a similar restriction.
| alerighi wrote:
| In fact they can't (both Facebook and OpenAI) train their
| models without asking permission. Just wait for someone to
| start raising this concern. The EU is working on regulating
| these kind of aspects, for example this is not compliant at
| all with the GDPR (unless you train only on data that doesn't
| contain personal data, that is more rare than you would
| think).
| concinds wrote:
| Fundamentally untrue, and disheartening that it's the top
| comment.
|
| You can't use a model's output to train another model, it leads
| to complete gibberish (termed "model collapse").
| https://arxiv.org/abs/2305.17493v2
|
| And the Llama 2 license allows users to train derivative
| models, which is what people really care about.
| https://github.com/facebookresearch/llama/blob/main/LICENSE
| rgoldste wrote:
| The truth is between these two. You can use a model's output
| to train another model, but it has drawbacks, including model
| collapse.
| danShumway wrote:
| I don't see how this would be enforceable in law without
| killing almost every AI company on the market today.
|
| The whole legal premise of these models is that training on
| copyrighted material is fair use. If it's not, then... I mean
| is Facebook trying to claim that including copyrighted material
| in a dataset _isn 't_ fair use regardless of the author's
| wishes? Because I have bad news for LLaMA then.
|
| "You need permission to train on this" is an interesting legal
| stance for any AI company to take.
| doctorpangloss wrote:
| > The whole legal premise of these models is that training on
| copyrighted material is fair use.
|
| Not to diminish the conversation here, but not even a Supreme
| Court Justice knows what the legality is. You'd have to be a
| whole 9 person Supreme Court to make an accurate statement
| here. I don't think anyone really knows how Congress meant
| today's laws to work in this scenario.
| mschuster91 wrote:
| > I don't think anyone really knows how Congress meant
| today's laws to work in this scenario.
|
| Congress, or more accurate, the drafters of the
| Constitution, intended that Congress would work to keep the
| Constitution updated to match the needs of modern times.
| Instead, Congress ossified to the point it's unable to pass
| basic laws because a bunch of far right morons hold the
| House GQP hostage and an absurd amount of leverage was
| passed to the executive and the Supreme Court as a result -
| with the active aid of both parties by the way, who didn't
| even think of passing actual laws to codify something as
| important as equitable access to elections, fair elections,
| or the right to have an abortion or to smoke weed. And on
| top of that your Supreme Court and many Federal court picks
| were hand-selected from a society that prefers a literal
| viewpoint of the constitution.
|
| But fear not, y'all are not alone in this kind of idiocy,
| just look at us Germans and how we're still running on fax
| machines.
| rcxdude wrote:
| From my non-legal-professional POV I can see an angle which
| may work:
|
| Firstly, llama is not just the weights, but also the code
| alongside it. The weights may or may not be copyrightable,
| but the code is (and possibly also the network structure
| itself? that would be important if true but I don't know if
| it would qualify).
|
| Secondly, you can write what you want in a copyright license:
| you could write that the license becomes null and void if the
| licensee eats too much blue cheese if you want.
|
| Following from that, if you were to train on the outputs of
| the AI, you may not be guilty of copyright infringement in
| terms of doing the training (both because AI output is not
| copyrightable in the first place, something which seems
| pretty set in precedent already, and possibly also because
| even if it was, it gets established that it is fair use like
| any other data), but if it means your license to the original
| code is revoked then you will at the very least need to find
| another implementation that can use the weights, or (if the
| weights can be copyrighted, which I would argue is probably
| not the case, if you follow the argument that the training is
| fair use, especially if the reasoning is that the weights are
| simply a collection of facts about the training data, but
| it's very plausible that courts will rule differently here).
|
| This could wind up with some strange situations where someone
| generating output with the intent of using it for training
| could be prosecuted (or at least forced to cease and desist)
| but anyone actually using that output for training would be
| in the clear.
|
| I agree it is extremely "have your cake and eat it" on the
| part of the AI companies: They wish to both bypass copyright
| and also benefit from the restrictions of it (or, in the case
| of OpenAI, build a moat by lobbying for restrictions on the
| creation and use of the models themselves, by playing to
| fears of AI danger).
| danShumway wrote:
| These are good points to bring up.
|
| > This could wind up with some strange situations where
| someone generating output with the intent of using it for
| training could be prosecuted (or at least forced to cease
| and desist) but anyone actually using that output for
| training would be in the clear.
|
| I'll add to this that it's not just output; say that
| someone is using another service built on top of LLaMA.
| Facebook itself launched LLaMA 2.0 with a public-facing
| playground that doesn't require any license agreement or
| login to use.
|
| You can go right now and use their public-facing portal and
| generate as much training data as you can before they IP-
| block you, and... as far as I can tell you haven't done
| anything in that scenario that I can see that would bind
| you to this license agreement.
|
| So I still feel like I'll be surprised if any AI company
| that's serious about wanting bootstrapping itself off of
| LLaMA is going to be too concerned about this license
| (whether that's a good idea to do just because the training
| data itself might be garbage is another conversation). It
| just seems so easy to get around any restrictions.
| Ajedi32 wrote:
| I'd say it's enforceable in the sense that if you agree to
| the license then violating those terms would be breach of
| contract regardless of whether use of the LLaMA v2 output is
| protected by copyright or not. But there's nothing stopping
| someone else who didn't agree to the license from using
| output you generate with LLaMA v2 to train their model.
| danShumway wrote:
| I don't want to dip too much into the conversation of
| whether weights themselves are copyrightable, but note that
| it's very easy in the case of LLaMA 1.0 to get the weights
| and play with them without ever signing a contract.
|
| If they turn out to be not copyrightable, then... all this
| would mean is downloading LLaMA 2.0 weights from a mirror
| instead of from Facebook.
| renewiltord wrote:
| I would just do it anyway. In fact, I can release a suitably
| laundered version and you'd never know. If I release a few
| million, each with slight variation, there's no way provenance
| can be established. And then we're home-free.
| objektif wrote:
| I played with Llama2 for a bit and for a lot of the questions I
| asked I got complete made up garbage stuff. Why would you want
| to train on it?
| heyzk wrote:
| You see a similar loosening of the term in other fields e.g. open
| source journalism. Although that seems to be more about
| crowdsourcing than transparency or usage rights.
| PreInternet01 wrote:
| It's not just in the LLM space; even for 'older' models,
| companies have aggressively embraced this approach. For example:
| YOLOv3 has been appropriated by a company called Ultralytics,
| which has subsequently released the 'YOLOv5' and 'YOLOv8'
| "updates": https://github.com/ultralytics/ultralytics
|
| There is no marked increase in model effectiveness in these 'new'
| versions, but even if you just use the 'YOLOv8' Pytorch weights
| (and no part of their Python toolchain, which _might_ have some
| improvements), these will somehow try to download files from
| Ultralytics servers. Possibly for a good reason, but most likely
| to, let 's say, "pull an Oracle."
|
| Serious AI researchers won't go anywhere near this stuff, but the
| number of students-slash-potential-interns with "but it's on
| GitHub!" expectations that I had to reject lately due to "nope,
| we're not paying these guys for their Enterprise license just to
| check out your project" is rather disheartening...
| donretag wrote:
| Since Open Source has been established in the tech ethos for a
| while now, any deviation has been met with derision. It seems
| like the community has been more tolerant of these "open"
| licenses as of late. While must of the hate for projects that do
| not fit the FOSS standard is mostly unwarranted, hopefully we are
| not moving quickly in the "open" direction.
|
| Here is another article on LLaMa2:
| https://opensourceconnections.com/blog/2023/07/19/is-llama-2...
| blueblimp wrote:
| What's problematic is that there are big models that adopt truly
| open source licenses, such as MPT-30b and Falcon-40b. As grateful
| as I am for having access to the Llama2 weights, it feels unfair
| that it gets credit for being "open source" when there are
| competing models that really are open source, in the traditional
| OSI sense.
|
| The practical difference between the licenses is small enough
| that I expect most people (including me) will choose Llama2
| anyway, because the models are higher quality. But that incentive
| may mean that we get stuck with these awkward pseudo-open
| licenses.
| indus wrote:
| No wonder there is such "momentum" on watermarking.
| sytse wrote:
| Great point in the article. In
| https://opencoreventures.com/blog/2023-06-27-ai-weights-are-... I
| propose a framework to solve the confusion. From the post: "AI
| licensing is extremely complex. Unlike software licensing, AI
| isn't as simple as applying current proprietary/open source
| software licenses. AI has multiple components--the source code,
| weights, data, etc.--that are licensed differently. AI also poses
| socio-ethical consequences that don't exist on the same scale as
| computer software, necessitating more restrictions like
| behavioral use restrictions, in some cases, and distribution
| restrictions. Because of these complexities, AI licensing has
| many layers, including multiple components and additional
| licensing considerations."
| [deleted]
| danShumway wrote:
| > For the foreseeable future, open source and open weights will
| be used interchangeably, and I think that's okay.
|
| This is a little weird given that directly above, the author puts
| LLaMA into the "restricted weights" category. Even by the
| definition the author proposes, LLaMA 2.0 isn't open source; we
| shouldn't be calling it open source.
|
| If open source in the LLM world means "you can get the weights"
| and doesn't imply anything about restrictions on their usage,
| then I don't think that's adapting terminology to a new context,
| I think it's really cheapening the meaning of Open Source. If you
| want to refer to specifically "open weights" as open source, I'm
| a bit more sympathetic to that (although I don't think it's the
| right terminology to use). But I see where people are coming from
| -- I'm not too put off by people using open source to describe
| weights you can download without restrictions on usage.
|
| But LLaMA is not open weights. It's a closed, proprietary set of
| weights[0] that at best could be compared to source available
| software.
|
| It is deceptive for Facebook to call LLaMA open source, and we
| shouldn't go along with that narrative.
|
| [0]: to the extent weights can be copyrighted at all, which I
| would argue they can't be copyrighted, but that's another
| conversation.
| FanaHOVA wrote:
| Author here. I agree with you. LLaMA2 isn't open source (as my
| title says, the HN one was modified). My point is that the
| average person will still call it "open source" because they
| don't know any better, and it's hard to fix that. Rather than
| just saying "this isn't open source", we should try to come up
| with better terminology.
|
| Also, while weights usage might be restricted, it's a very big
| compute investment shared with the public. They use a 285:1
| training tokens to params ratio, and the loss graphs show the
| model wasn't yet saturated. This is valuable information for
| other teams looking to train their own models.
|
| LLaMA1 was highly restrictive, but the data mix mentioned in
| the paper led to the creation of RedPajama, which was used in
| the training of MPT. There's still plenty of value in this work
| that will flow to open source, even if it doesn't fit in the
| traditional labels.
| danShumway wrote:
| Thanks for replying! And agreed on the title change; I think
| your original title is much, much better phrased and I'm
| sorry that I glossed over it when reading the article
| (although I'm not sure "doesn't matter" fully captures the
| distinction you're making here) -- mods probably shouldn't
| have changed it.
|
| > There's still plenty of value in this work that will flow
| to open source, even if it doesn't fit in the traditional
| labels.
|
| That is a good point; the fight over what is open source and
| what is source available can get heated, and part of that is
| a defense against the erosion of the term. But... in general
| source available is better than closed source software. And
| LLaMA 2 is a significant improvement over LLaMA 1 in that
| regard, it really is. So I don't necessarily want to be down
| on it, in some ways it's just backlash of being tired of
| companies stretching definitions. But they're doing a thing
| that will absolutely help improve open access to LLMs.
|
| I'm always a little bit torn about how to go about this kind
| of criticism of terminology, and I'm not trying to say that
| people shouldn't be excited about LLaMA 2. But the way it
| works out I'm often playing word police because the erosion
| of the term does make it harder to refer to models with
| actual open weights like StableLM. Facebook deserves real
| praise for releasing a model with weights that can be used
| commercially. It doesn't deserve to be treated as if what
| it's doing is equivalent to what StabilityAI or RedPanda is
| doing.
|
| I do like your terminology of "open weights" and "restricted
| weights", and I wouldn't be opposed to even breaking that
| down even further, I think there's a clear difference between
| LLaMA 1 and 2 in terms of user freedom, so I'm not opposed to
| people trying to distinguish, just... it's not hitting the
| bar of being open weights.
|
| It's a bit like if the word vegetarian didn't exist, and if
| everyone argued about how it's unhelpful to say that drinking
| milk isn't vegan because it's still tangibly different from
| eating meat. On one hand I agree, but on the other hand it's
| better to have another category for it that means "not vegan,
| but still not eating meat." There is an actual danger in
| blurring a line so much that the line doesn't mean anything
| anymore, and where people who mean something more rigorous no
| longer have a term to communicate amongst themselves. If
| average people get bothered by throwing LLaMA 2 into the
| "restricted weights" category, it's better to introduce
| another category between restricted and open that means
| "restricted but not commercially".
|
| Beyond that though... yeah, I agree. I don't really have a
| problem with people calling open weights open source, my only
| objection to that is kind of technical and pedantic, but I
| don't think it causes any actual harm if someone wants to
| call StableLM open source.
| pk-protect-ai wrote:
| llama2 is absolutely useless. From the small models the
| guanaco-33b and guanaco-65b are the best (though they are derived
| from llama).
| Oranguru wrote:
| Useless for what? Are you comparing the base model with chat-
| tuned models?
|
| Chat-tuned derivatives of LLaMa 2 are already appearing. Given
| that the base LLaMa 2 model is more efficient than LLaMa 1, it
| is reasonable to expect that these more refined chat-tuned
| versions of the chat-tuned versions will outperform the ones
| you mention.
| monlockandkey wrote:
| wait for the tuned models
| ngai_aku wrote:
| Is that just based on your experience, or do you have a link to
| benchmarks?
| pk-protect-ai wrote:
| Try these prompts with different models. LLaMA 2 output is
| pure garbage: ----1---- On a map sized (256,256), Karen is
| currently located at position (33,33). Her mission is to
| defeat the ogre positioned at (77,17). However, Karen only
| has a 1/2 chance of succeeding in her task. To increase her
| odds, she can: 1. Collect the nightshades at position
| (122,133), which will improve her chances by 25%. 2. Obtain a
| blessing from the elven priest in the elven village at
| (230,23) in exchange for a fox fur, further increasing her
| chances by additional 25% Foxes can be found in the forest
| located between positions (55,33) and (230,90).
|
| Find the optimal route for Karen's quest which maximizes her
| chances of defeating the ogre to 100%. ----2---- Write a
| python code using imageio.v3 to create a PNG image
| representing the map way-points and the route of Karen in her
| quest, each way-point must be of a different color and her
| path must be a gradient of the colors between the waypoints.
| ------------
|
| I have a lot of cases those I test against different models
| ... GPT-4 since one week is really degraded, GPT-3.5 became a
| little bit better, and LLaMA2 is garbage.
| bloppe wrote:
| Why not just "downloadable"? It describes the actual difference
| between LLaMA and GPT. Open-data is the only other distinction
| that matters.
| rvz wrote:
| Yes (Unfortunately). But Llama 2 being released for free as a
| downloadable AI model is much better than nothing. For now it is
| a great start against the cloud-only AI models.
|
| As for terms, we'll settle on '$0 downloadable AI models' which
| are available today. Would rather use that over cloud-only AI
| models which can fall over and break your app at any time and you
| have zero control over that.
|
| Stable Diffusion is a good example that fits the definition of
| 'open-source AI' as we have the entire training data, weights
| reproduciblity, etc and Llama 2 does not.
| FanaHOVA wrote:
| Agreed. I called it a "$3M of FLOPS donation" by Meta.
| throwuwu wrote:
| Should be good motivation to figure out what those numbers mean
| mk_stjames wrote:
| In the diagram, there is theoretically another category outside
| the 'Restricted Weights' but maybe less than the 'Completely
| Closed' superspace, and that would be something along the lines
| of 'Blackbox weights and model' that is free to use but
| essentially non inspectable or transferrable. This would be the
| sister to 'free to use' closed-source software. An AI that is
| free to use but provided as a binary blob would meet this
| criterion. Or a module importable to python that calls
| precompiled binaries for the inference engine + weights with no
| source available. The traditional complement of this in the
| current software world would be Linux drivers from 3rd parties
| that are not open source. They are free, but not open.
|
| We haven't seen this too much yet in the AI world, as mostly
| people who open the weights are doing so in a research manner,
| where the inference is decidedly needed to be open sourced- and
| people with closed models do so in order to make money and thus
| no reason to open source the inference side either, just charge
| for an API ("OpenAI").
| FanaHOVA wrote:
| Yea I didn't include it, but that'd be the "free as in beer,
| but not freedom" circle :)
| rapatel0 wrote:
| Fully reproducible model training might simply not be possible if
| information from the training environment is not captured. In
| addition to data and code you might have additional uncertainty
| from:
|
| - pseudo/true random number generator and initialization
|
| - certain speculative optimizations associated with training
| environments (distributed)
|
| - Speculative optimizations associated with model compression
|
| - Image decompression algorithm mismatch (basically this is
| library versioning)
|
| - ....things I'm forgetting...
|
| It's just a lot of things to remember to capture, communicate,
| and reproduce.
| martincmartin wrote:
| _pseudo /true random number generator and initialization_
|
| It's not just the generator and initialization. If you do
| anything multithreaded, like a producer/consumer queue, then
| you need to know which pieces of work went to which thread in
| which order.
|
| It's a lot like reproducing subtle and rare race conditions.
| monocasa wrote:
| Most of the mature ML environments are pretty focused on
| reproducible training though. It's pretty necessary for
| debugging and iteration.
| taneq wrote:
| There's "open source" in the original sense, where the source was
| available. Then there's "FOSS" where the source is not only
| available, but it's under a copyleft license designed to protect
| the IP from greedy individual humans. And then there's "open" in
| the Shenzhen sense where you can find the source and other data
| online and nobody's going to stop you building something based on
| those. This is an interesting timeline.
| pzo wrote:
| On top of that there are also different OSS such as Apache and
| MIT that the latter one can still restrict the user from using
| because project owner might patented some algorithm and MIT
| license doesn't have patent grant.
|
| LGPL3.0 also pretty much is restricted in a way that not sure
| if can be used to distribute software in App Store for iOS
| legally.
| risho wrote:
| The original sense of open source is defined by the people who
| fractured off from the Free Software movement in the mid 90's
| and created it. It's just "Free Software" that has a focus on
| practicality and utility rather than "Free Software"'s focus on
| idealism and doing the right thing. It has NOTHING to do with
| "source available" which is a movement that has recently been
| co-opting the open source name.
|
| "FOSS" has absolutely no requirement of it being copyleft. The
| MIT license is just as FOSS as the GPL. Many of the free
| software advocates do have an affinity for copyleft, but they
| are not mutually exclusive. There are plenty of FOSS advocates
| who also use and advocate for permissive licenses as well.
| jordigh wrote:
| > There's "open source" in the original sense
|
| That original sense never existed. Virtually nobody said "open
| source" before OSI's 1998 campaign for "Open Source", as
| bankrolled by Tim O'Reilly.
|
| https://thebaffler.com/salvos/the-meme-hustler
|
| I know it's been a long time, and we've forgotten, but there is
| virtually no record of anyone saying "open source" before 1998,
| except in rare and obscure contexts and often unrelated to the
| modern meaning.
| teddyh wrote:
| There's this one from September 10th, 1996, which I find
| intriguing:
|
| https://web.archive.org/web/20180402143912/http://www.xent.c.
| ..
| hiatus wrote:
| > And then there's "open" in the Shenzhen sense where you can
| find the source and other data online and nobody's going to
| stop you building something based on those.
|
| I believe there is a name for that: gongkai.
| https://www.bunniestudios.com/blog/?page_id=3107
| taneq wrote:
| Ooh, thanks! I've watched a few of bunnie's things in the
| past but that's a term I'll remember.
| failuser wrote:
| Of course, it's not open source. With proliferation of the cloud,
| software has obtained an entirely new level of closeness: not
| being able to see the program binaries. Having an ability to run
| locally is now somewhat open in comparison.
| Eduard wrote:
| An understood term like "open open" source shouldn't be
| hijacked and exploited for marketing purposes.
|
| What these models do, they should either invented a new term,
| or use an appropriate existing term, eg. "fair use"
| failuser wrote:
| Absolutely. Maybe the term is already coined, but I don't
| know it. Open source implies the ability to compile software
| from human-generated inputs. This is just self-hosted
| freeware.
| jerf wrote:
| This isn't really new, the strict "Open Source" as defined for
| software has never made exact, perfect sense for anything other
| than software. That's why the Creative Commons licenses exist;
| putting a photographic image under GPL2 has never made any sense.
| It always needs redefinition in new media.
| alerighi wrote:
| Even for medias such as photos, songs, videos, you have a
| source. That is the raw materials and the projects from which
| you rendered the image, the video or the audio output.
|
| The source of a language model is more in reality the model,
| that is the code that was used to train the particular model.
| The model itself is more of a compiled binary, altough not in
| machine code.
|
| So for a model to be really open source to me it would mean
| that you have to release the software used for generating it,
| so I can modify it, train it on my data, and use it.
| hardolaf wrote:
| The strict "Open Source" wasn't even a definition when I
| started college.
| not2b wrote:
| An LLM is more like software than it is like media. The GPL
| defines source code as the preferred form for making
| modifications, including the scripts needed for building the
| executable from source. The weights in this case are more
| similar to the optimized executable code that comes out of a
| flow. The "source" would be the training data and the code and
| procedures for turning that into a model. For very large LLMs
| almost no one could use this, but for smaller academic models
| it might make sense, so researchers could build on each others'
| work.
| RobotToaster wrote:
| Creative commons has never claimed to be an open source licence
| though, they usually use the term free culture.
| Flimm wrote:
| It doesn't need redefinition. We just need a new term for new
| media.
| curtis3389 wrote:
| Part of the benefit of FOSS & open source is that a curious user
| can inspect how something is made and learn from it. It matters
| that open weights are no different from a compiled program. Sure,
| you can always modify an executable's instructions, but there's
| no openness there.
|
| Then there's the problems of the content of the training data,
| which parallel the dangers of opaque algorithms.
| morpheuskafka wrote:
| The chart in this this article is very wrong to show only GPL as
| free software and MIT/Apache as open source but not free software
| licenses.
|
| While the FSF side of things doesn't like the term "open source,"
| even they say that "nearly all open source software is free
| software." Specifically, the MIT and Apache (and LGPL) licenses
| are absolutely free software licenses--otherwise Debian, FSF-
| approved distros, etc. would have far less software to choose
| from.
|
| What the chart probably meant to distinguish is copyleft vs free
| software or open source. And if you're ordering it from a
| permissiveness viewpoint, the subset relationship should be
| reversed--GPL is far more permissive than SSPL, etc., but still
| less permissive that MIT/Apache.
| skybrian wrote:
| I don't see why the term "open source" needs to evolve when
| "source available" is available. Or in this case, "weights
| available under a license with few restrictions."
| mhh__ wrote:
| New generation of programmers can't remember not having open
| source / free software of any kind so the difference is
| academic versus felt.
| flir wrote:
| "Nyet! Am not open source! Not want lose autonomy!"
|
| (Downvotes... oops. The reference is Charlie Stross's
| Accelerando. The protagonist has a conversation with an AI that's
| just trying to survive. One of the options he suggests is to open
| source itself. Which is a roundabout way of saying that
| _eventually_ we 're going to have to take the AI's own opinions
| into account. What if it doesn't want to be open source?)
| Havoc wrote:
| It is quite an unfortunate dilution of the term
| arikanev wrote:
| How is it possible that you can fine tune Llama v2 but the
| weights are not available? That doesn't make sense to me.
| godelski wrote:
| The headline is editorialized. Actual is "LLaMA2 isn't "Open
| Source" - and why it doesn't matter"
|
| It is actually editorialized in a way that feels quite different
| from the actual one. I think the author and the poster might
| disagree on what open source means.
| swyx wrote:
| they are the same person :)
| FanaHOVA wrote:
| Mods changed the title, I used the original one when first
| posting. Not sure why they changed it.
| Der_Einzige wrote:
| Given that it's basically impossible to prove that a particular
| text was generated using a particular LLM (and yes, even with all
| the watermarking tricks we know of, this is and will still be the
| case), they might as well be interchangeable. Folks can and will
| simply ignore the silly license BS that the creators put on the
| LLM.
|
| I hope that users aggressively ignore these restrictive licenses
| and give the middle finger to greedy companies like Facebook who
| try to restrict usage of their models. Information deserves to be
| free, and Aaron Swartz was a saint.
| api wrote:
| I'm not sure open source applies to actual models. Models aren't
| human readable, so it's closer to a binary blob. It would apply
| to the training code and possibly data set.
|
| Llama2 is a binary blob pre-trained model that is useful and is
| licensed in a fairly permissive way, and that's fine.
| politelemon wrote:
| Yes I think you've put it well. If models were smaller I'd see
| those in the Github releases section. The model training is
| what I'd see in the source code and the README etc, to arrive
| at the 'blob'.
| api wrote:
| Even if it costs millions in compute to run at that scale,
| seeing that code would be extremely informative.
| cjdell wrote:
| Very like a binary blob. You have to execute it to use it and
| impossible for humans to reason about just by looking at it.
|
| At least binary blobs can be disassembled.
___________________________________________________________________
(page generated 2023-07-21 23:01 UTC)