[HN Gopher] If you can't reproduce the model then it's not open-...
___________________________________________________________________
If you can't reproduce the model then it's not open-source
Author : mgreg
Score : 171 points
Date : 2024-01-17 17:49 UTC (5 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| ssgodderidge wrote:
| > Imagine if Linux published only a binary without the codebase.
| Or published the codebase without the compiler used to make the
| binary. This is where we are today.
|
| This was such a helpful way to frame the problem! Something felt
| off about the "open source models" out there; this highlights the
| problem incredibly well.
| fzliu wrote:
| In my mind, what's more crucial here is code for
| downloading/scraping and labeling the data, not the model
| architecture nor training script.
|
| As much as I appreciate Mis(x)tral, I would've loved it even
| more if they released code for gathering data.
| declaredapple wrote:
| I'm speculating they are attempting to avoid controversy
| about their datasources. That and a possible competitive edge
| depending on what specific sets/filtering they're using.
| ssgodderidge wrote:
| To avoid controversy AND potential lawsuits.
| declaredapple wrote:
| Yup.
|
| I think many countries (japan already has) will allow IP
| for training data.
|
| They just need to buy time until then.
| pk-protect-ai wrote:
| I don't mind to have ready to use datasets instead the code
| for downloading/scraping and labeling. It will save a lot of
| time. It is not complicated to write some code for gathering
| the data, it might be sometimes impossible to replicate the
| datasets after all if some parts of the data which you have
| to scrape are already gone (removed because of various
| reasons).
| stcredzero wrote:
| Open Source and Free Software wasn't formulated to deal with
| the need for this level of gargantuan amounts of data and
| compute.
|
| Can the public compete? What percentage of the technical public
| could we expect to participate, and how much data, compute, and
| data quality improvement could they bring to the table? I
| suspect that large corporations are at least an order of
| magnitude advantaged economically.
| RandomWorker wrote:
| There is a big effort being worked on in China, Yuanqing Lin
| gave an interview on the deep learning course that works on
| this magnitude [1]. They suggest that they will host both the
| resources to store the data, train the data, and have all
| those algorithms available in China.
|
| [1] https://www.youtube.com/watch?v=3GfOnI3goAk
| tikhonj wrote:
| The public doesn't have the resources to train the largest
| state-of-the-art LLMs, but training _useful_ LLMs seems
| doable. Maybe not for most individuals but certainly for a
| range of nonprofits, research teams and companies.
| stcredzero wrote:
| Isn't is relatively easy for a smaller model to poke holes
| in the output of a larger model?
| jncfhnb wrote:
| But not nearly as in reach as modifying open source models.
| edgarvaldes wrote:
| Open Source and Free Software are not about the amount of
| data.
| nullc wrote:
| The GPL describes the source as the "preferred form for
| modification".
| CHY872 wrote:
| And, that's obviously fun, because with LLMs, you have the
| LLM itself which cost hundreds of thousands in compute to
| train, but given you have the weights it's eminently fine-
| tunable. So it's actually not really like Linux - rather it's
| closer to something like a car, where you had no hope of
| making it in the first place but now you have it, maybe you
| can modify it.
| patmorgan23 wrote:
| So in this case, the weights are the source code and the
| training material + compute time is like the software
| development process that went into creating the source
| code.
|
| It would probably take well over a million dollars in
| engineering hours to recreate the postgres source code from
| scratch, just as it would take millions in compute to
| rebuild the weights.
| jncfhnb wrote:
| The model weights ARE the preferred form for modification
| tonyarkles wrote:
| Unless you want to try modifying the model structure, in
| which case the weights aren't necessarily valid anymore and
| will need to be retrained.
| FooBarWidget wrote:
| I think a better analogy is firmware binary blobs in the Linux
| kernel, or VM bytecodes.
|
| The LLM inference engine (architecture implementation) is like
| a kernel driver that loads a firmware binary blob, or a virtual
| machine that loads bytecode. The inference engine is open
| source. The problem is that the weights (firmware blobs, VM
| bytecodes) are opaque: you don't have the means to reproduce
| them.
|
| The Linux community has long argued that drivers that load
| firmware blobs are cheating: they don't count as open source.
|
| Still, the "open source" LLMs are more open than "API-gated"
| LLMs. It's a step in the right direction, but I hope we don't
| stop there.
| code_biologist wrote:
| If we're continuing the analogy, the compute required to turn
| the source into binaries costs millions of dollars. Not a
| license fee for the compiler, but the actual time on a
| computer.
| jncfhnb wrote:
| No it's not. You have everything you need to modify the models
| to your own liking. You can explore how it works.
|
| This analogy is bad. Models are unlike code bases in this way.
| tonyarkles wrote:
| > You have everything you need to modify the models to your
| own liking.
|
| What if I wanted to train it using only half of its training
| set? If the inputs that were used to generate the set of
| released weights are not available I can't do that. I have a
| set of weights and the model structure but without the
| training dataset I have no way of doing that.
|
| To riff on the parent post, I have: Source
| + Compiler => Binaries
|
| For the vast majority of open source models I have:
| [unavailable inputs] + Model Structure => Weights
|
| They're not exactly the same as the source code/binary
| scenario because I can still do this (which isn't generally
| possible with binaries): Model Structure +
| Weights + [my own training data] => New Weights
|
| Another way to look at it is that with source code I can
| modify the code and recompile it from scratch. Maybe I think
| the model author should have used a deeper CNN layer in the
| middle of the model. Without the inputs I can't do a
| comparison.
| mgreg wrote:
| Applying the term "open source" to AI models is a bit more
| nuanced than to software. Many consider reproducibility the bar
| to get over to earn the label "open source."
|
| For an AI model that means the model itself, the dataset, and the
| training recipe (e.g. process, hyperparameters) often also
| released as source code. With that (and a lot of compute) you can
| train the model to get the weights.
| belval wrote:
| > if you can't reproduce the model then it's not truly open-
| source.
|
| Open-source means open source, it does not make reproducibility
| guarantees. You get the code and you can use the code. Pushed to
| the extreme this is like saying Chromium is not open-source
| because my 4GB laptop can't compile it.
|
| Getting training code for GPT-4 under MIT would be mostly
| useless, but it would still be open source.
| stcredzero wrote:
| Then what we need isn't open source. It's something else. Maybe
| called "Open Model?"
| belval wrote:
| Yes that would make sense. I'm in no way arguing that models
| can't be more open, just that overloading a commonly used
| expression such as "open-source" and then complaining that
| projects are not complying with your new definition of open-
| source just does not make sense to me.
| camgunz wrote:
| > Pushed to the extreme this is like saying Chromium is not
| open-source because my 4GB laptop can't compile it.
|
| Not really, an analog would be if Chromium shipped LLVM IR as
| its source but no one could get any version of LLVM to output
| the exact same IR no matter what configurations they tried, and
| thus any "home grown" Chromium was a little off.
| ramesh31 wrote:
| No, but it's still insanely useful and free as in beer.
| anticorporate wrote:
| The Open Source Initiative, who maintain the Open Source
| Definition, have been running a whole series over the past year
| to collect input from all sorts of stakeholders about what it
| means for an AI to be open source. I was lucky enough to
| participate in an afternoon long session with about a hundred
| other people last year at All Things Open.
|
| https://deepdive.opensource.org/
|
| I encourage you to go check out what's already being done here. I
| promise it's way more nuanced than anything than is going to fit
| on a tweet.
| nextaccountic wrote:
| Can you summarize? I'm reading
| https://deepdive.opensource.org/wp-content/uploads/2023/02/D...
| but it seems to tackle too many questions when I'm really only
| interested on what criteria to use when deciding whether (for
| example) Stable Diffusion is open source or not.
|
| Anyway, to go on a tangent, some day maybe with zero knowledge
| proofs we will be able to prove that a given pretrained model
| was indeed the result of training using a given dataset, in a
| way that can be verified vastly cheaper than training the model
| itself from scratch. (This same technique could also be applied
| to other things like verifying if a binary was compiled from a
| given source with a given compiler, hopefully verified in a
| cheaper way than compiling and applying all optimizations from
| scratch).
|
| If this ever materialize, then we can just demand proofs.
|
| Here's a study on that
|
| https://montrealethics.ai/experimenting-with-zero-knowledge-...
|
| https://dl.acm.org/doi/10.1145/3576915.3623202
|
| And here is another
|
| https://eprint.iacr.org/2023/1174
| RcouF1uZ4gsC wrote:
| I would argue that while technically correct, it is not what most
| people really care. What they care about are the following:
|
| 1. Can I download it?
|
| 2. Can I run it on my hardware?
|
| 3. Can I modify it?
|
| 4. Can I share my modifications with others?
|
| If those questions are in the affirmative, then I think most
| people consider it open enough, and it is a huge step for freedom
| compared to the models such as OpenAI.
| mgreg wrote:
| It's a great observation. People simply want their free stuff.
|
| The potential challenge arises in the future. Today's models
| will probably look weak compared to models we'll have in 1, 3
| or 10 years which means that today's models will likely be
| irrelevant in years hence. Every competitive "open" model today
| is tied closely to a controlling organization weather it's
| Meta, Mistral.AI, TII, 01.AI, etc.
|
| If they simply choose not to publish the next iteration of
| their model and follow OpenAI's path that's the end of the
| line.
|
| A truly open model could have some life beyond that of its
| original developer/organization. Of course it would still take
| great talent, updated datasets, and serious access to compute
| to keep a model moving forward and developing but if this is
| done in the "open" community then we'd have some guarantee for
| the future.
|
| Imagine if Linux was actually owned by a for-profit corporation
| and they could simply choose not to release a future version
| AND it was not possible for another organization to fork and
| carry on "open" Linux?
| camgunz wrote:
| Some people want more than that, e.g. they want to fix their
| printer but the driver is closed source, so they start the GNU
| project and the broader free software movement, responsible for
| almost all software innovation for decades.
| ivan_gammel wrote:
| ,,Can it be trusted?" is the question many people will care
| about, when the awareness of the risks becomes higher. If this
| question can be answered without publishing the source, fine,
| but that would probably mean that publisher must be liable for
| damages from model output.
| Der_Einzige wrote:
| 95% of the value comes from the model being freely downloadable
| and analyzable (i.e. not obfuscated/crippled post-hoc). Sure
| there is some difference, but as researchers I care far more
| about open access than making every "gnuight" on the internet
| happy that we used the right terminology.
| elashri wrote:
| I think the process of data acquisition isn't so clear-cut. Take
| CERN as an example: they release loads of data from various
| experiments under the CC0 license [1]. This isn't just a few
| small datasets for classroom use; we're talking big-league data,
| like the entire first run data from LHCb [2].
|
| On their portal, they don't just dump the data and leave you to
| it. They've got guides on analysis and the necessary tools
| (mostly open source stuff like ROOT [3] and even VMs). This means
| anyone can dive in. You could potentially discover something new
| or build on existing experiment analyses. This setup, with open
| data and tools, ticks the boxes for reproducibility. But does it
| mean people need to recreate the data themselves?
|
| Ideally, yeah, but realistically, while you could theoretically
| rebuild the LHC (since most technical details are public), it
| would take an army of skilled people, billions of dollars, and
| years to do it.
|
| This contrasts with open source models, where you can retrain
| models using data to get the weights. But getting hold of the
| data and the cost to reproduce the weights is usually
| prohibitive. I get that CERN's approach might seem to counter
| this, but remember, they're not releasing raw data (which is
| mostly noise), but a more refined version. Try downloading
| several petabytes of raw data if not; good luck with that. But
| for training something like a LLM, you might need the whole
| dataset, which in many cases have its own problems with
| copyrights...etc.
|
| [1] https://opendata.cern.ch/docs/terms-of-use
|
| [2] https://opendata.cern.ch/docs/lhcb-releases-entire-
| run1-data...
|
| [3] https://root.cern/
| lmm wrote:
| You're right that most people have neither the need nor the
| ability to recreate the data themselves. But the same applies
| to using open-source software in the first place: most people
| who use OSS have neither the need nor the ability to compile
| the software from source themselves. But the whole point of OSS
| is that that source is available for those who want to use it,
| whether to study it, to diagnose a bug, or something else. I
| think the same is true for the LHC's technical details or a
| model's training data: most people won't recreate it at home,
| but it's important to make it available, and even someone who
| can't rebuild the whole thing themselves might spot an
| important bug or omission by going through the data collection
| details.
| albert180 wrote:
| I think the biggest issue is with publishing the datasets. Then
| people and companies would discover that it's full of their
| copyrighted content and sue. I wouldn't be surprised if they
| slurped in the whole Z-Library et Al into their models. Or Google
| their entire Google Books Dataset
| zelon88 wrote:
| Somewhat unrelated, but here is a thought experiment...
|
| If a human knows a song "by heart" (imperfectly), it is not
| considered copyright infringement.
|
| If a LLM knows a song as part of its training data, then it is
| copyright infringement.
|
| But what if you developed a model with no prepared training
| data and forced it to learn from it's own sensory inputs.
| Instead of shoveling it bits, you played it this particular
| song and it (imperfectly) recorded the song with it's sensory
| input device. The same way humans listen to and experience
| music.
|
| Is the latter learning model infringing on the copyright of the
| song?
| stevehine wrote:
| This depends, how many times does it need to hear the song to
| build up a reasonably consistent internal reproduction, and
| are you paying per stream or buying the input data as CD
| Singles - or just putting the AI in a room with the radio on
| and waiting for it to take in the playlist a few times ?
| zelon88 wrote:
| Let's assume it is in a room with a radio listening to
| music, and that the AI is "general purpose" meaning that it
| can also perform other functions. It is not the sole
| purpose of the AI to do this all day.
|
| I see where you are coming from in trying to identify the
| source of the copyright. This would be important
| information if a human wanted to sue another human for re-
| producing copyright material.
|
| However, does that apply here? Nobody hears a human humming
| a song and asks if they obtained that music legally. Should
| it be important to ask an AI that same question if the
| purpose of listening to the song is not to steal it?
| nemomarx wrote:
| Not humming, but Don't we prevent singing songs
| sometimes? The birthday song was famously held up by ip
| law for some years right?
| kube-system wrote:
| The standards applied are exactly the same regardless of
| what tools are used. It doesn't matter if you're talking
| about a dumb AI, a general purpose AI, or a Xerox
| machine.
|
| If you want an exception to copyright, you're going to
| want to start looking at a section 107 (of the copyright
| act) exception:
| https://www.copyright.gov/title17/92chap1.html#107
|
| The reason someone walking down the street and humming a
| song is not a violation is because it very clearly meets
| all of the tests in section 107.
|
| The biggest problem with feeding stuff through a black
| box like an LLM is it isn't easy for a human to determine
| how close the result is to the original. An LLM could act
| like a Xerox machine, and it won't tell you.
| zelon88 wrote:
| I think this conversion has corrected some misgivings I
| had about the AI copyright argument. My takeaway is;
|
| Possession copyright material is not inherently
| infringing on a copyright. Disseminating copyright
| material is unless you meet section 107. AI runs afoul of
| section 107 when it verbatim shares copyright material
| from its dataset without attribution.
| kube-system wrote:
| > AI runs afoul of section 107 when it verbatim shares
| copyright material from its dataset without attribution.
|
| Technically, the AI doesn't run afoul. The person
| disseminating the copyrighted material does.
| teddyh wrote:
| If a person _plays_ a song similarly enough, then it _is_
| copyright infringment! Mere knowledge is irrelevant, it is
| the _producing of copies_ (and also a few related actions)
| which is prohibited by copyright.
| kube-system wrote:
| > If a LLM knows a song as part of its training data, then it
| is copyright infringement.
|
| No it isn't. You can feed whatever you want into your LLM,
| including copyrighted data. The issues arise when you start
| reproducing or distributing copyrighted content.
| jdietrich wrote:
| _> You can feed whatever you want into your LLM, including
| copyrighted data._
|
| That's currently the subject of considerable legal debate.
|
| https://edition.cnn.com/2023/07/10/tech/sarah-silverman-
| open...
| kube-system wrote:
| That is mostly an issue of the latter, whether the
| service that Meta/OpenAI offers outputs content that is a
| violation of copyright. Technically, derivative works are
| a copyright violation, but if you're not distributing
| them, you normally have a good fair use argument, and/or
| nobody knows.
| beardyw wrote:
| I think the answer is in the name. The "source" has always been
| what you need to build the thing. In this context I think we can
| agree that the thing is the model. Based on that the model is no
| more open source than a binary program.
| declaredapple wrote:
| I'll venture to say the majority of these "open access models"
| are meant to serve as advertisements of capabilities (either of
| hardware, research, or techniques) and nothing more. MPT being
| one of the most obvious example.
|
| Many don't offer any information, some do offer information but
| provide no new techniques and just threw a bunch of compute and
| some data to make a sub-par model that shows up on a specific
| leaderboard.
|
| Everyone is trying to save a card up their sleeve so they can
| sell it. And showing up on scoreboards is a great advertisement.
| darrenBaldwin03 wrote:
| Same with open-core - if you can't self-host the thing on your
| own infra then its not REALLY OSS
| dbish wrote:
| Many companies are using "open source" as marketing rather then
| actually releasing open source software and models. No data?
| Not open source. Special license cutting out self-hosting or
| competitive use? Not open source.
| darrenBaldwin03 wrote:
| ^^^^ Well said
| tqi wrote:
| "the project does not benefit from the OSS feedback loop" It's
| not like you can submit PRs to training data that fixes specific
| issues the way you can submit bug fixes, so I'm skeptical you
| would see much of a feedback loop.
|
| "it's hard to verify that the model has no backdoors (eg sleeper
| agents)" Again given the size of the datasets and the opaque way
| training works, I am skeptical that anyone would be able tell if
| there is a backdoor in the training data.
|
| "impossible to verify the data and content filter and whether
| they match your company policy" I don't totally know what this
| means. For one, you can/probably should apply company policies to
| the model outputs, which you can do without access to training
| data. Is the idea that every company could/should filter input
| data and train their own models?
|
| "you are dependent on the company to refresh the model" At the
| current cost, this is probably already true for most people.
|
| "A true open-source LLM project -- where everything is open from
| the codebase to the data pipeline -- could unlock a lot of value,
| creativity, and improve security." I am overall skeptical that
| this is true in the case of LLMs. If anything, I think this
| creates a larger surface for bad actors to attack.
| nick238 wrote:
| The company policy/backdoors issues are possibly like the whole
| Getty Images debacle. If a company contracts with a provider or
| just uses a given model themselves, they may have no idea that
| it's taking from a ton of copyrighted work AND with enough of a
| trail where the infringed party could probably win a suit.
|
| Backdoors I'd think of is if there are some sneaky words (maybe
| not even english) that all of a sudden causes it to emit NSFW
| outputs. Microsoft's short-lived @TayandYou comes to mind (but
| I don't think _anyone 's_ making that mistake again, where
| multiple users' sessions are pooled).
| andy99 wrote:
| I don't agree, and the analogy is poor. One can do the things he
| lists with a trained model. Having the data is basically a red
| herring. I wish this got more attention. Open/free software is
| about exercising freedoms, and they all can be exercised if
| you've got the model weights and code.
|
| https://www.marble.onl/posts/considerations_for_copyrighting...
| tytso wrote:
| But one of the four freedoms is being able to modify/tweek
| things, including the model. If all you have is the model
| weights, then you can't easily tweak the model. The model
| weights is hardly the preferred form for making changes to
| update the model.
|
| The equivalent would be someone which gives you only the binary
| to Libreoffice. That's perfectly fine for editing documents and
| spreadsheets, but suppose you want to fix a bug in Libreoffice?
| Just having the binary is going to make it quite difficult to
| fix things.
|
| Simiarly, suppose you find that the model has a bias in terms
| of labeling African Americans as criminals; or women as lousy
| computer programmers. If all you have is the model weights of
| the trained model, how easily can you fix the model? And how
| does that compare with running emacs on the Libreoffice binary?
| jdietrich wrote:
| _> suppose you find that the model has a bias in terms of
| labeling African Americans as criminals; or women as lousy
| computer programmers. If all you have is the model weights of
| the trained model, how easily can you fix the model?_
|
| That's textbook fine-tuning and is basically trivial. Adding
| another layer and training that is many orders of magnitude
| more efficient than retraining the whole model and works
| ~exactly as well.
|
| Models are data, not instructions. Analogies to software are
| actively harmful. We do not fix bugs in models any more than
| we fix bugs in a JPEG.
| two_in_one wrote:
| Next step will be to ask for GPU time. Because even with
| data, model code and training framework you may have no
| resources to train. "The equivalent would be" someone gives
| you the code, but no access to mainframe which is required to
| compile. Which would make it not open source(?) There are
| other variations, like original compiler was lost, current
| compilers aren't backward compatible. Does that make old open
| source code closed now?
|
| In other words there should be a reasonable line when model
| is called open source. In extreme view it's when the model,
| the training framework, and the data are available for free.
| This would mean open source model can be trained only on
| public domain data. Which makes class of open source models
| very, very limited.
|
| More realistic is to make the code and the weights available.
| So that with some common knowledge new model can be trained,
| or old fine tuned, on available data. Important note: weights
| cannot be reproduced even if original training data is
| available. It will be always a new model with (slightly)
| different responses.
| andy99 wrote:
| You can fine tune a model, you ve got way more power to do so
| given the trained model than starting from scratch and the
| raw data.
| code_biologist wrote:
| If all you have are the model weights, you can very easily
| tweak the model. How else are all these "decensored" Llama2
| showing up on Hugging Face? There's a lot of value in a
| trained LLM model itself and it's 100% a type of openness to
| release these trained models.
|
| What you can't easily do is retrain from scratch using a
| heavily modified architecture or different training data
| preconditioning. So yes, it is valuable to have dataset
| access and compute to do this and this is the primary type of
| value for LLM providers. It would be great if this were more
| open -- it would also be great if everybody had a million
| dollars.
|
| I think it's pretty misguided to put down the first type of
| value and openness when honestly they're pretty independent,
| and the second type of value and openness is hard for anybody
| without millions of dollars to access.
| wlesieutre wrote:
| My main concern is that if all you have are weights you're
| stuck hoping for the benevolence of whatever organization is
| actually able to train the model with their secret dataset.
|
| When they get bought by Oracle and progress slows to a crawl
| because it's not profitable enough to interest them, you can't
| exactly do a LibreOffice. Or they can turn around and say
| "license change, future versions may not be used for <market
| that controlling company would like to dominate>" and now
| you're stuck with whatever old version of the model while they
| steamroll your project with newer updates.
|
| Open weights are worth nothing in terms of long term security
| of development, they're a toy that you can play with but you
| have no assurances of anything for the future.
| jncfhnb wrote:
| Everything you just said applies to normal software. Oh no!
| Big Corp just started a closed fork of their open source
| codebase! Well, the open source version is still there. The
| open source community can build off of it.
|
| You may complain that subsequent models are not iterative on
| the past and so having that old version doesn't help; but
| then the data probably changes too so having the old data
| would largely leave you with the same old model.
| wlesieutre wrote:
| When you train an updated model on a new dataset do you
| really start by deleting all of the data that you collected
| for training the previous version?
| jncfhnb wrote:
| Probably not. But if it's the new data providing the
| advantage then you're not exactly better off having the
| old data and the model vs. just having the model.
| wlesieutre wrote:
| The idea would be that another group could fork it and
| continue adding to the dataset on their own.
|
| As opposed to not being able to fork it at all because an
| "open source" model actually just means "you are allowed
| to use this particular release of our mystery box."
| jncfhnb wrote:
| You do not need the original dataset to train the model
| on an additional dataset
|
| Maybe I misunderstood your original question. To be
| clear, the process of modifying a trained model does not
| require the presence of the original data. You said
| "deleted" which perhaps I misinterpreted. You're not
| "instantiating a new model from scratch" when you modify
| it. You're continuing to train it where it left off.
| wlesieutre wrote:
| What if you want to start with a subset of the original
| data? Like you've trained a model, and then later said
| "You know, this new data we're adding is great, but maybe
| pulling all those comments from 4chan earlier was a
| mistake," wouldn't that require starting fresh with
| access to the actual data?
| robblbobbl wrote:
| Agreed.
| cpeterso wrote:
| Are there any true open-source LLM models, where all the training
| data is publicly-available (with a compatible license) and the
| training software can reproduce bit-identical models?
|
| Is training nondeterministic? I know LLM outputs are purposely
| nondeterministic.
| edoardo-schnell wrote:
| So, we need something like dockerfiles for models?
| emadm wrote:
| We made our last language model fully reproducible including all
| datasets, training details, hyper parameters etc:
| https://stability.wandb.io/stability-llm/stable-lm/reports/S...
___________________________________________________________________
(page generated 2024-01-17 23:01 UTC)