[HN Gopher] If you can't reproduce the model then it's not open-...
       ___________________________________________________________________
        
       If you can't reproduce the model then it's not open-source
        
       Author : mgreg
       Score  : 171 points
       Date   : 2024-01-17 17:49 UTC (5 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | ssgodderidge wrote:
       | > Imagine if Linux published only a binary without the codebase.
       | Or published the codebase without the compiler used to make the
       | binary. This is where we are today.
       | 
       | This was such a helpful way to frame the problem! Something felt
       | off about the "open source models" out there; this highlights the
       | problem incredibly well.
        
         | fzliu wrote:
         | In my mind, what's more crucial here is code for
         | downloading/scraping and labeling the data, not the model
         | architecture nor training script.
         | 
         | As much as I appreciate Mis(x)tral, I would've loved it even
         | more if they released code for gathering data.
        
           | declaredapple wrote:
           | I'm speculating they are attempting to avoid controversy
           | about their datasources. That and a possible competitive edge
           | depending on what specific sets/filtering they're using.
        
             | ssgodderidge wrote:
             | To avoid controversy AND potential lawsuits.
        
               | declaredapple wrote:
               | Yup.
               | 
               | I think many countries (japan already has) will allow IP
               | for training data.
               | 
               | They just need to buy time until then.
        
           | pk-protect-ai wrote:
           | I don't mind to have ready to use datasets instead the code
           | for downloading/scraping and labeling. It will save a lot of
           | time. It is not complicated to write some code for gathering
           | the data, it might be sometimes impossible to replicate the
           | datasets after all if some parts of the data which you have
           | to scrape are already gone (removed because of various
           | reasons).
        
         | stcredzero wrote:
         | Open Source and Free Software wasn't formulated to deal with
         | the need for this level of gargantuan amounts of data and
         | compute.
         | 
         | Can the public compete? What percentage of the technical public
         | could we expect to participate, and how much data, compute, and
         | data quality improvement could they bring to the table? I
         | suspect that large corporations are at least an order of
         | magnitude advantaged economically.
        
           | RandomWorker wrote:
           | There is a big effort being worked on in China, Yuanqing Lin
           | gave an interview on the deep learning course that works on
           | this magnitude [1]. They suggest that they will host both the
           | resources to store the data, train the data, and have all
           | those algorithms available in China.
           | 
           | [1] https://www.youtube.com/watch?v=3GfOnI3goAk
        
           | tikhonj wrote:
           | The public doesn't have the resources to train the largest
           | state-of-the-art LLMs, but training _useful_ LLMs seems
           | doable. Maybe not for most individuals but certainly for a
           | range of nonprofits, research teams and companies.
        
             | stcredzero wrote:
             | Isn't is relatively easy for a smaller model to poke holes
             | in the output of a larger model?
        
             | jncfhnb wrote:
             | But not nearly as in reach as modifying open source models.
        
           | edgarvaldes wrote:
           | Open Source and Free Software are not about the amount of
           | data.
        
         | nullc wrote:
         | The GPL describes the source as the "preferred form for
         | modification".
        
           | CHY872 wrote:
           | And, that's obviously fun, because with LLMs, you have the
           | LLM itself which cost hundreds of thousands in compute to
           | train, but given you have the weights it's eminently fine-
           | tunable. So it's actually not really like Linux - rather it's
           | closer to something like a car, where you had no hope of
           | making it in the first place but now you have it, maybe you
           | can modify it.
        
             | patmorgan23 wrote:
             | So in this case, the weights are the source code and the
             | training material + compute time is like the software
             | development process that went into creating the source
             | code.
             | 
             | It would probably take well over a million dollars in
             | engineering hours to recreate the postgres source code from
             | scratch, just as it would take millions in compute to
             | rebuild the weights.
        
           | jncfhnb wrote:
           | The model weights ARE the preferred form for modification
        
             | tonyarkles wrote:
             | Unless you want to try modifying the model structure, in
             | which case the weights aren't necessarily valid anymore and
             | will need to be retrained.
        
         | FooBarWidget wrote:
         | I think a better analogy is firmware binary blobs in the Linux
         | kernel, or VM bytecodes.
         | 
         | The LLM inference engine (architecture implementation) is like
         | a kernel driver that loads a firmware binary blob, or a virtual
         | machine that loads bytecode. The inference engine is open
         | source. The problem is that the weights (firmware blobs, VM
         | bytecodes) are opaque: you don't have the means to reproduce
         | them.
         | 
         | The Linux community has long argued that drivers that load
         | firmware blobs are cheating: they don't count as open source.
         | 
         | Still, the "open source" LLMs are more open than "API-gated"
         | LLMs. It's a step in the right direction, but I hope we don't
         | stop there.
        
           | code_biologist wrote:
           | If we're continuing the analogy, the compute required to turn
           | the source into binaries costs millions of dollars. Not a
           | license fee for the compiler, but the actual time on a
           | computer.
        
         | jncfhnb wrote:
         | No it's not. You have everything you need to modify the models
         | to your own liking. You can explore how it works.
         | 
         | This analogy is bad. Models are unlike code bases in this way.
        
           | tonyarkles wrote:
           | > You have everything you need to modify the models to your
           | own liking.
           | 
           | What if I wanted to train it using only half of its training
           | set? If the inputs that were used to generate the set of
           | released weights are not available I can't do that. I have a
           | set of weights and the model structure but without the
           | training dataset I have no way of doing that.
           | 
           | To riff on the parent post, I have:                   Source
           | + Compiler => Binaries
           | 
           | For the vast majority of open source models I have:
           | [unavailable inputs] + Model Structure => Weights
           | 
           | They're not exactly the same as the source code/binary
           | scenario because I can still do this (which isn't generally
           | possible with binaries):                   Model Structure +
           | Weights + [my own training data] => New Weights
           | 
           | Another way to look at it is that with source code I can
           | modify the code and recompile it from scratch. Maybe I think
           | the model author should have used a deeper CNN layer in the
           | middle of the model. Without the inputs I can't do a
           | comparison.
        
       | mgreg wrote:
       | Applying the term "open source" to AI models is a bit more
       | nuanced than to software. Many consider reproducibility the bar
       | to get over to earn the label "open source."
       | 
       | For an AI model that means the model itself, the dataset, and the
       | training recipe (e.g. process, hyperparameters) often also
       | released as source code. With that (and a lot of compute) you can
       | train the model to get the weights.
        
       | belval wrote:
       | > if you can't reproduce the model then it's not truly open-
       | source.
       | 
       | Open-source means open source, it does not make reproducibility
       | guarantees. You get the code and you can use the code. Pushed to
       | the extreme this is like saying Chromium is not open-source
       | because my 4GB laptop can't compile it.
       | 
       | Getting training code for GPT-4 under MIT would be mostly
       | useless, but it would still be open source.
        
         | stcredzero wrote:
         | Then what we need isn't open source. It's something else. Maybe
         | called "Open Model?"
        
           | belval wrote:
           | Yes that would make sense. I'm in no way arguing that models
           | can't be more open, just that overloading a commonly used
           | expression such as "open-source" and then complaining that
           | projects are not complying with your new definition of open-
           | source just does not make sense to me.
        
         | camgunz wrote:
         | > Pushed to the extreme this is like saying Chromium is not
         | open-source because my 4GB laptop can't compile it.
         | 
         | Not really, an analog would be if Chromium shipped LLVM IR as
         | its source but no one could get any version of LLVM to output
         | the exact same IR no matter what configurations they tried, and
         | thus any "home grown" Chromium was a little off.
        
       | ramesh31 wrote:
       | No, but it's still insanely useful and free as in beer.
        
       | anticorporate wrote:
       | The Open Source Initiative, who maintain the Open Source
       | Definition, have been running a whole series over the past year
       | to collect input from all sorts of stakeholders about what it
       | means for an AI to be open source. I was lucky enough to
       | participate in an afternoon long session with about a hundred
       | other people last year at All Things Open.
       | 
       | https://deepdive.opensource.org/
       | 
       | I encourage you to go check out what's already being done here. I
       | promise it's way more nuanced than anything than is going to fit
       | on a tweet.
        
         | nextaccountic wrote:
         | Can you summarize? I'm reading
         | https://deepdive.opensource.org/wp-content/uploads/2023/02/D...
         | but it seems to tackle too many questions when I'm really only
         | interested on what criteria to use when deciding whether (for
         | example) Stable Diffusion is open source or not.
         | 
         | Anyway, to go on a tangent, some day maybe with zero knowledge
         | proofs we will be able to prove that a given pretrained model
         | was indeed the result of training using a given dataset, in a
         | way that can be verified vastly cheaper than training the model
         | itself from scratch. (This same technique could also be applied
         | to other things like verifying if a binary was compiled from a
         | given source with a given compiler, hopefully verified in a
         | cheaper way than compiling and applying all optimizations from
         | scratch).
         | 
         | If this ever materialize, then we can just demand proofs.
         | 
         | Here's a study on that
         | 
         | https://montrealethics.ai/experimenting-with-zero-knowledge-...
         | 
         | https://dl.acm.org/doi/10.1145/3576915.3623202
         | 
         | And here is another
         | 
         | https://eprint.iacr.org/2023/1174
        
       | RcouF1uZ4gsC wrote:
       | I would argue that while technically correct, it is not what most
       | people really care. What they care about are the following:
       | 
       | 1. Can I download it?
       | 
       | 2. Can I run it on my hardware?
       | 
       | 3. Can I modify it?
       | 
       | 4. Can I share my modifications with others?
       | 
       | If those questions are in the affirmative, then I think most
       | people consider it open enough, and it is a huge step for freedom
       | compared to the models such as OpenAI.
        
         | mgreg wrote:
         | It's a great observation. People simply want their free stuff.
         | 
         | The potential challenge arises in the future. Today's models
         | will probably look weak compared to models we'll have in 1, 3
         | or 10 years which means that today's models will likely be
         | irrelevant in years hence. Every competitive "open" model today
         | is tied closely to a controlling organization weather it's
         | Meta, Mistral.AI, TII, 01.AI, etc.
         | 
         | If they simply choose not to publish the next iteration of
         | their model and follow OpenAI's path that's the end of the
         | line.
         | 
         | A truly open model could have some life beyond that of its
         | original developer/organization. Of course it would still take
         | great talent, updated datasets, and serious access to compute
         | to keep a model moving forward and developing but if this is
         | done in the "open" community then we'd have some guarantee for
         | the future.
         | 
         | Imagine if Linux was actually owned by a for-profit corporation
         | and they could simply choose not to release a future version
         | AND it was not possible for another organization to fork and
         | carry on "open" Linux?
        
         | camgunz wrote:
         | Some people want more than that, e.g. they want to fix their
         | printer but the driver is closed source, so they start the GNU
         | project and the broader free software movement, responsible for
         | almost all software innovation for decades.
        
         | ivan_gammel wrote:
         | ,,Can it be trusted?" is the question many people will care
         | about, when the awareness of the risks becomes higher. If this
         | question can be answered without publishing the source, fine,
         | but that would probably mean that publisher must be liable for
         | damages from model output.
        
       | Der_Einzige wrote:
       | 95% of the value comes from the model being freely downloadable
       | and analyzable (i.e. not obfuscated/crippled post-hoc). Sure
       | there is some difference, but as researchers I care far more
       | about open access than making every "gnuight" on the internet
       | happy that we used the right terminology.
        
       | elashri wrote:
       | I think the process of data acquisition isn't so clear-cut. Take
       | CERN as an example: they release loads of data from various
       | experiments under the CC0 license [1]. This isn't just a few
       | small datasets for classroom use; we're talking big-league data,
       | like the entire first run data from LHCb [2].
       | 
       | On their portal, they don't just dump the data and leave you to
       | it. They've got guides on analysis and the necessary tools
       | (mostly open source stuff like ROOT [3] and even VMs). This means
       | anyone can dive in. You could potentially discover something new
       | or build on existing experiment analyses. This setup, with open
       | data and tools, ticks the boxes for reproducibility. But does it
       | mean people need to recreate the data themselves?
       | 
       | Ideally, yeah, but realistically, while you could theoretically
       | rebuild the LHC (since most technical details are public), it
       | would take an army of skilled people, billions of dollars, and
       | years to do it.
       | 
       | This contrasts with open source models, where you can retrain
       | models using data to get the weights. But getting hold of the
       | data and the cost to reproduce the weights is usually
       | prohibitive. I get that CERN's approach might seem to counter
       | this, but remember, they're not releasing raw data (which is
       | mostly noise), but a more refined version. Try downloading
       | several petabytes of raw data if not; good luck with that. But
       | for training something like a LLM, you might need the whole
       | dataset, which in many cases have its own problems with
       | copyrights...etc.
       | 
       | [1] https://opendata.cern.ch/docs/terms-of-use
       | 
       | [2] https://opendata.cern.ch/docs/lhcb-releases-entire-
       | run1-data...
       | 
       | [3] https://root.cern/
        
         | lmm wrote:
         | You're right that most people have neither the need nor the
         | ability to recreate the data themselves. But the same applies
         | to using open-source software in the first place: most people
         | who use OSS have neither the need nor the ability to compile
         | the software from source themselves. But the whole point of OSS
         | is that that source is available for those who want to use it,
         | whether to study it, to diagnose a bug, or something else. I
         | think the same is true for the LHC's technical details or a
         | model's training data: most people won't recreate it at home,
         | but it's important to make it available, and even someone who
         | can't rebuild the whole thing themselves might spot an
         | important bug or omission by going through the data collection
         | details.
        
       | albert180 wrote:
       | I think the biggest issue is with publishing the datasets. Then
       | people and companies would discover that it's full of their
       | copyrighted content and sue. I wouldn't be surprised if they
       | slurped in the whole Z-Library et Al into their models. Or Google
       | their entire Google Books Dataset
        
         | zelon88 wrote:
         | Somewhat unrelated, but here is a thought experiment...
         | 
         | If a human knows a song "by heart" (imperfectly), it is not
         | considered copyright infringement.
         | 
         | If a LLM knows a song as part of its training data, then it is
         | copyright infringement.
         | 
         | But what if you developed a model with no prepared training
         | data and forced it to learn from it's own sensory inputs.
         | Instead of shoveling it bits, you played it this particular
         | song and it (imperfectly) recorded the song with it's sensory
         | input device. The same way humans listen to and experience
         | music.
         | 
         | Is the latter learning model infringing on the copyright of the
         | song?
        
           | stevehine wrote:
           | This depends, how many times does it need to hear the song to
           | build up a reasonably consistent internal reproduction, and
           | are you paying per stream or buying the input data as CD
           | Singles - or just putting the AI in a room with the radio on
           | and waiting for it to take in the playlist a few times ?
        
             | zelon88 wrote:
             | Let's assume it is in a room with a radio listening to
             | music, and that the AI is "general purpose" meaning that it
             | can also perform other functions. It is not the sole
             | purpose of the AI to do this all day.
             | 
             | I see where you are coming from in trying to identify the
             | source of the copyright. This would be important
             | information if a human wanted to sue another human for re-
             | producing copyright material.
             | 
             | However, does that apply here? Nobody hears a human humming
             | a song and asks if they obtained that music legally. Should
             | it be important to ask an AI that same question if the
             | purpose of listening to the song is not to steal it?
        
               | nemomarx wrote:
               | Not humming, but Don't we prevent singing songs
               | sometimes? The birthday song was famously held up by ip
               | law for some years right?
        
               | kube-system wrote:
               | The standards applied are exactly the same regardless of
               | what tools are used. It doesn't matter if you're talking
               | about a dumb AI, a general purpose AI, or a Xerox
               | machine.
               | 
               | If you want an exception to copyright, you're going to
               | want to start looking at a section 107 (of the copyright
               | act) exception:
               | https://www.copyright.gov/title17/92chap1.html#107
               | 
               | The reason someone walking down the street and humming a
               | song is not a violation is because it very clearly meets
               | all of the tests in section 107.
               | 
               | The biggest problem with feeding stuff through a black
               | box like an LLM is it isn't easy for a human to determine
               | how close the result is to the original. An LLM could act
               | like a Xerox machine, and it won't tell you.
        
               | zelon88 wrote:
               | I think this conversion has corrected some misgivings I
               | had about the AI copyright argument. My takeaway is;
               | 
               | Possession copyright material is not inherently
               | infringing on a copyright. Disseminating copyright
               | material is unless you meet section 107. AI runs afoul of
               | section 107 when it verbatim shares copyright material
               | from its dataset without attribution.
        
               | kube-system wrote:
               | > AI runs afoul of section 107 when it verbatim shares
               | copyright material from its dataset without attribution.
               | 
               | Technically, the AI doesn't run afoul. The person
               | disseminating the copyrighted material does.
        
           | teddyh wrote:
           | If a person _plays_ a song similarly enough, then it _is_
           | copyright infringment! Mere knowledge is irrelevant, it is
           | the _producing of copies_ (and also a few related actions)
           | which is prohibited by copyright.
        
           | kube-system wrote:
           | > If a LLM knows a song as part of its training data, then it
           | is copyright infringement.
           | 
           | No it isn't. You can feed whatever you want into your LLM,
           | including copyrighted data. The issues arise when you start
           | reproducing or distributing copyrighted content.
        
             | jdietrich wrote:
             | _> You can feed whatever you want into your LLM, including
             | copyrighted data._
             | 
             | That's currently the subject of considerable legal debate.
             | 
             | https://edition.cnn.com/2023/07/10/tech/sarah-silverman-
             | open...
        
               | kube-system wrote:
               | That is mostly an issue of the latter, whether the
               | service that Meta/OpenAI offers outputs content that is a
               | violation of copyright. Technically, derivative works are
               | a copyright violation, but if you're not distributing
               | them, you normally have a good fair use argument, and/or
               | nobody knows.
        
       | beardyw wrote:
       | I think the answer is in the name. The "source" has always been
       | what you need to build the thing. In this context I think we can
       | agree that the thing is the model. Based on that the model is no
       | more open source than a binary program.
        
       | declaredapple wrote:
       | I'll venture to say the majority of these "open access models"
       | are meant to serve as advertisements of capabilities (either of
       | hardware, research, or techniques) and nothing more. MPT being
       | one of the most obvious example.
       | 
       | Many don't offer any information, some do offer information but
       | provide no new techniques and just threw a bunch of compute and
       | some data to make a sub-par model that shows up on a specific
       | leaderboard.
       | 
       | Everyone is trying to save a card up their sleeve so they can
       | sell it. And showing up on scoreboards is a great advertisement.
        
       | darrenBaldwin03 wrote:
       | Same with open-core - if you can't self-host the thing on your
       | own infra then its not REALLY OSS
        
         | dbish wrote:
         | Many companies are using "open source" as marketing rather then
         | actually releasing open source software and models. No data?
         | Not open source. Special license cutting out self-hosting or
         | competitive use? Not open source.
        
           | darrenBaldwin03 wrote:
           | ^^^^ Well said
        
       | tqi wrote:
       | "the project does not benefit from the OSS feedback loop" It's
       | not like you can submit PRs to training data that fixes specific
       | issues the way you can submit bug fixes, so I'm skeptical you
       | would see much of a feedback loop.
       | 
       | "it's hard to verify that the model has no backdoors (eg sleeper
       | agents)" Again given the size of the datasets and the opaque way
       | training works, I am skeptical that anyone would be able tell if
       | there is a backdoor in the training data.
       | 
       | "impossible to verify the data and content filter and whether
       | they match your company policy" I don't totally know what this
       | means. For one, you can/probably should apply company policies to
       | the model outputs, which you can do without access to training
       | data. Is the idea that every company could/should filter input
       | data and train their own models?
       | 
       | "you are dependent on the company to refresh the model" At the
       | current cost, this is probably already true for most people.
       | 
       | "A true open-source LLM project -- where everything is open from
       | the codebase to the data pipeline -- could unlock a lot of value,
       | creativity, and improve security." I am overall skeptical that
       | this is true in the case of LLMs. If anything, I think this
       | creates a larger surface for bad actors to attack.
        
         | nick238 wrote:
         | The company policy/backdoors issues are possibly like the whole
         | Getty Images debacle. If a company contracts with a provider or
         | just uses a given model themselves, they may have no idea that
         | it's taking from a ton of copyrighted work AND with enough of a
         | trail where the infringed party could probably win a suit.
         | 
         | Backdoors I'd think of is if there are some sneaky words (maybe
         | not even english) that all of a sudden causes it to emit NSFW
         | outputs. Microsoft's short-lived @TayandYou comes to mind (but
         | I don't think _anyone 's_ making that mistake again, where
         | multiple users' sessions are pooled).
        
       | andy99 wrote:
       | I don't agree, and the analogy is poor. One can do the things he
       | lists with a trained model. Having the data is basically a red
       | herring. I wish this got more attention. Open/free software is
       | about exercising freedoms, and they all can be exercised if
       | you've got the model weights and code.
       | 
       | https://www.marble.onl/posts/considerations_for_copyrighting...
        
         | tytso wrote:
         | But one of the four freedoms is being able to modify/tweek
         | things, including the model. If all you have is the model
         | weights, then you can't easily tweak the model. The model
         | weights is hardly the preferred form for making changes to
         | update the model.
         | 
         | The equivalent would be someone which gives you only the binary
         | to Libreoffice. That's perfectly fine for editing documents and
         | spreadsheets, but suppose you want to fix a bug in Libreoffice?
         | Just having the binary is going to make it quite difficult to
         | fix things.
         | 
         | Simiarly, suppose you find that the model has a bias in terms
         | of labeling African Americans as criminals; or women as lousy
         | computer programmers. If all you have is the model weights of
         | the trained model, how easily can you fix the model? And how
         | does that compare with running emacs on the Libreoffice binary?
        
           | jdietrich wrote:
           | _> suppose you find that the model has a bias in terms of
           | labeling African Americans as criminals; or women as lousy
           | computer programmers. If all you have is the model weights of
           | the trained model, how easily can you fix the model?_
           | 
           | That's textbook fine-tuning and is basically trivial. Adding
           | another layer and training that is many orders of magnitude
           | more efficient than retraining the whole model and works
           | ~exactly as well.
           | 
           | Models are data, not instructions. Analogies to software are
           | actively harmful. We do not fix bugs in models any more than
           | we fix bugs in a JPEG.
        
           | two_in_one wrote:
           | Next step will be to ask for GPU time. Because even with
           | data, model code and training framework you may have no
           | resources to train. "The equivalent would be" someone gives
           | you the code, but no access to mainframe which is required to
           | compile. Which would make it not open source(?) There are
           | other variations, like original compiler was lost, current
           | compilers aren't backward compatible. Does that make old open
           | source code closed now?
           | 
           | In other words there should be a reasonable line when model
           | is called open source. In extreme view it's when the model,
           | the training framework, and the data are available for free.
           | This would mean open source model can be trained only on
           | public domain data. Which makes class of open source models
           | very, very limited.
           | 
           | More realistic is to make the code and the weights available.
           | So that with some common knowledge new model can be trained,
           | or old fine tuned, on available data. Important note: weights
           | cannot be reproduced even if original training data is
           | available. It will be always a new model with (slightly)
           | different responses.
        
           | andy99 wrote:
           | You can fine tune a model, you ve got way more power to do so
           | given the trained model than starting from scratch and the
           | raw data.
        
           | code_biologist wrote:
           | If all you have are the model weights, you can very easily
           | tweak the model. How else are all these "decensored" Llama2
           | showing up on Hugging Face? There's a lot of value in a
           | trained LLM model itself and it's 100% a type of openness to
           | release these trained models.
           | 
           | What you can't easily do is retrain from scratch using a
           | heavily modified architecture or different training data
           | preconditioning. So yes, it is valuable to have dataset
           | access and compute to do this and this is the primary type of
           | value for LLM providers. It would be great if this were more
           | open -- it would also be great if everybody had a million
           | dollars.
           | 
           | I think it's pretty misguided to put down the first type of
           | value and openness when honestly they're pretty independent,
           | and the second type of value and openness is hard for anybody
           | without millions of dollars to access.
        
         | wlesieutre wrote:
         | My main concern is that if all you have are weights you're
         | stuck hoping for the benevolence of whatever organization is
         | actually able to train the model with their secret dataset.
         | 
         | When they get bought by Oracle and progress slows to a crawl
         | because it's not profitable enough to interest them, you can't
         | exactly do a LibreOffice. Or they can turn around and say
         | "license change, future versions may not be used for <market
         | that controlling company would like to dominate>" and now
         | you're stuck with whatever old version of the model while they
         | steamroll your project with newer updates.
         | 
         | Open weights are worth nothing in terms of long term security
         | of development, they're a toy that you can play with but you
         | have no assurances of anything for the future.
        
           | jncfhnb wrote:
           | Everything you just said applies to normal software. Oh no!
           | Big Corp just started a closed fork of their open source
           | codebase! Well, the open source version is still there. The
           | open source community can build off of it.
           | 
           | You may complain that subsequent models are not iterative on
           | the past and so having that old version doesn't help; but
           | then the data probably changes too so having the old data
           | would largely leave you with the same old model.
        
             | wlesieutre wrote:
             | When you train an updated model on a new dataset do you
             | really start by deleting all of the data that you collected
             | for training the previous version?
        
               | jncfhnb wrote:
               | Probably not. But if it's the new data providing the
               | advantage then you're not exactly better off having the
               | old data and the model vs. just having the model.
        
               | wlesieutre wrote:
               | The idea would be that another group could fork it and
               | continue adding to the dataset on their own.
               | 
               | As opposed to not being able to fork it at all because an
               | "open source" model actually just means "you are allowed
               | to use this particular release of our mystery box."
        
               | jncfhnb wrote:
               | You do not need the original dataset to train the model
               | on an additional dataset
               | 
               | Maybe I misunderstood your original question. To be
               | clear, the process of modifying a trained model does not
               | require the presence of the original data. You said
               | "deleted" which perhaps I misinterpreted. You're not
               | "instantiating a new model from scratch" when you modify
               | it. You're continuing to train it where it left off.
        
               | wlesieutre wrote:
               | What if you want to start with a subset of the original
               | data? Like you've trained a model, and then later said
               | "You know, this new data we're adding is great, but maybe
               | pulling all those comments from 4chan earlier was a
               | mistake," wouldn't that require starting fresh with
               | access to the actual data?
        
       | robblbobbl wrote:
       | Agreed.
        
       | cpeterso wrote:
       | Are there any true open-source LLM models, where all the training
       | data is publicly-available (with a compatible license) and the
       | training software can reproduce bit-identical models?
       | 
       | Is training nondeterministic? I know LLM outputs are purposely
       | nondeterministic.
        
       | edoardo-schnell wrote:
       | So, we need something like dockerfiles for models?
        
       | emadm wrote:
       | We made our last language model fully reproducible including all
       | datasets, training details, hyper parameters etc:
       | https://stability.wandb.io/stability-llm/stable-lm/reports/S...
        
       ___________________________________________________________________
       (page generated 2024-01-17 23:01 UTC)