[HN Gopher] The Open Source AI Definition RC1 Is Available for C...
       ___________________________________________________________________
        
       The Open Source AI Definition RC1 Is Available for Comments
        
       Author : foxbee
       Score  : 36 points
       Date   : 2024-10-09 19:00 UTC (4 hours ago)
        
 (HTM) web link (opensource.org)
 (TXT) w3m dump (opensource.org)
        
       | swyx wrote:
       | D.O.A without adoption from the major model labs (including the
       | "opener" ones like AI2 and lets say Together/Eleuther). i dont
       | like the open source old guard feeling like they have any say in
       | defining things when they dont have skin in the game. (and yes,
       | this is coming from a fan of their current work defending the
       | "open source" term in traditional dev tools). a good way to
       | ensure decline to irrelevance is to do a lot of busywork without
       | ensuring a credible quorum of the major players at the table.
       | 
       | please dont let me discourage tho, i think this could be
       | important work but if and only if this gets endorsement from >1
       | large model lab producing any interesting work
        
         | blackeyeblitzar wrote:
         | Why should the "old guard" not have to have the say when they
         | came up with the idea of open source? It is misleading to adopt
         | terminology with well known definitions and abuse it. People
         | like Meta are free to use some other terminology that isn't
         | "open source" to describe their models, which I cannot
         | reproduce because they've release nothing except weights and
         | inference code.
        
         | sigh_again wrote:
         | > they have any say in defining things when they dont have skin
         | in the game.
         | 
         | Then, maybe don't go around stealing and bastardizing the "open
         | source" concept when absolutely none of the serious AI research
         | is open source or reproductible. Just because you read a fancy
         | word online once and think you can use it doesn't mean you're
         | right.
        
         | jszymborski wrote:
         | > D.O.A without adoption from the major model labs
         | 
         | I definitely disagree. Adoption of open licenses has
         | historically been "bottom-up", starting with academia and
         | hobbyists and then eventually used by big names. I have zero
         | idea why that can't be the case here.
         | 
         | I know I'll be releasing my models under an open license once
         | finalized.
        
       | tananaev wrote:
       | The definition is good because currently many call their open
       | model weights as open "source". But I suspect most companies will
       | still call their models open source even when they're not.
        
       | datascientist wrote:
       | also see https://gradientflow.com/open-source-principles-in-
       | foundatio...
        
       | exac wrote:
       | > The aim of Open Source is not and has never been to enable
       | reproducible software.
       | 
       | Okay, well just because you have the domain name "opensource.org"
       | doesn't give you the ability to speak for the community, and the
       | community's understanding of the term.
       | 
       | opensource.org is irrelevant.
        
         | FrustratedMonky wrote:
         | I agree.
         | 
         | "never been to enable reproducible software"
         | 
         | I'd say, sure "Never" is a big word.
         | 
         | Having open code that everyone can read and run, was partly to
         | allow for reproducibility. In the closed world, how is anybody
         | reproducing anything, being open does enable that.
        
           | saurik wrote:
           | The article seems to cover this nuance in the next
           | paragraphs?
        
         | saurik wrote:
         | I mean, I've never understood "open source" to require
         | reproducibility? That concept barely even existed as a thing
         | people strove for until 15 years ago, a lot of software still
         | only barely supports such, and there are tons of tradeoffs that
         | come with it (as you effectively then also inherit your entire
         | toolchain as vendor maintained, and a lot of projects end up
         | making that result in awkward binaries, as almost no one
         | reproduces entirely from a small bit of bootstrapped lisp).
        
       | blackeyeblitzar wrote:
       | A reinforcement of definitions is needed. Open weights is NOT
       | open source. But there are people like Meta that are rampantly
       | open washing their work. The point of open source is that you can
       | recreate the product yourself, for example by compiling the
       | source code. Clearly the equivalent for an LLM is being able to
       | retrain the model to produce the weights. Yes I realize this is
       | impractical without access to the hardware, but the transparency
       | is still important, so we know how these models are designed, and
       | how they may be influencing us through biases/censorship.
       | 
       | The only actually open source model I am aware of is AI2's OLMo
       | (https://blog.allenai.org/olmo-open-language-
       | model-87ccfc95f5...), which includes training data, training
       | code, evaluation code, fine tuning code, etc.
       | 
       | The license also matters. A burdened license that restricts what
       | you can do with the software is not really open source.
       | 
       | I do have concerns about where OSI is going with all this. For
       | example, why are they now saying that reproducibility is not a
       | part of the definition? These two paragraphs below contradict
       | each other - what does it mean to be able to "meaningfully fork"
       | something and be able to make it more useful if you don't have
       | the ingredients to reproduce it in the first place?
       | 
       | > The aim of Open Source is not and has never been to enable
       | reproducible software. The same is true for Open Source AI:
       | reproducibility of AI science is not the objective. Open Source's
       | role is merely not to be an impediment to reproducibility. In
       | other words, one can always add more requirements on top of Open
       | Source, just like the Reproducible Builds effort does.
       | 
       | > Open Source means giving anyone the ability to meaningfully
       | "fork" (study and modify) a system, without requiring additional
       | permissions, to make it more useful for themselves and also for
       | everyone.
        
         | MichaelNolan wrote:
         | > what does it mean to be able to "meaningfully fork" something
         | and be able to make it more useful if you don't have the
         | ingredients to reproduce it in the first place?
         | 
         | I could be misunderstanding them, but my takeaway is that exact
         | bit for bit reproducibility is not required. Most software,
         | including open source, is not bit for bit reproducible. Exact
         | reproducibility is a fairly new concept. Even with all the
         | training data, and all the code, you are unlikely to get the
         | exact same model as before.
         | 
         | Though if that is what they mean, then they should be more
         | explicit about it.
        
       | glkanb wrote:
       | Ok, decent first steps. Now approve a BSD license with an
       | additional clause that prohibits use for "AI" training.
       | 
       | Just like a free grazing field would allow living animals, but
       | not a combine harvester. The old rules of "for any purpose" no
       | longer apply.
        
       | godelski wrote:
       | I don't think this makes sense nor is consistent with itself, let
       | alone its other definition[0]                 > The aim of Open
       | Source is not and has never been to enable reproducible software.
       | ...       > Open Source means giving anyone the ability to
       | meaningfully "fork" (study and modify) a system, without
       | requiring additional permissions, to make it more useful for
       | themselves and also for everyone.        ...       > Forking in
       | the machine learning context has the same meaning as with
       | software: having the ability and the rights to build a system
       | that behaves differently than its original status. Things that a
       | fork may achieve are: fixing security issues, improving behavior,
       | removing bias.
       | 
       | For these things, it does mean what most people are asking for:
       | training details.
       | 
       | So far companies are just releasing checkpoints and architecture.
       | It is better than nothing and this is a great step (especially
       | with how entrenched businesses are[1]). But if we really want to
       | do things like fixing security issues or remove bias, you have to
       | be able to understand the data that it was originally trained on
       | AND the training procedures. Both of these introduce certain
       | biases (via statistical definition, which is more general). These
       | issues can't all be solved by tuning and the ability to tune is
       | significantly influenced by these decisions.
       | 
       | The reason we care about reproducible builds is because it
       | matters to things like security, where we know what we're looking
       | at is the same thing that's in the actual program. It is fair to
       | say that the "aim" isn't about reproducible software, but it is a
       | direct consequence of the software being open source. Trust
       | matters, but the saying is "trust but verify". Sure, you can also
       | fix vulns and bugs in closed source software, hell, you can even
       | edit or build on top of it. But we don't call these things open
       | source (or source available) for a reason.
       | 
       | If we're going to be consistent in our definitions, we need to
       | understand what these things are at at least a minimal level of
       | abstraction. And frankly, as a ML researcher, I just don't see
       | it.
       | 
       | That said, I'm generally fine with "source available" and like
       | most people use it synonymous with "open source". But if you're
       | going to go around telling everyone they're wrong about the OSS
       | definition, at least be consistent and stick to your values.
       | 
       | [0] https://opensource.org/osd
       | 
       | [1] Businesses who's entire model depends on OSS (by OS's
       | definition) and freely available research
        
         | ensignavenger wrote:
         | "Reproducible build" is a term used to refer to getting an
         | exact binary match out of a build. This is outside the scope of
         | the OSD. I am not certain, but it sounds like this is what they
         | are talking about here. Just because you run the build yourself
         | doesn't mean you will get an exact match of what the original
         | producer built. Something as simple as a random number
         | generator or using a timestamp in the build will result in a
         | mismatch.
        
       | wmf wrote:
       | Various organizations are willing to release open weights but not
       | open source weights according to this definition, so this is
       | going to be a no-op. Open source already existed before the OSI
       | codified it, but now they're trying to will open source AI into
       | existence against tons of incentives not to.
        
       ___________________________________________________________________
       (page generated 2024-10-09 23:01 UTC)