[HN Gopher] Hello OLMo: A truly open LLM
       ___________________________________________________________________
        
       Hello OLMo: A truly open LLM
        
       Author : tosh
       Score  : 366 points
       Date   : 2024-04-08 22:26 UTC (1 days ago)
        
 (HTM) web link (blog.allenai.org)
 (TXT) w3m dump (blog.allenai.org)
        
       | blackeyeblitzar wrote:
       | This is the only LLM that is exciting to me. Clearly, LLMs are
       | powerful tools that may end up replacing search and even go much
       | further than simple searches by performing the research for you
       | and producing final answers. Closed models like those from Open
       | AI (ironically) or Anthropic cannot be audited. When most users
       | will end up blindly hitting Microsoft's Copilot button, which
       | they are forcing OEMs to adopt, who's to say how the information
       | a user gets is being curated or manipulated by OpenAI or
       | Microsoft or whoever?
       | 
       | We've already seen real world examples of severe bias injected
       | into LLMs. For example, Google's Gemini had secret meta prompts
       | that biased it towards certain types of answers and also caused
       | it to produce hallucinated images that were funny but also
       | dystopian (https://arstechnica.com/information-
       | technology/2024/02/googl...). I don't think we can just let
       | closed AI systems take over society when they can easily be
       | manipulated by the model owners without transparency.
       | 
       | What I like about AI2's approach with OLMo is that they are
       | _actually open_ , not just trading on the marketing benefits of
       | the word "open". Most "open" models are just open _weights_ not
       | open _source_. That's like sharing an executable and not the
       | source code. In my view, being open means that others have to be
       | able to reproduce the final product (the model) if they wanted to
       | and had the means (in terms of training hardware). It also means
       | that they should be able to use whatever is provided freely for
       | any purpose, rather than being subject to proprietary licensing.
       | AI2 shares the training source code, training data, evaluation
       | suite, and the model weights that they've produced by running the
       | training process. It all uses the Apache license. And it's also
       | interesting that they used AMD hardware to train this LLM rather
       | than Nvidia /CUDA.
       | 
       | Open weight models like Llama keep repeatedly catching up to the
       | best closed models from OpenAI or Anthropic or others. My hope is
       | that truly open models like OLMa keep developing quickly enough
       | to also keep up. Lastly, I hope that regulation does not block
       | open source private development of AI systems. These systems will
       | be the vehicle for speech for much of society in the future, so
       | blocking private AI systems is a lot like restricting speech. But
       | leaving that aside, open development will also drive innovation
       | and reducing competitive pressure will hurt innovation.
        
         | blackeyeblitzar wrote:
         | One thing I wanted to add and call attention to is the
         | importance of licensing in open models. This is often
         | overlooked when we blindly accept the vague branding of models
         | as "open", but I am noticing that many open weight models are
         | actually using encumbered proprietary licenses rather than
         | standard open source licenses that are OSI approved
         | (https://opensource.org/licenses). As an example, Databricks's
         | DBRX model has a proprietary license that forces adherence to
         | their highly restrictive Acceptable Use Policy _by referencing
         | a live website hosting their AUP_
         | (https://github.com/databricks/dbrx/blob/main/LICENSE), which
         | means as they change their AUP, you may be further restricted
         | in the future. Meta's Llama is similar
         | (https://github.com/meta-llama/llama/blob/main/LICENSE ). I'm
         | not sure who can depend on these models given this flaw.
        
           | idle_zealot wrote:
           | Do we even know if these licenses are binding? AFAIK we have
           | no ruling on whether model weights are even eligible for
           | copyright. They're machine-produced derivatives of other
           | work, so it's not a guarantee that copyright protects them.
        
             | blackeyeblitzar wrote:
             | That's a great point and I hope more people speak up to
             | treat models as just numerical derivative works so they
             | aren't automatically granted these protections. It's better
             | if society meaningfully debates this and chooses the right
             | approach.
        
         | gremlinunderway wrote:
         | > For example, Google's Gemini had secret meta prompts that
         | biased it towards certain types of answers and also caused it
         | to produce hallucinated images that were funny but also
         | dystopian (https://arstechnica.com/information-
         | technology/2024/02/googl...).
         | 
         | Such a bizarre take to call this "dystopian".
         | 
         | The model happened to create some out-there pictures. I mean,
         | it's no more outlandish then giant dragons and snakes and such
         | being created yet the thought of a person of color being
         | something historically inaccurate is this massive outcry
         | against revisionism? Who cares?
         | 
         | Besides, the article identifies the probable goal which was to
         | eliminate very known biases in existing models (i.e. when
         | generating "angry person" you mainly got black people). Clearly
         | this one wasnt tuned well for that goal, but the objective is
         | not only noble but absolutely should be required for anyone
         | producing LLM models.
        
           | blackeyeblitzar wrote:
           | If I may explain: the dystopian part to me is the lack of
           | transparency around training code, training data sources,
           | tuning, meta prompting, and so forth. In Google's case,
           | they're a large corporation that controls how much of society
           | accesses information. If they're secretly curating what that
           | information is, rather than presenting it as neutrally as
           | they can, it does feel dystopian to me. I'd like transparency
           | as a consumer of information, so I know to the extent
           | possible, what the sources of information were or how I am
           | being manipulated by choices the humans building these
           | systems made.
           | 
           | I appreciate the issue you're drawing attention to in the
           | example you shared about images of an angry person. I think I
           | agree that focused tuning for situations like that might be
           | noble and I would be okay with a model correcting for that
           | specific example you shared. But I also struggle with how to
           | _clearly_ draw that line where such tuning may go too far,
           | which is why I favor less manual biasing. But I disagree that
           | such tuning should be _required_ , if you meant required _by
           | the law_. Like with speech or art in general, I think anyone
           | should be able to produce software systems that generate
           | controversial or offensive speech or art. Individual
           | consumers can choose what they want to interact with, and
           | reject LLMs that don't meet their personal standards.
        
           | lynx23 wrote:
           | Right, "who cares" about the truth in our dystopian world?
           | 1984 is apparently too long ago for people to remember the
           | ministry of truth...
        
         | simonw wrote:
         | Pet peeve: Google's Gemini LLM model was not to blame for the
         | image generation weirdness.
         | 
         | That would be like blaming DALL-E weirdness on GPT-4.
         | 
         | Unfortunately, Google marketing decided to slap the "Gemini"
         | brand on both the end-user interface used to interact with the
         | model AND the actual model itself, hence people constantly
         | calling out Gemini-the-model for weird decisions made as part
         | of Gemini-the-user-interface.
        
           | yk wrote:
           | Did anybody manage to get the entire prompt out of gemini, or
           | what are you basing your claim on?
        
             | simonw wrote:
             | That's my point. The system prompt isn't part of the model
             | - it's part of the UI system that wraps the model.
        
           | michaelt wrote:
           | _> That would be like blaming DALL-E weirdness on GPT-4._
           | 
           | Actually when you trigger DALL-E through GPT-4 (i.e. with the
           | LLM generating the prompt to give the diffusion model then
           | returning the resulting image to the user) the LLM's system
           | instructions [1] say _" 7. Diversify depictions of ALL images
           | with people to always include always DESCENT and GENDER for
           | EACH person using direct terms."_ and a bunch of stuff along
           | those lines.
           | 
           | In OpenAI's system this doesn't always trigger; if the user
           | asks for an image of trash being collected, the user hasn't
           | explicitly asked for any people to be depicted, so the LLM
           | doesn't find anything in the prompt that needs diversity
           | added. The trash-being-collected prompt gets passed to DALL-E
           | unmodified, and the resulting image has all male workers.
           | 
           | [1] https://raw.githubusercontent.com/spdustin/ChatGPT-
           | AutoExper...
        
             | simonw wrote:
             | Yeah, I wrote about that last year:
             | https://simonwillison.net/2023/Oct/26/add-a-
             | walrus/#diversif...
             | 
             | Again, that's not a GPT-4 thing: that's a ChatGPT interface
             | running GPT-4 with DALL-E as a tool thing.
        
           | espadrine wrote:
           | > _Google 's Gemini LLM model was not to blame for the image
           | generation weirdness. That would be like blaming DALL-E
           | weirdness on GPT-4._
           | 
           | The way I read the Gemini technical report, it seemed like,
           | unlike GPT-4 vs DALL-E, Gemini was pretrained with multimodal
           | outputs. Is that not the case?
        
             | simonw wrote:
             | Is that right? I didn't think Gemini was generating images
             | directly, I assumed it was using a separate image
             | generation tool.
             | 
             | The paper here https://arxiv.org/pdf/2403.05530.pdf has a
             | model card for Gemini 1.5 Pro that says:
             | Output(s): Generated text in response to the input
             | (e.g., an answer to the question, a summary of
             | multiple documents, comparing documents/videos).
        
               | espadrine wrote:
               | Huh, that is true in both the model cards of Gemini 1.5
               | Pro and Gemini 1.0.
               | 
               | That feels like it runs counter to this statement from
               | the Gemini 1.0 technical report[0]:
               | 
               | > _Gemini models are trained to accommodate textual input
               | interleaved with a wide variety of audio and visual
               | inputs, such as natural images, charts, screenshots,
               | PDFs, and videos, and they can produce text and image
               | outputs_
               | 
               | [0]: https://arxiv.org/pdf/2312.11805.pdf
        
               | simonw wrote:
               | Yeah what does that bit about "image outputs" mean I
               | wonder?
        
         | theshackleford wrote:
         | > Open weight models like Llama keep repeatedly catching up to
         | the best closed models from OpenAI or Anthropic or others.
         | 
         | Since when? I've had the complete opposite experience.
        
       | timmg wrote:
       | Has their site been hugged-to-death or is it my hotel wifi?
        
       | Havoc wrote:
       | Notably "The Pile" doesn't seem to be part of the training data.
       | So this might be more sound legally than many other "open" LLMs
        
         | sgu999 wrote:
         | For those also wondering: https://pile.eleuther.ai
         | 
         | > The Pile is a 825 GiB diverse, open source language modelling
         | data set that consists of 22 smaller, high-quality datasets
         | combined together.
         | 
         | By what's the legal complication with it?
        
           | blackeyeblitzar wrote:
           | It received DMCA takedowns:
           | https://en.wikipedia.org/wiki/The_Pile_(dataset)
           | 
           | > The Books3 component of the dataset contains copyrighted
           | material compiled from Bibliotik, a pirate website. In July
           | 2023, the Rights Alliance took copies of The Pile down
           | through DMCA notices. Users responded by creating copies of
           | The Pile with the offending content removed.
        
           | simonw wrote:
           | It is absolutely absolutely packed with unlicensed,
           | copyrighted data.
           | 
           | Books3 is the most notable example - nearly 200,000 pirated
           | ebooks - but a lot of the rest of it is (unlicensed) scraped
           | web data.
           | 
           | The legal questions over whether this is a problem are
           | currently still unresolved. Many people are also bothered by
           | the ethical implications, which is a separate issue from the
           | legal questions.
        
             | 23B1 wrote:
             | Ironic that even our everyday governance has little
             | 'Alignment' between ethics and law.
        
               | jacobn wrote:
               | Ethics are a lot more nuanced and change a lot faster
               | than laws.
               | 
               | Heck, a large fraction of ethics seem to be so fickle
               | that they're subject to potential revision by every
               | generation.
               | 
               | In fact, I'd argue that those revisions are a significant
               | portion of how one generation distinguishes itself from
               | their parents.
               | 
               | Yet strangely every generation feels like they have
               | arrived at a set of "universal laws" in their ethics.
        
               | KarlKemp wrote:
               | In this case, both ethics and the law are murky.
               | 
               | Pretty excellent alignment, for once?
        
               | ben_w wrote:
               | We wouldn't need lawyers if all the rules could be
               | expressed as "be ethical".
        
               | 23B1 wrote:
               | The lawyers certainly agree with you on that!
        
           | codazoda wrote:
           | I took a quick peak at this last time it was mentioned and it
           | had dozens of my own repos of unlicensed source code in it.
           | All of that was published on GitHub and made public, but much
           | of it has no license specified.
        
       | mysteria wrote:
       | Is this one of the first LLMs of note that was successfully
       | trained on AMD GPUs? I wonder how seamless the process was and if
       | they faced any issues there.
        
         | sanxiyn wrote:
         | Databricks (who also participated in OLMo, it's probably the
         | same codebase) trained on AMD before, see 2023 post
         | https://www.databricks.com/blog/amd-mi250. It was probably
         | seamless, as any issues were fixed by Databricks in 2023.
        
         | otuutti wrote:
         | https://huggingface.co/LumiOpen/Poro-34B Also fully trained on
         | LUMI.
         | 
         | (more models here: https://huggingface.co/LumiOpen)
        
       | lostmsu wrote:
       | Too bad they did not put any comparison tables into the blog
       | post.
        
         | mysteria wrote:
         | They're on Hugging Face. Interestingly enough they don't
         | compare it against Mistral 7B.
         | 
         | https://huggingface.co/allenai/OLMo-7B
        
           | polygamous_bat wrote:
           | I commented this somewhere else, but word in the ether is
           | that OLMo is not actually that good of a model given its size
           | and compute budget. I am not entirely sure why, and it's
           | still good to have the full recipe for at least one model out
           | in the open, but the current OLMo definitely is a cautionary
           | tale for people training their own model.
        
       | refulgentis wrote:
       | This is 2 months old.
        
         | btbuildem wrote:
         | And yet it's topical and relevant.
        
       | timsuchanek wrote:
       | Great to see e2e openness. One of the only true OSS models out
       | there, vs most of the models releasing the binaries (weights).
       | Surprised that they didn't mention Mistral 7b in the comparisons.
        
         | sanxiyn wrote:
         | Falcon also released open dataset.
        
       | vjeux wrote:
       | If I read the license correctly, it seems that if you want to use
       | the LLM, you need to tell the authors what you are doing with it.
       | 
       | Am I reading this correctly? https://allenai.org/licenses/impact-
       | mr
       | 
       | "Derivative Impact Reports. AI2 seeks to encourage transparency
       | around Derivatives through the use of Derivative Impact Reports,
       | available here. Before releasing a Model Derivative or Data
       | Derivative, You will share with AI2 the intended use(s) of Your
       | Derivative by completing a Derivative Impact Report or otherwise
       | providing AI2 with substantially similar information in writing.
       | You agree that AI2 may publish, post, or make available such
       | information about Your Derivative for review by the general
       | public.
       | 
       | You will use good faith efforts to be transparent about the
       | intended use(s) of Your Derivatives by making the information
       | freely available to others who may access or use Your
       | Derivatives. You acknowledge that Derivative Impact Reports are
       | not intended to penalize any good faith disclosures about
       | Derivatives. Accordingly, if You initiate or participate in any
       | lawsuit or other legal action against a Third Party based on
       | information in such Third Party's Derivative Impact Report, then
       | this MR Agreement will terminate immediately as of the date such
       | lawsuit or legal action is filed or commenced."
        
         | blackeyeblitzar wrote:
         | Interesting. I recall seeing Apache licenses in their official
         | repositories. I wonder how these additional restrictions get
         | pulled in.
        
         | mkl wrote:
         | Does that apply to this model? On huggingface it says "License:
         | The code and model are released under Apache 2.0."
        
         | whimsicalism wrote:
         | no, this is apache license-d. yes it is confusing that AI2 has
         | custom licenses but they aren't using them here
        
           | lolinder wrote:
           | It looks like the weights [0] and code [1] are Apache
           | licensed, but the training data [2] is using the license that
           | OP is quoting from.
           | 
           | [0] https://huggingface.co/allenai/OLMo-7B
           | 
           | [1] https://github.com/allenai/OLMo
           | 
           | [2] https://huggingface.co/datasets/allenai/dolma
        
             | 6gvONxR4sf7o wrote:
             | Is the license not transitive? Like could your impact
             | report be "i want to remove this part of the license?"
        
               | gardnr wrote:
               | I like the way you think but 2b might prevent that.
        
         | Chris2048 wrote:
         | > if You initiate or participate in any lawsuit or other legal
         | action ... this MR Agreement will terminate immediately
         | 
         | Is this legal? Restricting legal options by making an agreement
         | dependant on it?
        
         | jrm4 wrote:
         | Weird. So even if these things are well intentioned, seems like
         | they don't have any teeth.
         | 
         | Are there any out there that have licenses which are (dare I
         | say) simpler, like the GPL?
        
       | kikoreis wrote:
       | What does the risk classification applied to the dataset actually
       | mean? The licensing page [1] AI2 provides for their datasets is
       | really nice but it doesn't really explain [2] what risk means in
       | the context.
       | 
       | Does it mean "risk that the items contained in this set are
       | licensed in a manner incompatible with its use in a training
       | dataset"?
       | 
       | [1] https://allenai.org/impact-license
       | 
       | [2] "the AI2 ImpACT Licenses are artifact-agnostic and are
       | instead structured according to the risk level we've assigned a
       | given artifact"
        
       | pksebben wrote:
       | It's odd. Running inference on this (and other models in its
       | class) and I keep running into a "repeating token" situation with
       | moderate-to-long context windows.
       | 
       | It feels almost as if, during inference, the model hits some
       | format of local minimum that it careens around, and while
       | temperature _seems_ to affect this - it doesn 't really _fix_ it.
       | 
       | at temp 0.2:
       | 
       | > [{'generated_text': 'What follows is a transcript of a talk
       | between a mysterious man and an agent of a bureau dedicated to
       | investigating things which is typically referred to by some
       | assortment of letters in the alphabet. The identity, origins, and
       | motivations of the man were not known then and remain so. This
       | transcript is not meant to scare, but provided simply to
       | enlighten the concerned citizen of all the various and sundry
       | things that may or may not go bump in the night. AGENT: Please
       | state your name for the record. MYSTERIOUS STRANGER: I am the
       | man. AGENT: Thank you. I am an agent of the Bureau of
       | Investigation. I am here to investigate the following: 1. The
       | following: 2. The following: 3. The following: 4. The following:
       | 5. The following: 6. The following: 7. The following: 8. The
       | following: 9. The following: 10. The following: 11. The
       | following: 12. The following: 13. The following: 14. The
       | following: 15. The following: 16. The following: 17. The
       | following: 18. The following: 19. The following: 20. The
       | following: 21. The following: 22. The following: 23. The
       | following: 24. The following'}]
       | 
       | ...and at temp 0.4:
       | 
       | > [{'generated_text': 'What follows is a transcript of a talk
       | between a mysterious man and an agent of a bureau dedicated to
       | investigating things which is typically referred to by some
       | assortment of letters in the alphabet. The identity, origins, and
       | motivations of the man were not known then and remain so. This
       | transcript is not meant to scare, but provided simply to
       | enlighten the concerned citizen of all the various and sundry
       | things that may or may not go bump in the night. AGENT: Please
       | state your name for the record. MYSTERIOUS STRANGER: My name is
       | not important. AGENT: My name is Agent Cyanide. MYSTERIOUS
       | STRANGER: Agent Cyanide. AGENT: I am an agent of the Bureau of
       | Investigations. MYSTERIOUS STRANGER: The Bureau of
       | Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS
       | STRANGER: The Bureau of Investigations. AGENT: The Bureau of
       | Investigations. MYSTERIOUS STRANGER: The Bureau of
       | Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS
       | STRANGER: The Bureau of Investigations. AGENT: The Bureau of
       | Investigations. MYSTERIOUS STRANGER: The Bureau of
       | Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS
       | STRANGER: The Bureau of Investigations'}]
        
         | pksebben wrote:
         | ... this can get a little goofy even with do_sample=False and
         | no temp:
         | 
         | | [{'generated_text': "DAUGHTER: tell me a story FATHER: but
         | it's late DAUGHTER: please? FATHER: okay, once upon a time
         | there was a little girl who lived in a little house with her
         | mother and father and her brother and sister and her dog and
         | her cat and her hamster and her fish and her bird and her
         | rabbit and her horse and her cow and her sheep and her goat and
         | her pig and her chicken and her duck and her turkey and her
         | goose and her llama and her alpaca and her camel and her zebra
         | and her giraffe and her elephant and her hippopotamus and her
         | rhinoceros and her kangaroo and her koala and her panda and her
         | bear and her wolf and her fox and her cat and her dog and her
         | bird and her fish and her hamster and her cat and her dog and
         | her bird and her fish and her hamster and her cat and her dog
         | and her bird and her fish and her hamster and her cat and her
         | dog and her bird and her fish and her hamster and"}]
        
           | gpderetta wrote:
           | That's seems a perfect story to put a little child to bed :D.
           | 
           | I have used a similar recursive story in the past. My son
           | still jokes about it.
        
             | fho wrote:
             | There actually was a podcast around that concept when (I
             | think) GPT2 was current.
             | 
             | Basically one generated story per day. Absurd in places.
        
         | polygamous_bat wrote:
         | From what I heard through the grapevine, OLMo is not nearly the
         | best model for its size or compute budget. Apparently something
         | didn't quite go right and AI2 didn't have the money to train
         | until they got it right.
        
       | ein0p wrote:
       | Seems to be surprisingly fast at smaller sizes, too.
        
       | wg0 wrote:
       | The hype around LLMs won't last past 2030 I suppose. LLMs - we
       | have statistical inference soup that gets outdated like stagnant
       | pond water and by each passing day, becoming less accurate.
       | 
       | I am curious how long the hype wave lasts. Ones I have recently
       | seen was K8S. It settled down and won TBH.
        
         | Grimblewald wrote:
         | I think the hype dies down and theyll become part of a bigger
         | thing, like dense neural networks.
        
         | michaelmior wrote:
         | The transformer architecture probably won't last and we might
         | start calling them something else, but I can't see something
         | that could reasonably be called an LLM going away any time
         | soon.
        
       | margorczynski wrote:
       | > 1. No biases. Following LLaMA, PaLM, and others, we exclude all
       | bias terms from our architecture in order to improve training
       | stability.
       | 
       | What does this mean? What is a "bias term"?
        
         | polygamous_bat wrote:
         | Think of the term b in y = Wx+b. W is called weight, b is
         | called bias.
        
       | arcza wrote:
       | sToP bLOgGinG wITh Medium!
        
         | egKYzyXeIL wrote:
         | Why shouldn't people use Medium? I'm probably out of the loop.
        
           | arcza wrote:
           | The nags, the dark patterns, the horrific UI, the soft
           | paywalls, and the tracking, to name a few reasons
        
           | gadflyinyoureye wrote:
           | They often require log in to see the whole article. Later
           | they cap your access to articles to N per some period of
           | time. The only way around that is to purchase a subscription.
           | Given the weak offering of Medium, it's seldom worth the
           | $/month cost of a subscription for the few jewels that might
           | appear.
        
         | barfbagginus wrote:
         | sToP bLOgGinG wITh Medium!
        
         | flotzam wrote:
         | https://scribe.rip/hello-olmo-a-truly-open-llm-43f7e7359222
        
       ___________________________________________________________________
       (page generated 2024-04-09 23:02 UTC)