hngopher.com

       [HN Gopher] Playing with Open Source LLMs
       ___________________________________________________________________
        
       Playing with Open Source LLMs
        
       Author : zer0tonin
       Score  : 53 points
       Date   : 2025-07-29 18:36 UTC (4 hours ago)
        
 (HTM) web link (alicegg.tech)
 (TXT) w3m dump (alicegg.tech)
        
       | lrvick wrote:
       | Can we all please stop confusing freeware with Open Source?
       | 
       | If something can not be reproduced from sources which are all
       | distributed under an OSI license it is not Open Source.
       | 
       | Non public sources of unknown license -> Closed source /
       | Proprietary
       | 
       | No training code, no training sources -> Closed source /
       | Proprietary
       | 
       | OSI public source code -> Open Source / Free Software
       | 
       | These terms are very well defined. https://opensource.org/osd
        
         | laughingcurve wrote:
         | Glad someone else said all this so I did not have to. Hats off
         | to you
        
           | lrvick wrote:
           | The older I get the more I fear I am turning into Richard
           | Stallman, but it is absolutely offensive to continually see
           | corporate proprietary freeware binary blobs built from
           | unlicensed sources be confused with Free Open Source Software
           | which is a dramatically higher bar of investment in public
           | good.
        
             | laughingcurve wrote:
             | Don't worry, until you're chewing on your foot mid lecture,
             | you're good
        
             | thewebguyd wrote:
             | > The older I get the more I fear I am turning into Richard
             | Stallman
             | 
             | You and me both. I always preferred and promoted FLOSS
             | where possible but still had a bit of a pragmatic approach,
             | but now the older I get the more I just want to rip out
             | everything not free (as in freedom) from my life, and/or
             | just go become a goat farmer.
             | 
             | Stallman was right from the beginning, and big tech have
             | proven over and over again that they are incapable of being
             | good citizens.
             | 
             | I'm probably a few more years away from "I'd like to
             | interject for a moment..."
        
         | diggan wrote:
         | The worst part is that they even lack internal consistency, so
         | they (someone at least) know it's wrong, yet they persist
         | regardless.
         | 
         | > https://www.llama.com/ - "Industry Leading, Open-Source AI"
         | 
         | > https://www.llama.com/llama4/license/ - "Llama Materials"
         | means, collectively, Meta's proprietary Llama 4
         | 
         | Either the team that built the landing page (Marketing dept?)
         | is wrong, or the legal department is wrong. I'm pretty sure I
         | know who I'd bet on to be more correct.
        
           | lrvick wrote:
           | Meta is indeed leading the gaslighting efforts to convince
           | the press and masses that binary blobs provably built from
           | pirated sources are actually Open Source.
           | 
           | The sad part is it is working. It is almost like Meta is
           | especially skilled at mass public manipulation.
        
         | oddb0d wrote:
         | > OSI public source code -> Open Source / Free Software
         | 
         | Can we all please stop confusing Free/Libre Open Source with
         | Open Source?
         | 
         | https://www.gnu.org/philosophy/open-source-misses-the-point....
         | 
         | Maybe if we'd focused on communicating the ethics the world
         | wouldn't be so unaware of the differences
        
           | lrvick wrote:
           | Are there any instances of OSI licensed code that are not
           | Free Software making my statement here invalid?
           | 
           | I was attempting to direct that when software is called Open
           | Source and actually is based on OSI licensed sources, then
           | they are likely talking about Free Software.
        
             | oddb0d wrote:
             | The last time I heard a comment along those lines I was
             | attending a session by an Open Source person and up on
             | screen they had a picture of RMS dressed as Che Guevara.
             | 
             | All those silly ethics, they get in the way of the real
             | work!
        
           | jrm4 wrote:
           | Honestly, I think "the framers" got it right here.
           | 
           | Too much communicating of the ethics would have bogged down
           | the useful legal work.
           | 
           | My take is, Free Software actually _won_ and we 're in a
           | post-that world.
        
             | oddb0d wrote:
             | I'm not sure I fully understand - whilst I agree there's
             | been useful legal work, we now have such a plethora of
             | licenses I ended up having to back what I'd call basic
             | common sense when someone was suggesting using a highly
             | restrictive "community" license that had ridiculous intents
             | such as saying you can't use it in this particular industry
             | because that industry is "bad".
             | 
             | The reason Free/Libre Open Source Software wins - and
             | always will do in the long run - is because the four
             | freedoms are super-simple and they reflect how the natural
             | world works.
        
         | j-bos wrote:
         | I mean the article did mention some legitimately open source
         | models.
        
           | lrvick wrote:
           | Which would have been great to distinguish from the ones that
           | certainly are not.
        
         | jrm4 wrote:
         | No, because "open source" itself was never clear enough to
         | carry that weight.
         | 
         | That's why we keep being annoying about "Free Software."
        
         | simonw wrote:
         | The Open Source Initiative themselves decided a last year to
         | relax their standards for AI models: they don't require the
         | training data to be released. https://opensource.org/ai
         | 
         | They do continue to require the core freedoms, most importantly
         | "Use the system for any purpose and without having to ask for
         | permission". That's why a lot of the custom licenses (Llama
         | etc) don't fit the OSI definition.
        
           | amelius wrote:
           | I don't agree with that definition. For a given model I want
           | to know what I can/cannot expect from it. To have a better
           | understanding of that, I need to know what it was trained on.
           | 
           | For a (somewhat extreme) example, what if I use the model to
           | write children's stories, and suddenly it regurgitates Mein
           | Kampf? That would certainly ruin the day.
        
             | echelon wrote:
             | Too bad. The OSI owns "open source".
             | 
             | Big tech has been abusing open source to cheaply capture
             | most of the internet and e-commerce anyway, so perhaps it's
             | time we walked away from the term altogether.
             | 
             | The OSI has abdicated the future of open machine learning.
             | And that's fine. We don't need them.
             | 
             | "Free software" is still a thing and it means a very
             | specific and narrow set of criteria. [1, 2]
             | 
             | There's also "Fair software" [3], which walks the line
             | between CC BY-NC-SA and shareware, but also sticks it to
             | big tech by preventing Redis/Elasticsearch capture by the
             | hyperscalers. There's an open game engine [4] that has a
             | pretty nice "Apache + NC" type license.
             | 
             | ---
             | 
             | Back on the main topic of "open machine learning": since
             | the OSI fucked up, I came up with a ten point scale here
             | [5] defining open AI models. It's just a draft, but if
             | other people agree with the idea, I'll publish a website
             | about it (so I'd appreciate your feedback!)
             | 
             | There are ten measures by which a model can/should be open:
             | 
             | 1. The model code (pytorch, whatever)
             | 
             | 2. The pre-training code
             | 
             | 3. The fine-tuning code (which might be very different from
             | the pre-training code)
             | 
             | 4. The inference code
             | 
             | 5. The raw training data (pre-training + fine-tuning)
             | 
             | 6. The processed training data (which might vary across
             | various stages of pre-training and fine-tuning: different
             | sizes, features, batches, etc.)
             | 
             | 7. The resultant weights blob(s)
             | 
             | 8. The inference inputs and outputs (which also need a
             | license; see also usage limits like O-RAIL)
             | 
             | 9. The research paper(s) (hopefully the model is also
             | described and characterized in the literature!)
             | 
             | 10. The patents (or lack thereof)
             | 
             | A good open model will have nearly all of these made
             | available. A fake "open" model might only give you two of
             | ten.
             | 
             | ---
             | 
             | [1] https://www.fsf.org/
             | 
             | [2] https://en.wikipedia.org/wiki/Free_software
             | 
             | [3] https://fair.io/
             | 
             | [4] https://defold.com/license/
             | 
             | [5] https://news.ycombinator.com/item?id=44438329
        
             | senko wrote:
             | Are you going to examine a few petabytes of data for each
             | model you want to run, to check if a random paragraph from
             | Main Kampf is in there? How?
             | 
             | We need better tools to examine the weights (what gets
             | activated to which extent for which topics, for example).
             | Getting full training corpus, while nice, cannot be our
             | only choice.
        
               | amelius wrote:
               | > Are you going to examine a few petabytes of data for
               | each model (...) How?
               | 
               | I can think of a few ways. Perhaps I'd use an LLM to find
               | objectionable content. But anyway, it is the same
               | argument as you can have against e.g. the Linux kernel.
               | Are you going to read every line of code to see if it is
               | secure? Maybe, or maybe not, but that is not the point.
               | 
               | The point is now a model is a black box. It might as well
               | be a Trojan horse.
        
           | thewebguyd wrote:
           | > The Open Source Initiative themselves decided a last year
           | to relax their standards for AI models: they don't require
           | the training data to be released.
           | 
           | Poor move IMO. Training data should be required to be
           | released to be considered an open source model. Without it
           | all I can do is set weights, etc. Without training data I
           | can't truly reproduce the model, inspect the data for
           | biases/audit the model for fairness, make improvements &
           | redistribute (a core open source ethos).
           | 
           | Keeping the training data closed means it's not truly open.
        
             | simonw wrote:
             | Their justification for this was that, for many
             | consequential models, releasing the training data just
             | isn't possible.
             | 
             | Obviously the biggest example here is all of that training
             | data which was scraped from the public web (or worse) and
             | cannot be relicensed because the model producers do not
             | have permission to relicense it.
             | 
             | There are other factors too though. A big one is things
             | like health data - if you train a model that can e.g.
             | visually detect cancer cells you want to be able to release
             | that model without having to release the private health
             | scans that it was trained on.
             | 
             | See their FAQ item: Why do you allow the exclusion of some
             | training data? https://opensource.org/ai/faq#why-do-you-
             | allow-the-exclusion...
        
               | actionfromafar wrote:
               | Wouldn't it great though if it was public knowledge
               | exactly on what they were trained on and how, even though
               | the data itself can freely copied?
        
             | tbrownaw wrote:
             | > _Poor move IMO. Training data should be required to be
             | released to be considered an open source model._
             | 
             | The actual poor move is trying to fit the term "open
             | source" onto AI models at all, rather than new terms with
             | names that actually match how models are developed.
        
         | tbrownaw wrote:
         | > _These terms are very well
         | defined.https://opensource.org/osd_
         | 
         | Yes. And you're using them wrong.
         | 
         | From the OSD:
         | 
         | < _The source code must be the preferred form in which a
         | programmer would modify the program._ >
         | 
         | So, what's the preferred way to modify a model? You get the
         | weights and then run fine-tuning with a relatively small amount
         | of data. Which is way cheaper than re-training the entire thing
         | from scratch.
         | 
         | ---
         | 
         | The issue is that normal software doesn't have a way to modify
         | the binary artifacts without completely recreating them, and
         | that AI models not only do have that but have a large cost
         | difference. The development lifecycle has nodes that don't
         | exist for normal software.
         | 
         | Which means that really, AI models need their own different
         | terminology that matches that difference. Say, open-weights and
         | open-data or something.
         | 
         | Kinda like how Creative Commons is a thing because software
         | development lifecycle concepts don't map very well to
         | literature or artwork either.
        
       | channel_t wrote:
       | This is definitely a useful exercise worth going through for the
       | educational value before eventually surrendering and just using
       | the big models owned by "unprofitable companies."
        
         | lrvick wrote:
         | Never accepted the terms of service of any proprietary models-
         | as-a-service providers and never will.
         | 
         | Be one of the few humans still pretty good at using their own
         | brains for those problems LLMs can't solve, and you will be
         | very employable.
        
           | mattmanser wrote:
           | If you don't find out what the models can do, how can you
           | know what problems they can't solve?
        
       | simonw wrote:
       | > On paper, this looks like a success. In practice, the time
       | spent crafting a prompt, waiting for the AI to run and fixing the
       | small issue that came up immensely exceeds the 10 minutes it
       | would have taken me to edit the file myself. I don't think coding
       | that way would lead me to a massive performance improvement for
       | now.
       | 
       | The models used in this experiment - deepseek-r1:8b, mistral:7b,
       | qwen3:8b - are _tiny_. It 's honestly a miracle that they produce
       | anything that looks like working code at all!
       | 
       | I'm not surprised that the conclusion was that writing without
       | LLM assistance would be more productive in this case.
        
         | mattmanser wrote:
         | Yeah, the truth is avoiding the big players is silly right now.
         | It's not the small models won't eventually work either, we have
         | no idea how they can get compressed in future. Especially with
         | people trying to get the mixture of experts approach working.
         | 
         | Right now, you need the bigger models for good responses, but
         | in a year's time?
         | 
         | So the whole exercise was a bit of a waste of his time, the
         | present target moves too quickly. This isn't a time to be
         | clutching your pearls about running your own models unless you
         | want to do something shady with AI.
         | 
         | And like video streaming was progressed by the porn industry, a
         | lot of people are watching the, um, "thirsty" AI enthusiasts
         | for the big advances in small models.
        
       | roadside_picnic wrote:
       | While I'm always happy to see more people using open models, I
       | was hoping the "playing" would be a bit more about actually
       | interacting with the models themselves, rather than just running
       | them.
       | 
       | For anyone interested in playing around with the _internals_ of
       | LLMs without needing to worry about having the hardware to train
       | locally, a couple of projects I 've found really fun and
       | educational:
       | 
       | - Implement speculative decoding for two different sized models
       | that share a tokenizer [0]
       | 
       | - Enforce structured outputs through _constrained decoding_ (a
       | great way to dive deeper in to regex parsing as well).
       | 
       | - Create a novel sampler using entropy or other information about
       | token probabilities
       | 
       | The real value of open LLMs, at least for me, has been that they
       | _aren 't_ black boxes, you can open them up and take a look
       | inside. For all the AI hype it's a bit of shame that so few
       | people seem to really be messing around with the insides of LLMs.
       | 
       | 0. https://arxiv.org/pdf/2211.17192
        
       | 44za12 wrote:
       | "I used all the top tier 'Open Source LLMs' and they suck I was
       | right, like always, LLMs suck so hard, my job is safe!"
        
       | runningmike wrote:
       | Nice writeup and status update on use of Foss ml things. Saves me
       | a lot of time!
        
       | BoorishBears wrote:
       | Deepseek R1 8B isn't famous for anything (except maybe being
       | confused for Deepseek R1) and isn't by Deepseek anymore than me
       | finetuning Llama makes me the creator of Llama.
        
       | v3ss0n wrote:
       | Those LLM influencers don't know what is a distill. Deepseek R1
       | 8B IS A DISTILLED Qwen2 .you should be using qwen3 8b-14b instead
       | a lot better
        
       ___________________________________________________________________
       (page generated 2025-07-29 23:01 UTC)