[HN Gopher] Playing with Open Source LLMs
___________________________________________________________________
Playing with Open Source LLMs
Author : zer0tonin
Score : 53 points
Date : 2025-07-29 18:36 UTC (4 hours ago)
(HTM) web link (alicegg.tech)
(TXT) w3m dump (alicegg.tech)
| lrvick wrote:
| Can we all please stop confusing freeware with Open Source?
|
| If something can not be reproduced from sources which are all
| distributed under an OSI license it is not Open Source.
|
| Non public sources of unknown license -> Closed source /
| Proprietary
|
| No training code, no training sources -> Closed source /
| Proprietary
|
| OSI public source code -> Open Source / Free Software
|
| These terms are very well defined. https://opensource.org/osd
| laughingcurve wrote:
| Glad someone else said all this so I did not have to. Hats off
| to you
| lrvick wrote:
| The older I get the more I fear I am turning into Richard
| Stallman, but it is absolutely offensive to continually see
| corporate proprietary freeware binary blobs built from
| unlicensed sources be confused with Free Open Source Software
| which is a dramatically higher bar of investment in public
| good.
| laughingcurve wrote:
| Don't worry, until you're chewing on your foot mid lecture,
| you're good
| thewebguyd wrote:
| > The older I get the more I fear I am turning into Richard
| Stallman
|
| You and me both. I always preferred and promoted FLOSS
| where possible but still had a bit of a pragmatic approach,
| but now the older I get the more I just want to rip out
| everything not free (as in freedom) from my life, and/or
| just go become a goat farmer.
|
| Stallman was right from the beginning, and big tech have
| proven over and over again that they are incapable of being
| good citizens.
|
| I'm probably a few more years away from "I'd like to
| interject for a moment..."
| diggan wrote:
| The worst part is that they even lack internal consistency, so
| they (someone at least) know it's wrong, yet they persist
| regardless.
|
| > https://www.llama.com/ - "Industry Leading, Open-Source AI"
|
| > https://www.llama.com/llama4/license/ - "Llama Materials"
| means, collectively, Meta's proprietary Llama 4
|
| Either the team that built the landing page (Marketing dept?)
| is wrong, or the legal department is wrong. I'm pretty sure I
| know who I'd bet on to be more correct.
| lrvick wrote:
| Meta is indeed leading the gaslighting efforts to convince
| the press and masses that binary blobs provably built from
| pirated sources are actually Open Source.
|
| The sad part is it is working. It is almost like Meta is
| especially skilled at mass public manipulation.
| oddb0d wrote:
| > OSI public source code -> Open Source / Free Software
|
| Can we all please stop confusing Free/Libre Open Source with
| Open Source?
|
| https://www.gnu.org/philosophy/open-source-misses-the-point....
|
| Maybe if we'd focused on communicating the ethics the world
| wouldn't be so unaware of the differences
| lrvick wrote:
| Are there any instances of OSI licensed code that are not
| Free Software making my statement here invalid?
|
| I was attempting to direct that when software is called Open
| Source and actually is based on OSI licensed sources, then
| they are likely talking about Free Software.
| oddb0d wrote:
| The last time I heard a comment along those lines I was
| attending a session by an Open Source person and up on
| screen they had a picture of RMS dressed as Che Guevara.
|
| All those silly ethics, they get in the way of the real
| work!
| jrm4 wrote:
| Honestly, I think "the framers" got it right here.
|
| Too much communicating of the ethics would have bogged down
| the useful legal work.
|
| My take is, Free Software actually _won_ and we 're in a
| post-that world.
| oddb0d wrote:
| I'm not sure I fully understand - whilst I agree there's
| been useful legal work, we now have such a plethora of
| licenses I ended up having to back what I'd call basic
| common sense when someone was suggesting using a highly
| restrictive "community" license that had ridiculous intents
| such as saying you can't use it in this particular industry
| because that industry is "bad".
|
| The reason Free/Libre Open Source Software wins - and
| always will do in the long run - is because the four
| freedoms are super-simple and they reflect how the natural
| world works.
| j-bos wrote:
| I mean the article did mention some legitimately open source
| models.
| lrvick wrote:
| Which would have been great to distinguish from the ones that
| certainly are not.
| jrm4 wrote:
| No, because "open source" itself was never clear enough to
| carry that weight.
|
| That's why we keep being annoying about "Free Software."
| simonw wrote:
| The Open Source Initiative themselves decided a last year to
| relax their standards for AI models: they don't require the
| training data to be released. https://opensource.org/ai
|
| They do continue to require the core freedoms, most importantly
| "Use the system for any purpose and without having to ask for
| permission". That's why a lot of the custom licenses (Llama
| etc) don't fit the OSI definition.
| amelius wrote:
| I don't agree with that definition. For a given model I want
| to know what I can/cannot expect from it. To have a better
| understanding of that, I need to know what it was trained on.
|
| For a (somewhat extreme) example, what if I use the model to
| write children's stories, and suddenly it regurgitates Mein
| Kampf? That would certainly ruin the day.
| echelon wrote:
| Too bad. The OSI owns "open source".
|
| Big tech has been abusing open source to cheaply capture
| most of the internet and e-commerce anyway, so perhaps it's
| time we walked away from the term altogether.
|
| The OSI has abdicated the future of open machine learning.
| And that's fine. We don't need them.
|
| "Free software" is still a thing and it means a very
| specific and narrow set of criteria. [1, 2]
|
| There's also "Fair software" [3], which walks the line
| between CC BY-NC-SA and shareware, but also sticks it to
| big tech by preventing Redis/Elasticsearch capture by the
| hyperscalers. There's an open game engine [4] that has a
| pretty nice "Apache + NC" type license.
|
| ---
|
| Back on the main topic of "open machine learning": since
| the OSI fucked up, I came up with a ten point scale here
| [5] defining open AI models. It's just a draft, but if
| other people agree with the idea, I'll publish a website
| about it (so I'd appreciate your feedback!)
|
| There are ten measures by which a model can/should be open:
|
| 1. The model code (pytorch, whatever)
|
| 2. The pre-training code
|
| 3. The fine-tuning code (which might be very different from
| the pre-training code)
|
| 4. The inference code
|
| 5. The raw training data (pre-training + fine-tuning)
|
| 6. The processed training data (which might vary across
| various stages of pre-training and fine-tuning: different
| sizes, features, batches, etc.)
|
| 7. The resultant weights blob(s)
|
| 8. The inference inputs and outputs (which also need a
| license; see also usage limits like O-RAIL)
|
| 9. The research paper(s) (hopefully the model is also
| described and characterized in the literature!)
|
| 10. The patents (or lack thereof)
|
| A good open model will have nearly all of these made
| available. A fake "open" model might only give you two of
| ten.
|
| ---
|
| [1] https://www.fsf.org/
|
| [2] https://en.wikipedia.org/wiki/Free_software
|
| [3] https://fair.io/
|
| [4] https://defold.com/license/
|
| [5] https://news.ycombinator.com/item?id=44438329
| senko wrote:
| Are you going to examine a few petabytes of data for each
| model you want to run, to check if a random paragraph from
| Main Kampf is in there? How?
|
| We need better tools to examine the weights (what gets
| activated to which extent for which topics, for example).
| Getting full training corpus, while nice, cannot be our
| only choice.
| amelius wrote:
| > Are you going to examine a few petabytes of data for
| each model (...) How?
|
| I can think of a few ways. Perhaps I'd use an LLM to find
| objectionable content. But anyway, it is the same
| argument as you can have against e.g. the Linux kernel.
| Are you going to read every line of code to see if it is
| secure? Maybe, or maybe not, but that is not the point.
|
| The point is now a model is a black box. It might as well
| be a Trojan horse.
| thewebguyd wrote:
| > The Open Source Initiative themselves decided a last year
| to relax their standards for AI models: they don't require
| the training data to be released.
|
| Poor move IMO. Training data should be required to be
| released to be considered an open source model. Without it
| all I can do is set weights, etc. Without training data I
| can't truly reproduce the model, inspect the data for
| biases/audit the model for fairness, make improvements &
| redistribute (a core open source ethos).
|
| Keeping the training data closed means it's not truly open.
| simonw wrote:
| Their justification for this was that, for many
| consequential models, releasing the training data just
| isn't possible.
|
| Obviously the biggest example here is all of that training
| data which was scraped from the public web (or worse) and
| cannot be relicensed because the model producers do not
| have permission to relicense it.
|
| There are other factors too though. A big one is things
| like health data - if you train a model that can e.g.
| visually detect cancer cells you want to be able to release
| that model without having to release the private health
| scans that it was trained on.
|
| See their FAQ item: Why do you allow the exclusion of some
| training data? https://opensource.org/ai/faq#why-do-you-
| allow-the-exclusion...
| actionfromafar wrote:
| Wouldn't it great though if it was public knowledge
| exactly on what they were trained on and how, even though
| the data itself can freely copied?
| tbrownaw wrote:
| > _Poor move IMO. Training data should be required to be
| released to be considered an open source model._
|
| The actual poor move is trying to fit the term "open
| source" onto AI models at all, rather than new terms with
| names that actually match how models are developed.
| tbrownaw wrote:
| > _These terms are very well
| defined.https://opensource.org/osd_
|
| Yes. And you're using them wrong.
|
| From the OSD:
|
| < _The source code must be the preferred form in which a
| programmer would modify the program._ >
|
| So, what's the preferred way to modify a model? You get the
| weights and then run fine-tuning with a relatively small amount
| of data. Which is way cheaper than re-training the entire thing
| from scratch.
|
| ---
|
| The issue is that normal software doesn't have a way to modify
| the binary artifacts without completely recreating them, and
| that AI models not only do have that but have a large cost
| difference. The development lifecycle has nodes that don't
| exist for normal software.
|
| Which means that really, AI models need their own different
| terminology that matches that difference. Say, open-weights and
| open-data or something.
|
| Kinda like how Creative Commons is a thing because software
| development lifecycle concepts don't map very well to
| literature or artwork either.
| channel_t wrote:
| This is definitely a useful exercise worth going through for the
| educational value before eventually surrendering and just using
| the big models owned by "unprofitable companies."
| lrvick wrote:
| Never accepted the terms of service of any proprietary models-
| as-a-service providers and never will.
|
| Be one of the few humans still pretty good at using their own
| brains for those problems LLMs can't solve, and you will be
| very employable.
| mattmanser wrote:
| If you don't find out what the models can do, how can you
| know what problems they can't solve?
| simonw wrote:
| > On paper, this looks like a success. In practice, the time
| spent crafting a prompt, waiting for the AI to run and fixing the
| small issue that came up immensely exceeds the 10 minutes it
| would have taken me to edit the file myself. I don't think coding
| that way would lead me to a massive performance improvement for
| now.
|
| The models used in this experiment - deepseek-r1:8b, mistral:7b,
| qwen3:8b - are _tiny_. It 's honestly a miracle that they produce
| anything that looks like working code at all!
|
| I'm not surprised that the conclusion was that writing without
| LLM assistance would be more productive in this case.
| mattmanser wrote:
| Yeah, the truth is avoiding the big players is silly right now.
| It's not the small models won't eventually work either, we have
| no idea how they can get compressed in future. Especially with
| people trying to get the mixture of experts approach working.
|
| Right now, you need the bigger models for good responses, but
| in a year's time?
|
| So the whole exercise was a bit of a waste of his time, the
| present target moves too quickly. This isn't a time to be
| clutching your pearls about running your own models unless you
| want to do something shady with AI.
|
| And like video streaming was progressed by the porn industry, a
| lot of people are watching the, um, "thirsty" AI enthusiasts
| for the big advances in small models.
| roadside_picnic wrote:
| While I'm always happy to see more people using open models, I
| was hoping the "playing" would be a bit more about actually
| interacting with the models themselves, rather than just running
| them.
|
| For anyone interested in playing around with the _internals_ of
| LLMs without needing to worry about having the hardware to train
| locally, a couple of projects I 've found really fun and
| educational:
|
| - Implement speculative decoding for two different sized models
| that share a tokenizer [0]
|
| - Enforce structured outputs through _constrained decoding_ (a
| great way to dive deeper in to regex parsing as well).
|
| - Create a novel sampler using entropy or other information about
| token probabilities
|
| The real value of open LLMs, at least for me, has been that they
| _aren 't_ black boxes, you can open them up and take a look
| inside. For all the AI hype it's a bit of shame that so few
| people seem to really be messing around with the insides of LLMs.
|
| 0. https://arxiv.org/pdf/2211.17192
| 44za12 wrote:
| "I used all the top tier 'Open Source LLMs' and they suck I was
| right, like always, LLMs suck so hard, my job is safe!"
| runningmike wrote:
| Nice writeup and status update on use of Foss ml things. Saves me
| a lot of time!
| BoorishBears wrote:
| Deepseek R1 8B isn't famous for anything (except maybe being
| confused for Deepseek R1) and isn't by Deepseek anymore than me
| finetuning Llama makes me the creator of Llama.
| v3ss0n wrote:
| Those LLM influencers don't know what is a distill. Deepseek R1
| 8B IS A DISTILLED Qwen2 .you should be using qwen3 8b-14b instead
| a lot better
___________________________________________________________________
(page generated 2025-07-29 23:01 UTC)