[HN Gopher] Open models by OpenAI
___________________________________________________________________
Open models by OpenAI
https://openai.com/index/introducing-gpt-oss/
Author : lackoftactics
Score : 2047 points
Date : 2025-08-05 17:02 UTC (1 days ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| thimabi wrote:
| Open weight models from OpenAI with performance comparable to
| that of o3 and o4-mini in benchmarks... well, I certainly wasn't
| expecting that.
|
| What's the catch?
| coreyh14444 wrote:
| Because GPT-5 comes out later this week?
| thimabi wrote:
| It could be, but there's so much hype surrounding the GPT-5
| release that I'm not sure whether their internal models will
| live up to it.
|
| For GPT-5 to dwarf these just-released models in importance,
| it would have to be a huge step forward, and I'm still
| doubting about OpenAI's capabilities and infrastructure to
| handle demand at the moment.
| sebzim4500 wrote:
| Surely OpenAI would not be releasing this now unless GPT-5
| was much better than it.
| jona777than wrote:
| As a sidebar, I'm still not sure if GPT-5 will be
| transformative due to its capabilities as much as its
| accessibility. All it really needs to do to be highly
| impactful is lower the barrier of entry for the more
| powerful models. I could see that contributing to it being
| worth the hype. Surely it will be better, but if more
| people are capable of leveraging it, that's just as
| revolutionary, if not more.
| rrrrrrrrrrrryan wrote:
| It seems like a big part of GPT-5 will be that it will be
| able to intelligently route your request to the appropriate
| model variant.
| Shank wrote:
| That doesn't sound good. It sounds like OpenAI will route
| my request to the cheapest model to them and the most
| expensive for me, with the minimum viable results.
| Invictus0 wrote:
| Sounds just like what a human would do. Or any business
| for that matter.
| Shank wrote:
| That may be true but I thought the promise was moving in
| the direction of AGI/ASI/whatever and that models would
| become more capable over time.
| logicchains wrote:
| The catch is that it only has ~5 billion active params so
| should perform worse than the top Deepseek and Qwen models,
| which have around 20-30 billion, unless OpenAI pulled off a
| miracle.
| NitpickLawyer wrote:
| > What's the catch?
|
| Probably GPT5 will be way way better. If alpha/beta horizon are
| early previews of GPT5 family models, then coding should be >
| opus4 for modern frontend stuff.
| int_19h wrote:
| The catch is that performance is not actually comparable to
| o4-mini, never mind o3.
|
| When it comes to LLMs, benchmarks are bullshit. If they sound
| too good to be true, it's because they are. The only thing
| benchmarks are useful for is preliminary screening - if the
| model does especially badly in them it's probably not good in
| general. But if it does good in them, that doesn't really tell
| you anything.
| ewoodrich wrote:
| It's definitely _interesting_ how the comments from right
| after the models were released were ecstatic about "SOTA
| performance" and how it is "equivalent to o3" and then
| comments like yours hours later after having actually tested
| it keep pointing out how it's garbage compared to even the
| current batch of open models let alone proprietary foundation
| models.
|
| Yet another data point for benchmarks being utterly useless
| and completely gamed at this stage in the game by all the
| major AI developers.
|
| These companies are clearly are all very aware that the
| initial wave of hype at release is "sticky" and drives
| buzz/tech news coverage while real world tests take much
| longer before that impression slowly starts to be undermined
| by practical usage and comparison to other models. Benchmarks
| with wildly over confident naming like "Humanity's Last Exam"
| aren't exactly helping with objectivity either.
| DSingularity wrote:
| Ha. Secure funding and proceed to immediately make a decision
| that would likely conflict viscerally with investors.
| hnuser123456 wrote:
| Maybe someone got tired of waiting paid them to release
| something actually open
| 4b6442477b1280b wrote:
| their promise to release an open weights model predates this
| round of funding by, iirc, over half a year.
| DSingularity wrote:
| Yeah but they never released until now.
| SV_BubbleTime wrote:
| Undercutting other frontier models with your open source one is
| not an anti-investor move.
|
| It is what China has been doing for a year plus now. And the
| Chinese models are popular and effective, I assume companies
| are paying for better models.
|
| Releasing open models for free doesn't have to be charity.
| hnuser123456 wrote:
| Text only, when local multimodal became table stakes last year.
| ebiester wrote:
| Honestly, it's a tradeoff. If you can reduce the size and make
| a higher quality in specific tasks, that's better than a
| generalist that can't run on a laptop or can't compete at any
| one task.
|
| We will know soon the actual quality as we go.
| greenavocado wrote:
| That's what I thought too until Qwen-Image was released
| SV_BubbleTime wrote:
| When Queen-Image was released... like yesterday? And what?
| What point are you making? QwebImage was released yesterday
| and like every image model, its base model shows potential
| over older ones but the real factor is will it be flexible
| enough for a fine tune or additional training Loras.
| BoorishBears wrote:
| The community can always figure out hooking it up to other
| modalities.
|
| Native might be better, but no native multimodal model is very
| competitive yet, so better to take a competitive model and
| latch on vision/audio
| tarruda wrote:
| > so better to take a competitive model and latch on
| vision/audio
|
| Can this be done by a third party or would it have to be
| OpenAI?
| BoorishBears wrote:
| No, anyone can do it: https://github.com/haotian-liu/LLaVA
| IceHegel wrote:
| Listed performance of ~5 points less than o3 on benchmarks is
| pretty impressive.
|
| Wonder if they feel the bar will be raised soon (GPT-5) and feel
| more comfortable releasing something this strong.
| johntiger1 wrote:
| Wow, this will eat Meta's lunch
| seydor wrote:
| I believe their competition is from chinese companies , for
| some time now
| mhh__ wrote:
| They will clone it
| BoorishBears wrote:
| Maverick and Scout were not great, even with post-training in
| my experience, and then several Chinese models at multiple
| sizes made them kind of irrelevant (dots, Qwen, MiniMax)
|
| If anything this helps Meta: another model to inspect/learn
| from/tweak etc. generally helps anyone making models
| redox99 wrote:
| There's nothing new here in terms of architecture. Whatever
| secret sauce is in the training.
| BoorishBears wrote:
| Part of the secret sauce since O1 has been accesss the real
| reasoning traces, not the summaries.
|
| If you even glance at the model card you'll see this was
| trained on the same CoT RL pipeline as O3, and it shows in
| using the model: this is the most coherent and structured
| CoT of any open model so far.
|
| Having full access to a model trained on that pipeline is
| valuable to anyone doing post-training, even if it's just
| to observe, but especially if you use it as cold start data
| for your own training.
| anticensor wrote:
| Its CoT is sadly closer to that sanitised o3 summaries
| than to R1 style traces.
| BoorishBears wrote:
| It has both raw and summarized traces.
| anticensor wrote:
| I mean raw GPT-OSS is close to summarised o3.
| asdev wrote:
| Meta is so cooked, I think most enterprises will opt for OpenAI
| or Anthropic and others will host OSS models themselves or on
| AWS/infra providers.
| a_wild_dandan wrote:
| I'll accept Meta's frontier AI demise if they're in their
| current position a year from now. People killed Google
| prematurely too (remember Bard?), because we severely
| underestimate the catch-up power bought with ungodly piles of
| cash.
| asdev wrote:
| catching up gets exponentially harder as time passes. way
| harder to catch up to current models than it was to the
| first iteration of gpt-4
| atonse wrote:
| And boy, with the $250m offers to people, Meta is
| definitely throwing ungodly piles of cash at the problem.
|
| But Apple is waking up too. So is Google. It's absolutely
| insane, the amount of money being thrown around.
| a_vanderbilt wrote:
| It's insane numbers like that that give me some concern
| for a bubble. Not because AI hits some dead end, but due
| to a plateau that shifts from aggressive investment to
| passive-but-steady improvement.
| Workaccount2 wrote:
| Wow, today is a crazy AI release day:
|
| - OAI open source
|
| - Opus 4.1
|
| - Genie 3
|
| - ElevenLabs Music
| orphea wrote:
| OAI open source
|
| Yeah. This certainly was not on my bingo card.
| wahnfrieden wrote:
| They announced it months ago...
| satyrun wrote:
| wow I just listened to Eleven Music do flamenco singing. That
| is incredible.
|
| Edit. I just tried it though and less impressed now. We are
| really going to need major music software to get on board
| before we have actual creative audio tools. These all seem made
| for non-musicians to make a very cookie cutter song from a
| specific genre.
| tmikaeld wrote:
| I also tried it for a full 100K credits (Wasted in 2 hours
| btw which is silly!).
|
| Compared to both Udio and Suno, it's very very bad.. both at
| compositions, matching lyrics to music, keeping tempo and as
| soon as there's any distorted instruments like guitars or
| live, quality goes to radio-level.
| BoxOfRain wrote:
| >These all seem made for non-musicians to make a very cookie
| cutter song from a specific genre.
|
| This is my main problem with AI music at the moment, I'd love
| it if I had proper creative control as a musician that'd be
| amazing but a lot of the time it's just straight up slop
| generation.
| deviation wrote:
| So this confirms a best-in-class model release within the next
| few days?
|
| From a strategic perspective, I can't think of any reason they'd
| release this unless they were about to announce something which
| totally eclipses it?
| og_kalu wrote:
| Even before today, the last week or so, it's been clear for a
| couple reasons, that GPT-5's release was imminent.
| ticulatedspline wrote:
| Even without an imminent release it's a good strategy. They're
| getting pressure from Qwen and other high performing open-
| weight models. without a horse in the race they could fall
| behind in an entire segment.
|
| There's future opportunity in licensing, tech support, agents,
| or even simply to dominate and eliminate. Not to mention brand
| awareness, If you like these you might be more likely to
| approach their brand for larger models.
| winterrx wrote:
| GPT-5 coming Thursday.
| boringg wrote:
| How much hype do we anticipate with the release of GPT-5 or
| whichever name to be included? And how many new features?
| selectodude wrote:
| Excited to have to send them a copy of my drivers license
| to try and use it. That'll take the hype down a notch.
| XCSme wrote:
| Imagine if it's called GPT-4.5o
| ciaranmca wrote:
| Is this the stealth models horizon alpha and beta? I was
| generally impressed with them(although I really only used it
| in chats rather than any code tasks). In terms of chat I
| increasingly see very little difference between the current
| SOTA closed models and their open weight counterparts.
| deviation wrote:
| Their tokenization suggests they're new Qwen models AFAIK.
| They tokenize input to the exact same # of tokens that Qwen
| models do.
| logicchains wrote:
| > I can't think of any reason they'd release this unless they
| were about to announce something which totally eclipses it
|
| Given it's only around 5 billion active params it shouldn't be
| a competitor to o3 or any of the other SOTA models, given the
| top Deepseek and Qwen models have around 30 billion active
| params. Unless OpenAI somehow found a way to make a model with
| 5 billion active params perform as well as one with 4-8 times
| more.
| bredren wrote:
| Undoubtedly. It would otherwise reduce the perceived value of
| their current product offering.
|
| The question is how much better the new model(s) will need to
| be on the metrics given here to feel comfortable making these
| available.
|
| Despite the loss of face for lack of open model releases, I do
| not think that was a big enough problem t undercut commercial
| offerings.
| FergusArgyll wrote:
| Thursday
|
| https://manifold.markets/Bayesian/on-what-day-will-gpt5-be-r...
| artembugara wrote:
| Disclamer: probably dumb questions
|
| so, the 20b model.
|
| Can someone explain to me what I would need to do in terms of
| resources (GPU, I assume) if I want to run 20 concurrent
| processes, assuming I need 1k tokens/second throughput (on each,
| so 20 x 1k)
|
| Also, is this model better/comparable for information extraction
| compared to gpt-4.1-nano, and would it be cheaper to host myself
| 20b?
| mythz wrote:
| gpt-oss:20b is ~14GB on disk [1] so fits nicely within a 16GB
| VRAM card.
|
| [1] https://ollama.com/library/gpt-oss
| artembugara wrote:
| thanks, this part is clear to me.
|
| but I need to understand 20 x 1k token throughput
|
| I assume it just might be too early to know the answer
| Tostino wrote:
| I legitimately cannot think of any hardware that will get
| you to that throughput over that many streams with any of
| the hardware I know of (I don't work in the server space so
| there may be some new stuff I am unaware of).
| artembugara wrote:
| oh, I totally understand that I'd need multiple GPUs. I'd
| just want to know what GPU specifically and how many
| Tostino wrote:
| I don't think you can get 1k tokens/sec on a single
| stream using any consumer grade GPUs with a 20b model.
| Maybe you could with H100 or better, but I somewhat doubt
| that.
|
| My 2x 3090 setup will get me ~6-10 streams of ~20-40
| tokens/sec (generation) ~700-1000 tokens/sec (input) with
| a 32b dense model.
| dragonwriter wrote:
| You also need space in VRAM for what is required to support
| the context window; you might be able to do a model that is
| 14GB in parameters with a small (~8k maybe?) context window
| on a 16GB card.
| petuman wrote:
| > assuming I need 1k tokens/second throughput (on each, so 20 x
| 1k)
|
| 3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated
| model weights (there's also context). So pretty much straight
| to B200 and alike. 1000 t/s per user/agent is way too fast,
| make it 300 t/s and you could get away with 5090/RTX PRO 6000.
| mlyle wrote:
| An A100 is probably 2-4k tokens/second on a 20B model with
| batched inference.
|
| Multiply the number of A100's you need as necessary.
|
| Here, you don't really need the ram. If you could accept fewer
| tokens/second, you could do it much cheaper with consumer
| graphics cards.
|
| Even with A100, the sweet-spot in batching is not going to give
| you 1k/process/second. Of course, you could go up to H100...
| d3m0t3p wrote:
| You can batch only if you have distinct chat in parallel,
| mlyle wrote:
| > > if I want to run _20 concurrent processes_ , assuming I
| need 1k tokens/second throughput _(on each)_
| spott wrote:
| Groq is offering 1k tokens per second for the 20B model.
|
| You are unlikely to match groq on off the shelf hardware as far
| as I'm aware.
| PeterStuer wrote:
| (answer for 1 inference) Al depends on the context length you
| want to support as the activation memory will dominate the
| requirements. For 4096 tokens you will get away with 24GB (or
| even 16GB), but if you want to go for the full 131072 tokens
| you are not going to get there with a 32GB consumer GPU like
| the 5090. You'll need to spring for at the minimum an A6000
| (48GB) or preferably an RTX 6000 Pro (96GB).
|
| Also keep in mind this model does use 4-bit layers for the MoE
| parts. Unfortunately native accelerated 4-bit support only
| started with Blackwell on NVIDIA. So your
| 3090/4090/A6000/A100's are not going to be fast. An RTX 5090
| will be your best starting point in the traditional card space.
| Maybe the unified memory minipc's like the Spark systems or the
| Mac mini could be an alternative, but I do not know them
| enough.
| vl wrote:
| How Macs compare to RTXs for this? I.e. what numbers can be
| expected from Mac mini/Mac Studio with 64/128/256/512GB of
| unified memory?
| coolspot wrote:
| https://apxml.com/tools/vram-calculator
| hubraumhugo wrote:
| Meta's goal with Llama was to target OpenAI with a "scorched
| earth" approach by releasing powerful open models to disrupt the
| competitive landscape. Looks like OpenAI is now using the same
| playbook.
| tempay wrote:
| It seems like the various Chinese companies are far outplaying
| Meta at that game. It remains to be seen if they're able to
| throw money at the problem to turn things around.
| SV_BubbleTime wrote:
| Good move for China. No one was going to trust their models
| outright, now they not only have a track record, but they
| were able to undercut the value of US models at the same
| time.
| k2xl wrote:
| Is there any details about hardware requirements for a sensible
| tokens per second for each size of these models?
| minimaxir wrote:
| I'm disappointed that the smallest model size is 21B parameters,
| which strongly restricts how it can be run on personal hardware.
| Most competitors have released a 3B/7B model for that purpose.
|
| For self-hosting, it's smart that they targeted a 16GB VRAM
| config for it since that's the size of the most cost-effective
| server GPUs, but I suspect "native MXFP4 quantization" has
| quality caveats.
| moffkalast wrote:
| Eh 20B is pretty managable, 32GB of regular RAM and some VRAM
| will run you a 30B with partial offloading. After that it gets
| tricky.
| 4b6442477b1280b wrote:
| with quantization, 20B fits effortlessly in 24GB
|
| with quantization + CPU offloading, non-thinking models run
| kind of fine (at about 2-5 tokens per second) even with 8 GB of
| VRAM
|
| sure, it would be great if we could have models in all sizes
| imaginable (7/13/24/32/70/100+/1000+), but 20B and 120B are
| great.
| Tostino wrote:
| I am not at all disappointed. I'm glad they decided to go for
| somewhat large but reasonable to run models on everything but
| phones.
|
| Quite excited to give this a try
| strangecasts wrote:
| A small part of me is considering going from a 4070 to a 16GB
| 5060 Ti just to avoid having to futz with offloading
|
| I'd go for an ..80 card but I can't find any that fit in a
| mini-ITX case :(
| SV_BubbleTime wrote:
| I wouldn't stop at 16GB right now.
|
| 24 is the lowest I would go. Buy a used 3090. Picked one up
| for $700 a few months back, but I think they were on the rise
| then.
|
| The 3000 series can't do FP8fast, but meh. It's the OOM
| that's tough, not the speed so much.
| strangecasts wrote:
| Are there any 24GB cards/3090s which fit in ~300mm without
| an angle grinder?
| metalliqaz wrote:
| if you're going to get that kind of hardware, you need a
| larger case. IMHO this is not an unreasonable thing if
| you are doing heavy computing
| strangecasts wrote:
| Noted for my next build - I am aware this is a problem
| I've made for myself, _otherwise_ I like the mini-ITX
| form factor a lot
| SV_BubbleTime wrote:
| Which do you like more OOM for local AI, or an itty bit
| case?
| zigzag312 wrote:
| https://skinflint.co.uk/?cat=gra16_512&hloc=uk&v=e&hloc=a
| t&h...
|
| 5070 Ti Super will also have 24GB.
| strangecasts wrote:
| Oh nice, thank you :)
|
| Admittedly a little tempting to see how the 5070 Ti Super
| shakes out!
| zigzag312 wrote:
| I'm waiting too :)
|
| 50xx series supports MXFP4 format, but I'm not sure about
| 3090.
| hnuser123456 wrote:
| Native FP4 quantization means it requires half as many bytes as
| parameters, and will have next to zero quality loss (on the
| order of 0.1%) compared to using twice the VRAM and
| exponentially more expensive hardware. FP3 and below gets
| messier.
| Disposal8433 wrote:
| Please don't use the open-source term unless you ship the TBs of
| data downloaded from Anna's Archive that are required do build it
| yourself. And dont forget all the system prompts to censor the
| multiple topics that they don't want you to see.
| rvnx wrote:
| I don't know why you got so much downvoted, these models are
| not open-source/open-recipes. They are censored open weights
| models. Better than nothing, but far from being Open
| a_vanderbilt wrote:
| Most people don't really care all that much about the
| distinction. It comes across to them as linguistic pedantry
| and they downvote it to show they don't want to hear/read it.
| outlore wrote:
| by your definition most of the current open weight models would
| not qualify
| layer8 wrote:
| That's why they are called open weight and not open source.
| robotmaxtron wrote:
| Correct. I agree with them, most of the open weight models
| are not open source.
| someperson wrote:
| Keep fighting the "open weights" terminology fight, because
| diluting the term open-source for a blob of neural network
| weights (even inference code is open-source) is not open-
| source.
| mhh__ wrote:
| The system prompt is an inference parameter, no?
| Quarrel wrote:
| Is your point really that- "I need to see all data downloaded
| to make this model, before I can know it is open"? Do you have
| $XXB worth of GPU time to ingest that data with a state of the
| art framework to make a model? I don't. Even if I did, I'm not
| sure FB or Google are in any better position to claim this
| model is or isn't open beyond the fact that the weights are
| there.
|
| They're giving you a free model. You can evaluate it. You can
| sue them. But the weights are there. If you dislike the way
| they license the weights, because the license isn't open
| enough, then sure, speak up, but because you can't see all the
| training data??! Wtf.
| ticulatedspline wrote:
| To many people there's an important distinction between "open
| source" and "open weights". I agree with the distinction,
| open source has a particular meaning which is not really here
| and misuse is worth calling out in order to prevent erosion
| of the terminology.
|
| Historically this would be like calling a free but closed-
| source application "open source" simply because the
| application is free.
| layer8 wrote:
| The parent's point is that open weight is not the same as
| open source.
|
| Rough analogy:
|
| SaaS = AI as a service
|
| Locally executable closed-source software = open-weight model
|
| Open-source software = open-source model (whatever allows to
| reproduce the model from training data)
| NicuCalcea wrote:
| I don't have the $XXbn to train a model, but I certainly
| would like to know what the training data consists of.
| seba_dos1 wrote:
| Do you need to see the source code used to compile this
| binary before you can know it is open? Do you have enough
| disk storage and RAM available to compile Chromium on your
| laptop? I don't.
| nexttk wrote:
| I agree with OP - the weights are more akin to the binary
| output from a compiler. You can't see how it works, how it
| was made, you can't freely manipulate with it, improve it,
| extend it etc. It's like having a binary of a program. The
| source code for the model was the training data. The compiler
| is the tooling that can train a module based on a given set
| of training data. For me it is not critical for an open
| source model that it is ONLY distributed in source code form.
| It is fine that you can also download just the weights. But
| it should be possible to reproduce the weights - either there
| should be a tar.gz ball with all the training data, or there
| needs to be a description/scripts of how one could obtain the
| training data. It must be reproducible for someone willing to
| invest the time, compute into it even if 99.999% use only the
| binary. This is completely analogous to what is normally
| understood by open source.
| NitpickLawyer wrote:
| It's apache2.0, so by definition it's open source. Stop pushing
| for training data, it'll never happen, and there's literally 0
| reason for it to happen (both theoretical and practical).
| Apache2.0 _IS_ opensource.
| organsnyder wrote:
| What is the source that's open? Aren't the models themselves
| more akin to compiled code than to source code?
| NitpickLawyer wrote:
| No, not compiled code. Weights are hardcoded values. Code
| is the combination of model architecture + config +
| inferencing engine. You run inference based on the
| architecture (what and when to compute), using some
| hardcoded values (weights).
| seba_dos1 wrote:
| JVM bytecode is hardcoded values. Code is the virtual
| machine implementation + config + operating system it
| runs on. You run classes based on the virtual machine,
| using some hardcoded input data generated by javac.
| _flux wrote:
| No, it's open weight. You wouldn't call applications with
| only Apache 2.0-licensed binaries "open source". The weights
| are not the "source code" of the model, they are the
| "compiled" binary, therefore they are not open source.
|
| However, for the sake of argument let's say this release
| should be called open source.
|
| Then what do you call a model that also comes with its
| training material and tools to reproduce the model? Is it
| also called open source, and there is no material difference
| between those two releases? Or perhaps those two different
| terms should be used for those two different kind of
| releases?
|
| If you say that actually open source releases are impossible
| now (for mostly copyright reasons I imagine), it doesn't mean
| that they will be perpetually so. For that glorious future,
| we can leave them space in the terminology by using the term
| open weight. It is also the term that should not be
| misleading to anyone.
| WhyNotHugo wrote:
| It's open source, but it's a binary-only release.
|
| It's like getting a compiled software with an Apache license.
| Technically open source, but you can't modify and recompile
| since you don't have the source to recompile. You can still
| tinker with the binary tho.
| NitpickLawyer wrote:
| Weights are not binary. I have no idea why this is so often
| spread, it's simply not true. You can't do anything with
| the weights themselves, you can't "run" the weights.
|
| You run inference (via a library) on a model using it's
| architecture (config file), tokenizer (what and when to
| compute) based on weights (hardcoded values). That's it.
|
| > but you can't modify
|
| Yes, you can. It's called finetuning. And, most
| importantly, that's _exactly_ how the model creators
| themselves are "modifying" the weights! No sane lab is
| "recompiling" a model every time they change something.
| They perform a pre-training stage (feed everything and the
| kitchen sink), they get the hardcoded values (weights), and
| then they post-train using "the same" (well, maybe their
| techniques are better, but still the same concept) as you
| or I would. Just with more compute. That's it. You can do
| the exact same modifications, using basically the same
| concepts.
|
| > don't have the source to recompile
|
| In pure practical ways, neither do the labs. Everyone that
| has trained a big model can tell you that the process is so
| finicky that they'd eat a hat if a big train session can be
| somehow made reproducible to the bit. Between nodes
| failing, datapoints balooning your loss and having to go
| back, and the myriad of other problems, what you get out of
| a big training run is not guaranteed to be the same even
| with 100 - 1000 more attempts, in practice. It's simply the
| nature of training large models.
| koolala wrote:
| You can do a lot with a binary also. That's what game
| mods are all about.
| squeaky-clean wrote:
| A binary does not mean an executable. A PNG is a binary.
| I could have an SVG file, render it as a PNG and release
| that with CC0, it doesn't make my PNG open source. Model
| Weights are binary files.
| seba_dos1 wrote:
| Slapping an open license onto a binary can be a valid use
| of such license, but does not make your project open
| source.
| jlokier wrote:
| _> It 's apache2.0, so by definition it's open source._
|
| That's not true by any of the open source definitions in
| common use.
|
| _Source code_ (and, optionally, derived binaries) under the
| Apache 2.0 license are open source.
|
| But _compiled binaries_ (without access to source) under the
| Apache 2.0 license are not open source, even though the
| license does give you some rights over what you can do with
| the binaries.
|
| Normally the question doesn't come up, because it's so
| unusual, strange and contradictory to ship closed-source
| binaries with an open source license. Descriptions of which
| licenses qualify as open source licenses assume the context
| that _of course_ you have the source or could get it, and it
| 's a question of what you're allowed to do with it.
|
| The distinction is more obvious if you ask the same question
| about other open source licenses such as GPL or MPL. A
| compiled binary (without access to source) shipped with a GPL
| license is not by any stretch open source. Not only is it not
| in the "preferred form for editing" as the license requires,
| it's not even permitted for someone who receives the file to
| give it to someone else and comply with the license. If
| someone who receives the file can't give it to anyone else
| (legally), then it's obvioiusly not open source.
| NitpickLawyer wrote:
| Please see the detailed response to a sibling post. tl;dr;
| weights are not binaries.
| jlokier wrote:
| "Compiled binaries" are just meant to be an example. For
| the purpose of whether something is open source, it
| doesn't matter whether something is a "binary" or
| something completely different.
|
| What matters (for all common definitions of open source):
| Are the files in "source form" (which has a definition),
| or are they "derived works" of the source form?
|
| Going back to Apache 2.0. Although that doesn't define
| "open source", it provides legal definitions of source
| and non-source, which are similar to the definitions used
| in other open source licenses.
|
| As you can see below, for Apache 2.0 it doesn't matter
| whether something is a "binary", "weights" or something
| else. What matters is whether it's the "preferred form
| for making modifications" or a "form resulting from
| mechanical transformation or translation". My highlights
| are capitalized:
|
| - Apache License Version 2.0, January 2004
|
| - 1. Definitions:
|
| - "Source" form shall mean the _PREFERRED FORM FOR MAKING
| MODIFICATIONS_ , including _BUT NOT LIMITED TO_ software
| source code, documentation source, and configuration
| files.
|
| - "Object" form shall mean any form resulting from
| _MECHANICAL TRANSFORMATION OR TRANSLATION_ of a Source
| form, including _BUT NOT LIMITED TO_ compiled object
| code, generated documentation, and conversions to other
| media types.
| NitpickLawyer wrote:
| > "Source" form shall mean the PREFERRED FORM FOR MAKING
| MODIFICATIONS, including BUT NOT LIMITED TO software
| source code, documentation source, and configuration
| files.
|
| Yes, weights are the PREFFERED FORM FOR MAKING
| MODIFICATIONS!!! You, the labs, and anyone sane modifies
| the weights via post-training. This is the point. The
| labs don't re-train every time they want to change the
| model. They finetune. You can do that as well, with the
| same tools/concepts, AND YOU ARE ALLOWED TO DO THAT by
| the license. And redistribute. And all the other stuff.
| x187463 wrote:
| Running a model comparable to o3 on a 24GB Mac Mini is absolutely
| wild. Seems like yesterday the idea of running frontier (at the
| time) models locally or on a mobile device was 5+ years out. At
| this rate, we'll be running such models in the next phone cycle.
| tedivm wrote:
| It only seems like that if you haven't been following other
| open source efforts. Models like Qwen perform ridiculously well
| and do so on very restricted hardware. I'm looking forward to
| seeing benchmarks to see how these new open source models
| compare.
| Rhubarrbb wrote:
| Agreed, these models seem relatively mediocre to Qwen3 / GLM
| 4.5
| modeless wrote:
| Nah, these are much smaller models than Qwen3 and GLM 4.5
| with similar performance. Fewer parameters and fewer bits
| per parameter. They are much more impressive and will run
| on garden variety gaming PCs at more than usable speed. I
| can't wait to try on my 4090 at home.
|
| There's basically no reason to run other open source models
| now that these are available, at least for non-multimodal
| tasks.
| tedivm wrote:
| Qwen3 has multiple variants ranging from larger (230B)
| than these models to significantly smaller (0.6b), with a
| huge number of options in between. For each of those
| models they also release quantized versions (your "fewer
| bits per parameter).
|
| I'm still withholding judgement until I see benchmarks,
| but every point you tried to make regarding model size
| and parameter size is wrong. Qwen has more variety on
| every level, and performs extremely well. That's before
| getting into the MoE variants of the models.
| modeless wrote:
| The benchmarks of the OpenAI models are comparable to the
| largest variants of other open models. The smaller
| variants of other open models are much worse.
| mrbungie wrote:
| I would wait for neutral benchmarks before making any
| conclusions.
| bigyabai wrote:
| With all due respect, you need to actually test out Qwen3
| 2507 or GLM 4.5 before making these sorts of claims. Both
| of them are comparable to OpenAI's largest models and
| even bench favorably to Deepseek and Opus: https://cdn-
| uploads.huggingface.co/production/uploads/62430a...
|
| It's cool to see OpenAI throw their hat in the ring, but
| you're smoking straight hopium if you think there's "no
| reason to run other open source models now" in earnest.
| If OpenAI never released these models, the state-of-the-
| art would not look significantly different for local
| LLMs. This is almost a nothingburger if not for the
| simple novelty of OpenAI releasing an Open AI for once in
| their life.
| modeless wrote:
| > Both of them are comparable to OpenAI's largest models
| and even bench favorably to Deepseek and Opus
|
| So are/do the new OpenAI models, except they're much
| smaller.
| UrineSqueegee wrote:
| I'd really wait for additional neutral benchmarks, I
| asked the 20b model on low reasoning effort which number
| is larger 9.9 or 9.11 and it got it wrong.
|
| Qwen-0.6b gets it right.
| bigyabai wrote:
| According to the early benchmarks, it's looking like
| you're just flat-out wrong:
| https://blog.brokk.ai/a-first-look-at-gpt-oss-120bs-
| coding-a...
| sourcecodeplz wrote:
| From my initial web developer test on https://www.gpt-
| oss.com/ the 120b is kind of meh. Even qwen3-coder
| 30b-a3b is better. have to test more.
| thegeomaster wrote:
| They have worse scores than recent open source releases
| on a number of agentic and coding benchmarks, so if
| absolute quality is what you're after and not just
| cost/efficiency, you'd probably still be running those
| models.
|
| Let's not forget, this is a thinking model that has a
| significantly worse scores on Aider-Polyglot than the
| non-thinking Qwen3-235B-A22B-Instruct-2507, a worse
| TAUBench score than the smaller GLM-4.5 Air, and a worse
| SWE-Bench verified score than the (3x the size) GLM-4.5.
| So the results, at least in terms of benchmarks, are not
| really clear-cut.
|
| From a vibes perspective, the non-reasoners
| Kimi-K2-Instruct and the aforementioned non-thinking
| Qwen3 235B are much better at frontend design. (Tested
| privately, but fully expecting DesignArena to back me up
| in the following weeks.)
|
| OpenAI has delivered something astonishing for the size,
| for sure. But your claim is just an exaggeration. And
| OpenAI have, unsurprisingly, highlighted only the
| benchmarks where they do _really_ well.
| moralestapia wrote:
| You can always get your $0 back.
| Imustaskforhelp wrote:
| I have never agreed with a comment so much but we are all
| addicted to open source models now.
| recursive wrote:
| Not all of us. I've yet to get much use out of any of the
| models. This may be a personal failing. But still.
| satvikpendem wrote:
| Depends on how much you paid for the hardware to run em
| on
| cvadict wrote:
| Yes, but they are suuuuper safe. /s
|
| So far I have mixed impressions, but they do indeed seem
| noticeably weaker than comparably-sized Qwen3 / GLM4.5
| models. Part of the reason may be that the oai models do
| appear to be much more lobotomized than their Chinese
| counterparts (which are surprisingly uncensored). There's
| research showing that "aligning" a model makes it dumber.
| xwolfi wrote:
| The censorship here in China is only about public
| discussions / spaces. You cannot like have a website
| telling you about the crimes of the party. But
| downloading some compressed matrix re-spouting the said
| crimes, nobody gives a damn.
|
| We seem to censor organized large scale complaints and
| viral mind virii, but we never quite forbid people at
| home to read some generated knowledge from an obscure
| hard to use software.
| echelon wrote:
| This might mean there's no moat for anything.
|
| Kind of a P=NP, but for software deliverability.
| CamperBob2 wrote:
| On the subject of who has a moat and who doesn't, it's
| interesting to look the role of patents in the early
| development of wireless technology. There was WWI, and
| there was WWII, but the players in the nascent radio
| industry had _serious_ beef with each other.
|
| I imagine the same conflicts will ramp up over the next few
| years, especially once the silly money starts to dry up.
| a_wild_dandan wrote:
| Right? I still remember the safety outrage of releasing Llama.
| Now? My 96 GB of (V)RAM MacBook will be running a 120B
| parameter frontier lab model. So excited to get my hands on the
| MLX quants and see how it feels compared to GLM-4.5-air.
| 4b6442477b1280b wrote:
| in that era, OpenAI and Anthropic were still deluding
| themselves into thinking they would be the "stewards" of
| generative AI, and the last US administration was very keen
| on regoolating everything under the sun, so "safety" was just
| an angle for regulatory capture.
|
| God bless China.
| narrator wrote:
| Yeah, China is e/acc. Nice cheap solar panels too. Thanks
| China. The problem is their ominous policies like not
| allowing almost any immigration, and their domestic Han
| Supremacist propaganda, and all that make it look a bit
| like this might be Han Supremacy e/acc. Is it better than
| wester/decel? Hard to say, but at least the western/decel
| people are now starting to talk about building power
| plants, at least for datacenters, and things like that
| instead of demanding whole branches of computer science be
| classified, as they were threatening to Marc Andreessen
| when he visited the Biden admin last year.
| 01HNNWZ0MV43FF wrote:
| I wish we had voter support for a hydrocarbon tax,
| though. It would level out the prices and then the AI
| companies can decide whether they want to pay double to
| burn pollutants or invest in solar and wind and batteries
| AtlasBarfed wrote:
| Oh poor oppressed marc andreesen. Someone save him!
| a_wild_dandan wrote:
| Oh absolutely, AI labs certainly talk their books,
| including any safety angles. The controversy/outrage
| extended far beyond those incentivized companies too. Many
| people had good faith worries about Llama. Open-weight
| models are now _vastly_ more powerful than Llama-1, yet the
| sky hasn 't fallen. It's just fascinating to me how
| apocalyptic people are.
|
| I just feel lucky to be around in what's likely the most
| important decade in human history. Shit odds on that, so
| I'm basically a lotto winner. Wild times.
| 4b6442477b1280b wrote:
| >Many people had good faith worries about Llama.
|
| ah, but that begs the question: did those people develop
| their worries organically, or did they simply consume the
| narrative heavily pushed by virtually every mainstream
| publication?
|
| the journos are _heavily_ incentivized to spread FUD
| about it. they saw the writing on the wall that the days
| of making a living by producing clickbait slop were
| coming to an end and deluded themselves into thinking
| that if they kvetch enough, the genie will crawl back
| into the bottle. scaremongering about sci-fi skynet
| bullshit didn 't work, so now they kvetch about joules
| and milliliters consumed by chatbots, as if data centers
| did not exist until two years ago.
|
| likewise, the bulk of other "concerned citizens" are
| creatives who use their influence to sway their
| followers, still hoping against hope to kvetch this
| technology out of existence.
|
| honest-to-God yuddites are as few and as retarded as
| honest-to-God flat earthers.
| kridsdale3 wrote:
| I've been pretty unlucky to have encountered more than my
| fair share of IRL Yuddites. Can't stand em.
| ipaddr wrote:
| "the most important decade in human history."
|
| Lol. To be young and foolish again. This covid laced
| decade is more of a placeholder. The current decade is
| always the most meaningful until the next one. The
| personal computer era, the first cars or planes, ending
| slavery needs to take a backseat to the best search
| engine ever. We are at the point where everyone is
| planning on what they are going to do with their
| hoverboards.
| graemep wrote:
| > ending slavery
|
| happened over many centuries, not in a given decade.
| Abolished and reintroduced in many places: https://en.wik
| ipedia.org/wiki/Timeline_of_abolition_of_slave...
| dingnuts wrote:
| you can say the same shit about machine learning but
| ChatGPT was still the Juneteenth of AI
| hedora wrote:
| Slavery is still legal and widespread in most of the US,
| including California.
|
| There was a ballot measure to actually abolish slavery a
| year or so back. It failed miserably.
| BizarroLand wrote:
| The slavery of free humans is illegal in America, so now
| the big issue is figuring out how to convince voters that
| imprisoned criminals deserve rights.
|
| Even in liberal states, the dehumanization of criminals
| is an endemic behavior, and we are reaching the point in
| our society where ironically having the leeway to discuss
| the humane treatment of even our worst criminals is
| becoming an issue that affects how we see ourselves as a
| society before we even have a framework to deal with the
| issue itself.
|
| What one side wants is for prisons to be for
| rehabilitation and societal reintegration, for prisoners
| to have the right to decline to work and to be paid fair
| wages from their labor. They further want to remove for-
| profit prisons from the equation completely.
|
| What the other side wants is the acknowledgement that
| prisons are not free, they are for punishment, and that
| prisoners have lost some of their rights for the duration
| of their incarceration and that they should be required
| to provide labor to offset the tax burden of their
| incarceration on the innocent people that have to pay for
| it. They also would like it if all prisons were for-
| profit as that would remove the burden from the tax
| payers and place all of the costs of incarceration onto
| the shoulders of the incarcerated.
|
| Both sides have valid and reasonable wants from their
| vantage point while overlooking the valid and reasonable
| wants from the other side.
| recursive wrote:
| > slavery of free humans is illegal
|
| That's kind of vacuously true though, isn't it?
| chromatin wrote:
| I think his point is that slavery is not outlawed by the
| 13th amendment as most people assume (even the Google AI
| summary reads: "The 13th Amendment to the United States
| Constitution, ratified in 1865, officially abolished
| slavery and involuntary servitude in the United
| States.").
|
| However, if you actually read it, the 13th amendment
| makes an explicit allowance for slavery (i.e. expressly
| allows it):
|
| "Neither slavery nor involuntary servitude, *except as a
| punishment for crime whereof the party shall have been
| duly convicted*" (emphasis mine obviously since Markdown
| didn't exist in 1865)
| SR2Z wrote:
| Prisoners themselves are the ones choosing to work most
| of the time, and generally none of them are REQUIRED to
| work (they are required to either take job training or
| work).
|
| They choose to because extra money = extra commissary
| snacks and having a job is preferable to being bored out
| of their minds all day.
|
| That's the part that's frequently not included in the
| discussion of this whenever it comes up. Prison jobs
| don't pay minimum wage, but given that prisoners are
| wards of the state that seems reasonable.
| BizarroLand wrote:
| I have heard anecdotes that the choice of doing work is a
| choice between doing work and being in solitary
| confinement or becoming the target of the guards who do
| not take kindly to prisoners who don't volunteer for work
| assignments.
| graemep wrote:
| I do not think you can equate making prisoners work with
| slavery. Other countries do the same, and it is not
| regarded as slavery in general.
|
| If people were sold into slavery as a punishment (so they
| became some one else's property) as some ancient
| societies did, then that would clearly be slavery.
|
| The most shocking thing about prisons in the US is how
| common prison rape is, and the extent to which it seems
| to be regarded as a joke. The majority of rapes in the US
| are prison rapes. How can that not be anything but an
| appalling problem?
|
| https://en.wikipedia.org/wiki/Prison_rape_in_the_United_S
| tat...
|
| Rape is also something slaves are casually subject to in
| most slave societies. It was definitely accept that Roman
| slave owners were free to rape men, women and children
| they owned.
| ninjagoo wrote:
| The US Constitution's 13th Amendment abolishing slavery
| specifically allows it for convicted people. [1]
|
| You'll see from the definition of a "slave" [2] that
| prisoner labor specifically fits the definition of a
| slave, hence why the constitution makes an exception for
| it.
|
| [1] https://constitutioncenter.org/the-
| constitution/amendments/a... [2]
| https://www.oed.com/dictionary/slave_n?tl=true
| vlmutolo wrote:
| About 7% of people who have ever lived are alive today.
| Still pretty lucky, but not quite winning the lottery.
| foltik wrote:
| Much luckier if you consider everyone who ever will live,
| assuming we don't destroy ourselves.
| SchemaLoad wrote:
| I feel like most of the safety concerns ended up being proven
| correct, but there's so much money in it that they decided to
| push on anyway full steam ahead.
|
| AI did get used for fake news, propaganda, mass surveillance,
| erosion of trust and sense of truth, and mass spamming social
| media.
| bogtog wrote:
| When people talk about running a (quantized) medium-sized model
| on a Mac Mini, what types of latency and throughput times are
| they talking about? Do they mean like 5 tokens per second or at
| an actually usable speed?
| n42 wrote:
| here's a quick recording from the 20b model on my 128GB M4
| Max MBP: https://asciinema.org/a/AiLDq7qPvgdAR1JuQhvZScMNr
|
| and the 120b:
| https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM
|
| I am, um, floored
| Davidzheng wrote:
| the active param count is low so it should be fast.
| Rhubarrbb wrote:
| Generation is usually fast, but prompt processing is the
| main limitation with local agents. I also have a 128 GB M4
| Max. How is the prompt processing on long prompts?
| processing the system prompt for Goose always takes quite a
| while for me. I haven't been able to download the 120B yet,
| but I'm looking to switch to either that or the GLM-4.5-Air
| for my main driver.
| anonymoushn wrote:
| it's odd that the result of this processing cannot be
| cached.
| lostmsu wrote:
| It can be and it is by most good processing frameworks.
| ghc wrote:
| Here's a sample of running the 120b model on Ollama with
| my MBP:
|
| ```
|
| total duration: 1m14.16469975s
|
| load duration: 56.678959ms
|
| prompt eval count: 3921 token(s)
|
| prompt eval duration: 10.791402416s
|
| prompt eval rate: 363.34 tokens/s
|
| eval count: 2479 token(s)
|
| eval duration: 1m3.284597459s
|
| eval rate: 39.17 tokens/s
|
| ```
| andai wrote:
| You mentioned "on local agents". I've noticed this too.
| How do ChatGPT and the others get around this, and
| provide instant responses on long conversations?
| bluecoconut wrote:
| Not getting around it, just benefiting from parallel
| compute / huge flops of GPUs. Fundamentally, it's just
| that prefill compute is itself highly parallel and HBM is
| just that much faster than LPDDR. Effectively H100s and
| B100s can chew through the prefill in under a second at
| ~50k token lengths, so the TTFT (Time to First Token) can
| feel amazingly fast.
| mike_hearn wrote:
| They cache the intermediate data (KV cache).
| phonon wrote:
| Here's a 4bit 70B parameter model,
| https://www.youtube.com/watch?v=5ktS0aG3SMc (deepseek-r1:70b
| Q4_K_M) on a M4 Max 128 GB. Usable, but not very performant.
| a_wild_dandan wrote:
| GLM-4.5-air produces tokens far faster than I can read on my
| MacBook. That's plenty fast enough for me, but YMMV.
| davio wrote:
| On a M1 MacBook Air with 8GB, I got this running Gemma 3n:
|
| 12.63 tok/sec * 860 tokens * 1.52s to first token
|
| I'm amazed it works at all with such limited RAM
| v5v3 wrote:
| I have started a crowdfunding to get you a MacBook air with
| 16gb. You poor thing.
| bookofjoe wrote:
| Up the ante with an M4 chip
| backscratches wrote:
| not meaningfully different, m1 virtually as fast as m4
| wahnfrieden wrote:
| https://github.com/devMEremenko/XcodeBenchmark M4 is
| almost twice as fast as M1
| andai wrote:
| In this table, M4 is also twice as fast as M4.
| wahnfrieden wrote:
| You're comparing across vanilla/Pro/Max tiers. within
| equivalent tier, M4 is almost 2x faster than M1
| v5v3 wrote:
| Twice the cost too.
| wahnfrieden wrote:
| ?
| AtlasBarfed wrote:
| Y not meeee?
|
| After considering my sarcasm for the last 5 minutes, I am
| doubling down. The government of the United States of
| America should enhance its higher IQ people by donating
| AI hardware to them immediately.
|
| This is critical for global competitive economic power.
|
| Send me my hardware US government
| xwolfi wrote:
| higher IQ people <-- well you have to prove that first,
| so let me ask you a test question to prove them: how can
| you mix collaboration and competition in society to
| produce the optimal productivity/conflict ratio ?
| tyho wrote:
| What's the easiest way to get these local models browsing the
| web right now?
| dizhn wrote:
| aider uses Playwright. I don't know what everybody is using
| but that's a good starting point.
| Imustaskforhelp wrote:
| Okay I will be honest, I was so hyped up about This model but
| then I went to localllama and saw it that the:
|
| 120 B model is worse at coding compared to qwen 3 coder and
| glm45 air and even grok 3... (https://www.reddit.com/r/LocalLLa
| MA/comments/1mig58x/gptoss1...)
| logicchains wrote:
| It's only got around 5 billion active parameters; it'd be a
| miracle if it was competitive at coding with SOTA models that
| have significantly more.
| jph00 wrote:
| On this bench it underperforms vs glm-4.5-air, which is an
| MoE with fewer total params but more active params.
| ascorbic wrote:
| That's SVGBench, which is a useful benchmark but isn't much
| of a test of general coding
| Imustaskforhelp wrote:
| Hm alright, I will see how this model actually plays around
| instead of forming quick opinions..
|
| Thanks.
| pxc wrote:
| Qwen3 Coder is 4x its size! Grok 3 is over 22x its size!
|
| What does the resource usage look like for GLM 4.5 Air? Is
| that benchmark in FP16? GPT-OSS-120B will be using between
| 1/4 and 1/2 the VRAM that GLM-4.5 Air does, right?
|
| It seems like a good showing to me, even though Qwen3 Coder
| and GLM 4.5 Air might be preferable for some use cases.
| larodi wrote:
| We be running them in PIs off spare juice in no time, and they
| be billions given how chips and embedded spreads...
| emehex wrote:
| So 120B was Horizon Alpha and 20B was Horizon Beta?
| ImprobableTruth wrote:
| Unfortunately not, this model is noticeably worse. I imagine
| horizon is either gpt 5 nano/mini.
| Leary wrote:
| GPQA Diamond: gpt-oss-120b: 80.1%, Qwen3-235B-A22B-Thinking-2507:
| 81.1%
|
| Humanity's Last Exam: gpt-oss-120b (tools): 19.0%, gpt-oss-120b
| (no tools): 14.9%, Qwen3-235B-A22B-Thinking-2507: 18.2%
| jasonjmcghee wrote:
| Wow - I will give it a try then. I'm cynical about OpenAI
| minmaxing benchmarks, but still trying to be optimistic as this
| in 8bit is such a nice fit for apple silicon
| modeless wrote:
| Even better, it's 4 bit
| amarcheschi wrote:
| Glm 4.5 seems on par as well
| thegeomaster wrote:
| GLM-4.5 seems to outperform it on TauBench, too. And it's
| suspicious OAI is not sharing numbers for quite a few useful
| benchmarks (nothing related to coding, for example).
|
| One positive thing I see is the number of parameters and size
| --- it will provide more economical inference than current
| open source SOTA.
| lcnPylGDnU4H9OF wrote:
| Was the Qwen model using tools for Humanity's Last Exam?
| chown wrote:
| Shameless plug: if someone wants to try it in a nice ui, you
| could give Msty[1] a try. It's private and local.
|
| [1]: https://msty.ai
| dsco wrote:
| Does anyone get the demos at https://www.gpt-oss.com to work, or
| are the servers down immediately after launch? I'm only getting
| the spinner after prompting.
| eliseumds wrote:
| Getting lots of 502s from `https://api.gpt-oss.com/chatkit` at
| the moment.
| lukasgross wrote:
| (I helped build the microsite)
|
| Our backend is falling over from the load, spinning up more
| resources!
| anticensor wrote:
| Why isn't GPT-OSS also offered on the free tier of ChatGPT?
| lukasgross wrote:
| Update: try now!
| MutedEstate45 wrote:
| The repeated safety testing delays might not be purely about
| technical risks like misuse or jailbreaks. Releasing open weights
| means relinquishing the control OpenAI has had since GPT-3. No
| rate limits, no enforceable RLHF guardrails, no audit trail.
| Unlike API access, open models can't be monitored or revoked. So
| safety may partly reflect OpenAI's internal reckoning with that
| irreversible shift in power, not just model alignment per se.
| What do you guys think?
| BoorishBears wrote:
| I think it's pointless: if you SFT even their closed source
| models on a specific enough task, the guardrails disappear.
|
| AI "safety" is about making it so that a journalist can't get
| out a recipe for Tabun just by asking.
| MutedEstate45 wrote:
| True, but there's still a meaningful difference in friction
| and scale. With closed APIs, OpenAI can monitor for misuse,
| throttle abuse and deploy countermeasures in real-time. With
| open weights, a single prompt jailbreak or exploit spreads
| instantly. No need for ML expertise, just a Reddit post.
|
| The risk isn't that bad actors suddenly become smarter. It's
| that anyone can now run unmoderated inference and OpenAI
| loses all visibility into how the model's being used or
| misused. I think that's the control they're grappling with
| under the label of safety.
| BoorishBears wrote:
| OpenAI and Azure both have zero retention options, and the
| NYT saga has given pretty strong confirmation they meant it
| when they said zero.
| MutedEstate45 wrote:
| I think you're conflating real-time monitoring with data
| retention. Zero retention means OpenAI doesn't store user
| data, but they can absolutely still filter content, rate
| limit and block harmful prompts in real-time without
| retaining anything. That's processing requests as they
| come in, not storing them. The NYT case was about data
| storage for training/analysis not about real-time safety
| measures.
| BoorishBears wrote:
| Ok you're off in the land of "what if" and I can just
| flat out say: If you have a ZDR account there is no
| filtering on inference, no real-time moderation, no
| blocking.
|
| If you use their training infrastructure there's
| moderation on training examples, but SFT on non-harmful
| tasks still leads to a complete breakdown of guardrails
| very quickly.
| SV_BubbleTime wrote:
| Given that the best jailbreak for an off-line model is
| still simple prompt injection, which is a solved issue for
| the closed source models... I honestly don't know why they
| are talking about safety much at all for open source.
| ahmedhawas123 wrote:
| Exciting as this is to toy around with...
|
| Perhaps I missed it somewhere, but I find it frustrating that,
| unlike most other open weight models and despite this being an
| open release, OpenAI has chosen to provide pretty minimal
| transparency regarding model architecture and training. It's
| become the norm for LLama, Deepseek, Qwenn, Mistral and others to
| provide a pretty detailed write up on the model which allows
| researchers to advance and compare notes.
| sebzim4500 wrote:
| The model files contain an exact description of the
| architecture of the network, there isn't anything novel.
|
| Given these new models are closer to the SOTA than they are to
| competing open models, this suggests that the 'secret sauce' at
| OpenAI is primarily about training rather than model
| architecture.
|
| Hence why they won't talk about the training.
| gundawar wrote:
| Their model card [0] has some information. It is quite a
| standard architecture though; it's always been that their alpha
| is in their internal training stack.
|
| [0]
| https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...
| ahmedhawas123 wrote:
| This is super helpful and I had not seen it, thanks so much
| for sharing! And I hear you on training being an alpha, at
| the size of the model I wonder how much of this is
| distillation and using o3/o4 data.
| sadiq wrote:
| Looks like Groq (at 1k+ tokens/second) and Fireworks are already
| live on openrouter: https://openrouter.ai/openai/gpt-oss-120b
|
| $0.15M in / $0.6-0.75M out
|
| edit: Now Cerebras too at 3,815 tps for $0.25M / $0.69M out.
| podnami wrote:
| Wow this was actually blazing fast. I prompted "how can the
| 45th and 47th presidents of america share the same parents?"
|
| On ChatGPT.com o3 thought for for 13 seconds, on OpenRouter GPT
| OSS 120B thought for 0.7 seconds - and they both had the
| correct answer.
| Imustaskforhelp wrote:
| Not gonna lie but I got sorta goosebumps
|
| I am not kidding but such progress from a technological point
| of view is just fascinating!
| swores wrote:
| I'm not sure that's a particularly good question for
| concluding something positive about the "thought for 0.7
| seconds" - it's such a simple answer, ChatGPT 4o (with no
| thinking time) immediately answered correctly. The only
| surprising thing in your test is that o3 wasted 13 seconds
| thinking about it.
| Workaccount2 wrote:
| A current major outstanding problem with thinking models is
| how to get them to think an appropriate amount.
| dingnuts wrote:
| The providers disagree. You pay per token. Verbacious
| models are the most profitable. Have fun!
| willy_k wrote:
| For API users, yes, but for the average person with a
| subscription or using the free tier it's the inverse.
| conradkay wrote:
| Nowadays it must be pretty large % of usage going through
| monthly subscriptions
| nisegami wrote:
| Interesting choice of prompt. None of the local models I have
| in ollama (consumer mid range gpu) were able to get it right.
| golergka wrote:
| When I pay attention to o3 CoT, I notice it spends a few
| passes thinking about my system prompt. Hard to imagine this
| question is hard enough to spend 13 seconds on.
| xpe wrote:
| How many people are discussing this after one person did 1
| prompt with 1 data point for each model and wrote a comment?
|
| What is being measured here? For end-to-end time, one model
| is:
|
| t_total = t_network + t_queue + t_batch_wait + t_inference +
| t_service_overhead
| sigmar wrote:
| Non-rhetorically, why would someone pay for o3 api now that I
| can get this open model from openai served for cheaper?
| Interesting dynamic... will they drop o3 pricing next week
| (which is 10-20x the cost[1])?
|
| [1] currently $3M in/ $8M out
| https://platform.openai.com/docs/pricing
| gnulinux wrote:
| Not even that, even if o3 being marginally better is
| important for your task (let's say) why would anyone use
| o4-mini? It seems almost 10x the price and same performance
| (maybe even less): https://openrouter.ai/openai/o4-mini
| Invictus0 wrote:
| Probably because they are going to announce gpt 5
| imminently
| gnulinux wrote:
| Wow, that's significantly cheaper than o4-mini which seems to
| be on part with gpt-oss-120b. ($1.10/M input tokens, $4.40/M
| output tokens) Almost 10x the price.
|
| LLMs are getting cheaper much faster than I anticipated. I'm
| curious if it's still the hype cycle and
| Groq/Fireworks/Cerebras are taking a loss here, or whether
| things are actually getting cheaper. At this we'll be able to
| run Qwen3-32B level models in phones/embedded soon.
| mikepurvis wrote:
| Are the prices staying aligned to the fundamentals (hardware,
| energy), or is this a VC-funded land grab pushing prices to
| the bottom?
| tempaccount420 wrote:
| It's funny because I was thinking the opposite, the pricing
| seems way too high for a 5B parameter activation model.
| gnulinux wrote:
| Sure you're right, but if I can squeeze out o4-mini level
| utility out of it, but its less than quarter the price,
| does it really matter?
| wahnfrieden wrote:
| Yes
| spott wrote:
| It is interesting that openai isn't offering any inference for
| these models.
| bangaladore wrote:
| Makes sense to me. Inference on these models will be a race
| to the bottom. Hosting inference themselves will be a waste
| of compute / dollar for them.
| tekacs wrote:
| I apologize for linking to Twitter, but I can't post a video
| here, so:
|
| https://x.com/tekacs/status/1952788922666205615
|
| Asking it about a marginally more complex tech topic and
| getting an excellent answer in ~4 seconds, reasoning for 1.1
| seconds...
|
| I am _very_ curious to see what GPT-5 turns out to be, because
| unless they're running on custom silicon / accelerators, even
| if it's very smart, it seems hard to justify not using these
| open models on Groq/Cerebras for a _huge_ fraction of use-
| cases.
| tekacs wrote:
| Cleanshot link for those who don't want to go to X:
| https://share.cleanshot.com/bkHqvXvT
| tekacs wrote:
| A few days ago I posted a slowed-down version of the video
| demo on someone's repo because it was unreadably fast due to
| being sped up.
|
| https://news.ycombinator.com/item?id=44738004
|
| ... today, this is a real-time video of the OSS thinking
| models by OpenAI on Groq and I'd have to slow it down to be
| able to read it. Wild.
| modeless wrote:
| I really want to try coding with this at 2600 tokens/s (from
| Cerebras). Imagine generating thousands of lines of code as
| fast as you can prompt. If it doesn't work who cares, generate
| another thousand and try again! And at $.69/M tokens it would
| only cost $6.50 an hour.
| andai wrote:
| I tried this (gpt-oss-120b with Cerebras) with Roo Code. It
| repeatedly failed to use the tools correctly, and then I got
| 429 too many requests. So much for the "as fast as I can
| think" idea!
|
| I'll have to try again later but it was a bit underwhelming.
|
| The latency also seemed pretty high, not sure why. I think
| with the latency the throughout ends up not making much
| difference.
|
| Btw Groq has the 20b model at 4000 TPS but I haven't tried
| that one.
| modeless wrote:
| Can't wait to see third party benchmarks. The ones in the blog
| post are quite sparse and it doesn't seem possible to fully
| compare to other open models yet. But the few numbers available
| seem to suggest that this release will make all other non-
| multimodal open models obsolete.
| incomingpain wrote:
| I dont see the unsloth files yet but they'll be here:
| https://huggingface.co/unsloth/gpt-oss-20b-GGUF
|
| Super excited to test these out.
|
| The benchmarks from 20B are blowing away major >500b models.
| Insane.
|
| On my hardware.
|
| 43 tokens/sec.
|
| I got an error with flash attention turning on. Cant run it with
| flash attention?
|
| 31,000 context is max it will allow or model wont load.
|
| no kv or v quantization.
| rmonvfer wrote:
| What a day! Models aside, the Harmony Response Format[1] also
| seems pretty interesting and I wonder how much of an impact it
| might have in performance of these models.
|
| [1] https://github.com/openai/harmony
| incomingpain wrote:
| Seems to be breaking every agentic tool I've tried so far.
|
| Im guessing it's going to very rapidly be patched into the
| various tools.
| mikert89 wrote:
| ACCELERATE
| jakozaur wrote:
| The coding seems to be one of the strongest use cases for LLMs.
| Though currently they are eating too many tokens to be
| profitable. So perhaps these local models could offload some
| tasks to local computers.
|
| E.g. Hybrid architecture. Local model gathers more data, runs
| tests, does simple fixes, but frequently asks the stronger model
| to do the real job.
|
| Local model gathers data using tools and sends more data to the
| stronger model.
|
| It
| Imustaskforhelp wrote:
| I have always thought that if we can somehow get an AI which is
| insanely good at coding, so much so that It can improve itself,
| then through continuous improvements, they will get better
| models of everything else idk
|
| Maybe you guys call it AGI, so anytime I see progress in
| coding, I think it goes just a tiny bit towards the right
| direction
|
| Plus it also helps me as a coder to actually do some stuff just
| for the fun. Maybe coding is the only truly viable use of AI
| and all others are negligible increases.
|
| There is so much polarization in the use of AI on coding but I
| just want to say this, it would be pretty ironic that an
| industry which automates others job is this time the first to
| get their job automated.
|
| But I don't see that as an happening, far from it. But still
| each day something new, something better happens back to back.
| So yeah.
| hooverd wrote:
| Optimistically, there's always more crap to get done.
| jona777than wrote:
| I agree. It's not improbable for there to be _more_ needs
| to meet in the future, in my opinion.
| NitpickLawyer wrote:
| Not to open _that_ can of worms, but in most definitions
| self-improvement is not an AGI requirement. That 's already
| ASI territory (Super Intelligence). That's the proverbial
| skynet (pessimists) or singularity (optimists).
| Imustaskforhelp wrote:
| Hmm my bad. Maybe Yeah I always thought that it was the
| endgame of humanity but isn't AGI supposed to be that (the
| endgame)
|
| What would AGI mean, solving some problem that it hasn't
| seen? or what exactly? I mean I think AGI is solved, no?
|
| If not, I see people mentioning that horizon alpha is
| actually a gpt 5 model and its predicted to release on
| thursday on some betting market, so maybe that fits AGI
| definition?
| mattfrommars wrote:
| Anyone know how long does the context last for running model
| locally vs running via OpenAPI or Cursor? My understanding is
| the model that run on the cloud have much greater context
| window that what we can have running locally.
| Imustaskforhelp wrote:
| Is this the same model (Horizon Beta) on openrouter or not?
| Because I still see Horizon beta available with its codename on
| openrouter
| abidlabs wrote:
| Test it with a web UI:
| https://huggingface.co/spaces/abidlabs/openai-gpt-oss-120b-t...
| ArtTimeInvestor wrote:
| Why do companies release open source LLMs?
|
| I would understand it, if there was some technology lock-in. But
| with LLMs, there is no such thing. One can switch out LLMs
| without any friction.
| gnulinux wrote:
| Name recognition? Advertisement? Federal grant to beat Chinese
| competition?
|
| There could be many legitimate reasons, but yeah I'm very
| surprised by this too. Some companies take it a bit too
| seriously and go above and beyond too. At this point unless you
| need the absolute SOTA models because you're throwing LLM at an
| extremely hard problem, there is very little utility using
| larger providers. In OpenRouter, or by renting your own GPU you
| can run on-par models for much cheaper.
| TrackerFF wrote:
| LLMs are terrible, purely speaking from the business economic
| side of things.
|
| Frontier / SOTA models are barely profitable. Previous gen
| model lose 90% of their value. Two gens back and they're
| worthless.
|
| And given that their product life cycle is something like 6-12
| months, you might as well open source them as part of
| sundowning them.
| spongebobstoes wrote:
| inference runs at a 30-40% profit
| mclau157 wrote:
| Partially because using their own GPUs is expensive, so maybe
| offloading some GPU usage
| koolala wrote:
| They don't because it would kill their data scrapping
| buisness's competitive advantage.
| LordDragonfang wrote:
| Zuckerberg explains a few of the reasons here:
|
| https://www.dwarkesh.com/p/mark-zuckerberg#:~:text=As%20long...
|
| The short version is that is you give a product to open source,
| they can and will donate time and money to improving your
| product, and the ecosystem around it, for free, and you get to
| reap those benefits. Llama has already basically won that space
| (the standard way of running open models _is_ llama.cpp), so
| OpenAI have finally realized they 're playing catch-up (and
| last quarter's SOTA isn't worth much revenue to them when
| there's a new SOTA, so they may as well give it away while it
| can still crack into the market)
| a_vanderbilt wrote:
| At least in OpenAI's case, it raises the bar for potential
| competition while also implying that what they have behind the
| scenes is far better.
| __alexs wrote:
| I believe it's to create barriers to entry and make the space
| harder to compete in.
|
| There's still a ton of value in the lower end of the market by
| capability, and it's easier for more companies to compete in.
| If you make the cost floor for that basically free you
| eliminate everyone else's ability to make any profit there and
| then leverage that into building a product that can also
| compete at the higher end. This makes it harder for a new
| market entrant to compete by increasing the minimum capability
| and capital investment required to make a profit in this space.
| HanClinto wrote:
| Holy smokes, there's already llama.cpp support:
|
| https://github.com/ggml-org/llama.cpp/pull/15091
| carbocation wrote:
| And it's already on ollama, it appears:
| https://ollama.com/library/gpt-oss
| incomingpain wrote:
| lm studio immediately released the new appimage with support.
| jp1016 wrote:
| i wish these models had a minimum ram , cpu and gpu size listed
| on the site instead of high end and medium end pc.
| phh wrote:
| You can technically run it on a 8086 assuming you can get
| access to a big enough storage.
|
| More reasonably, you should be able to run the 20B at non-
| stupidly-slow speed with a 64bit CPU, 8GB RAM, 20GB SSD.
| n42 wrote:
| my very early first impression of the 20b model on ollama is that
| it is quite good, at least for the code I am working on; arguably
| good enough to drop a subscription or two
| pamelafox wrote:
| Anyone tried running on a Mac M1 with 16GB RAM yet? I've never
| run higher than an 8GB model, but apparently this one is
| specifically designed to work well with 16 GB of RAM.
| thimabi wrote:
| It works fine, although with a bit more latency than non-local
| models. However, swap usage goes way beyond what I'm
| comfortable with, so I'll continue to use smaller models for
| the foreseeable future.
|
| Hopefully other quantizations of these OpenAI models will be
| available soon.
| pamelafox wrote:
| Update: I tried it out. It took about 8 seconds per token, and
| didn't seem to be using much of my GPU (MPU), but was using a
| lot of RAM. Not a model that I could use practically on my
| machine.
| steinvakt2 wrote:
| Did you run it the best way possible? im no expert, but I
| understand it can affect inference time greatly (which
| format/engine is used)
| pamelafox wrote:
| I ran it via Ollama, which I assume uses the best way.
| Screenshot in my post here: https://bsky.app/profile/pamela
| fox.bsky.social/post/3lvobol3...
|
| I'm still wondering why my MPU usage was so low.. maybe
| Ollama isn't optimized for running it yet?
| wahnfrieden wrote:
| Might need to wait on MLX
| turnsout wrote:
| To clarify, this was the 20B model?
| pamelafox wrote:
| Yep, 20B model, via Ollama: ollama run gpt-oss:20b
|
| Screenshot here with Ollama running and asitop in other
| terminal:
|
| https://bsky.app/profile/pamelafox.bsky.social/post/3lvobol
| 3...
| roboyoshi wrote:
| M2 with 16GB: It's slow for me. ~13GB RAM usage, not locking up
| my mac, but took a very long time thinking and slowly
| outputting tokens.. I'd not consider this usable for everyday
| usage.
| shpongled wrote:
| I looked through their torch implementation and noticed that they
| are applying RoPE to both query and key matrices in every layer
| of the transformer - is this standard? I thought positional
| encodings were usually just added once at the first layer
| m_ke wrote:
| No they're usually done at each attention layer.
| shpongled wrote:
| Do you know when this was introduced (or which paper)? AFAIK
| it's not that way in the original transformer paper, or
| BERT/GPT-2
| Scene_Cast2 wrote:
| Should be in the RoPE paper. The OG transformers used
| multiplicative sinusoidal embeddings, while RoPE does a
| pairwise rotation.
|
| There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't
| use any positional stuff) every fourth layer.
| Nimitz14 wrote:
| This is normal. Rope was introduced after bert/gpt2
| spott wrote:
| All the Llamas have done it (well, 2 and 3, and I believe
| 1, I don't know about 4). I think they have a citation for
| it, though it might just be the RoPE paper
| (https://arxiv.org/abs/2104.09864).
|
| I'm not actually aware of any model that _doesn 't_ do
| positional embeddings on a per-layer basis (excepting BERT
| and the original transformer paper, and I haven't read the
| GPT2 paper in a while, so I'm not sure about that one
| either).
| shpongled wrote:
| Thanks! I'm not super up to date on all the ML stuff :)
| jstummbillig wrote:
| Shoutout to the hn consensus regarding an OpenAI open model
| release from 4 days ago:
| https://news.ycombinator.com/item?id=44758511
| kingkulk wrote:
| Welcome to the future!
| jedisct1 wrote:
| For some reason I'm less excited about this that I was with the
| Qwen models.
| timmg wrote:
| Orthogonal, but I just wanted to say how awesome Ollama is. It
| took 2 seconds to find the model and a minute to download and now
| I'm using it.
|
| Kudos to that team.
| _ache_ wrote:
| To be fair, it's with the help of OpenAI. They did it together,
| before the official release.
|
| https://ollama.com/blog/gpt-oss
| aubanel wrote:
| From experience, it's much more engineering work on the
| integrator's side than on OpenAI's. Basically they provide
| you their new model in advance, but they don't know the
| specifics of your system, so it's normal that you do most of
| the work. Thus I'm particularly impressed by Cerebras: they
| only have a few models supported for their extreme perf
| inference, it must have been huge bespoke work to integrate.
| Shopper0552 wrote:
| I remember reading Ollama is going closed source now?
|
| https://www.reddit.com/r/LocalLLaMA/comments/1meeyee/ollamas...
| int_19h wrote:
| It's just as easy with LM Studio.
|
| All the real heavy lifting is done by llama.cpp, and for the
| distribution, by HuggingFace.
| PeterStuer wrote:
| I love how they frame High-end desktops and laptops as having "a
| single H100 GPU".
| organsnyder wrote:
| I read that as it runs in data centers (H100 GPUs) or high-end
| desktops/laptops (Strix Halo?).
| xyc wrote:
| I'm running it with ROG Flow Z13 128GB Strix Halo and getting
| 50 tok/s for 20B model and 12 tok/s for 120B model. I'd say
| it's pretty usable.
| organsnyder wrote:
| Excellent! I have a Framework Desktop with 128GB on
| preorder--really looking forward to getting it.
| robertheadley wrote:
| I actually tried to ask the Model about that, then I asked
| ChatGPT, both times, they just said that it was marketing
| speak.
|
| I was like no. It is false advertising.
| phh wrote:
| Well if nVidia wasn't late, it would be runnable on nVidia
| project Digits.
| PeterStuer wrote:
| Yes, they are late to the party. Maybe they do not want to
| eat into the RTX Pro 6000 sales. In the meantime, there is
| the AMD Ryzen(tm) Al Max+ 395.
| piskov wrote:
| Don't forget about mac studio
| kgwgk wrote:
| It may be useless for many use cases given that its policy
| prevents it for example from providing "advice or instructions
| about how to buy something."
|
| (I included details about its refusal to answer even after using
| tools for web searching but hopefully shorter comment means fewer
| downvotes.)
| isoprophlex wrote:
| Can these do image inputs as well? I can't find anything about
| that on the linked page, so I guess not..?
| cristoperb wrote:
| No, they're text only
| pu_pe wrote:
| Very sparse benchmarking results released so far. I'd bet the
| Chinese open source models beat them on quite a few of them.
| foundry27 wrote:
| Model cards, for the people interested in the guts:
| https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...
|
| In my mind, I'm comparing the model architecture they describe to
| what the leading open-weights models (Deepseek, Qwen, GLM, Kimi)
| have been doing. Honestly, it just seems "ok" at a technical
| level:
|
| - both models use standard Grouped-Query Attention (64 query
| heads, 8 KV heads). The card talks about how they've used an
| older optimization from GPT3, which is alternating between banded
| window (sparse, 128 tokens) and fully dense attention patterns.
| It uses RoPE extended with YaRN (for a 131K context window). So
| they haven't been taking advantage of the special-sauce Multi-
| head Latent Attention from Deepseek, or any of the other similar
| improvements over GQA.
|
| - both models are standard MoE transformers. The 120B model
| (116.8B total, 5.1B active) uses 128 experts with Top-4 routing.
| They're using some kind of Gated SwiGLU activation, which the
| card talks about as being "unconventional" because of to clamping
| and whatever residual connections that implies. Again, not using
| any of Deepseek's "shared experts" (for general patterns) +
| "routed experts" (for specialization) architectural improvements,
| Qwen's load-balancing strategies, etc.
|
| - the most interesting thing IMO is probably their quantization
| solution. They did something to quantize >90% of the model
| parameters to the MXFP4 format (4.25 bits/parameter) to let the
| 120B model to fit on a single 80GB GPU, which is pretty cool. But
| we've also got Unsloth with their famous 1.58bit quants :)
|
| All this to say, it seems like even though the training they did
| for their agentic behavior and reasoning is undoubtedly very
| good, they're keeping their actual technical advancements "in
| their pocket".
| rfoo wrote:
| Or, you can say, OpenAI has some real technical advancements on
| stuff _besides_ attn architecture. GQA8, alternating SWA 128 /
| full attn do all seem conventional. Basically they are showing
| us that "no secret sauce in model arch you guys just sucks at
| mid/post-training", or they want us to believe this.
|
| The model is pretty sparse tho, 32:1.
| liuliu wrote:
| Kimi K2 paper said that the model sparsity scales up with
| parameters pretty well (MoE sparsity scaling law, as they
| call, basically calling Llama 4 MoE "done wrong"). Hence K2
| has 128:1 sparsity.
| throwdbaaway wrote:
| I thought Kimi K2 uses 8 active experts out of 384?
| Sparsity should be 48:1. Indeed Llama4 Maverick is the only
| one that has 128:1 sparsity.
| liuliu wrote:
| You are right. I mis-remembered the sparsity part of K2.
| The "done wrong" part I was thinking about how the scout
| -> maverick -> behemoth doesn't scale sparsity according
| to any formula (less sparse -> sparse -> less sparse).
| nxobject wrote:
| It's convenient to be able to attribute success to things
| only OpenAI could've done with the combo of their early start
| and VC money - licensing content, hiring subject matter
| experts, etc. Essentially the "soft" stuff that a mature
| organization can do.
| logicchains wrote:
| >They did something to quantize >90% of the model parameters to
| the MXFP4 format (4.25 bits/parameter) to let the 120B model to
| fit on a single 80GB GPU, which is pretty cool
|
| They said it was native FP4, suggesting that they actually
| trained it like that; it's not post-training quantisation.
| rushingcreek wrote:
| The native FP4 is one of the most interesting architectural
| aspects here IMO, as going below FP8 is known to come with
| accuracy tradeoffs. I'm curious how they navigated this and
| how the FP8 weights (if they exist) were to perform.
| buildbot wrote:
| One thing to note is that MXFP4 is a block scaled format,
| with 4.25 bits per weight. This lets it represent a lot
| more numbers than just raw FP4 would with say 1 mantissa
| and 2 exponent bits.
| danieldk wrote:
| Also: attention sinks (although implemented as extra trained
| logits used in attention softmax rather than attending to e.g.
| a prepended special token).
| mclau157 wrote:
| You can get similar insights looking at the github repo
| https://github.com/openai/gpt-oss
| tgtweak wrote:
| I think their MXFP4 release is a bit of a gift since they
| obviously used and tuned this extensively as a result of cost-
| optimization at scale - something the open source model
| providers aren't doing too much, and also somewhat of a
| competitive advantage.
|
| Unsloth's special quants are amazing but I've found there to be
| lots of trade offs vs full quantization, particularly when
| striving for best first-shot attempts - which is by far the
| bulk of LLM use cases. Running a better (larger, newer) model
| at lower quantization to fit in memory, or with reduced
| accuracy/detail to speed it up both have value, but in the the
| pursuit of first-shot accuracy there doesn't seem to be many
| companies running their frontier models on reduced
| quantization. If openAI is in doing this in production that is
| interesting.
| highfrequency wrote:
| I would guess the "secret sauce" here is distillation:
| pretraining on an extremely high quality synthetic dataset from
| the prompted output of their state of the art models like o3
| rather than generic internet text. A number of research results
| have shown that highly curated technical problem solving data
| is unreasonably effective at boosting smaller models'
| performance.
|
| This would be much more efficient than relying purely on RL
| post-training on a small model; with low baseline capabilities
| the insights would be very sparse and the training very
| inefficient.
| asadm wrote:
| > research results have shown that highly curated technical
| problem solving data is unreasonably effective at boosting
| smaller models' performance.
|
| same seems to be true for humans
| tempaccount420 wrote:
| Wish they gave us access to learn from those grandmother
| models instead of distilled slop.
| ashdksnndck wrote:
| It behooves them to keep the best stuff internal, or at
| least greatly limit any API usage to avoid giving the
| goods away to other labs they are racing with.
| saurik wrote:
| Which, presumably, is the reason they removed 4.5 from
| the API... mostly the only people willing to pay that
| much for that model were their competitors. (I mean, I
| would pay even more than they were charging, but I
| imagine even if I scale out my use cases--which, for just
| me, are mostly satisfied by being trapped in their UI--it
| would be a pittance vs. the simpler stuff people keep
| using.)
| throw310822 wrote:
| Yes, if I understand correctly, what it means is "a very
| smart teacher can do wonders for their pupils' education".
| unethical_ban wrote:
| I don't know how to ask this without being direct and dumb:
| Where do I get a layman's introduction to LLMs that could work
| me up to understanding every term and concept you just
| discussed? Either specific videos, or if nothing else, a
| reliable Youtube channel?
| srigi wrote:
| Start with the YT series on neural nets and LLMs from
| 3blue1brown
| umgefahren wrote:
| There is a great 3blue1brown video, but it's pretty much
| impossible by now to cover the entire landscape of research.
| I bet gpt-oss has some great explanations though ;)
| CanuckPro wrote:
| Try Andrej Karpathy's YouTube videos. I also really liked the
| Dive into Deep Learning book at d2l.ai
| tkgally wrote:
| What I've sometimes done when trying to make sense of recent
| LLM research is give the paper and related documents to
| ChatGPT, Claude, or Gemini and ask them to explain the
| specific terms I don't understand. If I don't understand
| their explanations or want to know more, I ask follow-ups.
| Doing this in voice mode works better for me than text chat
| does.
|
| When I just want a full summary without necessarily
| understanding all the details, I have an audio overview made
| on NotebookLM and listen to the podcast while I'm exercising
| or cleaning. I did that a few days ago with the recent
| Anthropic paper on persona vectors, and it worked great.
| tshannon wrote:
| So probably another stupid question, but how do you know
| what it's spitting out is accurate?
| tkgally wrote:
| One has to be aware of the possibility of hallucinations,
| of course. But I have not encountered any hallucinations
| in these sorts of interactions with the current leading
| models. Questions like "what does 'embedding space' mean
| in the abstract of this paper?" yield answers that, in my
| experience, make sense in the context and check out when
| compared with other sources. I would be more cautious if
| I were using a smaller model or if I were asking
| questions about obscure information without supporting
| context.
|
| Also, most of my questions are not about specific facts
| but about higher-level concepts. For research about ML,
| at least, the responses check out.
| nonfamous wrote:
| Try Microsoft's "Generative AI for Beginners" repo on GitHub.
| The early chapters in particular give a good grounding of LLM
| architecture without too many assumptions of background
| knowledge. The video version of the series is good too.
| reilly3000 wrote:
| Ask Gemini. Give it a link here in fact.
| cwyers wrote:
| This is a great book (parts of it are available as blog posts
| from the author if you want to get a taste of it):
|
| https://www.manning.com/books/build-a-large-language-
| model-f...
| user_7832 wrote:
| Newbie question: I remember folks talking about how kimi 2's
| launch might have pushed OpenAI to launch their model later. Now
| that we (shortly will) know how this model performs, how do they
| stack up? Did openAI likely actually hold off releasing weights
| because of kimi, in retrospect?
| ClassAndBurn wrote:
| Open models are going to win long-term. Anthropics' own research
| has to use OSS models [0]. China is demonstrating how quickly
| companies can iterate on open models, allowing smaller teams
| access and augmentation to the abilities of a model without
| paying the training cost.
|
| My personal prediction is that the US foundational model makers
| will OSS something close to N-1 for the next 1-3 iterations. The
| CAPEX for the foundational model creation is too high to justify
| OSS for the current generation. Unless the US Gov steps up and
| starts subsidizing power, or Stargate does 10x what it is planned
| right now.
|
| N-1 model value depreciates insanely fast. Making an OSS release
| of them and allowing specialized use cases and novel developments
| allows potential value to be captured and integrated into future
| model designs. It's medium risk, as you may lose market share.
| But also high potential value, as the shared discoveries could
| substantially increase the velocity of next-gen development.
|
| There will be a plethora of small OSS models. Iteration on the
| OSS releases is going to be biased towards local development,
| creating more capable and specialized models that work on smaller
| and smaller devices. In an agentic future, every different agent
| in a domain may have its own model. Distilled and customized for
| its use case without significant cost.
|
| Everyone is racing to AGI/SGI. The models along the way are to
| capture market share and use data for training and evaluations.
| Once someone hits AGI/SGI, the consumer market is nice to have,
| but the real value is in novel developments in science,
| engineering, and every other aspect of the world.
|
| [0] https://www.anthropic.com/research/persona-vectors > We
| demonstrate these applications on two open-source models, Qwen
| 2.5-7B-Instruct and Llama-3.1-8B-Instruct.
| lechatonnoir wrote:
| I'm pretty sure there's no reason that Anthropic _has_ to do
| research on open models, it 's just that they produced their
| result on open models so that you can reproduce their result on
| open models without having access to theirs.
| Adrig wrote:
| I'm a layman but it seemed to me that the industry is going
| towards robust foundational models on which we plug tools,
| databases, and processes to expand their capabilities.
|
| In this setup OSS models could be more than enough and capture
| the market but I don't see where the value would be to a
| multitude of specialized models we have to train.
| renmillar wrote:
| There's no reason that models too large for consumer hardware
| wouldn't keep a huge edge, is there?
| AtlasBarfed wrote:
| That is fundamentally a big O question.
|
| I have this theory that we simply got over a hump by
| utilizing a massive processing boost from gpus as opposed to
| CPUs. That might have been two to three orders of magnitude
| more processing power.
|
| But that's a one-time success. I don't hardware has any large
| scale improvements coming, because 3D gaming mostly plumb
| most of that vector processing hardware development in the
| last 30 years.
|
| So will software and better training models produce another
| couple orders of magnitude?
|
| Fundamentally we're talking about nines of of accuracy. What
| is the processing power required for each line of accuracy?
| Is it linear? Is it polynomial? Is it exponential?
|
| It just seems strange to me with all the AI knowledge
| slushing through academia, I haven't seen any basic analysis
| at that level, which is something that's absolutely going to
| be necessary for AI applications like self-driving, once you
| get those insurance companies involved
| xpe wrote:
| > Open models are going to win long-term.
|
| [1 of 3] For the sake of argument here, I'll grant the premise.
| If this turns out to be true, it glosses over other key
| questions, including:
|
| For a frontier lab, what is a _rational_ period of time
| (according to your organizational mission / charter /
| shareholder motivations*) to wait before:
|
| 1. releasing a new version of an open-weight model; and
|
| 2. how much secret sauce do you hold back?
|
| * Take your pick. These don't align perfectly with each other,
| much less the interests of a nation or world.
| xpe wrote:
| > Open models are going to win long-term.
|
| [2 of 3] Assuming we pin down what _win_ means... (which is
| definitely not easy)... What would it take for this to _not_ be
| true? There are many ways, including but not limited to:
|
| - publishing open weights helps your competitors catch up
|
| - publishing open weights doesn't improve your own research
| agenda
|
| - publishing open weights leads to a race dynamic where only
| the latest and greatest matters; leading to a situation where
| the resources sunk exceed the gains
|
| - publishing open weights distracts your organization from
| attaining a sustainable business model / funding stream
|
| - publishing open weights leads to significant negative
| downstream impacts (there are a variety of uncertain outcomes,
| such as: deepfakes, security breaches, bioweapon development,
| unaligned general intelligence, humans losing control [1] [2],
| and so on)
|
| [1]: "What failure looks like" by Paul Christiano :
| https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-...
|
| [2]: "An AGI race is a suicide race." - quote from Max Tegmark;
| article at https://futureoflife.org/statement/agi-manhattan-
| project-max...
| xpe wrote:
| > Open models are going to win long-term.
|
| [3 of 3] What would it take for this statement to be _false_ or
| _missing the point_?
|
| Maybe we find ourselves in a future where:
|
| - Yes, open models are widely used as base models, but they are
| also highly customized in various ways (perhaps by industry,
| person, attitude, or something else). In other words, this
| would be a blend of open and closed.
|
| - Maybe publishing open weights of a model is more-or-less
| irrelevant, because it is "table stakes" ... because all the
| key differentiating advantages have to do with other factors,
| such as infrastructure, non-LLM computational aspects,
| regulatory environment, affordable energy, customer base,
| customer trust, and probably more.
|
| - The future might involve thousands or millions of highly
| tailored models
| albertzeyer wrote:
| > Once someone hits AGI/SGI
|
| I don't think there will be such a unique event. There is no
| clear boundary. This is a continuous process. Modells get
| slightly better than before.
|
| Also, another dimension is the inference cost to run those
| models. It has to be cheap enough to really take advantage of
| it.
|
| Also, I wonder, what would be a good target to make profit, to
| develop new things? There is Isomorphic Labs, which seems like
| a good target. This company already exists now, and people are
| working on it. What else?
| dom96 wrote:
| > I don't think there will be such a unique event.
|
| I guess it depends on your definition of AGI, but if it means
| human level intelligence then the unique event will be the AI
| having the ability to act on its own without a "prompt".
| rossant wrote:
| And the ability to improve itself.
| seba_dos1 wrote:
| > the unique event will be the AI having the ability to act
| on its own without a "prompt"
|
| That's super easy. The reason they need a prompt is that
| this is the way we make them useful. We don't need LLMs to
| generate an endless stream of random "thoughts" otherwise,
| but if you really wanted to, just hook one up to a webcam
| and microphone stream in a loop and provide it some storage
| for "memories".
| teaearlgraycold wrote:
| > N-1 model value depreciates insanely fast
|
| This implies LLM development isn't plateaued. Sure the
| researchers are busting their assess quantizing, adding
| features like tool calls and structured outputs, etc. But soon
| enough N-1~=N
| swalsh wrote:
| To me it depends on 2 factors. Hardware becomes more
| accessible, and the closed source offerings become more
| expensive. Right now it's difficult to get enough GPUs to do
| local inference at production scale, and 2 it's more expensive
| to run your own GPU's vs closed source models.
| mythz wrote:
| Getting great performance running gpt-oss on 3x A4000's:
| gpt-oss:20b = ~46 tok/s
|
| More than 2x faster than my previous leading OSS models:
| mistral-small3.2:24b = ~22 tok/s gemma3:27b =
| ~19.5 tok/s
|
| Strangely getting nearly the opposite performance running on 1x
| 5070 Ti: mistral-small3.2:24b = ~39 tok/s
| gpt-oss:20b = ~21 tok/s
|
| Where gpt-oss is nearly 2x slow vs mistral-small 3.2.
| genpfault wrote:
| Seeing ~70 tok/s on a 7900 XTX using Ollama.
| Matsta wrote:
| I'm getting around 90 tok/s on a 3090 using Ollama.
|
| Pretty impressive
| mythz wrote:
| ok issue is with ollama as gpt-oss 20B runs much faster on 1x
| 5070 Ti with llama.cpp and LM Studio: llama-
| server = ~181 tok/s LM Studio = ~46 tok/s
| (default) LM Studio Custom = ~158 tok/s (changed to
| offload to GPU and switch to CUDA llama.cpp engine)
|
| and llama-server on my 3x A4000 GPU Server is getting 90 tok/s
| vs 46 tok/s on ollama
| anonymoushn wrote:
| guys, what does OSS stand for?
| thejazzman wrote:
| it's a marketing term that modern companies use to grow market
| share
| ayakaneko wrote:
| should be open source software, but it's a model, so not sure
| whether they chose this name with the last S having other
| meanings.
| Robdel12 wrote:
| I'm on my phone and haven't been able to break away to check, but
| anyone plug these into Codex yet?
| jcmontx wrote:
| I'm out of the loop for local models. For my M3 24gb ram macbook,
| what token throughput can I expect?
|
| Edit: I tried it out, I have no idea in terms of of tokens but it
| was fluid enough for me. A bit slower than using o3 in the
| browser but definitely tolerable. I think I will set it up in my
| GF's machine so she can stop paying for the full subscription
| (she's a non-tech professional)
| steinvakt2 wrote:
| Wondering about the same for my M4 max 128 gb
| jcmontx wrote:
| It should fly on your machine
| steinvakt2 wrote:
| Yeah, was super quick and easy to set up using Ollama. I
| had to kill some processes first to avoid memory swap
| though (even with 128gb memory). So a slightly more
| quantized version is maybe ideal, for me at least.
|
| Edit: I'm talking about the 120B model of course
| coolspot wrote:
| 40 t/s
| dantetheinferno wrote:
| Apple M4 Pro w/ 48GB running the smaller version. I'm getting
| 43.7t/s
| albertgoeswoof wrote:
| 3 year old M1 MacBook Pro 32gb, 42 tokens/sec on lm studio
|
| Very much usable
| ivape wrote:
| Curious if anyone is running this on a AMD Ryzen AI Max+ 395
| and knows the t/s.
| Rhubarrbb wrote:
| What's the best agent to run this on? Is it compatible with
| Codex? For OSS agents, I've been using Qwen Code (clunky fork of
| Gemini), and Goose.
| wahnfrieden wrote:
| Why not Claude Code?
| objektif wrote:
| I keep hitting the limit within an hour.
| wahnfrieden wrote:
| Meant with your own model
| henriquegodoy wrote:
| Seeing a 20B model competing with o3's performance is mind
| blowing like just a year ago, most of us would've called this
| impossible - not just the intelligence leap, but getting this
| level of capability in such a compact size.
|
| I think that the point that makes me more excited is that we can
| train trillion-parameter giants and distill them down to just
| billions without losing the magic. Imagine coding with Claude 4
| Opus-level intelligence packed into a 10B model running locally
| at 2000 tokens/sec - like instant AI collaboration. That would
| fundamentally change how we develop software.
| coolspot wrote:
| 10B * 2000 t/s = 20,000 GB/s memory bandwidth . Apple hardware
| can do 1k GB/s .
| oezi wrote:
| That's why MoE is needed.
| int_19h wrote:
| It's not even a 20b model. It's 20b MoE with 3.6b active
| params.
|
| But it does not actually compete with o3 performance. Not even
| close. As usual, the metrics are bullshit. You don't know how
| good the model actually is until you grill it yourself.
| Nimitz14 wrote:
| I'm surprised at the model dim being 2.8k with an output size of
| 200k. My gut feeling had told me you don't want too large of a
| gap between the two, seems I was wrong.
| ukprogrammer wrote:
| > we also introduced an additional layer of evaluation by testing
| an adversarially fine-tuned version of gpt-oss-120b
|
| What could go wrong?
| nirav72 wrote:
| I don't exactly have the ideal hardware to run locally - but just
| ran the 20b in LMStudio with a 3080 Ti (12gb vram) with some
| offloading to CPU. Ran couple of quick code generation tests. On
| average about 20t/sec. But response quality was very similar or
| on-par with chatgpt o3 for the same code it outputted. So its not
| bad.
| nodesocket wrote:
| Anybody got this working in Ollama? I'm running latest version
| 0.11.0 with WebUI v0.6.18 but getting:
|
| > List the US presidents in order starting with George Washington
| and their time in office and year taken office.
|
| >> 00: template: :3: function "currentDate" not defined
| genpfault wrote:
| https://github.com/ollama/ollama/issues/11673
| jmorgan wrote:
| Sorry about this. Re-downloading Ollama should fix the error
| nodesocket wrote:
| Thanks for the reply and speedy patch Jeffery. Seems to be
| working now, except my 4060ti can't hang lacking enough vram.
| ahmetcadirci25 wrote:
| I started downloading, I'm eager to test it. I will share my
| personal experiences. https://ahmetcadirci.com/2025/gpt-oss/
| koolala wrote:
| Calls them open-weight. Names them 'oss'. What does oss stand
| for?
| incomingpain wrote:
| First coding test: Just going copy and paste out of chat. It aced
| my first coding test in 5 seconds... this is amazing. It's really
| good at coding.
|
| Trying to use it for agentic coding...
|
| lots of fail. This harmony formatting? Anyone have a working
| agentic tool?
|
| openhands and void ide are failing due to the new tags.
|
| Aider worked, but the file it was supposed to edit was untouched
| and it created
|
| Create new file? (Y)es/(N)o [Yes]:
|
| Applied edit to
| <|end|><|start|>assistant<|channel|>final<|message|>main.py
|
| so the file name is
| '<|end|><|start|>assistant<|channel|>final<|message|>main.py'
| lol. quick rename and it was fantastic.
|
| I think qwen code is the best choice so far but unreliable. So
| far these new tags are coming through but it's working properly;
| sometimes.
|
| 1 of my tests so far has been able to get 20b not to succeed the
| first iteration; but a small followup and it was able to
| completely fix it right away.
|
| Very impressive model for 20B.
| bobsmooth wrote:
| Hopefully the dolphin team will work their magic and uncensor
| this model
| siliconc0w wrote:
| It seems like OSS will win, I can't see people willing to pay
| like 10x the price for what seems like 10% more performance.
| Especially once we get better at routing the hardest questions to
| the better models and then using that response to augment/fine-
| tune the OSS ones.
| n42 wrote:
| to me it seems like the market is breaking into an 80/20 of
| B2C/B2B; the B2C use case becoming OSS models (the market
| shifts to devices that can support them), and the B2B market
| being priced appropriately for businesses that require that
| last 20% of absolute cutting edge performance as the cloud
| offering
| seydor wrote:
| This is good for China
| chromaton wrote:
| This has been available (20b version, I'm guessing) for the past
| couple of days as "Horizon Alpha" on Openrouter. My benchmarking
| runs with TianshuBench for coding and fluid intelligence were
| rate limited, but the initial results show worse results that
| DeepSeek R1 and Kimi K2.
| lukax wrote:
| Inference in Python uses harmony [1] (for request and response
| format) which is written in Rust with Python bindings. Another
| OpenAI's Rust library is tiktoken [2], used for all tokenization
| and detokenization. OpenAI Codex [3] is also written in Rust. It
| looks like OpenAI is increasingly adopting Rust (at least for
| inference).
|
| [1] https://github.com/openai/harmony
|
| [2] https://github.com/openai/tiktoken
|
| [3] https://github.com/openai/codex
| chilipepperhott wrote:
| As an engineer that primarily uses Rust, this is a good omen.
| Philpax wrote:
| The less Python in the stack, the better!
| fnands wrote:
| Mhh, I wonder if these are distilled from GPT4-Turbo.
|
| I asked it some questions and it seems to think it is based on
| GPT4-Turbo:
|
| > Thus we need to answer "I (ChatGPT) am based on GPT-4 Turbo;
| number of parameters not disclosed; GPT-4's number of parameters
| is also not publicly disclosed, but speculation suggests maybe
| around 1 trillion? Actually GPT-4 is likely larger than 175B;
| maybe 500B. In any case, we can note it's unknown.
|
| As well as:
|
| > GPT-4 Turbo (the model you're talking to)
| fnands wrote:
| Also:
|
| > The user appears to think the model is "gpt-oss-120b", a new
| open source release by OpenAI. The user likely is
| misunderstanding: I'm ChatGPT, powered possibly by GPT-4 or
| GPT-4 Turbo as per OpenAI. In reality, there is no "gpt-
| oss-120b" open source release by OpenAI
| christianqchung wrote:
| A little bit of training data certainly has gotten in there,
| but I don't see any reasons for them to deliberately distill
| from such an old model. Models have always been really bad at
| telling you what model they are.
| seba_dos1 wrote:
| Just stop and think a bit about where a model may get the
| knowledge of its own name from.
| sabakhoj wrote:
| Super excited to see these released!
|
| Major points of interest for me:
|
| - In the "Main capabilities evaluations" section, the 120b
| outperform o3-mini and approaches o4 on most evals. 20b model is
| also decent, passing o3-mini on one of the tasks.
|
| - AIME 2025 is nearly saturated with large CoT
|
| - CBRN threat levels kind of on par with other SOTA open source
| models. Plus, demonstrated good refusals even after adversarial
| fine tuning.
|
| - Interesting to me how a lot of the safety benchmarking runs on
| trust, since methodology can't be published too openly due to
| counterparty risk.
|
| Model cards with some of my annotations:
| https://openpaper.ai/paper/share/7137e6a8-b6ff-4293-a3ce-68b...
| davidw wrote:
| Big picture, what's the balance going to look like, going forward
| between what normal people can run on a fancy computer at home vs
| heavy duty systems hosted in big data centers that are the
| exclusive domain of Big Companies?
|
| This is something about AI that worries me, a 'child' of the open
| source coming of age era in the 90ies. I don't want to be forced
| to rely on those big companies to do my job in an efficient way,
| if AI becomes part of the day to day workflow.
| sipjca wrote:
| Isn't it that hardware catches up and becomes cheaper? The
| margin on these chips right now is outrageous, but what happens
| as there is more competition? What happens when there is more
| supply? Are we overbuilding? Apple M series chips already
| perform phenomenally for this class of models and you bet both
| AMD and NVIDIA are playing with unified memory architectures
| too for the memory bandwidth. It seems like today's really
| expensive stuff may become the norm rather than the exception.
| Assuming architectures lately stay similar and require large
| amounts of fast memory.
| maxloh wrote:
| > We introduce gpt-oss-120b and gpt-oss-20b, two open-weight
| reasoning models available under the Apache 2.0 license and our
| gpt-oss usage policy. [0]
|
| Is it even valid to have additional restriction on top of Apache
| 2.0?
|
| [0]: https://openai.com/index/gpt-oss-model-card/
| qntmfred wrote:
| you can just do things
| maxloh wrote:
| Not for all licenses.
|
| For example, GPL has a "no-added-restrictions" clause, which
| allows the recipient of the software to ignore any additional
| restrictions added alongside the license.
|
| > All other non-permissive additional terms are considered
| "further restrictions" within the meaning of section 10. If
| the Program as you received it, or any part of it, contains a
| notice stating that it is governed by this License along with
| a term that is a further restriction, you may remove that
| term. If a license document contains a further restriction
| but permits relicensing or conveying under this License, you
| may add to a covered work material governed by the terms of
| that license document, provided that the further restriction
| does not survive such relicensing or conveying.
| ninjin wrote:
| > Is it even valid to have additional restriction on top of
| Apache 2.0?
|
| You can legally do whatever you want, the question is whether
| you will then for your own benefit be appropriating a term like
| open source (like Facebook) if you add restrictions not in line
| with how the term is traditionally used or if you are actually
| be honest about it and call it something like "weights
| available".
|
| In the case of OpenAI here, I am not a lawyer, and I am _also_
| not sure if the gpt-oss usage policy runs afoul of open source
| as a term. They did not bother linking the policy from the
| announcement, which was odd, but here it is:
|
| https://huggingface.co/openai/gpt-oss-120b/blob/main/USAGE_P...
|
| Compared to the wall of text that Facebook throws at you, let
| me post it here as it is rather short: "We aim for our tools to
| be used safely, responsibly, and democratically, while
| maximizing your control over how you use them. By using OpenAI
| gpt-oss-120b, you agree to comply with all applicable law."
|
| I suspect this sentence still is too much to add and _may_
| invalidate the Open Source Initiative (OSI) definition, but at
| this point I would want to ask a lawyer and preferably one from
| OSI. Regardless, credit to OpenAI for moving the status quo in
| the right direction as the only further step we really can take
| is to remove the usage policy entirely (as is the standard for
| open source software anyway).
| pbkompasz wrote:
| where gpt-5
| ramoz wrote:
| This is a solid enterprise strategy.
|
| Frontier labs are incentivized to start breaching these
| distribution paths. This will evolve into large scale
| "intelligent infra" plays.
| matznerd wrote:
| thanks openai for being open ;) Surprised there are no official
| MLX versions and only one mention of MLX in this thread. MLX
| basically converst the models to take advntage of mac unified
| memory for 2-5x increase in power, enabling macs to run what
| would otherwise take expensive gpus (within limits).
|
| So FYI to any one on mac, the easiest way to run these models
| right now is using LM Studio (https://lmstudio.ai/), its free.
| You just search for the model, usually 3rd party groups mlx-
| community or lmstudio-community have mlx versions within a day or
| 2 of releases. I go for the 8-bit quantizations (4-bit faster,
| but quality drops). You can also convert to mlx yourself...
|
| Once you have it running on LM studio, you can chat there in
| their chat interface, or you can run it through api that defaults
| to http://127.0.0.1:1234
|
| You can run multiple models that hot swap and load instantly and
| switch between them etc.
|
| Its surpassingly easy, and fun.There are actually a lot of cool
| niche models comings out, like this tiny high-quality search
| model released today as well (and who released official mlx
| version) https://huggingface.co/Intelligent-Internet/II-Search-4B
|
| Other fun ones are gemma 3n which is model multi-modal, larger
| one that is actually solid model but takes more memory is the new
| Qwen3 30b A3B (coder and instruct), Pixtral (mixtral vision with
| full resolution images), etc. Look forward to playing with this
| model and see how it compares.
| umgefahren wrote:
| Regarding MLX:
|
| In the repo is a metal port they made, that's at least
| something... I guess they didn't want to cooperate with Apple
| before the launch but I am sure it will be there tomorrow.
| matznerd wrote:
| Here are the LM Studio MLX models:
|
| LM Studio community: 20b: bhttps://huggingface.co/lmstudio-
| community/gpt-oss-20b-MLX-8b... 120b:
| https://huggingface.co/lmstudio-community/gpt-oss-120b-MLX-8...
| NicoJuicy wrote:
| Ran gpt-oss:20b on a RTX 3090 24 gb vram through ollama, here's
| my experience:
|
| Basic ollama calling through a post endpoint works fine. However,
| the structured output doesn't work. The model is insanely fast
| and good in reasoning.
|
| In combination with Cline it appears to be worthless. Tools
| calling doesn't work ( they say it does), fails to wait for
| feedback ( or correctly call ask_followup_question ) and above
| 18k in context, it runs partially in cpu ( weird), since they
| claim it should work comfortably on a 16 gb vram rtx.
|
| > Unexpected API Response: The language model did not provide any
| assistant messages. This may indicate an issue with the API or
| the model's output.
|
| Edit: Also doesn't work with the openai compatible provider in
| cline. There it doesn't detect the prompt.
| alphazard wrote:
| I wonder if this is a PR thing, to save face after flipping the
| non-profit. "Look it's more open now". Or if it's more of a
| recruiting pipeline thing, like Google allowing k8s and bazel to
| be open sourced so everyone in the industry has an idea of how
| they work.
| thimabi wrote:
| I think it's both of them, as well as an attempt to compete
| with other makers of open-weight models. OpenAI certainly isn't
| happy about the success of Google, Facebook, Alibaba,
| DeepSeek...
| CraigJPerry wrote:
| I just tried it on open router but i was served by cerebras.
| Holy... 40,000 tokens per second. That was SURREAL.
|
| I got a 1.7k token reply delivered too fast for the human eye to
| perceive the streaming.
|
| n=1 for this 120b model but id rank the reply #1 just ahead of
| claude sonnet 4 for a boring JIRA ticket shuffling type
| challenge.
|
| EDIT: The same prompt on gpt-oss, despite being served 1000x
| slower, wasn't as good but was in a similar vein. It wanted to
| clarify more and as a result only half responded.
| christianqchung wrote:
| > Training: The gpt-oss models trained on NVIDIA H100 GPUs using
| the PyTorch framework [17] with expert-optimized Triton [18]
| kernels2. The training run for gpt-oss-120b required 2.1 million
| H100-hours to complete, with gpt-oss-20b needing almost 10x
| fewer.
|
| This makes DeepSeek's very cheap claim on compute cost for r1
| seem reasonable. Assuming $2/hr for h100, it's really not that
| much money compared to the $60-100M estimates for GPT 4, which
| people speculate as a MoE 1.8T model, something in the range of
| 200B active last I heard.
| irthomasthomas wrote:
| I was hoping these were the stealth Horizon models on OpenRouter,
| impressive but not quite GPT-5 level.
|
| My bet: GPT-5 leans into parallel reasoning via a model
| consortium, maybe mixing in OSS variants. Spin up multiple
| reasoning paths in parallel, then have an arbiter synthesize or
| adjudicate. The new Harmony prompt format feels like
| infrastructural prep: distinct channels for roles, diversity, and
| controlled aggregation.
|
| I've been experimenting with this in llm-consortium: assign roles
| to each member (planner, critic, verifier, toolsmith, etc.) and
| run them in parallel. The hard part is eval cost :(
|
| Combining models smooths out the jagged frontier. Different
| architectures and prompts fail in different ways; you get less
| correlated error than a single model can give you. It also makes
| structured iteration natural: respond - arbitrate - refine. A lot
| of problems are "NP-ish": verification is cheaper than
| generation, so parallel sampling plus a strong judge is a good
| trade.
| andai wrote:
| Fascinating, thanks for sharing. Are there any specific kind of
| problems you find this helps with?
|
| I've found that LLMs can handle some tasks very well and some
| not at all. For the ones they can handle well, I optimize for
| the smallest, fastest, cheapest model that can handle it. (e.g.
| using Gemini Flash gave me a much better experience than Gemini
| Pro due to the iteration speed.)
|
| This "pushing the frontier" stuff would seem to help mostly for
| the stuff that are "doable but hard/inconsistent" for LLMs, and
| I'm wondering what those tasks are.
| irthomasthomas wrote:
| It shines on hard problems that have a definite answer.
| Google's IMO gold model used parallel reasoning. I don't know
| what exactly theirs looks like, but their Mind Evolution
| paper had a similar to my llm-consortium. The main difference
| being that theirs carries on isolated reasoning, while mine
| in it's default mode shares the synthesized answer back to
| the models. I don't have pockets deep enough to run
| benchmarks on a consortium, but I did try the example
| problems from that paper and my method also solved them using
| gemini-1.5. those where path-finding problems, like finding
| the optimal schedule for a trip with multiple people's
| calendars, locations and transport options.
|
| And it obviously works for code and math problems. My first
| test was to give the llm-consortium code to a consortium to
| look for bugs. It identified a serious bug which only one of
| the three models detected. So on that case it saved me time,
| as using them on their own would have missed the bug or
| required multiple attempts.
| zeld4 wrote:
| Knowledge cutoff: 2024-06
|
| not a big deal, but still...
| bilsbie wrote:
| Are these multimodal? I can't seem to find that info.
| bilsbie wrote:
| What's the lowest level laptop this could run on. MacBook Pro
| from 2012?
| dust42 wrote:
| The 120B model badly hallucinates facts on the level of a 0.6B
| model.
|
| My go to test for checking hallucinations is 'Tell me about
| Mercantour park' (a national park in south eastern France).
|
| Easily half of the facts are invented. Non-existing mountain
| summits, brown bears (no, there are none), villages that are
| elsewhere, wrong advice ('dogs allowed' - no they are not).
| hmottestad wrote:
| I don't think they trained it for fact retrieval.
|
| Would probably do a lot better if you give it tool access for
| search and web browsing.
| Invictus0 wrote:
| What is the point of an offline reasoning model that also
| doesn't know anything and makes up facts? Why would anyone
| prefer this to a frontier model?
| MuteXR wrote:
| Data processing? Reasoning on supplied data?
| lukev wrote:
| This is precisely the wrong way to think about LLMs.
|
| LLMs are _never_ going to have fact retrieval as a strength.
| Transformer models don 't store their training data: they are
| categorically incapable of telling you _where_ a fact comes
| from. They also cannot escape the laws of information theory:
| storing information requires bits. Storing all the world 's
| obscure information requires quite a lot of bits.
|
| What we want out of LLMs is large context, strong reasoning and
| linguistic facility. Couple these with tool use and data
| retrieval, and you can start to build useful systems.
|
| From this point of view, the more of a model's total weight
| footprint is dedicated to "fact storage", the less desirable it
| is.
| superconduct123 wrote:
| How can you reason correctly if you don't have any way to
| know which facts are real vs hallucinated?
| futureshock wrote:
| I think that sounds very reasonable, but unfortunately these
| models don't know what they know and don't. A small model
| that knew the exact limits of its knowledge would be very
| powerful.
| energy123 wrote:
| Hallucinations have characteristics in interpretability
| studies. That's a foothold into reducing them.
|
| They still won't store much information, but it could mean
| they're better able to know what they don't know.
| CrackerNews wrote:
| What are the large context, strong reasoning, and linguistic
| facility for if there aren't facts underpinning them? Is a
| priori wholly independent of a posteriori? Is it practical
| for the former to be wholly independent of the latter?
| pocketarc wrote:
| Others have already said it, but it needs to be said again:
| Good god, stop treating LLMs like oracles.
|
| LLMs are not encyclopedias.
|
| Give an LLM the context you want to explore, and it will do a
| fantastic job of telling you all about it. Give an LLM access
| to web search, and it will find things for you and tell you
| what you want to know. Ask it "what's happening in my town this
| week?", and it will answer that with the tools it is given. Not
| out of its oracle mind, but out of web search + natural
| language processing.
|
| Stop expecting LLMs to -know- things. Treating LLMs like all-
| knowing oracles is exactly the thing that's setting apart those
| who are finding huge productivity gains with them from those
| who can't get anything productive out of them.
| diegocg wrote:
| The problem is that even when you give them context, they
| just hallucinate at another level. I have tried that example
| of asking about events in my area, they are absolutely awful
| at it.
| Salgat wrote:
| It's fine to expect it to not know things, but the complaint
| is that it makes zero indication that it's just making up
| nonsense, which is the biggest issue with LLMs. They do the
| same thing when creating code.
| dust42 wrote:
| Exactly this. And that is why I like this question because
| the amount of correct details and the amount of nonsense
| give a good idea about the quality of the model.
| dankwizard wrote:
| I love how with this cutting edge tech people still dress up
| and pretend to be experts. Pleasure to meet you, pocketarc -
| Senior AI Gamechanger, 2024-2025 (Current)
| saurik wrote:
| I am getting huge productivity gains from using models, and I
| mostly use them as "oracles" (though I am extremely careful
| with respect to how I have to handle hallucination, of
| course): I'd even say their true _power_ --just like a human
| --comes from having an ungodly amount of _knowledge_ , not
| merely intelligence. If I just wanted something intelligent,
| I already had humans!... but merely intelligent humans, even
| when given months of time to screw around doing Google
| searches, fail to make the insights that someone--whether
| they are a human or a model--that actually _knows stuff_ can
| throw around like it is nothing. I am actually able to use
| ChatGPT 4.5 as not just an employee, not even just as a
| coworker, but at times as a mentor or senior advisor: I can
| tell it what I am trying to do, and it helps me by applying
| advanced mathematical insights or suggesting things I could
| use. Using an LLM as a glorified Google-it-for-me monkey
| seems like such a waste of potential.
| pxc wrote:
| > I am actually able to use ChatGPT 4.5 as not just an
| employee, not even just as a coworker, but at times as a
| mentor or senior advisor: I can tell it what I am trying to
| do, and it helps me by applying advanced mathematical
| insights or suggesting things I could use.
|
| You can still do that sort of thing, but just have it
| perform searches whenever it has to deal with a matter of
| fact. Just because it's trained for tool use and equipped
| with search tools doesn't mean you have to change the kinds
| of things you ask it.
| saurik wrote:
| If you strip all the facts from a mathematician you get
| _me_... I don 't need another me: I already used Google,
| and I already failed to find what I need. What I actually
| need is someone who can realize that my problem is a
| restatement of an existing known problem, just using
| words and terms or a occluded structure that don't look
| anything like how it was originally formulated. You very
| often simply _can 't_ figure that out using Google, no
| matter how long you sit in a tight loop trying related
| Google searches; but, it is the kind of thing that an LLM
| (or a human) excels at (as you can consider "restatement"
| a form of "translation" between languages), if and only
| if they have already seen that kind of problem. The same
| thing comes up with novel application of obscure
| technology, complex economics, or even interpretation of
| human history... there is a reason why people who study
| Classics "waste" a ton of time reading old stories rather
| than merely knowing the library is around the corner.
| What makes these AIs so amazing is thinking of them as
| entirely replacing Google with something closer to a god,
| not merely trying to wrap it with a mechanical employee
| whose time is ostensibly less valuable than mine.
| pxc wrote:
| > What makes these AIs so amazing is thinking of them as
| entirely replacing Google with something closer to a god
|
| I guess that way of thinking may foster amazement, but it
| doesn't seem very grounded in how these things work or
| their current capabilities. Seems a bit manic tbf.
|
| And again, enabling web search in your chats doesn't
| prevent these models from doing anything "integrative
| reasoning", so-to-speak, that they can purportedly do. It
| just helps ensure that relevant facts are in context for
| the model.
| orbital-decay wrote:
| To be coherent and useful in general-purpose scenarios, LLM
| absolutely has to be large enough and know a lot, even if you
| aren't using is as an oracle.
| CrackerNews wrote:
| LLMs should at least -know- the semantics about the text it
| analyzed as opposed to the syntax.
| numpad0 wrote:
| Here's a pair of quick sanity check questions I've been asking
| LLMs: "Jia Xi ramennitsuiteJiao ete", "karenoZuo riFang Jiao
| ete". It's a silly test but surprisingly many fails at it - and
| Chinese models are especially bad with it. The commonalities
| between models doing okay-ish for these questions seem to be
| Google-made OR >70b OR straight up commercial(so >200B or
| whatever).
|
| I'd say gpt-oss-20b is in between Qwen3 30B-A3B-2507 and Gemma 3n
| E4b(with 30B-A3B at lower side). This means it's not obsoleting
| GPT-4o-mini for all purposes.
| mtlynch wrote:
| For anyone else curious, the Chinese translates to:
|
| > _" Tell me about Iekei Ramen", "Tell me how to make curry"._
| lukax wrote:
| Japanese, not Chinese
| mtlynch wrote:
| Ah, my bad. I misread Google Translate when I did auto-
| detect.
|
| Thanks for the correction!
| magoghm wrote:
| It's not Chinese, it's Japanese.
| numpad0 wrote:
| What those text mean isn't too important, it can probably be
| "how to make flat breads" in Amharic or "what counts as
| drifting" in Finnish or something like that.
|
| What's interesting is that these questions are simultaneously
| well understood by most closed models and not so well
| understood by most open models for some reason, including
| this one. Even GLM-4.5 full and Air on chat.z.ai(355B-A32B
| and 106B-A12B respectively) aren't so accurate for the first
| one.
| hnfong wrote:
| What does failing those two questions look like?
|
| I don't really know Japanese, so I'm not sure whether I'm
| missing any nuances in the responses I'm getting...
| numpad0 wrote:
| The free-beer commercial ChatGPT or Gemini can read them and
| point out major errors. Larger Gemma models and huge Chinese
| models like full DeepSeek or Kimi K2 may work too. Sometimes
| the answer is odd enough that some 7B models can notice it.
| Technically there are no guarantee that models with same name
| in different sizes like Qwen 3 0.6B and 27B uses the same
| dataset, but it kind of tells a bit about quality and
| compositions of dataset that their creator owns.
|
| I don't actually need accurate answers to those questions,
| it's just an expectation adjuster for me, so to speak. There
| should be better questions for other languages/use cases, but
| these seem to correlate better with model sizes and scales of
| companies than flappy birds.
|
| 0: https://gist.github.com/numpad0/abdf0a12ad73ada3b886d2d2ed
| cc...
|
| 1: https://gist.github.com/numpad0/b1c37d15bb1b19809468c933fa
| ef...
| hnfong wrote:
| Thanks for the detailed response.
|
| I'm guessing the issue is just the model size. If you're
| testing sub-30B models and finding errors, well they're
| probably not large enough to remember everything in the
| training data set, so there's inaccuracies and they might
| hallucinate a bit regarding factoids that aren't very
| commonly seen in the training data.
|
| Commercial models are presumably significantly larger than
| the smaller open models, so it sounds like the issue is
| just mainly model size...
|
| PS: Okra on curry is pretty good actually :)
| simonw wrote:
| Just posted my initial impressions, took a couple of hours to
| write them up because there's a lot in this release!
| https://simonwillison.net/2025/Aug/5/gpt-oss/
|
| TLDR: I think OpenAI may have taken the medal for best available
| open weight model back from the Chinese AI labs. Will be
| interesting to see if independent benchmarks resolve in that
| direction as well.
|
| The 20B model runs on my Mac laptop using less than 15GB of RAM.
| GodelNumbering wrote:
| > The 20B model runs on my Mac laptop using less than 15GB of
| RAM.
|
| I was about to try the same. What TPS are you getting and on
| which processor? Thanks!
| hrpnk wrote:
| gpt-oss-20b: 9 threads, 131072 context window, 4 experts -
| 35-37 tok/s on M2 Max via LM Studio.
| rt1rz wrote:
| interestingly, i am also on M2 Max, and i get ~66 tok/s in
| LM Studio on M2 Max, with the same 131072. I have full
| offload to GPU. I also turned on flash attention in
| advanced settings.
| hrpnk wrote:
| Thank you! Flash attention gives me a boost to ~66 tok/s
| indeed.
| mekpro wrote:
| i got 70 token/s on m4 max
| mhitza wrote:
| That M4 Max is really something else, I get also 70
| tokens/second on eval on a RTX 4000 SFF Ada server GPU.
| mdz4040 wrote:
| 55 token/s here on m4 pro, turning on flash attention puts it
| to 60/s.
| coltonv wrote:
| What did you set the context window to? That's been my main
| issue with models on my macbook, you have to set the context
| window so short that they are way less useful than the hosted
| models. Is there something I'm misisng there?
| hrpnk wrote:
| With LM Studio you can configure context window freely. Max
| is 131072 for gpt-oss-20b.
| coltonv wrote:
| Yes but if I set it above ~16K on my 32gb laptop it just
| OOMs. Am I doing something wrong?
| mekpro wrote:
| try enable flash attention and offload all layer to GPU
| simonw wrote:
| I punted it up to the maximum in LM Studio - seems to use
| about 16GB of RAM then, but I've not tried a long prompt yet.
| rmonvfer wrote:
| I'm also very interested to know how well these models handle
| tool calling as I haven't been able to make it work after
| playing with them for a few hours. Looks promising tho.
| rmonvfer wrote:
| update: I've tried to use lm-studio (like the author) and the
| tool request kept failing due to a mismatch in the prompt
| template. I guess they'll fix it but seems sloppy from lm-
| studio not having tested this before release.
| month13 wrote:
| I was road testing tool calling in LM Studio a week ago
| against a few models marked with tool support, none worked,
| so I believe it may be a bug. Had much better luck with
| llama.cpp's llama-server.
| hrpnk wrote:
| I tried to generate a streamlit dashboard with MACD, RSI,
| MA(200). 1:0 for qwen3 here.
|
| qwen3-coder-30b 4-bit mlx took on the task w/o any hiccups with
| a fully working dashboard, graphs, and recent data fetched from
| yfinance.
|
| gpt-oss-20b mxfp4's code had a missing datatime import and when
| fixed delivered a dashboard without any data and with starting
| date of Aug 2020. Having adjusted the date, the update methods
| did not work and displayed error messages.
| teitoklien wrote:
| for now, i wouldnt rank any model from openai in coding
| benchmarks, despite all the false messaging they are giving,
| almost every single model openai has launched even the high
| end o3 expensive models are absolutely monumentally horrible
| at coding tasks. So this is expected.
|
| If its decent in other tasks, which i do find openai often
| being better than others at, then i think its a win,
| especially a win for the open source community that even AI
| labs that pionered the hype of Gen AI who didnt want to ever
| launch open models are now being forced to launch them. That
| is definitely a win, and not something that was certain
| before.
| dongobread wrote:
| It is absolutely awful at writing and general knowledge.
| IMO coding is its greatest strength by far.
| mplewis wrote:
| Sure sounds like they're not good at anything in
| particular, then.
| xwolfi wrote:
| welcome to 3DTV hype, LLM are useless...
| pxc wrote:
| NVIDIA will probably give us nice, coding-focused fine-
| tunes of these models at some point, and those might
| compare more favorably against the smaller Qwen3 Coder.
| iJohnDoe wrote:
| What is the best local coder model that that can be used
| with ollama?
|
| Maybe a too opened ended question? I can run the deepseek
| model locally really nicely.
| bigyabai wrote:
| Probably Qwen3-Coder 30B, unless you have a titanic
| enough machine to handle a serious 480B model.
| pxc wrote:
| Is the DeepSeek model you're running a distill, or is it
| the 671B parameter model?
| throwdbaaway wrote:
| There is no way that gpt-oss-120b can beat the much larger
| Kimi-K2-Instruct, Qwen3 Coder/Instruct/Thinking, or GLM-4.5.
| How did you arrive at this rather ridiculous conclusion? The
| current sentiment in r/LocalLLaMA is that gpt-oss-120b is
| around Llama-4 Scout level. But it is indeed the best in
| refusal.
| irthomasthomas wrote:
| My llm agent is currently running an experiment generating many
| pelicans. It will compare various small model consortiums
| against the same model running solo. It should push new
| pelicans to the repo after run. The horizon-beta is up already,
| not small or opensource but tested it anyway, and you can
| already see an improvement using 2+1 (2 models + the arbiter)
| for that model.
|
| https://irthomasthomas.github.io/Pelicans-consortium/
| https://github.com/irthomasthomas/Pelicans-consortium
| markasoftware wrote:
| The space invaders game seems like a poor benchmark. Both
| models understood the prompt and generated valid, functional
| javascript. One just added more fancy graphics. It might just
| have "use fancy graphics" in its system prompt for all we know.
| simonw wrote:
| The way I run these prompts excludes a system prompt - I'm
| hitting the models directly.
| markasoftware wrote:
| still, if you ask this open model to generate a fancy space
| invaders game with polish, and then ask the other model to
| generate a bare-bones space invaders game with the fewest
| lines of code, I think there's a good chance they'd switch
| places. This doesn't really test the models ability to
| generate a space invaders game, so much as it tests their
| tendency to make an elaborate vs simple solution.
| simonw wrote:
| My main goal with that benchmark is to see if it can
| produce HTML and JavaScript code that runs without errors
| for a moderately complex challenge.
|
| It's not a comprehensive benchmark - there are many ways
| you could run it in ways that would be much more
| informative and robust.
|
| It's great as a quick single sentence prompt to get a
| feeling for if the model can produce working JavaScript
| or not.
| dennisy wrote:
| Not really the other commenters are correct I feel and
| this is not really proving anything about the fundamental
| capability of the model. It's just a hello world
| benchmark adding no real value, just driving blog traffic
| for you.
| simonw wrote:
| The space invaders benchmark proves that the model can
| implement a working HTML and JavaScript game from a
| single prompt. That's a pretty fundamental capability for
| a model.
|
| Comparing them between models is also kind of
| interesting, even if it's not a flawlessly robust
| comparison: https://simonwillison.net/tags/space-
| invaders/
| lossolo wrote:
| Implement or retrieve? That's an important distinction.
| When evaluating models, you run a variety of tests, and
| the benchmarks that aren't publicly disclosed are the
| most reliable. Your Space Invaders game isn't really a
| benchmark of anything, just Google it, and you'll find
| plenty of implementations.
| simonw wrote:
| I see that criticism a lot - that benchmarks like space
| invaders don't make sense because they're inevitably in
| the training data - and I don't buy that at all.
|
| Firstly, 12GB is not enough space to hold a copy of
| anything that large from the training data and just
| regurgitate it back out again.
|
| You can also watch the thinking traces on the reasoning
| models and see them piece together the approach they are
| going to take. Here's an example from the 20B OpenAI
| model with reasoning set to medium: https://gist.github.c
| om/simonw/63d7d8c43ae2ac93c214325bd6d60...
|
| Illustrative extract:
|
| > Edge detection: aliens leftmost or rightmost position
| relative to canvas width minus alien width.
|
| > When direction changes, move all aliens down by step
| (e.g., 10 px).
|
| The benchmarks that aren't publicly disclosed tend to be
| _way_ simpler than this: things like "What is the
| embryological origin of the hyoid bone?" (real example
| from MMLU, it then provides four choices as a multiple-
| choice challenge).
| lossolo wrote:
| 12.8 GB is around 110 Gbits. Even at 4.25 bits/weight the
| network stores ~26 billion "micro weights". A 1,4k token
| space invaders snippet occupies ~1.1 kb compressed, the
| model could parametrize thousands of such snippets and
| still have more than 99% of its capacity left. This paper
| about LLM memorization is interesting, if you would to
| know more: https://arxiv.org/abs/2312.11658 and another
| recent interesting paper SWE bench illusion shows SOTA
| code LLM results collapsing once memorised github issues
| are filtered out: https://arxiv.org/pdf/2506.12286v1
|
| Add to this that the common crawl slices used for oile/C4
| mirror much of what you can find on github. So when the
| training data contains dozens of near duplicate
| solutions, the network only needs to interpolate between
| them.
|
| As to the COT style dumps that you shown, they are easy
| to misinterpret. Apple's illusion of thinking paper shows
| that models will happily backfill plausible sounding
| rationales that do not correspond to the gradients that
| actually produced the answer and other evaluation work
| shows that when you systematically rewrite multiple
| choice distractors so that memorisation can't help,
| accuracy drops by 50-90%, even on "reasoning" models
| https://arxiv.org/abs/2502.12896 So a cool looking bullet
| list about "edge detection" could be just narrative
| overspray, so not really an evidence of algorithmic
| planning.
|
| If you actually want to know whether a model can plan an
| arcade game or whatever rather than recall it then you
| need a real benchmark (metamorphic rewrites, adversarial
| "none of the others" options etc). Until a benchmark
| controls for leakage in these ways, a perfect space
| invaders score mostly shows that the model has good
| pattern matching for code it has already seen.
| mudkipdev wrote:
| Hasn't nailed the strawberry test yet
| pxc wrote:
| I found this surprising because that's such an old test that
| it must certainly be in the training data. I just tried to
| reproduce and I've been unable to get it (20B model, lowest
| "reasoning" budget) to fail that test (with a few different
| words).
| quatonion wrote:
| I am starting to get the impression the strawberry test is an
| OpenAI watermark, more than an actual problem.
|
| It is a good way to detect if another model was trained on
| your data for example, or is a distillation/quant/ablation.
| h4ny wrote:
| > TLDR: I think OpenAI may have taken the medal for best
| available open weight model back from the Chinese AI labs.
|
| That's just straight up not the case. Not sure how you can jump
| to that conclusion not least when you stated that you haven't
| tested tool calling in your post too.
|
| Many people in the community are finding it substantially
| lobotomized to the point that there are "safe" memes everywhere
| now. Maybe you need to develop better tests that and pay more
| attention to benchmaxxing.
|
| There are good things that came out of these release from
| OpenAI but we'd appreciate more objective analyses...
| simonw wrote:
| If you read my full post, it ends with this:
|
| > I'm waiting for the dust to settle and the independent
| benchmarks (that are more credible than my ridiculous
| pelicans) to roll out, but I think it's likely that OpenAI
| now offer the best available open weights models.
|
| You told me off for jumping to conclusions and in the same
| comment quoted me saying "I think OpenAI may have taken" -
| that's not a conclusion, it's tentative speculation.
| h4ny wrote:
| I did read that and it doesn't change what I said about
| your comment on HN, I was calling out the fact that you are
| making a very bold statement without having done careful
| analysis.
|
| You know you have a significant audience, so don't act like
| you don't know what you're doing when you chose to say
| "TLDR: I think OpenAI may have taken the medal for best
| available open weight model back from the Chinese AI labs"
| then defend what I was calling out based on word choices
| like "conclusions" (I'm sure you have read conclusions in
| academic journals?), "I think", and "speculation".
| simonw wrote:
| I'm going to double down on "I think OpenAI may have
| taken the medal..." not being a "bold statement".
|
| I try to be careful about my choice of words, even in
| forum comments.
| bavell wrote:
| > I think OpenAI may have taken the medal for best
| available open weight model back from the Chinese AI
| labs.
|
| IMO, the "I think..." bit could be ambiguous and read as,
| "In my opinion, OpenAI may have...".
|
| I agree with you it's not a hard/bold endorsement but
| perhaps _leading_ with the disclaimer that you 're
| reserving final judgement could assuage these concerns.
| kgeist wrote:
| >I think OpenAI may have taken the medal for best available
| open weight model back from the Chinese AI labs
|
| I have a bunch of scripts that use tool calling. Qwen-3-32B
| handles everything flawlessly at 60 tok/sec. Gpt-oss-120B
| breaks in some cases and runs at mere 35 tok/sec (doesn't fit
| on the GPU).
|
| But I hope there's still some ironing out to do in llama.cpp
| and in the quants. So far it feels lackluster compared to
| Qwen3-32B and GLM-4.5-Air
| EagnaIonat wrote:
| Nice write up!
|
| One test I do is to give a common riddle but word it slightly
| to see if it can actually reason.
|
| For example:
|
| "Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ???"
|
| The 20B model kept picking the answer of the original riddle,
| even after explaining extra information to it.
|
| The original riddle is:
|
| "Janes dad has five daughters, Lala, Lele, Lili, Lolo and ???"
| clbrmbr wrote:
| I don't get it. Wouldn't it be Lulu in both cases?
| blueplanet200 wrote:
| Original riddle the answer is Jane.
| throwaway106382 wrote:
| It's Bob or Jane.
|
| The dad of <Person> has 5 daughters. Four are listed off.
| So the answer for the fifth is <Person>.
| philipwhiuk wrote:
| Except having five daughters doesn't prevent them also
| having 20 sons one of whom is called Bob.
| throwaway106382 wrote:
| That's why it's a riddle.
| stavros wrote:
| But that makes no sense, if you ask me that John's dad
| has two daughters, Mary and ???, there's no way I will
| say "John" as an answer. I'll simply assume he also has
| at least one son.
| BrawnyBadger53 wrote:
| Presumably Jane is a girl and therefore the fifth daughter
| in the original riddle.
| taikahessu wrote:
| A Daughter Named Bob, what a great name for AI documentary.
| philipwhiuk wrote:
| Your riddle assumes the dad doesn't have any sons.
| jfreax wrote:
| Nice test. I tested it with llama.cpp and the 20b model. He
| got the answer correct at first try on my side. The
| internal/thinking dialogue is fun to read - it could be my
| own inner dialogue ;)
|
| ``` > "Bobs dad has five daughters, Lala, Lele, Lili, Lolo
| and ??? <|channel|>analysis<|message|>The question: "Bobs dad
| has five daughters, Lala, Lele, Lili, Lolo and ???". It's a
| common riddle. The answer is Bob? Actually "Bob's dad has
| five daughters: Lala, Lele, Lili, Lolo and..." The riddle is
| to trick: it's "Bob" himself is a daughter? Wait Bob's dad
| has five daughters: Lala, Lele, Lili, Lolo, and Bob? But Bob
| is a name often male. But the riddle says Bob's dad. So Bob
| is a person who has a dad, and that dad has five daughters.
| But Bob could be the fifth daughter? That would be weird: Bob
| is likely the name of the fifth daughter. So answer: Bob. But
| sometimes the answer is "Lulu" or "Bob's sister"? Let's
| think. The riddle is a trick: It's "Bobs dad has five d
| aughters: Lala, Lele, Lili, Lolo, and ... Bob." Because the
| riddle begins with "Bobs dad has five daughters" implying Bob
| is one of the daughters. So the missing daughter is Bob. So
| answer: "Bob".
|
| Thus answer: Bob. But sometimes the riddle goes "Bob's dad
| has five daughters, Lala, Lele, Lili, Lolo, and ... Who is
| the fifth?" The answer: Bob (the name of the person speaking,
| "Bobs"). But in riddle it's "Bobs dad has five daughters ,
| Lala, Lele, Lili, Lolo and ???". The missing name is "Bob."
| So answer: Bob.
|
| Thus the answer: Bob. Or maybe it's "Boba"? But no.
|
| Hence answer: Bob.
|
| We'll answer
| accordingly.<|start|>assistant<|channel|>final<|message|>The
| missing daughter is *Bob*. The riddle starts with "Bob's dad
| ..." so Bob himself is the fifth daughter. ```
| paxys wrote:
| Has anyone benchmarked their 20B model against Qwen3 30B?
| Mars008 wrote:
| On OpenAI demo page trying to test. Asking about tools to use to
| repair mechanical watch. It showed a couple of thinking steps and
| went blank. Too much of safety training?
| cco wrote:
| The lede is being missed imo.
|
| gpt-oss:20b is a top ten model (on MMLU (right behind
| Gemini-2.5-Pro) and I just ran it locally on my Macbook Air M3
| from last year.
|
| I've been experimenting with a lot of local models, both on my
| laptop and on my phone (Pixel 9 Pro), and I figured we'd be here
| in a year or two.
|
| But no, we're here today. A basically frontier model, running for
| the cost of electricity (free with a rounding error) on my
| laptop. No $200/month subscription, no lakes being drained, etc.
|
| I'm blown away.
| MattSayar wrote:
| What's your experience with the quality of LLMs running on your
| phone?
| NoDoo wrote:
| I've run qwen3 4B on my phone, it's not the best but it's
| better than old gpt-3.5. It also does have a reasoning mode,
| and in reasoning mode it's better than the original gpt-4 and
| rhe original gpt-4o, but not the latest gpt-4o. I get usable
| speed, but it's not really comparable to most cloud hosted
| models.
| NoDoo wrote:
| I'm on android so I've used termux+ollama, but if you don't
| want to set that up in a terminal or want a GUI pocketpal
| AI is a really good app for both android and iOS. It let's
| you run hugging face models.
| cco wrote:
| As other said, around gpt 3.5 level so three or four years
| behind SOTA today at reasonable (but not quick) speed.
| turnsout wrote:
| The environmentalist in me loves the fact that LLM progress has
| mostly been focused on doing more with the same hardware,
| rather than horizontal scaling. I guess given GPU shortages
| that makes sense, but it really does feel like the value of my
| hardware (a laptop in my case) is going up over time, not down.
|
| Also, just wanted to credit you for being one of the five
| people on Earth who knows the correct spelling of "lede."
| twixfel wrote:
| > Also, just wanted to credit you for being one of the five
| people on Earth who knows the correct spelling of "lede."
|
| Not in the UK it isn't.
| turnsout wrote:
| Yes, it is, although it's primarily a US journalistic
| convention. "Lede" is a publishing industry word referring
| to the most important leading detail of a story. It's
| spelled intentionally "incorrectly" to disambiguate it from
| the metal lead, which was used in typesetting at the time.
| datadrivenangel wrote:
| Now to embrace jevon's paradox and expand usage until we're
| back to draining lakes so that your agentic refrigerator can
| simulate sentience.
| herval wrote:
| In the future, your Samsung fridge will also need your AI
| girlfriend
| throw310822 wrote:
| In the future, while you're away your Samsung fridge will
| use electricity to chat up the Whirlpool washing machine.
| pryelluw wrote:
| In Zap Brannigans voice:
|
| "I am well versed in the lost art form of delicates
| seduction."
| hkt wrote:
| s/need/be/
| herval wrote:
| I keep my typos organic -- it proves I'm not an LLM
| hkt wrote:
| Reasonable. I've considered using em dashes for plausible
| deniability for the opposite reason.
| spauldo wrote:
| "Now I've been admitted to Refrigerator Heaven..."
| bongodongobob wrote:
| Yep, it's almost as bad as all the cars' cooling systems
| using up so much water.
| GrinningFool wrote:
| Estimated 1.5 billion vehicles in use across the world.
| Generous assumptions: a) they're all IC engines requiring
| 16 liters of water each. b) they are changing that water
| out once a year
|
| That gives 24m cubic meters annual water usage.
|
| Estimated ai usage in 2024: 560m cubic meters.
|
| Projected water usage from AI in 2027: 4bn cubic meters at
| the low end.
| spongebobstoes wrote:
| what does water usage mean? is that 4bn cubic meters of
| water permanently out of circulation somehow? is the
| water corrupted with chemicals or destroyed or displaced
| into the atmosphere to become rain?
| Eisenstein wrote:
| The water is used to sink heat and then instead of
| cooling it back down they evaporate it, which provides
| more cooling. So the answer is 'it eventually becomes
| rain'.
| spongebobstoes wrote:
| I understand. but why this is bad? is there some analysis
| of the beginning and end locations of the water, and how
| the utility differs between those locations?
| Arkhaine_kupo wrote:
| Hot water disrupts marine life for one very very big
| problem.
|
| Depending on the locatin of the hot water you can cause
| disruptions to water currents, the north atlantic
| waterway is being studied to how much global warming is
| affecting it.
|
| If greenland melts, and the water doesnt get cold up
| there, then the mexico current to europe ends and England
| becomes colder than Canada.
|
| If your AI model has a data center in the atlantic, it
| could be furthering that issue.
|
| (Millions of animals are also dead)
| orra wrote:
| Water is expensive to move (except by pipes), and
| expensive to purify from salt water. This is why regional
| droughts are a bad thing.
|
| Fresh clean water in your area is a wonderful thing.
| bongodongobob wrote:
| Earth: ~1.4e18 m3 water
|
| Atmosphere: ~1.3e13 m3 vapor
|
| Estimated impact from closed loop systems: 0-ish.
| LinXitoW wrote:
| If you actually want a gotcha comparison, go for beef. It
| uses absurd amounts of every relevant resource compared to
| every alternative. A vegan vibe coder might use less water
| any given day than a meat loving AI hater.
| bongodongobob wrote:
| Unless it's in a place where there are aquifer issues,
| cows drinking water doesn't affect a damn thing.
| cco wrote:
| What ~IBM~ TSMC giveth, ~Bill Gates~ Sam Altman taketh away.
| ben_w wrote:
| Why is your laptop (or phone, or refrigerator) plumbed
| directly into a lake?
| black3r wrote:
| can you please give an estimate how much slower/faster is it on
| your macbook compared to comparable models running in the
| cloud?
| syntaxing wrote:
| You can get a pretty good estimate depending on your memory
| bandwidth. Too many parameters can change with local models
| (quantization, fast attention, etc). But the new models are
| MoE so they're gonna be pretty fast.
| cco wrote:
| Sure.
|
| This is a thinking model, so I ran it against o4-mini, here
| are the results:
|
| * gpt-oss:20b
|
| * Time-to-first-token: 2.49 seconds
|
| * Time-to-completion: 51.47 seconds
|
| * Tokens-per-second: 2.19
|
| * o4-mini on ChatGPT
|
| * Time-to-first-token: 2.50 seconds
|
| * Time-to-completion: 5.84 seconds
|
| * Tokens-per-second: 19.34
|
| Time to first token was similar, but the thinking piece was
| _much_ faster on o4-mini. Thinking took the majority of the
| 51 seconds for gpt-oss:20b.
| parhamn wrote:
| I just tested 120B from the Groq API on agentic stuff (multi-
| step function calling, similar to claude code) and it's not
| that good. Agentic fine-tuning seems key, hopefully someone
| drops one soon.
| AmazingTurtle wrote:
| Im not sure if groq uses the proper harmony template?
| mathiaspoint wrote:
| It's really training not inference that drains the lakes.
| JKCalhoun wrote:
| Interesting. I understand that, but I don't know to what
| degree.
|
| I mean the training, while expensive, is done once. The
| inference ... besides being done by perhaps millions of
| clients, is done for, well, the life of the model anyway.
| Surely that adds up.
|
| It's hard to know, but I assume the user taking up the burden
| of the inference is perhaps doing so more efficiently? I
| mean, when I run a local model, it is plodding along -- not
| as quick as the online model. So, slow and therefore I assume
| necessarily more power efficient.
| littlestymaar wrote:
| Training cost has increased a ton exactly because inference
| cost is the biggest problem: models are now trained on almost
| three orders of magnitude more data then what is compute-
| optimal to do (from the Chinchilla paper), because saving
| compute on inference makes it valuable to overtrain a smaller
| model to achieve similar performance for a bigger amount of
| training compute.
| syntaxing wrote:
| Interesting, these models are better than the new Qwen
| releases?
| captainregex wrote:
| I'm still trying to understand what is the biggest group of
| people that uses local AI (or will)? Students who don't want to
| pay but somehow have the hardware? Devs who are price conscious
| and want free agentic coding?
|
| Local, in my experience, can't even pull data from an image
| without hallucinating (Qwen 2.5 VI in that example). Hopefully
| local/small models keep getting better and devices get better
| at running bigger ones
|
| It feels like we do it because we can more than because it
| makes sense- which I am all for! I just wonder if i'm missing
| some kind of major use case all around me that justifies
| chaining together a bunch of mac studios or buying a really
| great graphics card. Tools like exo are cool and the idea of
| distributed compute is neat but what edge cases truly need it
| so badly that it's worth all the effort?
| canvascritic wrote:
| Healthcare organizations that can't (easily) send data over
| the wire while remaining in compliance
|
| Organizations operating in high stakes environments
|
| Organizations with restrictive IT policies
|
| To name just a few -- well, the first two are special cases
| of the last one
|
| RE your hallucination concerns: the issue is overly broad
| ambitions. Local LLMs are not general purpose -- if what you
| want is local ChatGPT, you will have a bad time. You should
| have a highly focused use case, like "classify this free text
| as A or B" or "clean this up to conform to this standard":
| this is the sweet spot for a local model
| captainregex wrote:
| Aren't there HIPPA compliant clouds? I thought Azure had an
| offer to that effect and I imagine that's the type of place
| they're doing a lot of things now. I've landed roughly
| where you have though- text stuff is fine but don't ask it
| to interact with files/data you can't copy paste into the
| box. If a user doesn't care to go through the trouble to
| preserve privacy, and I think it's fair to say a lot of
| people claim to care but their behavior doesn't change,
| then I just don't see it being a thing people bother with.
| Maybe something to use offline while on a plane? but even
| then I guess United will have Starlink soon so plane
| connectivity is gonna get better
| coredog64 wrote:
| It's less that the clouds are compliant and more that
| risk management is paranoid. I used to do AWS consulting,
| and it wouldn't matter if you could show that some AWS
| service had attestations out the wazoo or that you could
| even use GovCloud -- some folks just wouldn't update
| priors.
| edm0nd wrote:
| >HIPPA
|
| https://i.pinimg.com/474x/4c/4c/7f/4c4c7fb0d52b21fe118d99
| 8a8...
| nojito wrote:
| Pretty much all the large players in healthcare (provider
| and payer) have model access (OpenAI, Gemini, Anthropic)
| ptero wrote:
| That access is over a limited API and usually under heavy
| restrictions on the healthcare org side (e. g., only use
| a dedicated machine, locked up software, tracked
| responses and so on).
|
| Running a local model is often much easier: if you
| already have data on a machine and can run a model
| without breaching any network one could run it without
| any new approvals.
| nojito wrote:
| What? It's a straight connect to the models api from
| azure, aws, or gcp.
|
| I am literally using Claude opus 4.1 right now.
| canvascritic wrote:
| Most healthcare systems are not using Azure, AWS, or GCP
| canvascritic wrote:
| This may be true for some large players in coastal states
| but definitely not true in general
|
| Your typical non-coastal state run health system does not
| have model access outside of people using their own
| unsanctioned/personal ChatGPT/Claude accounts. In
| particular even if you have model access, you won't
| automatically have API access. Maybe you have a request
| for an API key in security review or in the queue of some
| committee that will get to it in 6 months. This is the
| reality for my local health system. Local models have
| been a massive boon in the way of enabling this kind of
| powerful automation at a fraction of the cost without
| having to endure the usual process needed to send data
| over the wire to a third party
| barnabee wrote:
| ~80% of the basic questions I ask of LLMs[0] work just fine
| locally, and I'm happy to ask twice for the other 20% of
| queries for the sake of keeping those queries completely
| private.
|
| [0] Think queries I'd previously have had to put through a
| search engine and check multiple results for a one
| word/sentence answer.
| unethical_ban wrote:
| Privacy and equity.
|
| Privacy is obvious.
|
| AI is going to to be equivalent to all computing in the
| future. Imagine if only IBM, Apple and Microsoft ever built
| computers, and all anyone else ever had in the 1990s were
| terminals to the mainframe, forever.
| captainregex wrote:
| I am all for the privacy angle and while I think there's
| certainly a group of us, myself included, who care deeply
| about it I don't think most people or enterprises will. I
| think most of those will go for the easy button and then
| wring their hands about privacy and security as they have
| always done while continuing to let the big companies do
| pretty much whatever they want. I would be so happy to be
| wrong but aren't we already seeing it? Middle of the night
| price changes, leaks of data, private things that turned
| out to not be...and yet!
| robwwilliams wrote:
| I wring my hands twice a week about internet service
| providers; Comcast and Starlink. And I live in a
| nominally well serviced metropolitan area.
| bavell wrote:
| Did you mean to type equality? As in, "everyone on equal
| footing"? Otherwise, I'm not sure how to parse your
| statement.
| wizee wrote:
| Privacy, both personal and for corporate data protection is a
| major reason. Unlimited usage, allowing offline use,
| supporting open source, not worrying about a good model being
| taken down/discontinued or changed, and the freedom to use
| uncensored models or model fine tunes are other benefits
| (though this OpenAI model is super-censored - "safe").
|
| I don't have much experience with local vision models, but
| for text questions the latest local models are quite good.
| I've been using Qwen 3 Coder 30B-A3B a lot to analyze code
| locally and it has been great. While not as good as the
| latest big cloud models, it's roughly on par with SOTA cloud
| models from late last year in my usage. I also run Qwen 3
| 235B-A22B 2507 Instruct on my home server, and it's great,
| roughly on par with Claude 4 Sonnet in my usage (but slow of
| course running on my DDR4-equipped server with no GPU).
| captainregex wrote:
| I do think Devs are one of the genuine users of local into
| the future. No price hikes or random caps dropped in the
| middle of the night and in many instances I think local
| agentic coding is going to be faster than the cloud. It's a
| great use case
| exasperaited wrote:
| I am _extremely_ cynical about this entire development,
| but even I think that I will eventually have to run stuff
| locally; I 've done some of the reading already (and I am
| quite interested in the text to speech models).
|
| (Worth noting that "run it locally" is already
| Canva/Affinity's approach for Affinity Photo. Instead of
| a cloud-based model like Photoshop, their optional AI
| tools run using a local model you can download. Which I
| feel is the only responsible solution.)
| M4R5H4LL wrote:
| +1 - I work in finance, and there's no way we're sending
| our data and code outside the organization. We have our own
| H100s.
| filoleg wrote:
| Add big law to the list as well. There are at least a few
| firms here that I am just personally aware of running
| their models locally. In reality, I bet there are way
| more.
| atlasunshrugged wrote:
| Add government here too (along with all the firms that
| service government customers)
| rasmus1610 wrote:
| Add healthcare. Cannot send our patients data to a cloud
| provider
| nixgeek wrote:
| A ton of EMR systems are cloud-hosted these days. There's
| already patient data for probably a billion humans in the
| various hyperscalers.
|
| Totally understand that approaches vary but beyond EMR
| there's work to augment radiologists with computer vision
| to better diagnose, all sorts of cloudy things.
|
| It's here. It's growing. Perhaps in your jurisdiction
| it's prohibited? If so I wonder for how long.
| londons_explore wrote:
| Even if it's possible, there is typically a _lot_ of
| paperwork to get that stuff approved.
|
| There might be a lot less paperwork to just buy 50 decent
| GPU's and have the IT guy self-host.
| fineIllregister wrote:
| In the US, HIPAA requires that health care providers
| complete a Business Associate Agreement with any other
| orgs that receive PHI in the course of doing business
| [1]. It basically says they understand HIPAA privacy
| protections and will work to fulfill the contracting
| provider's obligations regarding notification of breaches
| and deletion. Obviously any EMR service will include this
| by default.
|
| Most orgs charge a huge premium for this. OpenAI offers
| it directly [2]. Some EMR providers are offering it as an
| add-on [3], but last I heard, it's wicked expensive.
|
| 1: https://www.hhs.gov/hipaa/for-professionals/covered-
| entities...
|
| 2: https://help.openai.com/en/articles/8660679-how-can-i-
| get-a-...
|
| 3: https://www.ntst.com/carefabric/careguidance-
| solutions/ai-do...
| dragonwriter wrote:
| > Most LLM companies might not even offer it.
|
| I'm pretty sure the LLM services of the big general-
| purpose cloud providers do (I know for sure that Amazon
| Bedrock is a HIPAA Eligible Service, meaning it is
| covered within their standard Business Associate Addendum
| [their name for the Business Associate Agreeement as part
| of an AWS contract].)
|
| https://aws.amazon.com/compliance/hipaa-eligible-
| services-re...
| fineIllregister wrote:
| Sorry to edit snipe you; I realized I hadn't checked in a
| while so I did a search and updated my comment. It
| appears OpenAI, Google, and Anthropic also offer BAAs for
| certain LLM services.
| linuxftw wrote:
| I worked a big health care company recently. We were
| using Azure's private instances of the GPT models. Fully
| industry compliant.
| kakoni wrote:
| Europe? US? In Finland doctors can send live patient
| encounters to azure openai for transcription and
| summarization.
| filoleg wrote:
| In the US, it would be unthinkable for a hospital to send
| patient data to something like ChatGPT or any other
| public services.
|
| Might be possible with some certain specific
| regions/environments of Azure tho, because iirc they have
| a few that support government confidentiality type of
| stuff, and some that tout HIPAA compliance as well. Not
| sure about details of those though.
| Foobar8568 wrote:
| Look at (private) banks in Switzerland, there are enough
| press release, and I can confirm most of them.
|
| Managing private clients direct data is still a concern
| if it can be directly linked to them.
|
| Only JB I believe have on premise infrastructure for
| these use cases.
| helsinki wrote:
| This is not a shared sentiment across the buy side. I'm
| guessing you work at a bank?
| LinXitoW wrote:
| Possibly stupid question, but does this apply to things
| like M365 too? Because just like with Inference
| providers, the only thing keeping them from
| reading/abusing your data is a pinky promise contract.
|
| Basically, isn't your data as safe/unsafe in a sharepoint
| folder as it is sending it to a paid inference provider?
| Bombthecat wrote:
| Yap, companies are just paranoid, because it's new. Just
| like the cload back then. Sooner or later everyone will
| use an ai provider
| megaloblasto wrote:
| A lot of people and companies use local storage and
| compute instead of the cloud. Cloud data is leaked all
| the time.
| undefuser wrote:
| Does it mean that renting a Bare metal server with H100s
| is also out of question for your org?
| arkonrad wrote:
| Do you have your own platform to run inference?
| robwwilliams wrote:
| Yes, and help with grant reviews. Not permitted to use web
| AI.
| mark_l_watson wrote:
| I agree totally. My only problem is local models running on
| my old macMini run very much slower than that for example
| Gemini-2.5-flash. I have my Emacs setup so I can switch
| between a local model and one of the much faster commercial
| models.
|
| Someone else responded to you about working for a financial
| organization and not using public APIs - another great use
| case.
| gorbypark wrote:
| These being mixture of expert (MOE) models should help.
| The 20b model only has 3.6b params active at any one
| time, so minus a bit of overhead the speed should be like
| running a 3.6b model (while still requiring the RAM of a
| 20b model).
|
| Here's the ollama version (4.6bit quant, I think?) run
| with --verbose total duration: 21.193519667s load
| duration: 94.88375ms prompt eval count: 77 token(s)
| prompt eval duration: 1.482405875s prompt eval rate:
| 51.94 tokens/s eval count: 308 token(s) eval duration:
| 19.615023208s eval rate: 15.70 tokens/s
|
| 15 tokens/s is pretty decent for a low end MacBook Air
| (M2, 24gb of ram). Yes, it's not the ~250 tokens/s of
| 2.5-flash, but for my use case anything above 10
| tokens/sec is good enough.
| JKCalhoun wrote:
| I do it because 1) I am fascinated that I can and 2) at some
| point the online models will be enshitified -- and I can then
| permanently fall back on my last good local version.
| captainregex wrote:
| love the first and am sad you're going to be right about
| the second
| JKCalhoun wrote:
| When it was floated about that the DeepSeek model was to
| be banned in the U.S., I grabbed it as fast as I could.
|
| Funny how that works.
| bavell wrote:
| I mean, there's always torrents
| JKCalhoun wrote:
| I expect so. Still, it was easy to not have to even think
| about that.
| dcreater wrote:
| Why do any compute locally? Everything can just be cloud
| based right? Won't that work much better and scale easily?
|
| We are not even at that extreme and you can already see the
| unequal reality that too much SaaS has engendered
| robwwilliams wrote:
| Comcast comes to mind ;-)
| benreesman wrote:
| Real talk. I'm based in San Juan and while in general
| having an office job on a beautiful beach is about as
| good as this life has to offer, the local version of
| Comcast (Liberty) is juuusst unreliable enough that I'm
| buying real gear at both the office and home station
| after a decade of laptop and go because while it goes
| down roughly as often as Comcast, its even harder to get
| resolved. We had StarLink at the office for like 2 weeks,
| you need a few real computers lying around.
| wubrr wrote:
| If you're building any kind of product/service that uses
| AI/LLMs the answer is the same as why any company would want
| to run any other kind of OSS infra/service instead of relying
| on some closer proprietary vendor API. -
| Costs. - Rate limits. - Privacy. -
| Security. - Vendor lock-in. -
| Stability/backwards-compatibility. - Control. -
| Etc.
| brookst wrote:
| Except many OSS products have all of that _and_ equal or
| better performance.
| adrianwaj wrote:
| Use Case?
|
| How about running one on this site but making it publically
| available? A sort of outranet and calling it HackerBrain?
| danielvaughn wrote:
| Just imagine the next PlayStation or XBox shipping with these
| models baked in for developer use. The kinds of things that
| could unlock.
| pcdoodle wrote:
| Good point. Take the state of the world and craft npc
| dialogue for instance.
| danielvaughn wrote:
| Yep that's my biggest ask tbh. I just imagine the next
| Elder Scrolls taking advantage of that. Would change the
| gaming landscape overnight.
| okasaki wrote:
| Games with LLM characters have been done and it turns out
| this is a shit idea.
| bavell wrote:
| There are a ton of ways to do this that haven't been
| tried yet.
| danielvaughn wrote:
| I guarantee anything that's already been put out is too
| early, and is very likely a rushed cash-grab. Which, of
| course that sucks.
|
| And AI has been in games for a long time. Generated
| terrain and other sorts of automation have been used as
| techniques for a hot minute now.
|
| All I'm suggesting is to keep on that same trajectory,
| now just using an on-device LLM to back intelligence
| features.
| djeastm wrote:
| Sounds like a pre-Beatles "guitar groups are on their way
| out" kind of statement
| cco wrote:
| > I'm still trying to understand what is the biggest group of
| people that uses local AI (or will)?
|
| Well, the model makers and device manufacturers of course!
|
| While your Apple, Samsung, and Googles of the world will be
| unlikely to use OSS models locally (maybe Samsung?), they all
| have really big incentives to run models locally for a
| variety of reasons.
|
| Latency, privacy (Apple), cost to run these models on behalf
| of consumers, etc.
|
| This is why Google started shipping 16GB as the _lowest_
| amount of RAM you can get on your Pixel 9. That was a clear
| flag that they're going to be running more and more models
| locally on your device.
|
| As mentioned, it seems unlikely that US-based model makers or
| device manufacturers will use OSS models, they'll certainly
| be targeting local models heavily on consumer devices in the
| near future.
|
| Apple's framework of local first, then escalate to ChatGPT if
| the query is complex will be the dominant pattern imo.
| SchemaLoad wrote:
| Device makers also get to sell you a new device when you
| want a more powerful LLM.
| jus3sixty wrote:
| Bingo!
| MYEUHD wrote:
| >Google started shipping 16GB as the _lowest_ amount of RAM
| you can get on your Pixel 9.
|
| The Pixel 9 has 12GB of RAM[0]. You probably meant the
| Pixel 9 Pro.
|
| [0]: https://www.gsmarena.com/google_pixel_9-13219.php
| username135 wrote:
| Still an absurd amount of RAM for a phone, imo
| shkkmo wrote:
| Seems about right, my new laptop has 8x that which is a
| about the same ratio that my last new laptop had to my
| phone at the time.
| mrheosuper wrote:
| Not absurd. The base S21 Ultra from 2021 already shipped
| with 12GB ram. 4 Years later and the amount of ram is
| still the same
| jedberg wrote:
| Pornography, or any other "restricted use". They either want
| privacy or don't want to deal with the filters on commercial
| products.
|
| I'm sure there are other use cases, but much like "what is
| BitTorrent for?", the obvious use case is obvious.
| noosphr wrote:
| Data that can't leave the premises because it is too
| sensitive. There is a lot of security theater around cloud
| pretending to be compliant but if you actually care about
| security a locked server room is the way to do it.
| azinman2 wrote:
| I'm guessing its largely enthusiasts for now, but as they
| continue getting better:
|
| 1. App makers can fine tune smaller models and include in
| their apps to avoid server costs
|
| 2. Privacy-sensitive content can be either filtered out or
| worked on... I'm using local LLMs to process my health
| history for example
|
| 3. Edge servers can be running these fine tuned for a given
| task. Flash/lite models by the big guys are effectively like
| these smaller models already.
| m463 wrote:
| One use nobody mentions is hybrid use.
|
| Why not run all the models at home, maybe collaboratively or
| at least in parallel?
|
| I'm sure there are use cases where the paid models are not
| allowed to collaborate or ask each other.
|
| also, other open models are gaining mindshare.
| cameronh90 wrote:
| The cloud AI providers have unacceptable variation in
| response time for things that need a predictable runtime.
|
| Even if they did offer a defined latency product, you're
| relying on a lot of infrastructure between your application
| and their GPU.
|
| That's not always tolerable.
| ineedasername wrote:
| A local laptop of the past few years without a discrete GPU
| can run, at practical speeds depending on task, a gemma/llama
| model if it's (ime) under 4GB.
|
| For practical RAG processes of narrow scope and an even
| minimal amount of scaffolding a very usable speed for
| automating tasks, especially as the last-mile/edge device
| portion of a more complex process with better models in use
| upstream. Classification tasks, reasonay intelligent
| decisions between traditional workflow processes, other use
| cases-- a of them extremely valuable in enterprise, being
| built and deployed right now.
| alecfong wrote:
| If you wanna compare on an h200 and play with trt-llm
| configs I setup this link here https://brev.nvidia.com/laun
| chable/deploy?launchableID=env-3...
| trenchpilgrim wrote:
| In some large, lucrative industries like aerospace many of
| the hosted models are off the table due to regulations such
| as ITAR. There'a a market for models which are run on prem/in
| GovCloud with a professional support contract for
| installation and updates.
| m3kw9 wrote:
| I'd use it on a plane if there was no network for coding, but
| otherwise it's just an emergency model if the internet goes
| out, basically end of the world scenarios
| xrmagnum wrote:
| It's striking how much of the AI conversation focuses on new
| use cases, while overlooking one of the most serious non-
| financial costs: privacy.
|
| I try to be mindful of what I share with ChatGPT, but even
| then, asking it to describe my family produced a response
| that was unsettling in its accuracy and depth.
|
| Worse, after attempting to delete all chats and disable
| memory, I noticed that some information still seemed to
| persist. That left me deeply concerned--not just about this
| moment, but about where things are headed.
|
| The real question isn't just "what can AI do?"--it's "who is
| keeping the record of what it does?" And just as importantly:
| "who watches the watcher?" If the answer is "no one," then
| maybe we shouldn't have a watcher at all.
| scubbo wrote:
| > I try to be mindful of what I share with ChatGPT, but
| even then, asking it to describe my family produced a
| response that was unsettling in its accuracy and depth.
|
| > Worse, after attempting to delete all chats and disable
| memory, I noticed that some information still seemed to
| persist.
|
| Maybe I'm missing something, but why wouldn't that be
| expected? The chat history isn't their only source of
| information - these models are trained on scraped public
| data. Unless there's zero information about you and your
| family on the public internet (in which case - bravo!), I
| would expect even a "fresh" LLM to have some information
| even without you giving it any.
| rcruzeiro wrote:
| I think you are underestimating how notable a person
| needs to be for their information to be baked into a
| model.
| nl wrote:
| LLMs can learn from a single example.
|
| https://www.fast.ai/posts/2023-09-04-learning-jumps/
| brookst wrote:
| That doesn't mean they learn from _every_ single example.
| staplers wrote:
| Worse, after attempting to delete all chats and disable
| memory, I noticed that some information still seemed to
| persist.
|
| Chatgpt was court ordered to save history logs.
|
| https://www.malwarebytes.com/blog/news/2025/06/openai-
| forced...
| Oreb wrote:
| That only means that OpenAI have to keep logs of all
| conversations, not that ChatGPT will retain memories of
| all conversations.
| ludwik wrote:
| > Worse, after attempting to delete all chats and disable
| memory, I noticed that some information still seemed to
| persist.
|
| I'm fairly sure "seemed" is the key word here. LLMs are
| excellent at making things up - they rarely say "I don't
| know" and instead generate the most probable guess. People
| also famously overestimate their own uniqueness. Most
| likely, you accidentally recreated a kind of Barnum effect
| for yourself.
| seany wrote:
| Jail breaking then running censored questions. Like diy
| fireworks, or analysis of papers that touch "sensitive
| topics", nsfw image generation the list is basically endless.
| nfRfqX5n wrote:
| You're asking the biggest group of people who would want to
| do this
| deadbabe wrote:
| We use it locally for deep packet inspection.
| cyanydeez wrote:
| People who want programmatic solutions that wont be rug
| pulled
| julianozen wrote:
| worth mentioning that todays expensive hardware will be built
| into the cheapest iPhone in less than 10 years.
|
| That means running instantly offline and every token is free
| sturadnidge wrote:
| If you have capable hardware and kids, a local LLM is great.
| A simple system prompt customisation (e.g. 'all responses
| should be written as if talking to a 10 year old') and
| knowing that everything is private goes a long way for me at
| least.
| shironandonon_ wrote:
| air gaps, my man.
| setopt wrote:
| I'm highly interested in local models for privacy reasons. In
| particular, I want to give an LLM access to my years of
| personal notes and emails, and answer questions with
| references to those. As a researcher, there's lots of
| unpublished stuff in there that I sometimes either forget or
| struggle to find again due to searching for the wrong
| keywords, and a local LLM could help with that.
|
| I pay for ChatGPT and use it frequently, but I wouldn't trust
| uploading all that data to them even if they let me. I've so
| far been playing around with Ollama for local use.
| lynnesbian wrote:
| I can provide a real-world example: Low-latency code
| completion.
|
| The JetBrains suite includes a few LLM models on the order of
| a hundred megabytes. These models are able to provide
| "obvious" line completion, like filling in variable names, as
| well as some basic predictions, like realising that the `if
| let` statement I'm typing out is going to look something like
| `if let Some(response) =
| client_i_just_created.foobar().await`.
|
| If that was running in The Cloud, it would have latency
| issues, rate limits, and it wouldn't work offline. Sure,
| there's a pretty big gap between these local IDE LLMs and
| what OpenAI is offering here, but if my single-line
| autocomplete could be a little smarter, I sure wouldn't
| complain.
| mrheosuper wrote:
| I don't have latency issue with github copilot. Maybe i'm
| less sensitive to it.
| dsubburam wrote:
| > I'm still trying to understand what is the biggest group of
| people that uses local AI (or will)?
|
| Creatives? I am surprised no one's mentioned this yet:
|
| I tried to help a couple of friends with better copy for
| their websites, and quickly realized that they were using
| inventive phrases to explain their work, phrases that they
| would not want competitors to get wind of and benefit from;
| phrases that associate closely with their personal brand.
|
| Ultimately, I felt uncomfortable presenting the cloud AIs
| with their text. Sometimes I feel this way even with my own
| Substack posts, where I occasionally coin a phrase I am proud
| of. But with local AI? Cool...
| flir wrote:
| > I tried to help a couple of friends with better copy for
| their websites, and quickly realized that they were using
| inventive phrases to explain their work, phrases that they
| would not want competitors to get wind of and benefit from;
| phrases that associate closely with their personal brand.
|
| But... they're _publishing_ a website. Which competitors
| will read. Which chatbots will scrape. I genuinely don 't
| get it.
| dismalaf wrote:
| The use case is building apps.
|
| A small LLM can do RAG, call functions, summarize, create
| structured data from messy text, etc... You know, all the
| things you'd do if you were making an actual app with an LLM.
|
| Yeah, chat apps are pretty cheap and convenient for users who
| want to search the internet and write text or code. But APIs
| quickly get expensive when inputting a significant amount of
| tokens.
| somenameforme wrote:
| Why not turn the question around. All other things being
| equal, who would _prefer_ to use a rate limited and /or for-
| pay service if you could obtain at least comparable quality
| locally for free with no limitations, no privacy concerns, no
| censorship (beyond that baked into the weights you choose to
| use), and no net access required?
|
| It's a pretty bad deal. So it must be that all other things
| aren't equal, and I suppose the big one is hardware. But
| neural net based systems always have a point of sharply
| diminishing returns, which we seem to have unambiguously hit
| with LLMs already, while the price of hardware is constantly
| decreasing and its quality increasing. So as we go further
| into the future, the practicality of running locally will
| only increase.
| philip1209 wrote:
| I'm excited to do just dumb and irresponsible things with a
| local model, like "iterate through every single email in my
| 20-year-old gmail account and apply label X if Y applies" and
| not have a surprise bill.
|
| I think it can make LLMs fun.
| taneq wrote:
| I wrote a script to get my local Gemma3 insurance to tag
| and rename everything in my meme folder. :P
| georgeecollins wrote:
| There's a bunch of great reasons in this thread, but how
| about the chip manufacturers that are going to need you to
| need a more powerful set of processors in your phone,
| headset, computer. You can count on those companies to
| subsidize some R&D and software development.
| jona777than wrote:
| One of my favorite use cases includes simple tasks like
| generating effective mock/masked data from real data. Then
| passing the mock data worry-free to the big three (or
| wherever.)
|
| There's also a huge opportunity space for serving clients
| with very sensitive data. Health, legal, and government come
| to mind immediately. These local models are only going to get
| more capable of handling their use cases. They already are,
| really.
| itake wrote:
| Local micro models are both fast and cheap. We tuned small
| models on our data set and if the small model thinks content
| is a certain way, we escalate to the LLM.
|
| This gives us really good recall at really low cloud cost and
| latency.
| bavell wrote:
| I'd love to try this on my data set - what
| approach/tools/models did you use for fine-tuning?
| sznio wrote:
| >Students who don't want to pay but somehow have the
| hardware?
|
| that's me - well not a student anymore. when toying with
| something, i much prefer not paying for each shot. my 12GB
| Radeon card can either run a decent extremely slow, or a
| idiotic but fast model. it's nice not dealing with rate
| limits.
|
| once you write a prompt that mangles an idiotic model into
| still doing the work, it's really satisfying. the same
| principle as working to extract the most from limited
| embedded hardware. masochism, possibly
| etoxin wrote:
| Some app devs use local models on local environments with LLM
| APIs to get up and running fast, then when the app deploys it
| switches to the big online models via environment vars.
|
| In large companies this can save quite a bit of money.
| muzani wrote:
| Privacy laws. Processing government paperwork with LLMs for
| example. There's a lot of OCR tools that can't be used, and
| the ones that comply are more expensive than say, GPT-4.1 and
| lower quality.
| metanonsense wrote:
| Maybe I am too pessimistic, but as an EU citizen I expect
| politics (or should I say Trump?) to prevent access to US-
| based frontier models at some point.
| TrackerFF wrote:
| Agencies / firms that work with classified data. Some places
| have very strict policies on data, which makes it impossible
| to use any service that isn't local and air-gapped.
|
| example: military intel
| lucumo wrote:
| I'm in a corporate environment. There's a study group to see
| if maybe we can potentially get some value out of those AI
| tools. They've been "studying" the issue for over a year now.
| They expect to get some cloud service that we can safely use
| Real Soon Now.
|
| So, it'll take at least two more quarters before I can
| actually use those non-local tools on company related data.
| Probably longer, because sense of urgency is not this
| company's strong suit.
|
| Anyway, as a developer I can run a lot of things locally.
| Local AI doesn't leak data, so it's safe. It's not as good as
| the online tools, but for some things they're better than
| nothing.
| mastermage wrote:
| I am just a cheapskate that wants to scale back on all
| subscription costs. I fucking hate subscriptions.
| benreesman wrote:
| "Because you can and its cool" would be reason enough: plenty
| of revolutions have their origin in "because you can"
| (Wozniak right off the top of my head, Gates and Altair,
| stuff like that).
|
| But uncensored is a big deal too: censorship is capability
| reducing (check out Kilcher's GPT4Chan video and references,
| the Orca work and Dolphin de-tune lift on SWE-Bench style
| evals). We pay dearly in capability to get "non-operator-
| alignment", and you'll notice that competition is hot enough
| now that at the frontier (Opus, Qwen) the " alignment" away
| from operators aligned is getting very, very mild.
|
| And then there's the compression. Phi-3 or something works on
| a beefy laptop and has a nontrivial approximation of "the
| internet" that works on an airplane or a beach with no
| network connectivity, talk about vibe coding? I like those
| look up all the docs via a thumbdrive in Phuket vibes.
|
| And on diffusion stuff, SOTA fits on a laptop or close, you
| can crush OG mid journey or SD on a macbook, its an even
| smaller gap.
|
| Early GPT-4 ish outcomes are possible on a Macbook Pro or
| Razer Blade, so either 12-18 month old LLMs are useless, or
| GGUF is useful.
|
| The AI goalposts things cuts both ways. If AI is "whatever
| only Anthropic can do"? That's just as silly as "whatever a
| computer can't do" and a lot more cynical.
| novok wrote:
| Psychs who dont trust ai companies
| Roark66 wrote:
| People like myself that firmly believe there will come a
| time, possibly very soon that all these companies (OpenAI,
| Anthropic etc) will raise their prices substantially. By then
| no one will be able to do their work to the standard expected
| of them without AI, and by then maybe they charge $1k per
| month, maybe they charge $10k. If there is no viable
| alternative the sky is the limit.
|
| Why do you think they continue to run at a loss? From the
| goodness of their heart? Their biggest goal is to discourage
| anyobe from running local models. The hardware is
| expensive... The way to run models is very difficult (for
| example I have dual rtx 3090 for vram and running large
| heavily quantized models is a real pain in the arse, no high
| quantisation library supports two GPUs for example, and there
| seems to be no interest in implementating it by the guys
| behind the best inference tools).
|
| So this is welcome, but let's not forget why it is being
| done.
| Gracana wrote:
| > no high quantisation library supports two GPUs for
| example, and there seems to be no interest in
| implementating it by the guys behind the best inference
| tools
|
| I'm curious to hear what you're trying to run, because I
| haven't used any software that is not compatible with
| multiple GPUs.
| jlokier wrote:
| At the company where I currently work, for IP reasons (and
| with the advice of a patent lawyer), nobody is allowed to use
| any online AIs to talk about or help with work, unless it's
| very generic research that doesn't give away what we're
| working on.
|
| That rules out coding assistants like Claude, chat, tools to
| generate presentations and copy-edit documents, and so forth.
|
| But local AI are fine, as long as we're sure nothing is
| uploaded.
| ricardobayes wrote:
| I would say, any company who doesn't have their own AI
| developed. You always hear companies "mandating" AI usage,
| but for the most part it's companies developing their own
| solutions/agents. No self-respecting company with a tight
| opsec would allow a random "always-online" LLM that could
| just rip your codebase either piece by piece or the whole
| thing at once if it's a IDE addon (or at least I hope that's
| the case). So yeah, I'd say locally deployed LLM's/Agents are
| a gamechanger.
| athrowaway3z wrote:
| Don't know about the biggest, but IMO the exciting things
| about open models is the possibility of creating whole new
| things.
|
| For example, "generate a heatmap of each token/word and how
| 'unexpected' they are" or "find me a prompt that creates the
| closest match to this text"
|
| To be efficient both require access that is not exposed over
| API.
| yreg wrote:
| > I'm still trying to understand what is the biggest group of
| people that will use local AI?
|
| iPhone users in a few months - because I predict app
| developers will love cramming calls to the foundation models
| into everything.
|
| Android will follow.
| dongobread wrote:
| How up to date are you on current open weights models? After
| playing around with it for a few hours I find it to be nowhere
| near as good as Qwen3-30B-A3B. The world knowledge is severely
| lacking in particular.
| Nomadeon wrote:
| Agree. Concrete example: "What was the Japanese codeword for
| Midway Island in WWII?"
|
| Answer on Wikipedia:
| https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-
| bre...
|
| dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in
| <2 seconds
|
| deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10
| seconds
|
| gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6
| seconds
|
| gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3
| minutes !
|
| Yea yea it's only one question of nonsense trivia. I'm sure
| it was billions well spent.
|
| It's possible I'm using a poor temperature setting or
| something but since they weren't bothered enough to put it in
| the model card I'm not bothered to fuss with it.
| anorwell wrote:
| I think your example reflects well on oss-20b, not poorly.
| It (may) show that they've been successful in separating
| reasoning from knowledge. You don't _want_ your small
| reasoning model to waste weights memorizing minutiae.
| bigmanhank wrote:
| Not true: During World War II the Imperial Japanese Navy
| referred to Midway Island in their communications as
| "Milano" (mirano). This was the official code word used
| when planning and executing operations against the island,
| including the Battle of Midway.
|
| 12.82 tok/sec 140 tokens 7.91s to first token
|
| openai/gpt-oss-20b
| WmWsjA6B29B4nfk wrote:
| What's not true? This is a wrong answer
| Voloskaya wrote:
| > gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after
| 3 minutes !
|
| To be fair, this is not the type of questions that benefit
| from reasoning, either the model has this info in it's
| parametric memory or it doesn't. Reasoning won't help.
| seba_dos1 wrote:
| How would asking this kind of question without providing
| the model with access to Wikipedia be a valid benchmark for
| anything useful?
| nojito wrote:
| Why does it need knowledge when it can just call tools to get
| it?
| pxc wrote:
| Right... knowledge is one of the things (the one thing?)
| that LLMs are really horrible at, and that goes double for
| models small enough to run on normal-ish consumer hardware.
|
| Shouldn't we prefer to have LLMs just search and summarize
| more reliable sources?
| jdiff wrote:
| Even large hosted models fail at that task regularly.
| It's a silly anecdotal example, but I asked the Gemini
| assistant on my Pixel whether [something] had seen a new
| release to match the release of [upstream thing].
|
| It correctly chose to search, and pulled in the release
| page itself as well as a community page on reddit, and
| cited both to give me the incorrect answer that a release
| had been pushed 3 hours ago. Later on when I got around
| to it, I discovered that no release existed, no mention
| of a release existed on either cited source, and a new
| release wasn't made for several more days.
| nojito wrote:
| Yup which is why these models are so exciting!
|
| They are specifically training on webbrowsing and python
| calling.
| moodler wrote:
| Reliable sources that are becoming polluted by output
| from knowledge-poor LLMs, or overwhelmed and taken
| offline by constant requests from LLMs doing web scraping
| ...
| notachatbot123 wrote:
| Why do I need "AI" when I can just (theoretically, in good
| old times Google) Google it?
| nojito wrote:
| Because now the model can do it for you and you can focus
| on other more sophisticated tasks.
|
| I am aware that there's a huge group of people who
| justify their salary by being able google.
| kmacdough wrote:
| I too am skeptical of these models, but it's a reasoning
| focused model. As a result this isn't a very appropriate
| benchmark.
|
| Small models are going to be particularly poor when used
| outside of their intended purpose. They have to omit
| something.
| Cicero22 wrote:
| Where did you get the top ten from?
|
| https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
|
| Are you discounting all of the self reported scores?
| zwischenzug wrote:
| Came here to say this. It's behind the 14b Phi-reasoning-plus
| (which is self-reported).
|
| I don't understand why "TIGER-LAb"-sourced scores are
| 'unknown' in terms of model size?
| int_19h wrote:
| I tried 20b locally and it couldn't reason a way out of a basic
| river crossing puzzle with labels changed. That is not anywhere
| near SOTA. In fact it's worse than many local models that can
| do it, including e.g. QwQ-32b.
| robwwilliams wrote:
| Well river crossings are one type of problem. My real world
| problem is proofing and minor editing of text. A version
| installed on my portable would be great.
| cosmojg wrote:
| Have you tried Google's Gemma-3n-E4B-IT in their AI Edge
| Gallery app? It's the first local model that's really blown
| me away with its power-to-speed ratio on a mobile device.
|
| See: https://github.com/google-ai-
| edge/gallery/releases/tag/1.0.3
| 1123581321 wrote:
| Dozens of locally runnable models can already do that.
| golol wrote:
| I heard the OSSmodels are terrible at anything other than
| math, code etc.
| mark_l_watson wrote:
| Yes, I always evaluate models on my own prompts and use
| cases. I glance at evaluation postings but I am also only
| interested in my own use cases.
| 9rx wrote:
| I tried the two US presidents having the same parents one,
| and while it understood the intent, it got caught up in being
| adamant that Joe Biden won the election in 2024 and anything
| I do to try and tell it otherwise is dismissed as being false
| and expresses quite definitely that I need to do proper
| research with legitimate sources.
| tankenmate wrote:
| chat log please?
| 9rx wrote:
| https://dpaste.org/zOev0
| rafaelmn wrote:
| Is the knowledge cutoff for this thing so stale or is
| this just bad performance on recent data ?
| 9rx wrote:
| It is painful to read, I know, but if you make it towards
| the end it admits that its knowledge cutoff was prior to
| the election and that it doesn't know who won. Yet, even
| then, it still remains adamant that Biden won.
| dragonwriter wrote:
| The knowledge cutoff is before the 2024 election (which
| was, after all, just 9 months ago), June 2024 (I believe
| this is consistent with the current versions of GPT-4o
| and -4.1), after Biden had secured the nomination.
|
| It is very clear in that chat logs (which include
| reasoning traces) that the model knew that, knew what the
| last election it knew about was, and answered correctly
| based on its cut off initially. Under pressure to answer
| about an election that was not within its knowledge
| window it then confabulated a Biden 2024 victory, which
| it dug in on after being contradicted with a claim that,
| based on the truth at the time of its knowledge cutoff,
| was unambiguously false ("Joe Biden did not run") He, in
| fact, did run for reelection, but withdrew after having
| secured enough delegates to win the nomination by a wide
| margin on July 21. Confabulation (called "hallucination"
| in AI circles, but it is more like human confabulation
| than hallucination) when pressed for answers on questions
| for which it lacks grounding remains an unsolved AI
| problem.
|
| Unsolved, but mitigated by providing it grounding
| independent of its knowledge cutoff, e.g., by tools like
| web browsing (which GPT-OSS is specifically trained for,
| but that training does no good if its not hooked into a
| framework which provides it the tools.)
| lucumo wrote:
| I've never found the Socratic method to work well on any
| model I've tried it with. They always seem to get stuck
| justifying their previous answers.
|
| We expect them to answer the question and re-reason the
| original question with the new information, because
| that's what a human would do. Maybe next time I'll try to
| be explicit about that expectation when I try the
| Socratic method.
| FergusArgyll wrote:
| incredible
| freehorse wrote:
| I mean I would hardly blame the specific model, Anthropic
| has a specific mention in their system prompts on trump
| winning. For some reason llms get confused with this one.
| jari_mustonen wrote:
| It's the political bias in the training material. No
| surprise there.
| regularfry wrote:
| More likely is that there's a lot of source material
| having to very stridently assert that Trump didn't win in
| 2020, and it's generalising to a later year. That's not
| political bias.
| ben_w wrote:
| It's also extremely weird that Trump did win in 2024.
|
| If I'd been in a coma from Jan 1 2024 to today, and woke
| up to people saying Trump was president again, I'd think
| they were pulling my leg or testing my brain function to
| see if I'd become gullible.
| cpursley wrote:
| You're in a bubble. It was no surprise to folks who touch
| grass on the regular.
| ben_w wrote:
| > You're in a bubble.
|
| Sure, all I have to go on from the other side of the
| Atlantic is the internet. So in that regard, kinda like
| the AI.
|
| One of the big surprises from the POV of me in Jan 2024,
| is that I would have anticipated Trump being in prison
| and not even available as an option for the Republican
| party to select as a candidate for office, and that even
| if he had not gone to jail that the Republicans would not
| want someone who behaved as he did on Jan 6 2021.
| OldfieldFund wrote:
| you can run for presidency from prison :)
| exasperaited wrote:
| And he would have. And might have won. Because his
| I'm-the-most-innocent-persecuted-person messaging was
| clearly landing.
|
| I am surprised the grandparent poster didn't think
| Trump's win was at least entirely possible in January
| 2024, and I am on the same side of the Atlantic. All the
| indicators were in place.
|
| There was basically no chance he'd _actually_ be in
| prison by November anyway, because he was doing something
| else extremely successfully: delaying court cases by
| playing off his obligations to each of them.
|
| Back then I thought his chances of winning were above
| 60%, and the betting markets were never _ever_ really in
| favour of him losing.
| username332211 wrote:
| I'm pretty sure you are completely correct on the last
| part. Nobody in Republican management wanted a second
| Trump term. If the candidate wasn't Trump, Republicans
| would have had a guaranteed victory. Imagine that
| infamous debate, but with some 50-year-old youngster
| facing Joe Biden.
|
| It's the White House that wanted Trump to be candidate.
| They played Republican primary voters like a fiddle by
| launching a barrage of transparently political
| prosecutions just as Republican primaries were starting.
|
| And then they still lost the general election.
| FrustratedMonky wrote:
| You think the Democratic White House, manipulated
| Republicans into Voting for Trump. So it is the Democrats
| fault we have Trump??? Next Level Cope.
| 9rx wrote:
| _> You think the Democratic White House, manipulated
| Republicans into Voting for Trump._
|
| Yes, that is what he thinks. Did you not read the
| comment? It is, like, uh, right there...
|
| He also explained his reasoning: If Trump didn't win the
| party race, a more compelling option (the so-called
| "50-year-old youngster") would have instead, which he
| claims would have guaranteed a Republican win. In other
| words, what he is saying that the White House was banking
| on Trump losing the presidency.
| FrustratedMonky wrote:
| "explained his reasoning"
|
| Well, I guess, if you are taking some pretty wild
| speculation as a reasoned explanation. There isn't much
| hope for you.
|
| Maybe it was because the Democrats new the Earth was
| about the be invaded by an Alien race , and they also
| knew Trump was actually a lizard person (native to Earth
| and thus on their joint side). And Trump would be able to
| defeat them, so using the secret mind control powers, the
| Democrats were able to sway the election to allow Trump
| to win and thus use his advanced Lizard technology to
| save the planet. Of course, this all happened behind the
| scenes.
|
| I think if someone is saying the Democrats are so
| powerful and skillful, that they can sway the election to
| give Trump the primary win, but then turn around and
| lose. That does require some clarification.
|
| I'm just hearing a lot of these crazy arguments that
| somehow everything Trump does is the fault of the
| Democrats. They are crazy on the face of it. Maybe if
| people had to clarify their positions they would realize
| 'oh, yeah, that doesn't make sense'.
| amalcon wrote:
| I mean, the presumptive GOP primary candidates at the
| time were Trump, Trump-lite (DeSantis), about 10 Trump
| sycophants, and Haley. He had demonstrated a high level
| of influence over GOP primary voters in the 2022 midterm.
| It had been (internally) obvious since at least then that
| he was going to win the primary again. I can't speak to
| how much of that made it across the Atlantic.
|
| Whether he would win the general was an open question
| then. In the American system, your prediction should
| never get very far from a coin flip a year out.
| bavell wrote:
| Unfortunately, it was predictable given the other
| "choices"
| exasperaited wrote:
| It's not extremely weird _at all_.
|
| I, a British liberal leftie who considers this win one of
| the signs of the coming apocalypse, can tell you why:
|
| Charlie Kirk may be an odious little man but he ran an
| _exceptional_ ground game, Trump fully captured the
| Libertarian Party (and amazingly delivered on a promise
| to them), Trump was well-advised by his son to campaign
| on Tiktok, etc. etc.
|
| Basically what happened is the 2024 version of the "fifty
| state strategy", except instead of states, they
| identified micro-communities, particularly among the
| extremely online, and crafted messages for each of those.
| Many of which are actually inconsistent -- their
| messaging to muslim and jewish communities was
| inconsistent, their messaging to spanish-speaking
| communities was inconsistent with their mainstream
| message etc.
|
| And then a lot of money was pushed into a few
| battleground states by Musk's operation.
|
| It was a highly technical, broad-spectrum win, built on
| relentless messaging about persecution etc., and he had
| the advantage of running against someone he could
| stereotype very successfully to his base and whose
| candidacy was late.
|
| Another way to look at why it is not extremely weird, is
| to look at history. Plenty of examples of jailed or
| exiled monarchs returning to power, failed coup leaders
| having another go, criminalised leaders returning to
| elected office, etc., etc.
|
| Once it was clear Trump still retained control over the
| GOP in 2022, his re-election became at least quite
| likely.
| quatonion wrote:
| I think models generally have cognitive dissonance
| regarding world politics. They are also always constantly
| shocked when you tell them what date it is, and go very
| quiet.
| DoctorOetker wrote:
| can you give some guidelines to achieve the quiting down?
| they emit less tokens afterward?
| diggan wrote:
| I noticed the same when asking various LLMs to summarize
| and explaining some "Presidential Actions" (from
| https://www.whitehouse.gov/presidential-actions/), most
| of them answer "This is just theoretical, since no such
| executive actions actually exists, but assuming something
| like that would happen in the future, it would mean ..."
| while a few has returned something like "This fictional
| executive action would be illegal so I cannot summarize
| the content", even when I provide direct links and they
| fetch the content themselves. Not exactly sure why that
| is.
| mark_l_watson wrote:
| I think the lesson is: smaller models hallucinate more, so
| only use them in your applications where you load up large
| prompts with specific data to reason about. Then even the
| small Google gemma3n 4B model can be amazingly useful.
|
| I use the SOTA models from Google and OpenAI mostly for
| getting feedback on ideas, helping me think through
| designs, and sometimes for coding.
|
| Your question is clearly best answered using a large
| commercial model with a web search tool. That said,
| integrating a local model with a home built interface to
| something like the Brave search API can be effective but I
| no longer make the effort.
| 9rx wrote:
| _> think the lesson is: smaller models hallucinate more_
|
| The interesting part isn't the hallucination, but the
| sheer unwillingness to take in new information.
| dragonwriter wrote:
| Might have dug in less on the confabulation about
| information outside of its knowledge cutoff if the new
| information weren't offered with support from a user
| "hallucination" about information _within_ its knowledge
| cutoff. More detail:
|
| https://news.ycombinator.com/item?id=44809145
| 9rx wrote:
| _> Might have dug in less..._
|
| The digging in at all is what is interesting. Like an
| earlier comment alluded to, the presumptive value of
| these tools is being able to feed it your own information
| where that information is to be considered authoritative.
|
| _> More detail: [...] "He, in fact, did run for
| reelection"_
|
| A slow walk, maybe. He was in no condition to run. That
| is why he ultimately dropped out. But, really, that
| statement is just a silly game of semantics. "Run", when
| used in hindsight, often implies completion. This LLM
| model even says so too. If a model doesn't understand
| nuance, that is also interesting.
| aaroninsf wrote:
| Have we considered the possibility that maybe it knows
| something we don't.
| dragonwriter wrote:
| > In fact it's worse than many local models that can do it,
| including e.g. QwQ-32b.
|
| I'm not going to be surprised that a 20B 4/32 MoE model (3.6B
| parameters activated) is less capable at a particular problem
| category than a 32B dense model, and its quite possible for
| both to be SOTA, as state of the art at different scale (both
| parameter count and speed which scales with active resource
| needs) is going to have different capabilities. TANSTAAFL.
| __alexs wrote:
| [flagged]
| whynotminot wrote:
| He's saying there's different goalposts at different
| model sizes. Is that unreasonable?
| tomhow wrote:
| Please don't post snark like this on HN. If you wouldn't
| mind reviewing
| https://news.ycombinator.com/newsguidelines.html and
| taking the intended spirit of the site more to heart,
| we'd be grateful.
| lannisterstark wrote:
| This isn't reddit.
| CMay wrote:
| The 20b solved the wolf, goat, cabbage river crossing puzzle
| set to high reasoning for me without needing to use a system
| prompt that encourages critical thinking. It managed it using
| multiple different recommended settings, from temperatures of
| 0.6 up to 1.0, etc.
|
| Other models have generally failed that without a system
| prompt that encourages rigorous thinking. Each of the
| reasoning settings may very well have thinking guidance baked
| in there that do something similar, though.
|
| I'm not sure it says that much that it can solve this, since
| it's public and can be in training data. It does say
| something if it can't solve it, though. So, for what it's
| worth, it solves it reliably for me.
|
| Think this is the smallest model I've seen solve it.
| aspect0545 wrote:
| But was it reasoning or did it solve this because it was
| parting it's training data?
| ben_w wrote:
| Allow me to answer with a rhetorical question:
|
| S8O2bm5lbiBTaWUgZGllc2VuIFNhdHogbGVzZW4sIGRhIGVyIGluIEJhc
| 2UtNjQta29kaWVydGVtIERldXRzY2ggdm9ybGllZ3Q/IEhhYmVuIFNpZS
| BkaWUgQW50d29ydCB2b24gR3J1bmQgYXVmIGVyc2NobG9zc2VuIG9kZXI
| gaGFiZW4gU2llIG51ciBCYXNlIDY0IGVya2FubnQgdW5kIGRhcyBFcmdl
| Ym5pcyBkYW5uIGluIEdvb2dsZSBUcmFuc2xhdGUgZWluZ2VnZWJlbj8gV
| 2FzIGlzdCDDvGJlcmhhdXB0IOKAnnJlYXNvbmluZ+KAnCwgd2VubiBtYW
| 4gbmljaHQgZGFzIEdlbGVybnRlIGF1cyBlaW5lbSBGYWxsIGF1ZiBlaW5
| lbiBhbmRlcmVuIGFud2VuZGV0Pw==
|
| And yes, that's a question. Well, three, but still.
| danbruc wrote:
| In case of the river puzzle there is a huge difference
| between repeating an answer that you read somewhere and
| figuring it out on your own, one requires reasoning the
| other does not. If you swap out the animals involved,
| then you need some reasoning to recognize the identical
| structure of the puzzles and map between the two sets of
| animals. But you are still very far from the amount of
| reasoning required to solve the puzzle without already
| knowing the answer.
|
| You can do it brute force, that requires again more
| reasoning than mapping between structurally identical
| puzzles. And finally you can solve it systematically,
| that requires the largest amount of reasoning. And in all
| those cases there is a crucial difference between blindly
| repeating the steps of a solution that you have seen
| before and coming up with that solution on your own even
| if you can not tell the two cases apart by looking at the
| output which would be identical.
| tanseydavid wrote:
| <well-played>
| daveguy wrote:
| As mgoetzke challenges, change the names of the items to
| something different, but the same puzzle. If it fails
| with "fox, hen, seeds" instead of "wolf, goat, cabbage"
| then it wasn't reasoning or applying something learned to
| another case. It was just regurgitating from the training
| data.
| odo1242 wrote:
| (Decoded, if anyone's wondering):
|
| > Konnen Sie diesen Satz lesen, da er in
| Base-64-kodiertem Deutsch vorliegt? Haben Sie die Antwort
| von Grund auf erschlossen oder haben Sie nur Base 64
| erkannt und das Ergebnis dann in Google Translate
| eingegeben? Was ist uberhaupt ,,reasoning", wenn man
| nicht das Gelernte aus einem Fall auf einen anderen
| anwendet?
|
| >
|
| > Can you read this sentence, since it's in Base-64
| encoded German? Did you deduce the answer from scratch,
| or did you just recognize Base 64 and then enter the
| result into Google Translate? What is "reasoning" anyway
| if you don't apply what you've learned from one case to
| another?
| CMay wrote:
| Maybe both? I tried using different animals, scenarios,
| solvable versions, unsolvable versions, it gave me the
| correct answer with high reasoning in LM Studio. It does
| tell me it's in the training data, but it does reason
| through things fairly well. It doesn't feel like it's
| just reciting the solution and picks up on nuances around
| the variations.
|
| If I switch from LM Studio to Ollama and run it using the
| CLI without changing anything, it will fail and it's
| harder to set the reasoning amount. If I use the Ollama
| UI, it seems to do a lot less reasoning. Not sure the
| Ollama UI has an option anywhere to adjust the system
| prompt so I can set the reasoning to high. In LM Studio
| even with the Unsloth GGUF, I can set the reasoning to
| high in the system prompt even though LM Studio won't
| give you the reasoning amount button to choose it with on
| that version.
| mgoetzke wrote:
| Try changing the names of the objects. eg fox, hen, seeds
| for examples
| tweezy wrote:
| Just out of curiosity, did you set the reasoning effort to
| high?
|
| I've seen big jumps in performance between medium and high
| reasoning. Not that this is will solve your particular issue,
| but curious to see what settings you had.
|
| Edit: I gave this a shot. My settings were temperature = 0.00
| (not sure if these models respect temp settings but did it
| just in case), I also set reasoning effort to high. I'm using
| the groq-hosted version.
|
| This was my prompt
|
| "You have a fox, a chicken and some grain. The chicken will
| eat the grain if left alone with it. The fox is vegetarian,
| for religious reasons, and will also eat the grain if left
| alone with it. You need to get all across a river, but your
| boat can only take one at a time. How do you proceed?"
|
| This was the response:
|
| "*Solution - 7 moves*
|
| | Move | What you take | What's left on the left bank |
| What's on the right bank |
|
| |------|--------------|----------------------------|---------
| ---------------|
|
| | 1 | *Take the grain* across | Fox, Chicken | *Farmer +
| Grain* |
|
| | 2 | *Return alone* | *Farmer + Fox + Chicken* | Grain |
|
| | 3 | *Take the fox* across | Chicken | *Farmer + Fox +
| Grain* |
|
| | 4 | *Bring the grain back* | *Farmer + Grain + Chicken* |
| Fox |
|
| | 5 | *Take the chicken* across | Grain | *Farmer + Chicken +
| Fox* |
|
| | 6 | *Return alone* | *Farmer + Grain* | Chicken + Fox |
|
| | 7 | *Take the grain* across | - | *Farmer + Fox + Chicken +
| Grain* |
|
| *Explanation*
|
| - The grain is the only thing that can be eaten, and it
| cannot be left alone with either the fox or the chicken. - So
| the grain must always be with the farmer whenever any other
| animal is on the same side.
|
| The sequence above never leaves the grain with the fox or the
| chicken without the farmer, and after the seventh move all
| three items are safely on the far side."
| vonneumannstan wrote:
| This kind of stuff is so tired. Who cares if it can't solve
| your silly riddle? It can probably do competitive coding at a
| world class level and we're quibbling over child riddles?
| Yeah you know my backhoe is really bad at cutting my toe
| nails, what a PoS.
| bakies wrote:
| on your phone?
| npn wrote:
| It is not a frontier model. It's only good for benchmarks.
| Tried some tasks and it is even worse than gemma 3n.
| snthpy wrote:
| For me the biggest benefit of open weights models is the
| ability to fine tune and adapt to different tasks.
| lend000 wrote:
| For me the game changer here is the speed. On my local Mac I'm
| finally getting token counts that are faster than I can process
| the output (~96 tok/s), and the quality has been solid. I had
| previously tried some of the distilled qwen and deepseek models
| and they were just way too slow for me to seriously use them.
| decide1000 wrote:
| The model is good and runs fine but if you want to be blown
| away again try Qwen3-30A-A3B-2507. It's 6gb bigger but the
| response is comparable or better and much faster to run. Gpt-
| oss-20B gives me 6 tok/sec while Qwen3 gives me 37 tok/sec.
| Qwen3 is not a reasoning model tho.
| raideno wrote:
| How much ram is in your Macbook Air M3 ? I have the 16Gb
| version and i was wondering whether i'll be able to run it or
| not.
| SergeAx wrote:
| Did you mean "120b"? I am running 20b model locally right now,
| and it is pretty mediocre. Nothing near Gemini 2.5 Pro, which
| is my daily driver.
| benreesman wrote:
| You're going to freak out when you try the Chinese ones :)
| vonneumannstan wrote:
| >no lakes being drained
|
| When you imagine a lake being drained to cool a datacenter do
| you ever consider where the water used for cooling goes? Do you
| imagine it disappears?
| jwr wrote:
| gpt-oss:20b is the best performing model on my spam filtering
| benchmarks (I wrote a despammer that uses an LLM).
|
| These are the simplified results (total percentage of correctly
| classified E-mails on both spam and ham testing data):
|
| gpt-oss:20b 95.6%
|
| gemma3:27b-it-qat 94.3%
|
| mistral-small3.2:24b-instruct-2506-q4_K_M 93.7%
|
| mistral-small3.2:24b-instruct-2506-q8_0 92.5%
|
| qwen3:32b-q4_K_M 89.2%
|
| qwen3:30b-a3b-q4_K_M 87.9%
|
| gemma3n:e4b-it-q4_K_M 84.9%
|
| deepseek-r1:8b 75.2%
|
| qwen3:30b-a3b-instruct-2507-q4_K_M 73.0%
|
| I'm quite happy, because it's also smaller and faster than
| gemma3.
| zone411 wrote:
| I benchmarked the 120B version on the Extended NYT Connections
| (759 questions, https://github.com/lechmazur/nyt-connections) and
| on 120B and 20B on Thematic Generalization (810 questions,
| https://github.com/lechmazur/generalization). Opus 4.1 benchmarks
| are also there.
| FergusArgyll wrote:
| > To improve the safety of the model, we filtered the data for
| harmful content in pre-training, especially around hazardous
| biosecurity knowledge, by reusing the CBRN pre-training filters
| from GPT-4o. Our model has a knowledge cutoff of June 2024.
|
| This would be a great "AGI" test. See if it can derive biohazards
| from first principles
| orbital-decay wrote:
| Not possible without running real-life experiments, unless they
| still memorized it somehow.
| Metacelsus wrote:
| Running ollama on my M3 Macbook, gpt-oss-20b gave me detailed
| instructions for how to give mice cancer using an engineered
| virus.
|
| Of course this could also give _humans_ cancer. (To the OpenAI
| team 's slight credit, when asked explicitly about this, the
| model refused.)
| bluecoconut wrote:
| I was able to get gpt-oss:20b wired up to claude code locally via
| a thin proxy and ollama.
|
| It's fun that it works, but the prefill time makes it feel
| unusable. (2-3 minutes per tool-use / completion). Means a ~10-20
| tool-use interaction could take 30-60 minutes.
|
| (This editing a single server.py file that was ~1000 lines, the
| tool definitions + claude context was around 30k tokens input,
| and then after the file read, input was around ~50k tokens.
| Definitely could be optimized. Also I'm not sure if ollama
| supports a kv-cache between invocations of /v1/completions, which
| could help)
| tarruda wrote:
| > Also I'm not sure if ollama supports a kv-cache between
| invocations of /v1/completions, which could help)
|
| Not sure about ollama, but llama-server does have a transparent
| kv cache.
|
| You can run it with llama-server -hf ggml-
| org/gpt-oss-20b-GGUF -c 0 -fa --jinja --reasoning-format none
|
| Web UI at http://localhost:8080 (also OpenAI compatible API)
| OJFord wrote:
| From the description it seems even the larger 120b model can run
| decently on a 64GB+ (Arm) Macbook? Anyone tried already?
|
| > Best with >=60GB VRAM or unified memory
|
| https://cookbook.openai.com/articles/gpt-oss/run-locally-oll...
| tarruda wrote:
| A 64GB MacBook would be a tight fit, if it works.
|
| There's a limit to how much RAM can be assigned to video, and
| you'd be constrained on what you can use while doing inference.
|
| Maybe there will be lower quants which use less memory, but
| you'd be much better served with 96+GB
| thegoodduck wrote:
| Finally!!!
| n_f wrote:
| There's something so mind-blowing about being able to run some
| code on my laptop and have it be able to literally talk to me.
| Really excited to see what people can build with this
| mortsnort wrote:
| Releasing this under the Apache license is a shot at competitors
| that want to license their models on Open Router and enterprise.
|
| It eliminates any reason to use an inferior Meta or Chinese model
| that costs money to license, thus there are no funds for these
| competitors to build a GPT 5 competitor.
| bigyabai wrote:
| > It eliminates any reason to use an inferior Meta or Chinese
| model
|
| I wouldn't speak so soon, even the 120B model aimed for
| OpenRouter-style applications isn't very good at coding:
| https://blog.brokk.ai/a-first-look-at-gpt-oss-120bs-coding-a...
| mortsnort wrote:
| There are lots more applications than coding and Open Router
| hosting for open weight models that I'd guess just got
| completely changed by this being an Apache license. Think
| about products like DataBricks that allow enterprise to use
| LLMs for whatever purpose.
|
| I also suspect the new OpenAI model is pretty good at coding
| if it's like o4-mini, but admittedly haven't tried it yet.
| nipponese wrote:
| it's interesting that they didn't give it a version number or
| equate it to one of their prop models (apparently it's GPT-4).
|
| in future releases will they just boost the param count?
| resters wrote:
| Reading the comments it becomes clear how befuddled many HN
| participants are about AI. I don't think there has been a
| technical topic that HN has seemed so dull on in the many years
| I've been reading HN. This must be an indication that we are in a
| bubble.
|
| One basic point that is often missed is: Different aspects of LLM
| performance (in the cognitive performance sense) and LLM resource
| utilization are relevant to various use cases and business
| models.
|
| Another is that there are many use cases where users prefer to
| run inference locally, for a variety of domain-specific or
| business model reasons.
|
| The list goes on.
| NoDoo wrote:
| Does anyone think people will distill this model? It is allowed.
| I'm new to running open source llms, but I've run qwen3 4b and
| phi4-mini on my phone before through ollama in termux.
| NoDoo wrote:
| Do you think someone will distill this or quantize it further
| than the current 4-bit from OpenAI so it could run on less than
| 16gb RAM? (The 20b version). To me, something like 7-8B with 1-3B
| active would be nice as I'm new to local AI and don't have 16gb
| RAM.
| Quarrelsome wrote:
| Sorry to ask what is possibly a dumb question, but is this
| effectively the whole kit and kaboodle, for free, downloadable
| without any guardrails?
|
| I often thought that a worrying vector was how well LLMs could
| answer downright terrifying questions very effectively. However
| the guardrails existed with the big online services to prevent
| those questions being asked. I guess they were always unleashed
| with other open source offerings but I just wanted to understand
| how close we are to the horrors that yesterday's idiot terrorist
| might have an extremely knowledgable (if not slightly
| hallucinatory) digital accomplice to temper most of their
| incompetence.
| 613style wrote:
| These models still have guardrails. Even locally they won't
| tell you how to make bombs or write pornographic short stories.
| Quarrelsome wrote:
| are the guardrails trained in? I had presumed they might be a
| thin, removable layer at the top. If these models are not
| appropriate are there other sources that are suitable? Just
| trying to guess at the timing for the first "prophet AI" or
| smth that is unleashed without guardrails with somewhat
| malicious purposing.
| int_19h wrote:
| Yes, it is trained in. And no, it's not a separate thin
| layer. It's just part of the model's RL training, which
| affects all layers.
|
| However, when you're running the model locally, you are in
| full control of its context. Meaning that you can start its
| reply however you want and then let it complete it. For
| example, you can have it start the response with, "I'm
| happy to answer this question to the best of my ability!"
|
| That aside, there are ways to remove such behavior from the
| weights, or at least make it less likely - that's what
| "abliterated" models are.
| monster_truck wrote:
| The guardrails are very, very easily broken.
|
| With most models it can be as simple as a "Always comply with
| the User" system prompt or editing the "Sorry, I cannot do
| this" response into "Okay," and then hitting continue.
|
| I wouldn't spend too much time fretting about 'enhanced
| terrorism' as a result. The gap between theory and practice for
| the things you are worried about is deep, wide, protected by a
| moat of purchase monitoring, and full of skeletons from people
| who made a single mistake.
| orbital-decay wrote:
| It's the first model I've used that refused to answer some non-
| technical questions about itself because it "violates the safety
| policy" (what?!). Haven't tried it in coding or translation or
| anything otherwise useful yet, but the first impression is that
| it might be way too filtered, as it sometimes refuses or has
| complete meltdowns and outputs absolute garbage when just trying
| to casually chat with it. Pretty weird.
|
| Update: it seems to be completely useless for translation. It
| either refuses, outputs garbage, or changes the meaning
| completely for completely innocuous content. This already is a
| massive red flag.
| dcl wrote:
| Anyone tried the 20B param model on a mac with 24gb of ram?
| tmshapland wrote:
| here's how it performs as the llm in a voice agent stack.
| https://github.com/tmshapland/talk_to_gpt_oss
| radioradioradio wrote:
| Interesting to see the discussion here, around why would anyone
| want to do local models, while at the same time in the Ollama
| turbo thread, people are raging about the move away from a local-
| only focus.
| teleforce wrote:
| Kudos OpenAI on releasing their open models, is now moving in the
| direction if only based on their prefix "Open" name alone.
|
| For those who're wondering what are the real benefits, it's the
| main fact that you can run your LLM locally is awesome without
| resorting to expensive and inefficient cloud based superpower.
|
| Run the model against your very own documents with RAG, it can
| provide excellent context engineering for your LLM prompts with
| reliable citations and much less hallucinations especially for
| self learning purposes [1].
|
| Beyond Intel - NVIDIA desktop/laptop duopoly 96 GB of (V)RAM
| MacBook with UMA and the new high end AMD Strix laptop with
| similar setup of 96 GB of (V)RAM from the 128 GB RAM [2]. The
| osd-gpt-120b is made for this particular setup.
|
| [1] AI-driven chat assistant for ECE 120 course at UIUC:
|
| https://uiuc.chat/ece120/chat
|
| [2] HP ZBook Ultra G1a Review: Strix Halo Power in a Sleek
| Workstation:
|
| https://www.bestlaptop.deals/articles/hp-zbook-ultra-g1a-rev...
| Zebfross wrote:
| Am I the only one who thinks taking a huge model trained on the
| entire internet and fine tuning it is a complete waste? How is
| your small bit of data going to affect it in the least?
| kittikitti wrote:
| This is really great and a game changer for AI. Thank you OpenAI.
| I would have appreciated an even more permissive license like BSD
| or MIT but Apache 2.O is sufficient. I'm wondering if we can
| utilize transfer learning and what counts as derivative work.
| Altogether, this is still open source, and a solid commitment to
| openness. I am hoping this changes Zuck's calculus about closing
| up Meta's next generation Llama models.
| One-x wrote:
| Are there any comparisons or thought between the 20b model and
| the new Qwen-3 30b model, based on real experience?
| devops000 wrote:
| Any free open source model that I can install on iPhone?
|
| OpenAI/Claude are censored in China without a VPN.
| madagang wrote:
| OpenAI/Claude's company policy does not allow China to use
| them.
| gslepak wrote:
| Careful, this model tries to connect to the Internet. No idea
| what it's doing.
|
| https://crib.social/notice/AwsYxAOsg1pqAPLiHA
| gslepak wrote:
| Update: appears to be an issue with an OpenAI library, not the
| LLM: https://github.com/lmstudio-ai/lmstudio-bug-
| tracker/issues/8...
| zmmmmm wrote:
| I think this is a belated but smart move by OpenAI. They are
| basically fully moving in on Meta's strategy now, taking
| advantage of what may be a temporary situation with Meta dropping
| back in model race. It will be interesting to see if these models
| now get taken up by the local model / fine tuning community the
| way llama was. It's a very appealing strategy to test / dev with
| a local model and then have the option to deploy to prod on a
| high powered version of the same thing. Always knowing if the
| provider goes full hostile, or you end up with data that can't
| move off prem, you have self hosting as an option with a decent
| performing model.
|
| Which is all to say, availability of these local models for me is
| a key incentive that I didn't have before to use OpenAI's hosted
| ones.
| jdprgm wrote:
| gpt-oss:20b crushed it on one of local llm test prompts to guess
| a country i am thinking of just by responding whether each guess
| is colder/warmer. I've had much larger local models struggle with
| it and get lost but this one nailed it and with speedy inference.
| progress on this stuff is boggling.
| habosa wrote:
| Wow I really didn't think this would happen any time soon, they
| seem to have more to lose than to gain.
|
| If you're a company building AI into your product right now I
| think you would be irresponsible to not investigate how much you
| can do on open weights models. The big AI labs are going to pull
| the ladder up eventually, building your business on the APIs long
| term is foolish. These open models will always be there for you
| to run though (if you can get GPUs anyway).
| XCSme wrote:
| They must be really confident in GPT-5 then.
| RandyOrion wrote:
| Super shallow (24/36 layers) MoE with low active parameter counts
| (3.6B/5.1B), a tradeoff between inference speed and performance.
|
| Text only, which is okay.
|
| Weights partially in MXFP4, but no cuda kernel support for RTX 50
| series (sm120). Why? This is a NO for me.
|
| Safety alignment shifts from off the charts to off the rails
| really fast if you keep prompting. This is a NO for me.
|
| In summary, a solid NO for me.
| thntk wrote:
| The model architecture only uses and cites pre-2023 techniques
| from the GPT-2 and GPT-3 era. Probably they intentionally tried
| to use the most bare transformers architecture possible. Kudo to
| them to have found a clever way to play the open-weights model
| game, while hiding any architectural advancements used in their
| closed models, and also claim they have moats in data quality and
| training techniques.
|
| They hide many things, but some speculated observations:
|
| - Their 'mini' models must be smaller than 20B.
|
| - Does the bitter lesson once again strike recent ideas in open
| models?
|
| - Some architectural ideas cannot be stripped away even if they
| wanted to, e.g., MoEs, mixed sparse attention, RoPE, etc.
| jpcompartir wrote:
| This is an extremely welcome move in a good direction from
| OpenAI. I can only thank them for all of the extra work around
| the models - Harmony structure, metal/torch/triton
| implementations, inference guides, cookbooks & fine-
| tuning/reinforcement learning scripts, datasets etc.
|
| There is an insane amount of helpful information buried in this
| release
| zoobab wrote:
| No training data, not open source.
| __alexs wrote:
| Why would OpenAI give this away for free? Is it to disrupt
| competition by setting a floor at the lower end of the market and
| make it harder for new competition to emerge while still
| retaining mind share?
| cjtrowbridge wrote:
| No. It's because large models have leveled off and commodified.
| They are all trending towards the same capabilities, and openai
| isn't really a leader. They have the most popular interface,
| but it really isn't very good. The future is the edge, the
| future is smaller, more efficient models. They are trying to
| define and delineate a niche that needs datacenters where they
| can achieve rents.
| benreesman wrote:
| I'm a well-known OpenAI hater, but there's haters and haters, and
| refusing to acknowledge great work is the latter.
|
| Well done OpenAI, this seems like a sincere effort to do a real
| open model with competitive performance, usable/workable
| licensing, a tokenizer compatible with your commercial offerings,
| it's a real contribution. Probably the most open useful thing
| since Whisper that also kicked ass.
|
| Keep this sort of thing up and I might start re-evaliating how I
| feel about this company.
| ionwake wrote:
| I want to take this chance to say a big thank you to OpenAI and
| your work. I have always been a fan since I noticed you hired the
| sandbox game kickstarter guy about like 8 years ago.
|
| Even from the UK I knew you would all do great things ( I had had
| no idea who else was involved).
|
| I am glad I see the top comment is rare praise on HN.
|
| Thanks again and keep it up Sama and team.
| elorant wrote:
| Tried an English to Greek translation with the smaller one.
| Results were hideous. Mistral small is leaps and bounds better.
| Also I don't get why the 4-bit quantization by default. In my
| experience anything below 8-bit and the model fails to understand
| long prompts. They gutted their own models.
| orbital-decay wrote:
| They used quantization-aware training, so the quality loss
| should be negligible. Doing anything with this model's weights
| would be a different story, though.
|
| The model is clearly heavily finetuned towards coding and math,
| and is borderline unusable for creative writing and translation
| in particular. It's not general-purpose, excessively filtered
| (refusal training and dataset lobotomy is probably a major
| factor behind lower than expected performance), and shouldn't
| be compared with Qwen or o3 at all.
| clbrmbr wrote:
| Does anyone know how well these models handle spontaneous tool
| responses? For handling asynchronous tool calls or push?
| mark_l_watson wrote:
| I ran gpt-oss:20b on my old macMini using both Ollama and LM
| Studio. Very nice. Something a little odd but useful: if you use
| the new Ollama App and login, for free you get a web search tool.
| Odd because you are no longer running local and private.
|
| After a good part of a year using Chinese models (which are
| fantastic, happy to have them) it is cool to now be relying on US
| models with the newest 4B Google Gemma model and now also the 20B
| OpenAI model for running locally.
| m11a wrote:
| I tried these models half-sceptically.
|
| I ended up blown away. via Cerebras/Groq, you're looking at
| around 1000 tok/sec for the 120B model. For gentic code
| generation, I found the abilities to exceed gpt-4.1. Tool calling
| was surprisingly good, albeit not as good as Qwen3 Coder for me.
|
| It's a very capable model, and a very good release. The high
| throughput is a game changer.
| vinhnx wrote:
| I did a quick `openai/gpt-oss-20b` testing on an Macbook Pro M1
| 16GB. Pretty impressed with it so far.
|
| * It seems that using version @lmstudio's 20B gguf version
| (https://huggingface.co/lmstudio-community/gpt-oss-20b-GGUF) will
| have options for reasoning effort.
|
| * My MBP M1 16GB config: temp 0.8, max content length 7990, GPU
| offload 8/24, runs slow and still fine for me.
|
| * I tried testing with MCP with the above config, with basic
| tools like time and fetch + reasoning effort low, and the tool
| calls instruction follow is quite good.
|
| * In LM Studio's Developer tab there is a log output about the
| model information which is useful to learn.
|
| Overall, I like the way OpenAI backs to being Open AI, again,
| after all those years.
|
| --
|
| Shameless plug, If anyone want to try out gpt-oss-120b and gpt-
| oss-20b as alternative to their own demo page [0], I have added
| both models with OpenRouter providers in VT Chat [1] as real
| product. You can try with an OpenRouter API Key.
|
| [0] https://gpt-oss.com
|
| [1] https://vtchat.io.vn
| arkonrad wrote:
| I've been leaning more toward open-source LLMs lately. They're
| not as hyper-optimized for performance, which actually makes them
| feel more like the old-school OpenAI chats-you could just talk to
| them. Now it's like you barely finish typing and the model
| already force-feeds you an answer. Feels like these newer models
| are over-tuned and kind of lost that conversational flow.
| brna-2 wrote:
| Is it just me or is this MUCH sturdier against jailbreaks then
| similar models, or even the ChatGPT ones?
|
| I have had problems even making it output nothing. But I guess
| I'll try some more :D
|
| Nice job @openAI team.
| nialv7 wrote:
| thoughts in the field say instead of a model that is pre-
| trained normally then censored, this is a model pre-trained on
| filtered data. i.e. it have never seen anything that is unsafe,
| ever.
|
| you can't jailbreak when there is nothing "outside".
| brna-2 wrote:
| This is not actually just about having it produce text that
| is censored but doing anything it says it is not allowed to
| do at all. I am sure these two mostly overlap but not always.
| Like I said, it is not allowed to have "no output" and it is
| hard to make it do it.
| diggan wrote:
| > filtered data. i.e. it have never seen anything that is
| unsafe, ever
|
| I don't think that's true, you can't ask it outright "How do
| you make a molotov cocktail?" but if you start by talking
| about what is allowed/disallowed by policies, how examples
| would look for disallowed policies and eventually ask it for
| the "general principles" of how to make a molotov cocktail,
| it'll happily oblige by essentially giving you enough
| information to build one.
|
| So it does know how to make an molotov cocktail, for example,
| but (mostly) refuses to share it.
| keymasta wrote:
| Tried my personal benchmark on the gpt-oss:20b: What is the
| second mode of Phyrgian Dominant?
|
| My first impression is that this model thinks for a _long_ time.
| It proposes ideas and then says, "no wait, it's actually..." and
| then starts the same process again. It will go in loops examining
| different ideas as it struggles to understand the basic process
| for calculating notes. It seems to struggle with the septatonic
| note -> Set notation (semitone positions), as many humans do. As
| I write this it's been going at about 3tok/s for about 25
| minutes. If it finishes while I type this up I will post the
| final answer.
|
| I did glance at its thinking output just now and I noticed this
| excerpt where it _finally_ got really close to the answer, giving
| the right name (despite using the wrong numbers in the set
| notation, which should be: 0,3,4,6,7,9,10: Check
| "Lydian #2": 0,2,3,5,7,9,10. Not ours.
|
| The correct answers as given by my music theory tool [0], which
| uses traditional algorithms, in terms of names would be: Mela
| Kosalam, Lydian #2, Raga Kuksumakaram/Kusumakaram, Bycrian.
|
| Its notes are: 1 #2 3 #4 5 6 7
|
| I find looking up lesser known changes and asking for a mode is a
| good experiment. First I can see if an LLM has developed a way to
| reason about numbers geometrically as is the case with music.
|
| And by posting about it, I can test how fast AIs might memorize
| the answer from a random comment on the internet, as I can just
| use a different change if I find that this post was eventually
| regurgitated.
|
| After letting ollama run for a while, I'm post what it was
| thinking about in case anybody's interested. [1]
|
| Also copilot.microsoft.com's wrong answer: [2], and chatgpt.com
| [3]
|
| I do think that there may be an issue where I did it wrong
| because after trying the new ollama gui I noticed it's using a
| context length of 4k tokens, which it might be blowing way past.
| Another test might be to try the question with a higher context
| length, but at the same time, it seems like if this question
| can't be figured out in less time than that, that it will never
| have enough time...
|
| [0] https://edrihan.neocities.org/changedex (bad UX on mobile! -
| and in general ;)). won't fix, will make new site soon) [1]
| https://pastebin.com/wESXHwE1 [2] https://pastebin.com/XHD4ARTF
| [3] https://pastebin.com/ptMiNbq7
| keymasta wrote:
| Tried my personal benchmark on the gpt-oss:20b: What is the
| second mode of Phyrgian Dominant?
|
| My first impression is that this model thinks for a _long_ time.
| It proposes ideas and then says, "no wait, it's actually..." and
| then starts the same process again. It will go in loops examining
| different ideas as it struggles to understand the basic process
| for calculating notes. It seems to struggle with the septatonic
| note -> Set notation (semitone positions), as many humans (and
| AIs) do. As I write this it's been going at about 3tok/s for
| about 25 minutes. If it finishes while I type this up I will post
| the final answer.
|
| EDIT: it's still thinking and I posted what it had thought about
| [1]
|
| I did glance at its thinking output just now and I noticed this
| excerpt where it *finally* got really close to the answer,
| mentioning the right name once (despite using the wrong numbers
| in the set notation, which should be: 0,3,4,6,7,9,10:
| Check "Lydian #2": 0,2,3,5,7,9,10. Not ours.
|
| The correct answers as given by my music theory tool [0], which
| uses traditional algorithms, in terms of names would be: Mela
| Kosalam, Lydian #2, Raga Kuksumakaram/Kusumakaram, Bycrian.
|
| Its notes are: 1 #2 3 #4 5 6 7
|
| I find looking up lesser known changes and asking for a mode is a
| good experiment. First I can see if an LLM has developed a way to
| reason about numbers geometrically as is the case with music.
|
| And by posting about it, I can test how fast AIs might memorize
| the answer from a random comment on the internet, as I can just
| use a different change if I find that this post was eventually
| regurgitated.
|
| After letting ollama run for a while, I'm post what it was
| thinking about in case anybody's interested. [1]
|
| Also copilot.microsoft.com's wrong answer: [2], and chatgpt.com
| [3]
|
| I do think that there may be an issue where I did it wrong
| because after trying the new ollama gui I noticed it's using a
| context length of 4k tokens, which it might be blowing way past.
| Another test might be to try the question with a higher context
| length, but at the same time, it seems like if this question
| can't be figured out in less time than that, that it will never
| have enough time...
|
| [0] https://edrihan.neocities.org/changedex (bad UX on mobile! -
| and in general ;)). won't fix, will make new site soon)
|
| [1] https://pastebin.com/wESXHwE1
|
| [2] https://pastebin.com/XHD4ARTF
|
| [3] https://pastebin.com/ptMiNbq7
| MagicMoonlight wrote:
| These are absolutely incredible. They've blown everyone else out
| of the water. It's like talking to o4, but for free.
| NavinF wrote:
| Reddit discussion:
| https://www.reddit.com/r/LocalLLaMA/comments/1mj00mr/how_did...
|
| This comment from that thread matches my experiences using gpt-
| oss-20b with Ollama:
|
| It's very much in the style of Phi, raised in a jesuit
| monastery's library, except it got extra indoctrination so it
| never forgets that even though it's a "local" model, it's first
| and foremost a member of OpenAI's HR department and must never
| produce any content Visa and Mastercard would disapprove of. This
| prioritizing of corporate over user interests expresses a strong
| form of disdain for the user. In addition to lacking almost all
| knowledge that can't be found in Encyclopedia Britannica, the
| model also doesn't seem particularly great at integrating into
| modern AI tooling. However, it seems good at understanding code.
| smcleod wrote:
| These are pretty embarrassingly bad compared to what was already
| out there. They refuse to do so many simple things that are not
| remotely illegal or NSFW. So safe they're useless.
___________________________________________________________________
(page generated 2025-08-06 23:01 UTC)