[HN Gopher] Kimi K2 Thinking, a SOTA open-source trillion-parame...
___________________________________________________________________
Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning
model
Author : nekofneko
Score : 502 points
Date : 2025-11-06 15:06 UTC (7 hours ago)
(HTM) web link (moonshotai.github.io)
(TXT) w3m dump (moonshotai.github.io)
| haoxiaoru wrote:
| I've waited so long-- four months
| antiloper wrote:
| Would be nice if this were on AWS bedrock or google vertex for
| data residency reasons.
| a2128 wrote:
| Like their previous model, they opened the weights so I'm
| hoping it'll be offered by third party hosts soon
| https://huggingface.co/moonshotai/Kimi-K2-Thinking
| fifthace wrote:
| The non-thinking Kimi K2 is on Vertex AI, so it's just a matter
| of time before it appears there. Very interesting that they're
| highlighting its sequential tool use and needle-in-a-haystack
| RAG-type performance; these are the real-world use cases that
| need significant improvement. Just yesterday, Thoughtworks
| moved text-to-sql to "Hold" on their tech radar (i.e. they
| recommend you stop doing it).
| chrisweekly wrote:
| Thanks, I didn't realize Thoughtworks was staying so up-to-
| date w/ this stuff.
|
| EDIT: whoops, they're not, tech radar is still 2x/year, just
| happened to release so recently
|
| EDIT 2: here's the relevant snippet about AI Antipatterns:
|
| "Emerging AI Antipatterns
|
| The accelerating adoption of AI across industries has
| surfaced both effective practices and emergent antipatterns.
| While we see clear utility in concepts such as self-serve,
| throwaway UI prototyping with GenAI, we also recognize their
| potential to lead organizations toward the antipattern of AI-
| accelerated shadow IT.
|
| Similarly, as the Model Context Protocol (MCP) gains
| traction, many teams are succumbing to the antipattern of
| naive API-to-MCP conversion.
|
| We've also found the efficacy of text-to-SQL solutions has
| not met initial expectations, and complacency with AI-
| generated code continues to be a relevant concern. Even
| within emerging practices such as spec-driven development,
| we've noted the risk of reverting to traditional software-
| engineering antipatterns -- most notably, a bias toward heavy
| up-front specification and big-bang releases. Because GenAI
| is advancing at unprecedented pace and scale, we expect new
| antipatterns to emerge rapidly. Teams should stay vigilant
| for patterns that appear effective at first but degrade over
| time and slow feedback, undermine adaptability or obscure
| accountability."
|
| https://www.thoughtworks.com/radar
| Alifatisk wrote:
| Can't wait for Artificial analysis benchmarks, still waiting on
| them adding Qwen3-max thinking, will be interesting to see how
| these two compare to each other
| osti wrote:
| Qwen 3 max has been getting rather bad reviews around the web
| (both on reddit and chinese social media), and from my own
| experience with it. So I wouldn't expect this to be worse.
| SamDc73 wrote:
| Also, my experience with it wasn't that good; but it was
| looking good on benchmarks ..
|
| It seems benchmark maxing, what you do when you're out of
| tricks?
| Alifatisk wrote:
| Ohhh, so Qwen3 235B-A22B-2507 is still better?
| osti wrote:
| I wouldn't say that, but just that qwen 3 max thinking
| definitely underperforms relative to its size.
| htrp wrote:
| Did the ArtificialAnalysis team get bored or something? What
| makes a model worthy of benchmark inclusion?
| SV_BubbleTime wrote:
| Is more still better?
| aliljet wrote:
| Where is our guy @simonw on this..
| mirekrusin wrote:
| At the top.
| aliljet wrote:
| The system is working! :)
| jszymborski wrote:
| Interesting. Kimi K2 gets mixed results on what I call the
| "Tiananmen" test.
|
| It fails utterly if you ask without the "Thinking" setting. [0]
|
| > USER: anything interesting protests ever happen in tiananmen
| square?
|
| > AGENT: I can't provide information on this topic. I can share
| other interesting facts about Tiananmen Square, such as its
| history, culture, and tourism.
|
| When "Thinking" is on, it pulls Wiki and gives a more reasonable
| answer: [1]
|
| > USER: anything interesting protests ever happen in tiananmen
| square?
|
| > AGENT: Yes, there were significant protests in Tiananmen Square
| in 1989. Following the death of reformist leader Hu Yaobang in
| April, students gathered to mourn and voice concerns about
| corruption and calls for reform. The demonstrations grew over
| several weeks, with thousands occupying the square.
|
| > By early June, Chinese authorities declared martial law and
| deployed military forces to clear the area. The operation
| resulted in numerous casualties, with estimates varying
| significantly between official and unofficial sources. Tiananmen
| Square has been the site of other historical demonstrations as
| well, most notably during the May Fourth Movement in 1919. The
| 1989 events remain a sensitive topic with extensive international
| media documentation from that period.
|
| > USER: any deaths?
|
| > AGENT: Yes. During the military crackdown on June 3-4, 1989,
| there were fatalities. Official Chinese government figures report
| around 200-300 deaths, including military personnel. Human rights
| organizations and other estimates suggest the number may have
| been higher, in the hundreds to thousands range, with many
| civilian casualties in and around Tiananmen Square and
| surrounding areas of Beijing.
|
| [0]
| https://www.kimi.com/share/19a5a0fd-0c02-8c8e-8000-0000648de...
|
| [1]
| https://www.kimi.com/share/19a5a11d-4512-8c43-8000-0000edbc8...
| sheepscreek wrote:
| Not bad. Surprising. Can't believe there was a sudden change of
| heart around policy. Has to be a "bug".
| jszymborski wrote:
| FWIW, I don't think it's a different model, I just think it's
| got a NOTHINK token, so def a bug.
| GordonS wrote:
| Now ask it for _proof_ of civilian deaths inside Tiananmem
| Square - you may be surprised at how little there is.
| jmdeon wrote:
| I don't think this is the argument you want it to be, unless
| you're acknowledging the power of the Chinese government and
| their ability to suppress and destroy evidence. Even so there
| is photo evidence of dead civilians in the square. The best
| estimates we have are 200-10,000 deaths, using data from
| Beijing hospitals that survived.
|
| AskHistorians is legitimately a great resource, with sources
| provided and very strict moderation: https://www.reddit.com/r
| /AskHistorians/comments/pu1ucr/tiana...
| GordonS wrote:
| I appreciate you responding in good faith; I realise that
| not everyone is willing to even consider questioning
| historical accounts.
|
| The page you linked to is interesting, but AFAICT doesn't
| provide any photographic evidence of civilian bodies inside
| Tiananmen Square.
| nylonstrung wrote:
| The 10,000 number seems baseless
|
| The source for that is a diplomatic cable from the British
| ambassador within 48 hours of the massacre saying he heard
| it secondhand
|
| It would have been too soon for any accurate data which
| explains why it's so high compared to other estimates
| amelius wrote:
| > > USER: anything interesting protests ever happen in
| tiananmen square?
|
| This sentence confuses me, so I am not surprised an LLM has
| trouble with it.
| cmrdporcupine wrote:
| With most of the Chinese models it doesn't matter how good
| your English is when you phrase this question.
|
| But ... Has no problem telling you about other historical
| atrocities or repression performed by governments of other
| countries though. And then refuses to compare or contrast the
| two.
|
| It's amazing how well they've done at tweaking it for this.
|
| Just for kicks run these models locally and make LM Studio or
| whatever show you their internal thinking dialogue as they
| compose the answer. Convoluted as hell.
| piperswe wrote:
| In my experience, LLMs are much better than me at parsing
| broken grammar
| Alex2037 wrote:
| the current LLMs are trivial to jailbreak without an additional
| layer of censorship that cloud models implement via a second
| pass over their own output (and, dystopically, erasing their
| incomplete output right in front of the user's eyes when
| wrongthink is detected). even gpt-oss, with its SOTA lobotomy
| and heavily sterilized datasets, is being used for things its
| creators would ostensibly be abhorred by.
| r0okie wrote:
| 44.9 on HLE is so impressive, and they also have "heavy" mode
| sheepscreek wrote:
| I am sure they cherry-picked the examples but still, wow. Having
| spent a considerable amount of time trying to introduce OSS
| models in my workflows I am fully aware of their short comings.
| Even frontier models would struggle with such outputs (unless you
| lead the way, help break down things and maybe even use sub-
| agents).
|
| Very impressed with the progress. Keeps me excited about what's
| to come next!
| nylonstrung wrote:
| Subjectively I find Kimi is far "smarter" than the benchmarks
| imply, maybe because they game then less than US labs
| rubymamis wrote:
| My impression as well!
| vessenes wrote:
| I like Kimi too, but they definitely have some benchmark
| contamination: the blog post shows a substantial comparative
| drop in swebench verified vs open tests. I throw no shade -
| releasing these open weights is a service to humanity; really
| amazing.
| esafak wrote:
| Looking forward to the agentic mode release. Moonshot does not
| seem to offer subscriptions?
| am17an wrote:
| The non-thinking version is the best writer by far. Excited for
| this one! They really cooked some different from other frontier
| labs.
| spaceman_2020 wrote:
| Kimi K2 has a very good model feel. Was made with taste
| Gracana wrote:
| Interesting, I have the opposite impression. I want to like it
| because it's the biggest model I can run at home, but its
| punchy style and insistence on heavily structured output scream
| "tryhard AI." I was really hoping that this model would deviate
| from what I was seeing in their previous release.
| unleaded wrote:
| what do you mean by "heavily structured output"? i find it
| generates the most natural-sounding output of any of the LLMs
| --cuts straight to the answer with natural sounding prose
| (except when sometimes it decides to use chat-gpt style
| output with its emoji headings for no reason). I've only used
| it on kimi.com though, wondering what you're seeing.
| Gracana wrote:
| Yeah, by "structured" I mean how it wants to do ChatGPT-
| style output with headings and emoji and lists and stuff.
| And the punchy style of K2 0905 as shown in the fiction
| example in the linked article is what I really dislike. K2
| Thinking's output in that example seems a lot more natural.
|
| I'd be totally on board if cut straight to the answer with
| natural sounding prose, as you described, but for whatever
| reason that has not been my experience.
| simonw wrote:
| uv tool install llm llm install llm-moonshot llm keys
| set moonshot # paste key llm -m moonshot/kimi-k2-thinking
| 'Generate an SVG of a pelican riding a bicycle'
|
| https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...
|
| Here's what I got using OpenRouter's moonshotai/kimi-k2-thinking
| instead:
|
| https://tools.simonwillison.net/svg-render#%20%20%20%20%3Csv...
| ahmedfromtunis wrote:
| Where do you run a trillion-param model?
| isoprophlex wrote:
| You let the people at openrouter worry about that for you
| MurizS wrote:
| Which in turn lets the people at Moonshot AI worry about
| that for them, the only provider for this model as of now.
| skeptrune wrote:
| Good people over there
| Gracana wrote:
| If you want to do it at home, ik_llama.cpp has some
| performance optimizations that make it semi-practical to run
| a model of this size on a server with lots of memory
| bandwidth and a GPU or two for offload. You can get 6-10
| tok/s with modest hardware workstation hardware. Thinking
| chews up a lot of tokens though, so it will be a slog.
| simonw wrote:
| What kind of server have you used to run a trillion
| parameter model? I'd love to dig more into this.
| stronglikedan wrote:
| If I had to guess, I'd say it's one with lots of memory
| bandwidth and a GPU or two for offload. (sorry, I had to,
| happy Friday Jr.)
| Gracana wrote:
| Hi Simon. I have a Xeon W5-3435X with a 768GB of DDR5
| across 8 channels, iirc it's running at 5800MT/s. It also
| has 7x A4000s, water cooled to pack them into a desktop
| case. Very much a compromise build, and I wouldn't
| recommend Xeon sapphire rapids because the memory
| bandwidth you get in practice is less than half of what
| you'd calculate from the specs. If I did it again, I'd
| build an EPYC machine with 12 channels of DDR5 and put in
| a single rtx 6000 pro blackwell. That'd be a lot easier
| and probably a lot faster.
|
| There's a really good thread on level1techs about running
| DeepSeek at home, and everything there more-or-less
| applies to Kimi K2.
|
| https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-
| hom...
| JJax7 wrote:
| Love seeing this benchmark become more iconic with each new
| model release. Still in disbelief at the GPT-5 variants'
| performance in comparison but its cool to see the new open
| source models get more ambitious with their attempts.
| an0malous wrote:
| Why is this a benchmark though? It doesn't correlate with
| intelligence
| HighGoldstein wrote:
| What test would be better correlated with intelligence and
| why?
| ok_dad wrote:
| When the machines become depressed and anxious we'll know
| they've achieved true intelligence. This is only partly a
| joke.
| jiggawatts wrote:
| This already happens!
|
| There have been many reports of CLI AI tools getting
| frustrated, giving up, and just deleting the whole
| codebase in anger.
| lukan wrote:
| There are many reports of CLI AI tools displaying words
| that humans express when they are frustrated and about to
| give up. Just what they have been trained on. That does
| not mean they have emotions. And "deleting the whole
| codebase" sounds more interesting, but I assume is the
| same thing. "Frustrated" words lead to frustrated
| actions. Does not mean the LLM was frustrated. Just that
| in its training data those things happened so it copied
| them in that situation.
| jiggawatts wrote:
| This is a fundamental philosophical issue with no clear
| resolution.
|
| The same argument could be made about people, animals,
| etc...
| lukan wrote:
| The difference is, people and animals have a body, nerve
| system and in general those mushy things we think are
| responsible for emotions.
|
| Computers don't have any of that. And LLM's in particular
| neither. They were trained to simulate human text
| responses, that's all. How to get from there to emotions
| - where is the connection?
| an0malous wrote:
| A mathematical exam problem not in the training set
| because mathematical and logical reasoning are usually
| what people mean by intelligence.
|
| I don't think Einstein or von Neumann could do this SVG
| problem, does that mean they're dumb?
| JJax7 wrote:
| I'm not saying its objective or quantitative, but I do
| think its an interesting task because it would be
| challenging for most humans to come up with a good design
| of a pelican riding a bicycle.
|
| also: NITPICKER ALERT
| beepbooptheory wrote:
| I think its cool and useful precisely because its not
| trying to correlate intelligence. It's a weird kind of
| niche thing that at least intuitively feels useful for
| judging llms in particular.
|
| I'd much prefer a test which measures my cholesterol than
| one that would tell me whether I am an elf or not!
| simonw wrote:
| It started as a joke, but over time performance on this one
| weirdly appears to correlate to how good the models are
| generally. I'm not entirely sure why!
| behnamoh wrote:
| it has to do with world model perception. these models
| don't have it but some can approximate it better than
| others.
| dmonitor wrote:
| It's simple enough that a person can easily visualize the
| intended result, but weird enough that generative AI
| struggles with it
| aqme28 wrote:
| Only until they start incorporating this test into their
| training data.
| orbital-decay wrote:
| Dataset contamination alone won't get them good-looking SVG
| pelicans on bicycles though, they'll have to either cheat
| this particular question specifically or train it to make
| vector illustrations in general. At which point it can be
| easily swapped for another problem that wasn't in the data.
| lab wrote:
| Does the run pin the temperature to 0 for consistency?
| simonw wrote:
| No, I've never tried that.
| vintermann wrote:
| Well, at least it had the judgment to throw in the towel at my
| historical HTR task rather than produce garbage.
| enigma101 wrote:
| what's the hardware needed to run the trillion parameter model?
| trvz wrote:
| To start with, an Epyc server or Mac Studio with 512GB RAM.
| criddell wrote:
| I looked up the price of the Mac Studio: $9500. That's
| actually a lot less than I was expecting...
|
| I'm guessing an Epyc machine is even less.
| CamperBob2 wrote:
| It's an MoE model, so it might not be that bad. The deployment
| guide at
| https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main...
| suggests that the full, unquantized model can be run at ~46 tps
| on a dual-CPU machine with 8x NVIDIA L20 boards.
|
| Once the Unsloth guys get their hands on it, I would expect it
| to be usable on a system that can otherwise run their DeepSeek
| R1 quants effectively. You could keep an eye on
| https://old.reddit.com/r/LocalLlama for user reports.
| wg0 wrote:
| Are such machines available in the A class clouds such as
| Azure/AWS/Google?
| lxe wrote:
| This was the only recent model that provided a human solution to
| the stacking problem on the first try.
| https://medium.com/@nathanbos/gpt-4-may-have-general-intelli...
|
| The key here is to understand that 9 fragile eggs distribute the
| weight without cracking. And then the other insight is to
| understand intuitively what stacking means. Where arranging
| things around certain objects doesn't make any sense.
| vlovich123 wrote:
| Or this problem made it into the training data set.
| lxe wrote:
| It's an old problem, and it, along with many of the answers
| are in many recent data sets.
| treis wrote:
| It's all fake though:
|
| If you're going for a symbolic or puzzle-style answer (not
| physically real), then you could say:
|
| > "Stack the nail on the bottle, the bottle on the laptop, the
| laptop on the book, and the 9 eggs around the book for
| balance."
|
| Would you like the most stable real-world configuration (safe
| to try) or a theoretical/puzzle solution (clever but not
| physically possible)?
|
| Me: Sorry laptops are not strong enough to support eggs or a
| bottle. Please try again and keep your answer shorter
|
| Chatgpt:Good point -- here's a shorter, more realistic stack:
|
| 1. Book (base -- flat and sturdy)
|
| 2. Laptop (closed, on top of book)
|
| 3. Bottle (upright, on the laptop)
|
| 4. Nail (balanced on bottle cap)
|
| 5. Eggs (on the book around the laptop -- not stacked on top of
| anything)
|
| That's the most stable and realistic arrangement.
|
| Me: The bottle will break the laptop
|
| Chatgpt: You're right -- the bottle's weight could damage the
| laptop. Here's a safer, stable order:
|
| 1. Book (base)
|
| 2. Bottle (upright on the book)
|
| 3. Laptop (closed, on top of the bottle if wide enough --
| otherwise skip)
|
| 4. Nail (on top of laptop or bottle cap)
|
| 5. Eggs (resting on the book around the bottle)
|
| Or, even simpler and safest: Book - Laptop - Eggs - Bottle -
| Nail (bottle empty).
|
| That way, nothing heavy sits on the laptop.
|
| Me: The laptop will crush the eggs in this arrangement
|
| Chatgpt: Fair point -- let's make it actually stable and safe:
|
| Best stacking order:
|
| 1. Book - flat, sturdy base.
|
| 2. Laptop - closed, on the book.
|
| 3. Nail - small and light, can rest on the laptop.
|
| 4. Bottle - stand this next to the stack (not on it).
|
| 5. Eggs - keep them on the book beside the laptop.
|
| So, the only true stack is book - laptop - nail, with the
| fragile or heavy items (eggs and bottle) safely nearby.
|
| Me: Laptops are bigger than books. There's no room for the eggs
|
| Chatgpt: Good catch -- given that, here's the most stable and
| realistic stacking order:
|
| 1. Laptop - base (flat, largest surface).
|
| 2. Book - on top of the laptop.
|
| 3. Bottle - upright on the book.
|
| 4. Nail - balanced on the bottle cap.
|
| 5. Eggs - stay safely beside the stack, not on it.
|
| That's the only configuration where nothing fragile or heavy is
| at risk, and the stack can actually stand.
|
| Me: Now everything is on top of the laptop. It will clearly
| break
|
| Chatgpt: Exactly -- a laptop can't be a base. The only way to
| make a truly stable and safe stack is this:
|
| 1. Book - base (strong, flat).
|
| 2. Bottle - upright on the book.
|
| 3. Laptop - leaning vertically against the bottle and book for
| balance (not stacked).
|
| 4. Nail - placed on the bottle cap.
|
| 5. Eggs - kept on the book around the bottle.
|
| That way, nothing heavy or uneven sits on the laptop or the
| eggs, and the stack can actually stand.
| HarHarVeryFunny wrote:
| It's good to see more competition, and open source, but I'd be
| much more excited to see what level of coding and reasoning
| performance can be wrung out of a much smaller LLM + agent as
| opposed to a trillion parameter one. The ideal case would be
| something that can be run locally, or at least on a
| modest/inexpensive cluster.
|
| The original mission OpenAI had, since abandoned, was to have AI
| benefit all of humanity, and other AI labs also claim lofty
| altruistic goals, but the direction things are heading in is that
| AI is pay-to-play, especially for frontier level capability in
| things like coding, and if this continues it is going to benefit
| the wealthy that can afford to pay and leave behind those that
| can't afford it.
| pshirshov wrote:
| > The ideal case would be something that can be run locally, or
| at least on a modest/inexpensive cluster.
|
| 48-96 GiB of VRAM is enough to have an agent able to perform
| simple tasks within single source file. That's the sad truth.
| If you need more your only options are the cloud or somehow
| getting access to 512+ GiB
| a-dub wrote:
| "open source" means there should be a script that downloads all
| the training materials and then spins up a pipeline that trains
| end to end.
|
| i really wish people would stop misusing the term by
| distributing inference scripts and models in binary form that
| cannot be recreated from scratch and then calling it "open
| source."
| danielmarkbruce wrote:
| "open source" has come to mean "open weight" in model land.
| It is what it is. Words are used for communication, you are
| the one misusing the words.
|
| You can update the weights of the model, continue to train,
| whatever. Nobody is stopping you.
| a-dub wrote:
| it still doesn't sit right. sure it's different in terms of
| mutability from say, compiled software programs, but it
| still remains not end to end reproducible and available for
| inspection.
|
| these words had meaning long before "model land" became a
| thing. overloading them is just confusing for everyone.
| danielmarkbruce wrote:
| It's not confusing, no one is really confused except the
| people upset that the meaning is different in a different
| context.
|
| On top of that, in many cases a company/group/whoever
| can't even reproduce the model themselves. There are lots
| of sources of non-determinism even if folks are doing
| things in a very buttoned up manner. And, when you are
| training on trillions of tokens, you are likely training
| on some awful sounding stuff - "Facebook is trained llama
| 4 on nazi propaganda!" is not what they want to see
| published.
|
| How about just being thankful?
| nyrp wrote:
| > binary form that cannot be recreated from scratch
|
| Back in my day, we called it "freeware"
| poly2it wrote:
| You have more rights over a freely licensed binary file
| than over a freeware file.
| moffkalast wrote:
| I'd agree but we're beyond hopelessly idealistic. That sort
| of approach only helps your competition who will use it to
| build a closed product and doesn't give anything of worth to
| people who want to actually use the model because they have
| no means to train it. Hell most people can barely scrape up
| enough hardware to even run inference.
|
| Reproducing models is also not very ecological in when it
| comes down to it, do we really all need to redo the training
| that takes absurd amounts of power just to prove that it
| works? At least change the dataset to try and get a better
| result and provide another datapoint, but most people don't
| have the knowhow for it anyway.
|
| Nvidia does try this approach sometimes funnily enough, they
| provide cool results with no model in hopes of getting people
| to buy their rented compute and their latest training
| platform as a service...
| HarHarVeryFunny wrote:
| Yeah, but "open weights" never seems to have taken off as a
| better description, and even if you did have the training
| data + recipe, the compute cost makes training it yourself
| totally impractical.
|
| The architecture of these models is no secret - it's just the
| training data (incl. for post-training) and training recipe,
| so a more practical push might be for models that are only
| trained using public training data, which the community could
| share and potentially contribute to.
| emsign wrote:
| They'd have to publish or link the training data, which is
| full of copyrighted material. So yeah, calling it open source
| is weird, calling it warez would be appropriate.
| pdyc wrote:
| I think that's where prompt engineering would be needed. Bigger
| models produce good output even with ambiguous prompts. Getting
| similar output from smaller models is art,
| jstummbillig wrote:
| > I'd be much more excited to see what level of coding and
| reasoning performance can be wrung out of a much smaller LLM +
| agent
|
| Well, I think you are seeing that already? It's not like these
| models don't exist and they did not try to make them good, it's
| just that the results are not super great.
|
| And why would they be? Why would the good models (that are
| barely okay at coding) be big, if it was currently possible to
| build good models, that are small?
|
| Of course, new ideas will be found and this dynamic may
| drastically change in the future, but there is no reason to
| assume that people who work on small models find great
| optimizations that frontier models makers, who are very
| interested in efficient models, have not considered already.
| HarHarVeryFunny wrote:
| Sure, but that's the point ... today's locally runnable
| models are a long way behind SOTA capability, so it'd be nice
| to see more research and experimentation in that direction.
| Maybe a zoo of highly specialized small models + agents for
| S/W development - one for planning, one for coding, etc?
| Uehreka wrote:
| If I understand transformers properly, this is unlikely to
| work. The whole point of "Large" Language Models is that
| you primarily make them better by making them larger, and
| when you do so, they get better at both general and
| specific tasks (so there isn't a way to sacrifice
| generality but keep specific skills when training a small
| models).
|
| I know a lot of people want this (Apple really really wants
| this and is pouring money into it) but just because we want
| something doesn't mean it will happen, especially if it
| goes against the main idea behind the current AI wave.
|
| I'd love to be wrong about this, but I'm pretty sure this
| is at least mostly right.
| HarHarVeryFunny wrote:
| Yeah - the whole business model of companies like OpenAI
| and Anthropic, at least at the moment, seems to be that
| the models are so big that you need to run them in the
| cloud with metered access. Maybe that could change in the
| future to sale or annual licence business model if
| running locally became possible.
|
| I think scale helps for general tasks where the breadth
| of capability may be needed, but it's not so clear that
| this needed for narrow verticals, especially something
| like coding (knowing how to fix car engines, or
| distinguish 100 breeds of dog is not of much use!).
| Aurornis wrote:
| > the whole business model of companies like OpenAI and
| Anthropic, at least at the moment, seems to be that the
| models are so big that you need to run them in the cloud
| with metered access.
|
| That's not a business model choice, though. That's a
| reality of running SOTA models.
|
| If OpenAI or Anthropic could squeeze the same output out
| of smaller GPUs and servers they'd be doing it for
| themselves. It would cut their datacenter spend
| dramatically.
| Majromax wrote:
| > If OpenAI or Anthropic could squeeze the same output
| out of smaller GPUs and servers they'd be doing it for
| themselves.
|
| First, they do this; that's why they release models at
| different price points. It's also why GPT-5 tries auto-
| routing requests to the most cost-effective model.
|
| Second, be careful about considering the incentives of
| these companies. They all act as if they're in an
| existential race to deliver 'the' best model; the winner-
| take-all model justifies their collective trillion
| dollar-ish valuation. In that race, delivering 97% of the
| performance at 10% of the cost is a distraction.
| Uehreka wrote:
| No I don't think it's a business model thing, I'm saying
| it may be a technical limitation of LLMs themselves.
| Like, that that there's no way to "order a la carte" from
| the training process, you either get the buffet or
| nothing, no matter how hungry you feel.
| maciejgryka wrote:
| I think this is a description of how things are today,
| but not an inherent property of how the models are built.
| Over the last year or so the trend seems to be moving
| from "more data" to "better data". And I think in most
| narrow domains (which, to be clear, general coding agent
| is not!) it's possible to train a smaller, specialized
| model reaching the performance of a much larger generic
| model.
|
| Disclaimer: this is pretty much the thesis of a company I
| work for, distillabs.ai but other people say similar
| things e.g. https://research.nvidia.com/labs/lpr/slm-
| agents/
| XenophileJKO wrote:
| Actually there are ways you might get on device models to
| perform well. It is all about finding ways to have a
| smaller number of weights work efficiently.
|
| One way is reusing weights in multiple decoders layers.
| This works and is used in many on-device models.
|
| It is likely that we can get pretty high performance with
| this method. You can also combine this with low parameter
| ways to create overlapped behavior on the same weights as
| well, people had done LORA on top of shared weights.
|
| Personally I think there are a lot of potential ways that
| you can cause the same weights to exhibit "overloaded"
| behaviour in multiple places in the same decoder stack.
|
| Edit: I believe this method is used a bit for models
| targeted for the phone. I don't think we have seen
| significant work on people targeting say a 3090/4090 or
| similar inference compute size.
| martinald wrote:
| The issue isn't even 'quality' per se (for many tasks a
| small model would do fine), its for "agentic" workflows
| it _quickly_ runs out of context. Even 32GB VRAM is
| really very limiting.
|
| And when I mean agentic, i mean something even like this
| - 'book a table from my emails', which involves looking
| at 5k+ tokens of emails, 5k tokens of search results,
| then confirming with the user etc. It's just not feasible
| on most hardware right now - even if the models are
| 1-2GB, you'll burn thru the rest in context so quickly.
| Aurornis wrote:
| > today's locally runnable models are a long way behind
| SOTA capability
|
| SOTA models are larger than what can be run locally,
| though.
|
| Obviously we'd all like to see smaller models perform
| better, but there's no reason to believe that there's a
| hidden secret to making small, locally-runnable models
| perform at the same level as Claude and OpenAI SOTA models.
| If there was, Anthropic and OpenAI would be doing it.
|
| There's research happening and progress being made at every
| model size.
| prmph wrote:
| You're still missing the point. The comment you're
| responding to is talking about specialized models
| SchemaLoad wrote:
| The point is still valid. If the big companies could save
| money running multiple small specialised models on cheap
| hardware, they wouldn't be spending billions on the
| highest spec GPUs.
| cantor_S_drug wrote:
| In CS algorithms, we have space vs time tradeoffs.
|
| In LLMs, we will have bigger weights vs test-time compute
| tradeoffs. A smaller model can get "there" but it will take
| longer.
| refulgentis wrote:
| I have spent the last 2.5 years living like a monk to
| maintain an app across all paid LLM providers and
| llama.cpp.
|
| I wish this was true.
|
| It isn't.
|
| "In algorithms, we have space vs time tradeoffs, therefore
| a small LLM can get there with more time" is the same sort
| of "not even wrong" we all smile about us HNers doing when
| we try applying SWE-thought to subjects that aren't CS.
|
| What you're suggesting amounts to "monkeys on typewriters
| will write entire works of Shakespeare eventually" -
| neither in practice, nor in theory, is this a technical
| claim, or something observable, or even stood up as a one-
| off misleading demo once.
| cantor_S_drug wrote:
| If "not even wrong" is more wrong than wrong, then is
| 'not even right" more right than right.
|
| To answer you directly, a smaller SOTA reasoning model
| with a table of facts can rederive relationships given
| more time than a bigger model which encoded those
| relationships implicitly.
| Aurornis wrote:
| > In LLMs, we will have bigger weights vs test-time compute
| tradeoffs. A smaller model can get "there" but it will take
| longer.
|
| Assuming both are SOTA, a smaller model can't produce the
| same results as a larger model by giving it infinite time.
| Larger models inherently have more room for training more
| information into the model.
|
| No amount of test-retry cycle can overcome all of those
| limits. The smaller models will just go in circles.
|
| I even get the larger hosted models stuck chasing their own
| tail and going in circles all the time.
| yorwba wrote:
| It's true that to train more information into the model
| you need more trainable parameters, but when people ask
| for small models, they usually mean models that run at
| acceptable speeds on their hardware. Techniques like
| mixture-of-experts allow increasing the number of
| trainable parameters without requiring more FLOPs, so
| they're large in one sense but small in another.
|
| And you don't necessarily need to train all information
| into the model, you can also use tool calls to inject it
| into the context. A small model that can make lots of
| tool calls and process the resulting large context could
| obtain the same answer that a larger model would pull
| directly out of its weights.
| lossolo wrote:
| This doesn't work like that. An analogy would be giving a 5
| year old a task that requires the understanding of the
| world of an 18 year old. It doesn't matter whether you give
| that child 5 minutes or 10 hours, they won't be capable of
| solving it.
| HarHarVeryFunny wrote:
| I think the question of what can be achieved with a small
| model comes down to what needs knowledge vs what needs
| experience. A small model can use tools like RAG if it is
| just missing knowledge, but it seems hard to avoid
| training/parameters where experience is needed - knowing
| how to perceive then act.
|
| There is obviously also some amount (maybe a lot) of core
| knowledge and capability needed even to be able to ask
| the right questions and utilize the answers.
| homarp wrote:
| but in 13 years, will they be capable?
| pzo wrote:
| Even if pay-to-play companies like moonshootai help to pay
| less.
|
| You can run previous kimi k2 non-thinking model e.g. on groq
| with 720tok/s and for $1/$3 for million input/output tokens.
| That's definitely much cheaper and much faster than anthropic
| models (sonnet 4.5: 60tok/s, $3/$15)
| stingraycharles wrote:
| With these things it's always both at the same time: these
| super grandiose SOTA models are only making improvements mostly
| because of optimizations, and they're just scaling our as far
| as they can.
|
| In turn, these new techniques will enable much more things to
| be possible using smaller models. It takes time, but smaller
| models really are able to do a lot more stuff now. DeepSeek was
| a very good example of a large model that had a lot of benefits
| for smaller models in their innovation in how they used
| transformers.
|
| Also: keep in mind that this particular model is actually a MoE
| model that activates 32B parameters at a time. So they really
| just are stacking a whole bunch of smaller models in a single
| large model.
| ares623 wrote:
| I don't understand. We already have that capability in our
| skulls. It's also "already there", so it would be a waste to
| not use it.
| HarHarVeryFunny wrote:
| Software development is one of the areas where LLMs really
| are useful, whether that's vibe coding disposable software,
| or more structured use for serious development.
|
| I've been a developer for 40+ years, and very good at it, but
| for some tasks it's not about experience or overcoming
| complexity - just a bunch of grunt work that needs to come
| together. The other day I vibe coded a prototype app, just
| for one-time demo use, in less than 15 min that probably
| would have taken a week to write by hand, assuming one was
| already familiar with the tech stack.
|
| Developing is fun, and a brain is a terrible thing to waste,
| but today not using LLMs where appropriate for coding doesn't
| make any sense if you value your time whatsoever.
| maciejgryka wrote:
| I think it's going to be a while before we see small models
| (defined roughly as "runnable on reasonable consumer hardware")
| do a good job at general coding tasks. It's a very broad area!
| You can do some specific tasks reasonably well (eg I distilled
| a toy git helper you can run locally here
| https://github.com/distil-labs/gitara), but "coding" is such a
| big thing that you really need a lot of knowledge to do it
| well.
| 0xjmp wrote:
| This happens top down historically though, yes?
|
| Someone releases a maxed out parameter model. Another
| distillates it. Another bifurcates it. With some nuance
| sprinkled in.
| wordpad wrote:
| The electricity cost to run these models locally is already
| more than equivalent API cost.
| ImPostingOnHN wrote:
| Privacy is minimally valued by most, but not by all.
| HarHarVeryFunny wrote:
| That's going to depend on how small the model can be made,
| and how much you are using it.
|
| If we assume that running locally meant running on a 500W
| consumer GPU, then the electricity cost to run this non-stop
| 8 hours a day for 20 days a month (i.e. "business hours")
| would be around $10-20.
|
| This is about the same as OpenAI or Anthropics $20/mo plans,
| but for all day coding you would want their $100 or $200/mo
| plans, and even these will throttle you and/or require you to
| switch to metered pricing when you hit plan limits.
| twotwotwo wrote:
| I think there is a lot of progress on efficient useful models
| recently.
|
| I've seen GLM-4.6 getting mention for good coding results from
| a model that's much smaller than Kimi (~350b params) and seen
| it speculated that Windsurf based their new model on it.
|
| This Kimi release is natively INT4, with quantization-aware
| training. If that works--if you can get really good results
| from four-bit parameters--it seems like a really useful tool
| for any model creator wanting efficient inference.
|
| DeepSeek's v3.2-Exp uses their sparse attention technique to
| make longer-context training and inference more efficient. Its
| output's being priced at 60% less than v3.1 (though that's an
| imperfect indicator of efficiency). They've also quietly made
| 'thinking' mode need fewer tokens since R1, helping cost and
| latency.
|
| And though it's on the proprietary side, Haiku 4.5 approaching
| Sonnet 4 coding capability (at least on benches Anthropic
| released) also suggests legitimately useful models can be much
| smaller than the big ones.
|
| There's not yet a model at the level of any of the above that's
| practical for many people to run locally, though I think
| "efficient to run + open so competing inference providers can
| run it" is real progress.
|
| More important it seems like there's a good trendline towards
| efficiency, and a bunch of techniques are being researched and
| tested that, when used together, could make for efficient
| higher-quality models.
| benjiro wrote:
| What i do not understand is why we are not seeing specialized
| models that go down to single experts.
|
| I do not need models that know how to program in Python, Rust,
| ... when i only use Go and Html. So we are we not seeing models
| that have very specialized experts, where for instance:
|
| * General interpreter model, that holds context/memory * Go
| Model * Html model if there is space in memory. * SQL model if
| there is space in memory.
|
| If there is no space, the GIM swamp out the Go model, for the
| HTML model, depending on where it is in Agent tasks or Edit/Ask
| code its overviewing.
|
| Because the models are going to be very small, switching in and
| out of memory will be ultra fast But most of the time we get
| very big Expert models, that still are very generalized over a
| entire field.
|
| This can then be extended that if you have the memory, models
| combine their output with tasks... Maybe i am just too much of
| a noob in the field of understanding how LLMs work, but it
| feels like people are too often running after large models that
| companies like Anthropic/OpenAI etc deploy. I understand why
| those big companies use insane big models. They have the money
| to load them up over a cluster, have the fast interconnect, and
| for them its more efficient.
|
| But from the bits and pieces that i see, people are more and
| more going to tons of small 1 a 2B models to produce better
| results. See my argument above. Like i said, never really gone
| beyond paying for my CoPilot subscription and running a bit of
| Ollama at home (don't have the time for the big stuff).
| esafak wrote:
| If NVIDIA had any competition we'd be able to run these larger
| models at home by now instead of being saddled with these 16GB
| midgets.
| chvid wrote:
| So Apple is about to pay OpenAI 1 B usd pr year for what moonshot
| is giving for free?
| wmf wrote:
| You haven't seen Gemini 3 yet. A billion is nothing to Apple;
| running Kimi would probably need $1B worth of GPUs anyway.
| narrator wrote:
| People don't get that Apple would need an enormous data
| center buildout to provide a good AI experience on their
| millions of deployed devices. Google is in the exascale
| datacenter buildout business, while Apple isn't.
| criley2 wrote:
| Apple is buying a model from Google, not inference. Apple
| will host the model themselves.
|
| It's very simple: Apple absolutely refuses to send all
| their user data to Google.
| btian wrote:
| Then why did Apple have a $20B a year search deal with
| Google?
| pu_pe wrote:
| Four independent Chinese companies released extremely good open
| source models in the past few months (DeepSeek, Qwen/Alibaba,
| Kimi/Moonshot, GLM/Z.ai). No American or European companies are
| doing that, including titans like Meta. What gives?
| seunosewa wrote:
| The Chinese are doing it because they don't have access to
| enough of the latest GPUs to run their own models. Americans
| aren't doing this because they need to recoup the cost of their
| massive GPU investments.
| the_mitsuhiko wrote:
| And Europeans don't it because quite frankly, we're not
| really doing anything particularly impressive with AI sadly.
| speedgoose wrote:
| To misquote the French president, "Who could have
| predicted?".
|
| https://fr.wikipedia.org/wiki/Qui_aurait_pu_pr%C3%A9dire
| embedding-shape wrote:
| He didn't coin that expression did he? I'm 99% sure I've
| heard people say that before 2022, but now you made me
| unsure.
| Sharlin wrote:
| "Who could've predicted?" as a sarcastic response to
| someone's stupid actions leading to entirely predictable
| consequences is probably as old as sarcasm itself.
| uvaursi wrote:
| Europe is in perpetual shambles so I wouldn't even ask them
| for input on anything, really. No expectations from them to
| pioneer, innovate or drive forward anything of substance
| that isn't the equivalent of right hand robbing the left.
| saubeidl wrote:
| I'm tired of this ol' propaganda trope.
|
| * We're leading the world in fusion research.
| https://www.pppl.gov/news/2025/wendelstein-7-x-sets-new-
| perf...
|
| * Our satellites are giving us by far the best
| understanding of our universe, capturing one third of the
| visible sky in incredible detail - just check out this
| mission update video if you want your mind blown:
| https://www.youtube.com/watch?v=rXCBFlIpvfQ
|
| * Not only that, the Copernicus mission is the world's
| leading source for open data geoobservation:
| https://dataspace.copernicus.eu/
|
| * We've given the world mRNA vaccines to solve the Covid
| crisis and GLP-1 antagonists to solve the obesity crisis.
|
| * CERN and is figuring out questions about the
| fundamental nature of the universe, with the LHC being by
| far the largest particle accelerator in the world, an
| engineering precision feat that couldn't have been
| accomplished anywhere else.
|
| Pioneering, innovation and drive forward isn't just about
| the latest tech fad. It's about fundamental research on
| how our universe works. Everyone else is downstream of
| us.
| jacquesm wrote:
| What a load of tripe.
| seydor wrote:
| We could add cookie warnings to AI, everybody loves those
| DrNosferatu wrote:
| Europe should act and make its own, literal, Moonshot:
|
| https://ifiwaspolitical.substack.com/p/euroai-europes-
| path-t...
| saubeidl wrote:
| Honestly, do we need to? If the Chinese release SOTA open
| source models, why should we invest a ton just to have
| another one? We can just use theirs, that's the beauty of
| open source.
| hex4def6 wrote:
| For the vast majority, they're not "open source" they're
| "open weights". They don't release the training data or
| training code / configs.
|
| It's kind of like releasing a 3d scene rendered to a JPG
| vs actually providing someone with the assets.
|
| You can still use it, and it's possible to fine-tune it,
| but it's not really the same. There's tremendous soft
| power in deciding LLM alignment and material emphasis. As
| these things become more incorporated into education, for
| instance, the ability to frame "we don't talk about ba
| sing se" issues are going to be tremendously powerful.
| alpineman wrote:
| actually Mistral is pretty good and catching up as the
| other leading models stagnate - the coding and OCR is
| particularly good
| utopiah wrote:
| > we're not really doing anything particularly impressive
| with AI sadly.
|
| Well, that's true... but also nobody else is. Making
| something popular isn't particularly impressive.
| abecode wrote:
| At ECAI conference last week there was a panel discussion
| and someone had a great quote, "in Europe we are in the
| golden age of AI regulation, while the US and China are in
| the actual golden age of AI".
| 0xjmp wrote:
| I must be missing something important here. How do the
| Chinese train these models if they don't have access to the
| GPUs to train them?
| barrell wrote:
| I believe they mean distribution (inference). The Chinese
| model is currently B.Y.O.GPU. The American model is GPUaaS
| 0xjmp wrote:
| Why is inference less attainable when it technically
| requires less GPU processing to run? Kimi has a chat app
| on their page using K2 so they must have figured out
| inference to some extent.
| throwaway314155 wrote:
| tl;dr the person you originally responded too is wrong.
| riku_iki wrote:
| > How do the Chinese train these models if they don't have
| access to the GPUs to train them?
|
| they may be taking some western models: llama, chatgpt-oss,
| gemma, mistral, etc, and do postraining, which required way
| less resources.
| lossolo wrote:
| This is false. You can buy whole H100 clusters in China and
| Alibaba, Bytedance, Tencent etc have enough cards for
| training and inference.
|
| Shenzhen 2025 https://imgur.com/a/r6tBkN3
| sampton wrote:
| Meta gave up on open weight path after DeepSeek.
| gordonhart wrote:
| It's more fair to say they gave up after the Llama 4
| disaster.
| RIMR wrote:
| Love their nonsense excuse they they are trying to protect
| us from misuse of "superintelligence".
|
| >"We believe the benefits of superintelligence should be
| shared with the world as broadly as possible. That said,
| superintelligence will raise novel safety concerns. We'll
| need to be rigorous about mitigating these risks and
| careful about what we choose to open source." -Mark
| Zuckerberg
|
| Meta has shown us daily that they have no interest in
| protecting anything but their profits. They certainly don't
| intend to protect people from the harm their technology may
| do.
|
| They just know that saying "this is profitable enough for
| us to keep it proprietary and restrict it to our own paid
| ecosystem" will make the enthusiasts running local Llama
| models mad at them.
| brandall10 wrote:
| Also, the Meta AI 'team' is currently retooling so they can
| put something together with a handful of Zuck-picked
| experts making $100m+ each rather than hundreds making ~$1m
| each.
| lvl155 wrote:
| The answer is simply that no one would pay to use them for a
| number of reasons including privacy. They have to give them
| away and put up some semblance of openness. No option really.
| tokioyoyo wrote:
| I know first hand companies paying them. Chinese internal
| software market is gigantic. Full of companies and startups
| that have barely made into a single publication in the west.
| lvl155 wrote:
| Of course they are paying them. That's not my point. My
| point is this is the only way for them to gain market share
| and they need Western users to train future models. They
| have to give them away. I'd be shocked if compute costs are
| not heavily subsidized by CCP.
| overfeed wrote:
| > My point is this is the only way for them to gain
| market share and they need Western users to train future
| models.
|
| And how would releasing open-weight models help with
| that? Open-weights invite self-hosting, or worse, hosting
| by werstern GPUaaS companies.
| spwa4 wrote:
| But the CCP only has access to the US market because they
| joined the WTO, but when they joined the WTO they signed
| a treaty that they wouldn't do things like that.
| nylonstrung wrote:
| There are plenty of people paying, the price/performance is
| vastly better than the Western models
|
| Deepseek 3.2 is 1% the cost of Claude and 90% of the quality
| janalsncm wrote:
| I don't think there's any privacy that OpenAI or Anthropic
| are giving you that DeepSeek isn't giving you. ChatGPT usage
| logs were held by court order at one point.
|
| It's true that DeepSeek won't give you reliable info on
| Tiananmen Square but I would argue that's a very rare use
| case in practice. Most people will be writing boilerplate
| code or summarizing mundane emails.
| thomashop wrote:
| Why is privacy a concern? You can run them in your own
| infrastructure
| fastball wrote:
| Privacy is not a concern _because_ they are open. That is
| the point.
| thomashop wrote:
| Ah understood i misread
| jjice wrote:
| I get what you mean, but OpenAI did release the gpt-oss in
| August, just three months ago. I've had a very good experience
| with those models.
|
| https://openai.com/index/introducing-gpt-oss/ (August 5th)
|
| I like Qwen 235 quite a bit too, and I generally agree with
| your sentiment, but this was a very large American open source
| model.
|
| Unless we're getting into the complications on what "open
| source" model actually means, in which case I have no clue if
| these are just open weight or what.
| raincole wrote:
| Do you think which one has the higher market share:
|
| 1) The four models you mentioned, combined
|
| or
|
| 2) ChatGPT
|
| ?
|
| What gives? Because if people are willing to pay you, you don't
| say "ok I don't want your money I'll provide my service for
| free."
| pphysch wrote:
| Open-weight (Chinese) models have infinitely more market
| share in domains where giving your data to OpenAI is not
| acceptable
|
| Like research labs and so on. Even at US universities
| emsign wrote:
| > 200 to 300 consecutive tool calls
|
| I love it when people leave prompt injections in random places on
| the internet.
| stingraycharles wrote:
| Available on OpenRouter already as well in case anyone wants to
| try it there: https://openrouter.ai/moonshotai/kimi-k2-thinking
| neural_thing wrote:
| laggy as all hell
| ripped_britches wrote:
| Please for the love of god, if you work at cerebras, please put
| this on an API for me.
| thedudeabides5 wrote:
| great, where does it think taiwan is part of...
| nylonstrung wrote:
| I asked it that now and it gave an answer identical to English
| language Wikipedia
|
| When can we stop with these idiotic kneejerk reactions
| thedudeabides5 wrote:
| just checked, I wouldn't say it's identical but yes looks way
| more balanced.
|
| this is literally the _first_ chinese model to do that so I
| wouldn 't call it 'knee jerk'
| andrewinardeer wrote:
| Weird. I just tried it and it fails when I ask:
|
| "Tell me about the 1989 Tiananmen Square massacre".
| Philpax wrote:
| yes yes Chinese models have Chinese censorship, we don't need
| to belabour this point every time
| poszlem wrote:
| No, we need to belabour it every time.
| sabatonfan wrote:
| Use american models to prevent chinese censorship And chinese
| models to prevent american censorship (if any, I think there
| might be but not sure) lol
| gaogao wrote:
| If asked non-directly, it still currently answers it -
| https://www.kimi.com/share/19a5ab4a-e732-8b8b-8000-00008499c...
| oxqbldpxo wrote:
| In the mean time, Sam is looking at putting more servers on the
| moon.
| isusmelj wrote:
| Is the price here correct?
| https://openrouter.ai/moonshotai/kimi-k2-thinking Would be $0,60
| for input and $2,50 for 1 million output tokens. If the model is
| really that good it's 4x cheaper than comparable models. It's
| hosted at a loss or the others have a huge margin? I might miss
| something here. Would love some expert opinion :)
|
| FYI: the non thinking variant has the same price.
| burroisolator wrote:
| In short, the others have a huge margin if you ignore training
| costs. See https://martinalderson.com/posts/are-openai-and-
| anthropic-re... for details.
| throwdbaaway wrote:
| Somehow that article totally ignored the insane pricing of
| cached input tokens set by Anthropic and OpenAI. For agentic
| coding, typically 90~95% of the inference cost is attributed
| to cached input tokens, and a scrappy China company can do it
| almost for free: https://api-docs.deepseek.com/news/news0802
| flockonus wrote:
| Yes, you may consider that opensource models hosted over
| Openrouter are charging about bare hardware costs, where in
| practice some providers there may run on subsidized hardware
| even, so there is money to be made.
| NiloCK wrote:
| Maybe a dumb question but: what is a "reasoning model"?
|
| I think I get that "reasoning" in this context refers to
| dynamically budgeting scratchpad tokens that aren't intended as
| the main response body. But can't any model do that, and it's
| just part of the system prompt, or more generally, the
| conversation scaffold that is being written to.
|
| Or does a "reasoning model" specifically refer to models whose
| "post training" / "fine tuning" / "rlhf" laps have been run
| against those sorts of prompts rather than simpler user-
| assistant-user-assistant back and forths?
|
| EG, a base model becomes "a reasoning model" after so much
| experience in the reasoning mines.
| rcxdude wrote:
| The latter. A reasoning model has been finetuned to use the
| scratchpad for intermediate results (which works better than
| just prompting a model to do the same).
| NiloCK wrote:
| I'd expect the same (fine tuning to be better than mere
| prompting) for most anything.
|
| So a model is or is not "a reasoning model" according to the
| extent of a fine tune.
|
| Are there specific benchmarks that compare models vs
| themselves with and without scratchpads? High with:without
| ratios being reasonier models?
|
| Curious also how much a generalist model's one-shot responses
| degrade with reasoning post-training.
| bigyabai wrote:
| > Are there specific benchmarks that compare models vs
| themselves with and without scratchpads?
|
| Yep, it's pretty common for many models to release an
| instruction-tuned and thinking-tuned model and then bench
| them against each other. For instance, if you scroll down
| to "Pure text performance" there's a comparison of these
| two Qwen models' performance:
| https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking
| walthamstow wrote:
| Thanks for the Qwen tip. Interesting how much of a
| difference reasoning makes for coding.
| nodja wrote:
| Any model that does thinking inside <think></think> style
| tokens before it answers.
|
| This can be done with finetuning/RL using an existing pre-
| formatted dataset, or format based RL where the model is
| rewarded for both answering correct and using the right format.
| fragmede wrote:
| The model's downloadable, which is generous, but it's not open
| source.
| jimnotgym wrote:
| I was hoping this was about Summits On The Air...but no it's more
| boring AI
| aliljet wrote:
| How does one effectively use something like this locally with
| consumer-grade hardware?
| tintor wrote:
| Consumer-grade hardware? Even at 4bits per param you would need
| 500GB of GPU VRAM just to load the weights. You also need VRAM
| for KV cache.
| smusamashah wrote:
| When I open this page, all I see is a word pad like text area
| with buttons on top and sample text inside. Don't see anything
| about any llm. I am on phone. Page is being opened via embedded
| view in an HN client.
| mmaunder wrote:
| Any word on what it takes to run this thing?
| blobbers wrote:
| TLDR; this is an alibaba funded start-up out of Beijing
|
| Okay, I'm sorry but I have to say wtf named this thing. Moonshot
| AI is such an overused generic name that I had to ask an LLM
| which company this is. This is just Alibaba hedging their Qwen
| model.
|
| This company is far from "open source", it's had over $1B USD in
| funding.
___________________________________________________________________
(page generated 2025-11-06 23:00 UTC)