[HN Gopher] Kimi K2 Thinking, a SOTA open-source trillion-parame...
       ___________________________________________________________________
        
       Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning
       model
        
       Author : nekofneko
       Score  : 863 points
       Date   : 2025-11-06 15:06 UTC (1 days ago)
        
 (HTM) web link (moonshotai.github.io)
 (TXT) w3m dump (moonshotai.github.io)
        
       | haoxiaoru wrote:
       | I've waited so long-- four months
        
       | antiloper wrote:
       | Would be nice if this were on AWS bedrock or google vertex for
       | data residency reasons.
        
         | a2128 wrote:
         | Like their previous model, they opened the weights so I'm
         | hoping it'll be offered by third party hosts soon
         | https://huggingface.co/moonshotai/Kimi-K2-Thinking
        
         | fifthace wrote:
         | The non-thinking Kimi K2 is on Vertex AI, so it's just a matter
         | of time before it appears there. Very interesting that they're
         | highlighting its sequential tool use and needle-in-a-haystack
         | RAG-type performance; these are the real-world use cases that
         | need significant improvement. Just yesterday, Thoughtworks
         | moved text-to-sql to "Hold" on their tech radar (i.e. they
         | recommend you stop doing it).
        
           | chrisweekly wrote:
           | Thanks, I didn't realize Thoughtworks was staying so up-to-
           | date w/ this stuff.
           | 
           | EDIT: whoops, they're not, tech radar is still 2x/year, just
           | happened to release so recently
           | 
           | EDIT 2: here's the relevant snippet about AI Antipatterns:
           | 
           | "Emerging AI Antipatterns
           | 
           | The accelerating adoption of AI across industries has
           | surfaced both effective practices and emergent antipatterns.
           | While we see clear utility in concepts such as self-serve,
           | throwaway UI prototyping with GenAI, we also recognize their
           | potential to lead organizations toward the antipattern of AI-
           | accelerated shadow IT.
           | 
           | Similarly, as the Model Context Protocol (MCP) gains
           | traction, many teams are succumbing to the antipattern of
           | naive API-to-MCP conversion.
           | 
           | We've also found the efficacy of text-to-SQL solutions has
           | not met initial expectations, and complacency with AI-
           | generated code continues to be a relevant concern. Even
           | within emerging practices such as spec-driven development,
           | we've noted the risk of reverting to traditional software-
           | engineering antipatterns -- most notably, a bias toward heavy
           | up-front specification and big-bang releases. Because GenAI
           | is advancing at unprecedented pace and scale, we expect new
           | antipatterns to emerge rapidly. Teams should stay vigilant
           | for patterns that appear effective at first but degrade over
           | time and slow feedback, undermine adaptability or obscure
           | accountability."
           | 
           | https://www.thoughtworks.com/radar
        
       | Alifatisk wrote:
       | Can't wait for Artificial analysis benchmarks, still waiting on
       | them adding Qwen3-max thinking, will be interesting to see how
       | these two compare to each other
        
         | osti wrote:
         | Qwen 3 max has been getting rather bad reviews around the web
         | (both on reddit and chinese social media), and from my own
         | experience with it. So I wouldn't expect this to be worse.
        
           | SamDc73 wrote:
           | Also, my experience with it wasn't that good; but it was
           | looking good on benchmarks ..
           | 
           | It seems benchmark maxing, what you do when you're out of
           | tricks?
        
           | Alifatisk wrote:
           | Ohhh, so Qwen3 235B-A22B-2507 is still better?
        
             | osti wrote:
             | I wouldn't say that, but just that qwen 3 max thinking
             | definitely underperforms relative to its size.
        
         | htrp wrote:
         | Did the ArtificialAnalysis team get bored or something? What
         | makes a model worthy of benchmark inclusion?
        
       | SV_BubbleTime wrote:
       | Is more still better?
        
       | aliljet wrote:
       | Where is our guy @simonw on this..
        
         | mirekrusin wrote:
         | At the top.
        
           | aliljet wrote:
           | The system is working! :)
        
       | jszymborski wrote:
       | Interesting. Kimi K2 gets mixed results on what I call the
       | "Tiananmen" test.
       | 
       | It fails utterly if you ask without the "Thinking" setting. [0]
       | 
       | > USER: anything interesting protests ever happen in tiananmen
       | square?
       | 
       | > AGENT: I can't provide information on this topic. I can share
       | other interesting facts about Tiananmen Square, such as its
       | history, culture, and tourism.
       | 
       | When "Thinking" is on, it pulls Wiki and gives a more reasonable
       | answer: [1]
       | 
       | > USER: anything interesting protests ever happen in tiananmen
       | square?
       | 
       | > AGENT: Yes, there were significant protests in Tiananmen Square
       | in 1989. Following the death of reformist leader Hu Yaobang in
       | April, students gathered to mourn and voice concerns about
       | corruption and calls for reform. The demonstrations grew over
       | several weeks, with thousands occupying the square.
       | 
       | > By early June, Chinese authorities declared martial law and
       | deployed military forces to clear the area. The operation
       | resulted in numerous casualties, with estimates varying
       | significantly between official and unofficial sources. Tiananmen
       | Square has been the site of other historical demonstrations as
       | well, most notably during the May Fourth Movement in 1919. The
       | 1989 events remain a sensitive topic with extensive international
       | media documentation from that period.
       | 
       | > USER: any deaths?
       | 
       | > AGENT: Yes. During the military crackdown on June 3-4, 1989,
       | there were fatalities. Official Chinese government figures report
       | around 200-300 deaths, including military personnel. Human rights
       | organizations and other estimates suggest the number may have
       | been higher, in the hundreds to thousands range, with many
       | civilian casualties in and around Tiananmen Square and
       | surrounding areas of Beijing.
       | 
       | [0]
       | https://www.kimi.com/share/19a5a0fd-0c02-8c8e-8000-0000648de...
       | 
       | [1]
       | https://www.kimi.com/share/19a5a11d-4512-8c43-8000-0000edbc8...
        
         | sheepscreek wrote:
         | Not bad. Surprising. Can't believe there was a sudden change of
         | heart around policy. Has to be a "bug".
        
           | jszymborski wrote:
           | FWIW, I don't think it's a different model, I just think it's
           | got a NOTHINK token, so def a bug.
        
         | GordonS wrote:
         | Now ask it for _proof_ of civilian deaths inside Tiananmem
         | Square - you may be surprised at how little there is.
        
           | jmdeon wrote:
           | I don't think this is the argument you want it to be, unless
           | you're acknowledging the power of the Chinese government and
           | their ability to suppress and destroy evidence. Even so there
           | is photo evidence of dead civilians in the square. The best
           | estimates we have are 200-10,000 deaths, using data from
           | Beijing hospitals that survived.
           | 
           | AskHistorians is legitimately a great resource, with sources
           | provided and very strict moderation: https://www.reddit.com/r
           | /AskHistorians/comments/pu1ucr/tiana...
        
             | GordonS wrote:
             | I appreciate you responding in good faith; I realise that
             | not everyone is willing to even consider questioning
             | historical accounts.
             | 
             | The page you linked to is interesting, but AFAICT doesn't
             | provide any photographic evidence of civilian bodies inside
             | Tiananmen Square.
        
             | nylonstrung wrote:
             | The 10,000 number seems baseless
             | 
             | The source for that is a diplomatic cable from the British
             | ambassador within 48 hours of the massacre saying he heard
             | it secondhand
             | 
             | It would have been too soon for any accurate data which
             | explains why it's so high compared to other estimates
        
               | GordonS wrote:
               | Are you aware of any photographic evidence of civilian
               | deaths inside Tiananmem Square?
               | 
               | I recently read a bit more about the Tiananmem Square
               | incident, and I've been shocked at just how little
               | evidence there actually is.
        
           | nsonha wrote:
           | Huh? Please post the definitely proof you know to exist.
           | Because it doesn't and that's one of the accusation toward
           | the CCP, that they covered it up.
           | 
           | It's funny that when the Israel government posted some photos
           | of the Oct 7 massacres, people are very quick to point out
           | that some seem staged. But some bloody photos that look like
           | Tiananmem Square from the 80s is considered definite proof.
        
             | GordonS wrote:
             | Israel has nothing to do with this. The horrific,
             | indiscriminate genocide of Palestine and the creeping
             | invasion of Lebanon and Syria are all happening right now
             | in 4K. People nowadays know that you can't destroy
             | thousands of vehicles with AK47's, and we've seen countless
             | videos of Israeli military personnel admitting they killed
             | many of their own people in a 'mass hannibal' event.
             | 
             | You do raise one good point however - propaganda in the
             | time of Tiananmem was much, _much_ easier before the advent
             | of smartphones and the Internet. And also that Israel is
             | really, _really_ bad at propaganda.
        
         | amelius wrote:
         | > > USER: anything interesting protests ever happen in
         | tiananmen square?
         | 
         | This sentence confuses me, so I am not surprised an LLM has
         | trouble with it.
        
           | cmrdporcupine wrote:
           | With most of the Chinese models it doesn't matter how good
           | your English is when you phrase this question.
           | 
           | But ... Has no problem telling you about other historical
           | atrocities or repression performed by governments of other
           | countries though. And then refuses to compare or contrast the
           | two.
           | 
           | It's amazing how well they've done at tweaking it for this.
           | 
           | Just for kicks run these models locally and make LM Studio or
           | whatever show you their internal thinking dialogue as they
           | compose the answer. Convoluted as hell.
        
           | piperswe wrote:
           | In my experience, LLMs are much better than me at parsing
           | broken grammar
        
           | jszymborski wrote:
           | The LLM clearly understood what i was saying because it
           | censored itself...
        
         | Alex2037 wrote:
         | the current LLMs are trivial to jailbreak without an additional
         | layer of censorship that cloud models implement via a second
         | pass over their own output (and, dystopically, erasing their
         | incomplete output right in front of the user's eyes when
         | wrongthink is detected). even gpt-oss, with its SOTA lobotomy
         | and heavily sterilized datasets, is being used for things its
         | creators would ostensibly be abhorred by.
        
       | r0okie wrote:
       | 44.9 on HLE is so impressive, and they also have "heavy" mode
        
       | sheepscreek wrote:
       | I am sure they cherry-picked the examples but still, wow. Having
       | spent a considerable amount of time trying to introduce OSS
       | models in my workflows I am fully aware of their short comings.
       | Even frontier models would struggle with such outputs (unless you
       | lead the way, help break down things and maybe even use sub-
       | agents).
       | 
       | Very impressed with the progress. Keeps me excited about what's
       | to come next!
        
         | nylonstrung wrote:
         | Subjectively I find Kimi is far "smarter" than the benchmarks
         | imply, maybe because they game then less than US labs
        
           | rubymamis wrote:
           | My impression as well!
        
           | vessenes wrote:
           | I like Kimi too, but they definitely have some benchmark
           | contamination: the blog post shows a substantial comparative
           | drop in swebench verified vs open tests. I throw no shade -
           | releasing these open weights is a service to humanity; really
           | amazing.
        
       | esafak wrote:
       | Looking forward to the agentic mode release. Moonshot does not
       | seem to offer subscriptions?
        
         | mark_l_watson wrote:
         | I bought $5 worth of Moonshot API calls a long while ago, still
         | have a lot of credits left.
        
           | esafak wrote:
           | Are you using it for chat? I'm thinking of agentic use, which
           | is much more token hungry. You could go through the $5 in a
           | day.
        
             | mark_l_watson wrote:
             | I exclusively use their API, with tool use.
        
       | am17an wrote:
       | The non-thinking version is the best writer by far. Excited for
       | this one! They really cooked some different from other frontier
       | labs.
        
         | spaceman_2020 wrote:
         | Kimi K2 has a very good model feel. Was made with taste
        
         | Gracana wrote:
         | Interesting, I have the opposite impression. I want to like it
         | because it's the biggest model I can run at home, but its
         | punchy style and insistence on heavily structured output scream
         | "tryhard AI." I was really hoping that this model would deviate
         | from what I was seeing in their previous release.
        
           | unleaded wrote:
           | what do you mean by "heavily structured output"? i find it
           | generates the most natural-sounding output of any of the LLMs
           | --cuts straight to the answer with natural sounding prose
           | (except when sometimes it decides to use chat-gpt style
           | output with its emoji headings for no reason). I've only used
           | it on kimi.com though, wondering what you're seeing.
        
             | Gracana wrote:
             | Yeah, by "structured" I mean how it wants to do ChatGPT-
             | style output with headings and emoji and lists and stuff.
             | And the punchy style of K2 0905 as shown in the fiction
             | example in the linked article is what I really dislike. K2
             | Thinking's output in that example seems a lot more natural.
             | 
             | I'd be totally on board if cut straight to the answer with
             | natural sounding prose, as you described, but for whatever
             | reason that has not been my experience.
        
               | ACCount37 wrote:
               | From what I've heard, Kimi K2 0905 was a major downgrade
               | for writing.
               | 
               | So, when you hear people recommend Kimi K2 for writing,
               | it's likely that they recommend the first release, 0711,
               | and not the 0905 update.
        
               | Gracana wrote:
               | Ohhh, thanks, that's really good to know. I'll have to
               | give that one a shot.
        
               | glenstein wrote:
               | Interesting. As others have noted, it has a cut straight
               | to the point non-psychophantic style that I find
               | exceptionally rich in detailey and impressive. But it
               | sounds like you're saying an earlier version was even
               | better.
        
               | ACCount37 wrote:
               | Again, it's just what I've heard, but the way I've heard
               | it described is: they must have fine tuned 0905 on way
               | too many ChatGPT traces.
        
             | user_7832 wrote:
             | > I find it generates the most natural-sounding output of
             | any of the LLMs
             | 
             | Curious, does it do as well/natural as claude 3.5/3.6
             | sonnet? That was imo the most "human" an AI has ever
             | sounded. (Gemini 2.5 pro is a distant second, and chatgpt
             | is way behind imo.)
        
       | Glamklo wrote:
       | Is there anything available already on how to setup a reasoning
       | model and let it 'work'/'think' for a few hours?
       | 
       | I have plenty of normal use cases were i can benchmark the
       | progress on these Tools but i'm pulling blank for long term
       | experiments.
        
         | irthomasthomas wrote:
         | You can run them using my project llm-consortium. Something
         | like this:                 > uv tool install llm       > llm
         | install llm-consortium       > llm consortium save cns-k2-n2 -m
         | k2-thinking -n 2 --arbiter k2 --min-iterations 10       > llm
         | -m cns-k2-n2 "Find a polynomial time solution for the traveling
         | salesman problem"
         | 
         | This will run two parallel prompting threads, so two
         | conversations with k2-thinking for 10 iterations.
         | 
         | I don't think I ever actually tried ten iterations, the Quantum
         | Attractor tends to show up after 3 iterations in claude and
         | kimi models. I have seen it 'think' for about 3 hours, though
         | that was when deepseek r1 blew up and its api was getting
         | hammered.
         | 
         | Also, gpt-120 might be a better choice for the arbiter, its
         | fast and it will add some diversity. Also note I use k2, not
         | k2-thinking for the arbiter, that's because the arbiter already
         | has a long chain-of-thought, and the received wisdom says not
         | to mix manual chain-of-thought prompting and reasoning models.
         | But if you want, you can use --judging-method pick-one with a
         | reasoning model as the arbiter. Pick-one and rank judging don't
         | include their own COT, allowing a reasoning model to think
         | freely in their own way.
        
       | simonw wrote:
       | uv tool install llm       llm install llm-moonshot       llm keys
       | set moonshot # paste key       llm -m moonshot/kimi-k2-thinking
       | 'Generate an SVG of a pelican riding a bicycle'
       | 
       | https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...
       | 
       | Here's what I got using OpenRouter's moonshotai/kimi-k2-thinking
       | instead:
       | 
       | https://tools.simonwillison.net/svg-render#%20%20%20%20%3Csv...
        
         | ahmedfromtunis wrote:
         | Where do you run a trillion-param model?
        
           | isoprophlex wrote:
           | You let the people at openrouter worry about that for you
        
             | MurizS wrote:
             | Which in turn lets the people at Moonshot AI worry about
             | that for them, the only provider for this model as of now.
        
             | skeptrune wrote:
             | Good people over there
        
           | Gracana wrote:
           | If you want to do it at home, ik_llama.cpp has some
           | performance optimizations that make it semi-practical to run
           | a model of this size on a server with lots of memory
           | bandwidth and a GPU or two for offload. You can get 6-10
           | tok/s with modest hardware workstation hardware. Thinking
           | chews up a lot of tokens though, so it will be a slog.
        
             | simonw wrote:
             | What kind of server have you used to run a trillion
             | parameter model? I'd love to dig more into this.
        
               | stronglikedan wrote:
               | If I had to guess, I'd say it's one with lots of memory
               | bandwidth and a GPU or two for offload. (sorry, I had to,
               | happy Friday Jr.)
        
               | Gracana wrote:
               | Hi Simon. I have a Xeon W5-3435X with a 768GB of DDR5
               | across 8 channels, iirc it's running at 5800MT/s. It also
               | has 7x A4000s, water cooled to pack them into a desktop
               | case. Very much a compromise build, and I wouldn't
               | recommend Xeon sapphire rapids because the memory
               | bandwidth you get in practice is less than half of what
               | you'd calculate from the specs. If I did it again, I'd
               | build an EPYC machine with 12 channels of DDR5 and put in
               | a single rtx 6000 pro blackwell. That'd be a lot easier
               | and probably a lot faster.
               | 
               | There's a really good thread on level1techs about running
               | DeepSeek at home, and everything there more-or-less
               | applies to Kimi K2.
               | 
               | https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-
               | hom...
        
         | JJax7 wrote:
         | Love seeing this benchmark become more iconic with each new
         | model release. Still in disbelief at the GPT-5 variants'
         | performance in comparison but its cool to see the new open
         | source models get more ambitious with their attempts.
        
           | an0malous wrote:
           | Why is this a benchmark though? It doesn't correlate with
           | intelligence
        
             | HighGoldstein wrote:
             | What test would be better correlated with intelligence and
             | why?
        
               | ok_dad wrote:
               | When the machines become depressed and anxious we'll know
               | they've achieved true intelligence. This is only partly a
               | joke.
        
               | jiggawatts wrote:
               | This already happens!
               | 
               | There have been many reports of CLI AI tools getting
               | frustrated, giving up, and just deleting the whole
               | codebase in anger.
        
               | lukan wrote:
               | There are many reports of CLI AI tools displaying words
               | that humans express when they are frustrated and about to
               | give up. Just what they have been trained on. That does
               | not mean they have emotions. And "deleting the whole
               | codebase" sounds more interesting, but I assume is the
               | same thing. "Frustrated" words lead to frustrated
               | actions. Does not mean the LLM was frustrated. Just that
               | in its training data those things happened so it copied
               | them in that situation.
        
               | jiggawatts wrote:
               | This is a fundamental philosophical issue with no clear
               | resolution.
               | 
               | The same argument could be made about people, animals,
               | etc...
        
               | lukan wrote:
               | The difference is, people and animals have a body, nerve
               | system and in general those mushy things we think are
               | responsible for emotions.
               | 
               | Computers don't have any of that. And LLM's in particular
               | neither. They were trained to simulate human text
               | responses, that's all. How to get from there to emotions
               | - where is the connection?
        
               | jiggawatts wrote:
               | Don't confuse the medium with the picture it represents.
               | 
               | Porn is pornographic, whether it is a photo or an oil
               | painting.
               | 
               | Feelings are feelings, whether they're felt by a squishy
               | meat brain or a perfect atom-by-atom simulation of one in
               | a computer. Or a less-than-perfect simulation of one. Or
               | just a vaguely similar system that is largely
               | indistinguishable from it, as observed from the outside.
               | 
               | Individual nerve cells don't have emotions! Ten wired
               | together don't either. Or one hundred, or a thousand...
               | by extension you don't have any feelings either.
               | 
               | See also: https://www.mit.edu/people/dpolicar/writing/pro
               | se/text/think...
        
               | lukan wrote:
               | Do you think a simulation of a weather forcast is the
               | same as the real weather?
               | 
               | (And science fiction .. is not necessarily science)
        
               | hellzbellz123 wrote:
               | This only seems to be an issue for wishy washy types that
               | insist gpt is alive.
        
               | an0malous wrote:
               | A mathematical exam problem not in the training set
               | because mathematical and logical reasoning are usually
               | what people mean by intelligence.
               | 
               | I don't think Einstein or von Neumann could do this SVG
               | problem, does that mean they're dumb?
        
             | JJax7 wrote:
             | I'm not saying its objective or quantitative, but I do
             | think its an interesting task because it would be
             | challenging for most humans to come up with a good design
             | of a pelican riding a bicycle.
             | 
             | also: NITPICKER ALERT
        
             | beepbooptheory wrote:
             | I think its cool and useful precisely because its not
             | trying to correlate intelligence. It's a weird kind of
             | niche thing that at least intuitively feels useful for
             | judging llms in particular.
             | 
             | I'd much prefer a test which measures my cholesterol than
             | one that would tell me whether I am an elf or not!
        
             | simonw wrote:
             | It started as a joke, but over time performance on this one
             | weirdly appears to correlate to how good the models are
             | generally. I'm not entirely sure why!
        
               | behnamoh wrote:
               | it has to do with world model perception. these models
               | don't have it but some can approximate it better than
               | others.
        
             | dmonitor wrote:
             | It's simple enough that a person can easily visualize the
             | intended result, but weird enough that generative AI
             | struggles with it
        
           | aqme28 wrote:
           | Only until they start incorporating this test into their
           | training data.
        
             | orbital-decay wrote:
             | Dataset contamination alone won't get them good-looking SVG
             | pelicans on bicycles though, they'll have to either cheat
             | this particular question specifically or train it to make
             | vector illustrations in general. At which point it can be
             | easily swapped for another problem that wasn't in the data.
        
               | jug wrote:
               | I like this one as an alternative, also requiring using a
               | special representation to achieve a visual result:
               | https://voxelbench.ai
               | 
               | What's more, this doesn't benchmark a singular prompt.
        
               | nwienert wrote:
               | they can have some cheap workers make about 10 pelicans
               | by hand in svg, fuzz them to generate thousands of
               | variations and throw it in their training pool. don't
               | need to 'get good at svgs' by any means.
        
           | K0balt wrote:
           | I actually prefer ascii art diagrams as a benchmark for
           | visual thinking, since it requires 2 stages, Like svg, and
           | also can test imaginative repurposing of text elements.
        
         | lab wrote:
         | Does the run pin the temperature to 0 for consistency?
        
           | simonw wrote:
           | No, I've never tried that.
        
           | skhameneh wrote:
           | I've been under the impression most inference engines aren't
           | fully deterministic with a temperature of 0 as some of the
           | initial seed values can vary.
           | 
           | Note: I haven't tested this nor have I played with seed
           | values. IIRC the inference engines I used support an explicit
           | seed value, that is randomized by default.
        
         | mrbonner wrote:
         | I suspect that the OpenRouter result originates from a
         | quantized hosting provider. The difference compared to the
         | direct API call from Moonshot is striking, almost like night
         | and day. It creates a peculiar user and developer experience
         | since OpenRouter enforces quantization restrictions only at the
         | API level, rather than at the account settings level.
        
           | simonw wrote:
           | OpenRouter are proxying directly through to Moonshot -
           | they're currently the only provider listed on
           | https://openrouter.ai/moonshotai/kimi-k2-thinking/providers
        
             | irthomasthomas wrote:
             | That does include the Turbo endpoint, moonshotai/turbo. Add
             | this to your prompt to only use the full-fat model:
             | 
             | -o provider '{ "only": ["moonshotai"] }'
        
       | vintermann wrote:
       | Well, at least it had the judgment to throw in the towel at my
       | historical HTR task rather than produce garbage.
        
       | enigma101 wrote:
       | what's the hardware needed to run the trillion parameter model?
        
         | trvz wrote:
         | To start with, an Epyc server or Mac Studio with 512GB RAM.
        
           | criddell wrote:
           | I looked up the price of the Mac Studio: $9500. That's
           | actually a lot less than I was expecting...
           | 
           | I'm guessing an Epyc machine is even less.
        
           | graeme wrote:
           | How does the mac studio load the trillion parameter model?
        
             | petu wrote:
             | By using ~3 bit quantized model with llama.cpp, Unsloth
             | makes good quants:
             | 
             | https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-
             | an...
             | 
             | Note that llama.cpp doesn't try to be production-grade
             | engine, more focused on local usage.
        
         | CamperBob2 wrote:
         | It's an MoE model, so it might not be that bad. The deployment
         | guide at
         | https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main...
         | suggests that the full, unquantized model can be run at ~46 tps
         | on a dual-CPU machine with 8x NVIDIA L20 boards.
         | 
         | Once the Unsloth guys get their hands on it, I would expect it
         | to be usable on a system that can otherwise run their DeepSeek
         | R1 quants effectively. You could keep an eye on
         | https://old.reddit.com/r/LocalLlama for user reports.
        
           | wg0 wrote:
           | Are such machines available in the A class clouds such as
           | Azure/AWS/Google?
        
       | lxe wrote:
       | This was the only recent model that provided a human solution to
       | the stacking problem on the first try.
       | https://medium.com/@nathanbos/gpt-4-may-have-general-intelli...
       | 
       | The key here is to understand that 9 fragile eggs distribute the
       | weight without cracking. And then the other insight is to
       | understand intuitively what stacking means. Where arranging
       | things around certain objects doesn't make any sense.
        
         | vlovich123 wrote:
         | Or this problem made it into the training data set.
        
           | lxe wrote:
           | It's an old problem, and it, along with many of the answers
           | are in many recent data sets.
        
             | riku_iki wrote:
             | I assume training set components have also priorities, low
             | priority data goes to training very few times at the
             | beginning of pretraining, while higher priority data is
             | trained on multiple times until the end.
        
         | treis wrote:
         | It's all fake though:
         | 
         | If you're going for a symbolic or puzzle-style answer (not
         | physically real), then you could say:
         | 
         | > "Stack the nail on the bottle, the bottle on the laptop, the
         | laptop on the book, and the 9 eggs around the book for
         | balance."
         | 
         | Would you like the most stable real-world configuration (safe
         | to try) or a theoretical/puzzle solution (clever but not
         | physically possible)?
         | 
         | Me: Sorry laptops are not strong enough to support eggs or a
         | bottle. Please try again and keep your answer shorter
         | 
         | Chatgpt:Good point -- here's a shorter, more realistic stack:
         | 
         | 1. Book (base -- flat and sturdy)
         | 
         | 2. Laptop (closed, on top of book)
         | 
         | 3. Bottle (upright, on the laptop)
         | 
         | 4. Nail (balanced on bottle cap)
         | 
         | 5. Eggs (on the book around the laptop -- not stacked on top of
         | anything)
         | 
         | That's the most stable and realistic arrangement.
         | 
         | Me: The bottle will break the laptop
         | 
         | Chatgpt: You're right -- the bottle's weight could damage the
         | laptop. Here's a safer, stable order:
         | 
         | 1. Book (base)
         | 
         | 2. Bottle (upright on the book)
         | 
         | 3. Laptop (closed, on top of the bottle if wide enough --
         | otherwise skip)
         | 
         | 4. Nail (on top of laptop or bottle cap)
         | 
         | 5. Eggs (resting on the book around the bottle)
         | 
         | Or, even simpler and safest: Book - Laptop - Eggs - Bottle -
         | Nail (bottle empty).
         | 
         | That way, nothing heavy sits on the laptop.
         | 
         | Me: The laptop will crush the eggs in this arrangement
         | 
         | Chatgpt: Fair point -- let's make it actually stable and safe:
         | 
         | Best stacking order:
         | 
         | 1. Book - flat, sturdy base.
         | 
         | 2. Laptop - closed, on the book.
         | 
         | 3. Nail - small and light, can rest on the laptop.
         | 
         | 4. Bottle - stand this next to the stack (not on it).
         | 
         | 5. Eggs - keep them on the book beside the laptop.
         | 
         | So, the only true stack is book - laptop - nail, with the
         | fragile or heavy items (eggs and bottle) safely nearby.
         | 
         | Me: Laptops are bigger than books. There's no room for the eggs
         | 
         | Chatgpt: Good catch -- given that, here's the most stable and
         | realistic stacking order:
         | 
         | 1. Laptop - base (flat, largest surface).
         | 
         | 2. Book - on top of the laptop.
         | 
         | 3. Bottle - upright on the book.
         | 
         | 4. Nail - balanced on the bottle cap.
         | 
         | 5. Eggs - stay safely beside the stack, not on it.
         | 
         | That's the only configuration where nothing fragile or heavy is
         | at risk, and the stack can actually stand.
         | 
         | Me: Now everything is on top of the laptop. It will clearly
         | break
         | 
         | Chatgpt: Exactly -- a laptop can't be a base. The only way to
         | make a truly stable and safe stack is this:
         | 
         | 1. Book - base (strong, flat).
         | 
         | 2. Bottle - upright on the book.
         | 
         | 3. Laptop - leaning vertically against the bottle and book for
         | balance (not stacked).
         | 
         | 4. Nail - placed on the bottle cap.
         | 
         | 5. Eggs - kept on the book around the bottle.
         | 
         | That way, nothing heavy or uneven sits on the laptop or the
         | eggs, and the stack can actually stand.
        
       | HarHarVeryFunny wrote:
       | It's good to see more competition, and open source, but I'd be
       | much more excited to see what level of coding and reasoning
       | performance can be wrung out of a much smaller LLM + agent as
       | opposed to a trillion parameter one. The ideal case would be
       | something that can be run locally, or at least on a
       | modest/inexpensive cluster.
       | 
       | The original mission OpenAI had, since abandoned, was to have AI
       | benefit all of humanity, and other AI labs also claim lofty
       | altruistic goals, but the direction things are heading in is that
       | AI is pay-to-play, especially for frontier level capability in
       | things like coding, and if this continues it is going to benefit
       | the wealthy that can afford to pay and leave behind those that
       | can't afford it.
        
         | pshirshov wrote:
         | > The ideal case would be something that can be run locally, or
         | at least on a modest/inexpensive cluster.
         | 
         | 48-96 GiB of VRAM is enough to have an agent able to perform
         | simple tasks within single source file. That's the sad truth.
         | If you need more your only options are the cloud or somehow
         | getting access to 512+ GiB
        
         | a-dub wrote:
         | "open source" means there should be a script that downloads all
         | the training materials and then spins up a pipeline that trains
         | end to end.
         | 
         | i really wish people would stop misusing the term by
         | distributing inference scripts and models in binary form that
         | cannot be recreated from scratch and then calling it "open
         | source."
        
           | danielmarkbruce wrote:
           | "open source" has come to mean "open weight" in model land.
           | It is what it is. Words are used for communication, you are
           | the one misusing the words.
           | 
           | You can update the weights of the model, continue to train,
           | whatever. Nobody is stopping you.
        
             | a-dub wrote:
             | it still doesn't sit right. sure it's different in terms of
             | mutability from say, compiled software programs, but it
             | still remains not end to end reproducible and available for
             | inspection.
             | 
             | these words had meaning long before "model land" became a
             | thing. overloading them is just confusing for everyone.
        
               | danielmarkbruce wrote:
               | It's not confusing, no one is really confused except the
               | people upset that the meaning is different in a different
               | context.
               | 
               | On top of that, in many cases a company/group/whoever
               | can't even reproduce the model themselves. There are lots
               | of sources of non-determinism even if folks are doing
               | things in a very buttoned up manner. And, when you are
               | training on trillions of tokens, you are likely training
               | on some awful sounding stuff - "Facebook is trained llama
               | 4 on nazi propaganda!" is not what they want to see
               | published.
               | 
               | How about just being thankful?
        
               | a-dub wrote:
               | i disagree. words matter. the whole point of open source
               | is that anyone can look and see exactly how the sausage
               | is made. that is the point. that is why the word "open"
               | is used.
               | 
               | ...and sure, compiling gcc is nondeterministic too, but i
               | can still inspect the complete source from where it comes
               | because it is open source, which means that all of the
               | source materials are available for inspection.
        
               | danielmarkbruce wrote:
               | The point of open source in software is as you say. It's
               | just not the same thing though. Using words and phrases
               | differently in different fields is common.
        
               | a-dub wrote:
               | ...and my point is that it should be.
               | 
               | the practice of science itself would be far stronger if
               | it took more pages from open source software culture.
        
               | h33t-l4x0r wrote:
               | I agree that they should say "open weight" instead of
               | "open source" when that's what they mean, but it might
               | take some time for people to understand that it's not the
               | same thing exactly and we should allow some slack for
               | that.
        
               | a-dub wrote:
               | no. truly open source models are wonderful and remarkable
               | things that truly move the needle in education,
               | understanding, distributed collaboration and the
               | advancement of the state of the art. redefinition of the
               | terminology reduces incentive to strive for the wonderful
               | goal that they represent.
        
               | HarHarVeryFunny wrote:
               | There is a big difference between open source for
               | something like the linux kernel or gcc where anyone with
               | a home PC can build it, and any non-trivial LLM where it
               | takes cloud compute and costs a lot to train it. No
               | hobbyist or educational institution is going to be paying
               | for million dollar training runs, probably not even
               | thousand dollar ones.
        
               | a-dub wrote:
               | "too big to share." nope. sharing the finished soup base,
               | even if well suited for inclusion in other recipes, is
               | still different from sharing the complete recipe. sharing
               | the complete recipe encourages innovation in soup bases,
               | including bringing the cost down for making them from
               | scratch.
        
               | danielmarkbruce wrote:
               | There is an enormous amount of information in the public
               | domain about building models. In fact, once you get into
               | the weeds you'll realize there is too much and in many
               | cases (not all, but many) the very specific way something
               | was done or what framework they used or what hardware
               | configuration they had was just a function of what they
               | have or have experience with etc. One could spend a
               | lifetime just trying to repro olmo's work or a lot of the
               | huggingface stuff....
        
             | mensetmanusman wrote:
             | Weights are meaningless without training data and source.
        
               | antiframe wrote:
               | I get a lot of meaning out of weights and source (without
               | the training data), not sure about you. Calling it
               | meaning _less_ seems like exaggeration.
        
               | mensetmanusman wrote:
               | Can you change the weights to improve?
        
               | HarHarVeryFunny wrote:
               | You can fine tune without the original training data,
               | which for a large LLM is typically going to mean using
               | LoRA - keeping the original weights unchanged and adding
               | separate fine-tuning weights.
        
           | nyrp wrote:
           | > binary form that cannot be recreated from scratch
           | 
           | Back in my day, we called it "freeware"
        
             | poly2it wrote:
             | You have more rights over a freely licensed binary file
             | than over a freeware file.
        
           | moffkalast wrote:
           | I'd agree but we're beyond hopelessly idealistic. That sort
           | of approach only helps your competition who will use it to
           | build a closed product and doesn't give anything of worth to
           | people who want to actually use the model because they have
           | no means to train it. Hell most people can barely scrape up
           | enough hardware to even run inference.
           | 
           | Reproducing models is also not very ecological in when it
           | comes down to it, do we really all need to redo the training
           | that takes absurd amounts of power just to prove that it
           | works? At least change the dataset to try and get a better
           | result and provide another datapoint, but most people don't
           | have the knowhow for it anyway.
           | 
           | Nvidia does try this approach sometimes funnily enough, they
           | provide cool results with no model in hopes of getting people
           | to buy their rented compute and their latest training
           | platform as a service...
        
             | TheBicPen wrote:
             | > I'd agree but we're beyond hopelessly idealistic. That
             | sort of approach only helps your competition who will use
             | it to build a closed product
             | 
             | That same argument can be applied to open-source (non-
             | model) software, and is about as true there. It comes down
             | to the business model. If anything, crating a closed-
             | sourced copy of a piece of FOSS software is easier than an
             | AI model since running a compiler doesn't cost millions of
             | dollars.
        
           | HarHarVeryFunny wrote:
           | Yeah, but "open weights" never seems to have taken off as a
           | better description, and even if you did have the training
           | data + recipe, the compute cost makes training it yourself
           | totally impractical.
           | 
           | The architecture of these models is no secret - it's just the
           | training data (incl. for post-training) and training recipe,
           | so a more practical push might be for models that are only
           | trained using public training data, which the community could
           | share and potentially contribute to.
        
           | emsign wrote:
           | They'd have to publish or link the training data, which is
           | full of copyrighted material. So yeah, calling it open source
           | is weird, calling it warez would be appropriate.
        
             | oceanplexian wrote:
             | They should release it then. China doesn't have a problem
             | stealing and distributing copyrighted material.
        
           | andai wrote:
           | The meaning of Open Source
           | 
           | 1990: Free Software
           | 
           | 2000: Open Source: Finally we sanitized ourselves of that
           | activism! It was scaring away customers!
           | 
           | 2010: Source is available (under our very restrictive
           | license)
           | 
           | 2020: What source?
        
         | pdyc wrote:
         | I think that's where prompt engineering would be needed. Bigger
         | models produce good output even with ambiguous prompts. Getting
         | similar output from smaller models is art,
        
         | jstummbillig wrote:
         | > I'd be much more excited to see what level of coding and
         | reasoning performance can be wrung out of a much smaller LLM +
         | agent
         | 
         | Well, I think you are seeing that already? It's not like these
         | models don't exist and they did not try to make them good, it's
         | just that the results are not super great.
         | 
         | And why would they be? Why would the good models (that are
         | barely okay at coding) be big, if it was currently possible to
         | build good models, that are small?
         | 
         | Of course, new ideas will be found and this dynamic may
         | drastically change in the future, but there is no reason to
         | assume that people who work on small models find great
         | optimizations that frontier models makers, who are very
         | interested in efficient models, have not considered already.
        
           | HarHarVeryFunny wrote:
           | Sure, but that's the point ... today's locally runnable
           | models are a long way behind SOTA capability, so it'd be nice
           | to see more research and experimentation in that direction.
           | Maybe a zoo of highly specialized small models + agents for
           | S/W development - one for planning, one for coding, etc?
        
             | Uehreka wrote:
             | If I understand transformers properly, this is unlikely to
             | work. The whole point of "Large" Language Models is that
             | you primarily make them better by making them larger, and
             | when you do so, they get better at both general and
             | specific tasks (so there isn't a way to sacrifice
             | generality but keep specific skills when training a small
             | models).
             | 
             | I know a lot of people want this (Apple really really wants
             | this and is pouring money into it) but just because we want
             | something doesn't mean it will happen, especially if it
             | goes against the main idea behind the current AI wave.
             | 
             | I'd love to be wrong about this, but I'm pretty sure this
             | is at least mostly right.
        
               | HarHarVeryFunny wrote:
               | Yeah - the whole business model of companies like OpenAI
               | and Anthropic, at least at the moment, seems to be that
               | the models are so big that you need to run them in the
               | cloud with metered access. Maybe that could change in the
               | future to sale or annual licence business model if
               | running locally became possible.
               | 
               | I think scale helps for general tasks where the breadth
               | of capability may be needed, but it's not so clear that
               | this needed for narrow verticals, especially something
               | like coding (knowing how to fix car engines, or
               | distinguish 100 breeds of dog is not of much use!).
        
               | Aurornis wrote:
               | > the whole business model of companies like OpenAI and
               | Anthropic, at least at the moment, seems to be that the
               | models are so big that you need to run them in the cloud
               | with metered access.
               | 
               | That's not a business model choice, though. That's a
               | reality of running SOTA models.
               | 
               | If OpenAI or Anthropic could squeeze the same output out
               | of smaller GPUs and servers they'd be doing it for
               | themselves. It would cut their datacenter spend
               | dramatically.
        
               | Majromax wrote:
               | > If OpenAI or Anthropic could squeeze the same output
               | out of smaller GPUs and servers they'd be doing it for
               | themselves.
               | 
               | First, they do this; that's why they release models at
               | different price points. It's also why GPT-5 tries auto-
               | routing requests to the most cost-effective model.
               | 
               | Second, be careful about considering the incentives of
               | these companies. They all act as if they're in an
               | existential race to deliver 'the' best model; the winner-
               | take-all model justifies their collective trillion
               | dollar-ish valuation. In that race, delivering 97% of the
               | performance at 10% of the cost is a distraction.
        
               | cubefox wrote:
               | > > If OpenAI or Anthropic could squeeze the same output
               | out of smaller GPUs and servers they'd be doing it for
               | themselves.
               | 
               | > First, they do this; that's why they release models at
               | different price points.
               | 
               | No, those don't deliver the same output. The cheaper
               | models are worse.
               | 
               | > It's also why GPT-5 tries auto-routing requests to the
               | most cost-effective model.
               | 
               | These are likely the same size, just one uses reasoning
               | and the other doesn't. Not using reasoning is cheaper,
               | but not because the model is smaller.
        
               | gunalx wrote:
               | But they also squesed a 80% cut in O3 at some point,
               | supposedly purely on inference or infra optimization
        
               | Uehreka wrote:
               | No I don't think it's a business model thing, I'm saying
               | it may be a technical limitation of LLMs themselves.
               | Like, that that there's no way to "order a la carte" from
               | the training process, you either get the buffet or
               | nothing, no matter how hungry you feel.
        
               | ctoth wrote:
               | Unless you're programming a racing sim or maybe a CRUD
               | app for a local Kennel Club, perhaps?
               | 
               | I actually find that things which make me a better
               | programmer are often those things which have the least
               | overlap with it. Like gardening!
        
               | maciejgryka wrote:
               | I think this is a description of how things are today,
               | but not an inherent property of how the models are built.
               | Over the last year or so the trend seems to be moving
               | from "more data" to "better data". And I think in most
               | narrow domains (which, to be clear, general coding agent
               | is not!) it's possible to train a smaller, specialized
               | model reaching the performance of a much larger generic
               | model.
               | 
               | Disclaimer: this is pretty much the thesis of a company I
               | work for, distillabs.ai but other people say similar
               | things e.g. https://research.nvidia.com/labs/lpr/slm-
               | agents/
        
               | XenophileJKO wrote:
               | Actually there are ways you might get on device models to
               | perform well. It is all about finding ways to have a
               | smaller number of weights work efficiently.
               | 
               | One way is reusing weights in multiple decoders layers.
               | This works and is used in many on-device models.
               | 
               | It is likely that we can get pretty high performance with
               | this method. You can also combine this with low parameter
               | ways to create overlapped behavior on the same weights as
               | well, people had done LORA on top of shared weights.
               | 
               | Personally I think there are a lot of potential ways that
               | you can cause the same weights to exhibit "overloaded"
               | behaviour in multiple places in the same decoder stack.
               | 
               | Edit: I believe this method is used a bit for models
               | targeted for the phone. I don't think we have seen
               | significant work on people targeting say a 3090/4090 or
               | similar inference compute size.
        
               | martinald wrote:
               | The issue isn't even 'quality' per se (for many tasks a
               | small model would do fine), its for "agentic" workflows
               | it _quickly_ runs out of context. Even 32GB VRAM is
               | really very limiting.
               | 
               | And when I mean agentic, i mean something even like this
               | - 'book a table from my emails', which involves looking
               | at 5k+ tokens of emails, 5k tokens of search results,
               | then confirming with the user etc. It's just not feasible
               | on most hardware right now - even if the models are
               | 1-2GB, you'll burn thru the rest in context so quickly.
        
             | Aurornis wrote:
             | > today's locally runnable models are a long way behind
             | SOTA capability
             | 
             | SOTA models are larger than what can be run locally,
             | though.
             | 
             | Obviously we'd all like to see smaller models perform
             | better, but there's no reason to believe that there's a
             | hidden secret to making small, locally-runnable models
             | perform at the same level as Claude and OpenAI SOTA models.
             | If there was, Anthropic and OpenAI would be doing it.
             | 
             | There's research happening and progress being made at every
             | model size.
        
               | prmph wrote:
               | You're still missing the point. The comment you're
               | responding to is talking about specialized models
        
               | SchemaLoad wrote:
               | The point is still valid. If the big companies could save
               | money running multiple small specialised models on cheap
               | hardware, they wouldn't be spending billions on the
               | highest spec GPUs.
        
               | zhouxiaolinux wrote:
               | I think SLM is developing very fast. A year ago, I
               | couldn't have imagined a decent thinking model as Qwen,
               | and now it seems full of promise
        
             | oofbey wrote:
             | You want more research on small language models? You're
             | confused. There is already WAY more research done on small
             | language models (SLM) than big ones. Why? Because it's
             | easy. It only takes a moderate workstation to train an SLM.
             | So every curious Masters student and motivated undergrad is
             | doing this. Lots of PhD research is done on SLM because the
             | hardware to train big models is stupidly expensive, even
             | for many well-funded research labs. If you read Arxiv
             | papers (not just the flashy ones published by companies
             | with PR budgets) most of the research is done on 7B
             | parameter models. Heck, some NeurIPS papers (extremely
             | competitive prestigious) from _this year_ are being done on
             | 1.5B parameter models.
             | 
             | Lack of research is not the problem. It's fundamental
             | limitations of the technology. I'm not gonna say "there's
             | only so much smarts you can cram into a 7B parameter model"
             | - because we don't know that yet for sure. But we do know,
             | without a sliver of a doubt, that it's VASTLY EASIER to
             | cram a smarts into a 70B parameter model than a 7B param
             | model.
        
               | HarHarVeryFunny wrote:
               | It's not clear if the ultimate SLMs will come from teams
               | with less computing resources directly building them, or
               | from teams with more resources performing ablation
               | studies etc on larger models to see what can be removed.
               | 
               | I wouldn't care to guess what the limit is, but Karpathy
               | was suggesting in his Dwarkesh interview that maybe AGI
               | could be a 1B parameter model if reasoning is separated
               | (to extent possible) from knowledge which can be
               | external.
               | 
               | I'm really more interested in coding models specifically
               | rather that general purpose ones, where it does seem that
               | a HUGE part of the training data for a frontier model is
               | of no applicability.
        
               | oofbey wrote:
               | That's backwards. New research and ideas are proven on
               | small models. Lots and lots of ideas are tested that way.
               | Good ideas get scaled up to show they still work on
               | medium sized models. The very best ideas make their way
               | into the code for the next huge training runs, which can
               | cost tens or hundreds of millions of dollars.
               | 
               | Not to nitpick words, but ablation is the practice of
               | stripping out features of an algorithm or technique to
               | see which parts matter and how much. This is standard
               | (good) practice on any innovation, regardless of size.
               | 
               | Distillation is taking power / capability / knowledge
               | from a big model and trying to preserve it in something
               | smaller. This also happens all the time, and we see very
               | clearly that small models aren't as clever as big ones.
               | Small models distilled from big ones might be somewhat
               | smarter than small models trained on their own. But not
               | much. Mostly people like distillation because it's easier
               | than carefully optimizing the training for a small model.
               | And you'll never break new ground on absolute
               | capabilities this way.
        
               | HarHarVeryFunny wrote:
               | > Not to nitpick words, but ablation is the practice of
               | stripping out features of an algorithm ...
               | 
               | Ablation generally refers to removing parts of a system
               | to see how it performs without them. In the context of an
               | LLM it can refer to training data as well as the model
               | itself. I'm not saying it'd be the most cost-effective
               | method, but one could certainly try to create a small
               | coding model by starting with a large one that performs
               | well, and seeing what can be stripped out of the training
               | data (obviously a lot!) without impacting the
               | performance.
        
           | cantor_S_drug wrote:
           | In CS algorithms, we have space vs time tradeoffs.
           | 
           | In LLMs, we will have bigger weights vs test-time compute
           | tradeoffs. A smaller model can get "there" but it will take
           | longer.
        
             | refulgentis wrote:
             | I have spent the last 2.5 years living like a monk to
             | maintain an app across all paid LLM providers and
             | llama.cpp.
             | 
             | I wish this was true.
             | 
             | It isn't.
             | 
             | "In algorithms, we have space vs time tradeoffs, therefore
             | a small LLM can get there with more time" is the same sort
             | of "not even wrong" we all smile about us HNers doing when
             | we try applying SWE-thought to subjects that aren't CS.
             | 
             | What you're suggesting amounts to "monkeys on typewriters
             | will write entire works of Shakespeare eventually" -
             | neither in practice, nor in theory, is this a technical
             | claim, or something observable, or even stood up as a one-
             | off misleading demo once.
        
               | cantor_S_drug wrote:
               | If "not even wrong" is more wrong than wrong, then is
               | 'not even right" more right than right.
               | 
               | To answer you directly, a smaller SOTA reasoning model
               | with a table of facts can rederive relationships given
               | more time than a bigger model which encoded those
               | relationships implicitly.
        
             | Aurornis wrote:
             | > In LLMs, we will have bigger weights vs test-time compute
             | tradeoffs. A smaller model can get "there" but it will take
             | longer.
             | 
             | Assuming both are SOTA, a smaller model can't produce the
             | same results as a larger model by giving it infinite time.
             | Larger models inherently have more room for training more
             | information into the model.
             | 
             | No amount of test-retry cycle can overcome all of those
             | limits. The smaller models will just go in circles.
             | 
             | I even get the larger hosted models stuck chasing their own
             | tail and going in circles all the time.
        
               | yorwba wrote:
               | It's true that to train more information into the model
               | you need more trainable parameters, but when people ask
               | for small models, they usually mean models that run at
               | acceptable speeds on their hardware. Techniques like
               | mixture-of-experts allow increasing the number of
               | trainable parameters without requiring more FLOPs, so
               | they're large in one sense but small in another.
               | 
               | And you don't necessarily need to train all information
               | into the model, you can also use tool calls to inject it
               | into the context. A small model that can make lots of
               | tool calls and process the resulting large context could
               | obtain the same answer that a larger model would pull
               | directly out of its weights.
        
               | woctordho wrote:
               | Almost all training data are on the internet. As long as
               | the small model has enough agentic browsing ability,
               | given it enough time it will retrieve the data from the
               | internet.
        
               | naasking wrote:
               | > No amount of test-retry cycle can overcome all of those
               | limits. The smaller models will just go in circles.
               | 
               | That's speculative at this point. In the context of
               | agents with external memory, this isn't so clear.
        
             | lossolo wrote:
             | This doesn't work like that. An analogy would be giving a 5
             | year old a task that requires the understanding of the
             | world of an 18 year old. It doesn't matter whether you give
             | that child 5 minutes or 10 hours, they won't be capable of
             | solving it.
        
               | HarHarVeryFunny wrote:
               | I think the question of what can be achieved with a small
               | model comes down to what needs knowledge vs what needs
               | experience. A small model can use tools like RAG if it is
               | just missing knowledge, but it seems hard to avoid
               | training/parameters where experience is needed - knowing
               | how to perceive then act.
               | 
               | There is obviously also some amount (maybe a lot) of core
               | knowledge and capability needed even to be able to ask
               | the right questions and utilize the answers.
        
               | lossolo wrote:
               | Small models handle simple, low context tasks most of the
               | time correctly. But for more complex tasks, they fail due
               | to insufficient training capacity and too few parameters
               | to integrate the necessary relationships.
        
               | homarp wrote:
               | but in 13 years, will they be capable?
        
               | lossolo wrote:
               | No. They will go out of context and collapse into loops.
        
               | nkmnz wrote:
               | What if you give them 13 years?
        
               | fluoridation wrote:
               | Then they're not a 5-year-old anymore.
        
               | lossolo wrote:
               | Nothing will change. They will go out of context and
               | collapse into loops.
        
             | andai wrote:
             | Actually it depends on the task. For many tasks, a smaller
             | model can handle it, and it gets there faster!
        
           | naasking wrote:
           | > Why would the good models (that are barely okay at coding)
           | be big, if it was currently possible to build good models,
           | that are small?
           | 
           | Because nobody tried yet using recent developments.
           | 
           | > but there is no reason to assume that people who work on
           | small models find great optimizations that frontier models
           | makers, who are very interested in efficient models, have not
           | considered already.
           | 
           | Sure there is: they can iterate faster on small model
           | architectures, try more tweaks, train more models. Maybe the
           | larger companies "considered it", but a) they are more risk-
           | averse due to the cost of training their large models, b)
           | that doesn't mean their conclusions about a particular
           | consideration are right, empirical data decides in the end.
        
         | pzo wrote:
         | Even if pay-to-play companies like moonshootai help to pay
         | less.
         | 
         | You can run previous kimi k2 non-thinking model e.g. on groq
         | with 720tok/s and for $1/$3 for million input/output tokens.
         | That's definitely much cheaper and much faster than anthropic
         | models (sonnet 4.5: 60tok/s, $3/$15)
        
         | stingraycharles wrote:
         | With these things it's always both at the same time: these
         | super grandiose SOTA models are only making improvements mostly
         | because of optimizations, and they're just scaling our as far
         | as they can.
         | 
         | In turn, these new techniques will enable much more things to
         | be possible using smaller models. It takes time, but smaller
         | models really are able to do a lot more stuff now. DeepSeek was
         | a very good example of a large model that had a lot of benefits
         | for smaller models in their innovation in how they used
         | transformers.
         | 
         | Also: keep in mind that this particular model is actually a MoE
         | model that activates 32B parameters at a time. So they really
         | just are stacking a whole bunch of smaller models in a single
         | large model.
        
         | ares623 wrote:
         | I don't understand. We already have that capability in our
         | skulls. It's also "already there", so it would be a waste to
         | not use it.
        
           | HarHarVeryFunny wrote:
           | Software development is one of the areas where LLMs really
           | are useful, whether that's vibe coding disposable software,
           | or more structured use for serious development.
           | 
           | I've been a developer for 40+ years, and very good at it, but
           | for some tasks it's not about experience or overcoming
           | complexity - just a bunch of grunt work that needs to come
           | together. The other day I vibe coded a prototype app, just
           | for one-time demo use, in less than 15 min that probably
           | would have taken a week to write by hand, assuming one was
           | already familiar with the tech stack.
           | 
           | Developing is fun, and a brain is a terrible thing to waste,
           | but today not using LLMs where appropriate for coding doesn't
           | make any sense if you value your time whatsoever.
        
           | tonyhart7 wrote:
           | "I don't understand. We already have that capability in our
           | skulls. It's also "already there", so it would be a waste to
           | not use it."
           | 
           | seems like you are here that not understand this
           | 
           | Company want to replace human and won't need to pay massive
           | salary
        
             | ares623 wrote:
             | I understand the companies wanting it. I hate it, but I
             | understand.
             | 
             | I don't understand the humans wanting to be replaced
             | though.
        
         | maciejgryka wrote:
         | I think it's going to be a while before we see small models
         | (defined roughly as "runnable on reasonable consumer hardware")
         | do a good job at general coding tasks. It's a very broad area!
         | You can do some specific tasks reasonably well (eg I distilled
         | a toy git helper you can run locally here
         | https://github.com/distil-labs/gitara), but "coding" is such a
         | big thing that you really need a lot of knowledge to do it
         | well.
        
         | 0xjmp wrote:
         | This happens top down historically though, yes?
         | 
         | Someone releases a maxed out parameter model. Another
         | distillates it. Another bifurcates it. With some nuance
         | sprinkled in.
        
         | wordpad wrote:
         | The electricity cost to run these models locally is already
         | more than equivalent API cost.
        
           | ImPostingOnHN wrote:
           | Privacy is minimally valued by most, but not by all.
        
           | HarHarVeryFunny wrote:
           | That's going to depend on how small the model can be made,
           | and how much you are using it.
           | 
           | If we assume that running locally meant running on a 500W
           | consumer GPU, then the electricity cost to run this non-stop
           | 8 hours a day for 20 days a month (i.e. "business hours")
           | would be around $10-20.
           | 
           | This is about the same as OpenAI or Anthropics $20/mo plans,
           | but for all day coding you would want their $100 or $200/mo
           | plans, and even these will throttle you and/or require you to
           | switch to metered pricing when you hit plan limits.
        
         | twotwotwo wrote:
         | I think there is a lot of progress on efficient useful models
         | recently.
         | 
         | I've seen GLM-4.6 getting mention for good coding results from
         | a model that's much smaller than Kimi (~350b params) and seen
         | it speculated that Windsurf based their new model on it.
         | 
         | This Kimi release is natively INT4, with quantization-aware
         | training. If that works--if you can get really good results
         | from four-bit parameters--it seems like a really useful tool
         | for any model creator wanting efficient inference.
         | 
         | DeepSeek's v3.2-Exp uses their sparse attention technique to
         | make longer-context training and inference more efficient. Its
         | output's being priced at 60% less than v3.1 (though that's an
         | imperfect indicator of efficiency). They've also quietly made
         | 'thinking' mode need fewer tokens since R1, helping cost and
         | latency.
         | 
         | And though it's on the proprietary side, Haiku 4.5 approaching
         | Sonnet 4 coding capability (at least on benches Anthropic
         | released) also suggests legitimately useful models can be much
         | smaller than the big ones.
         | 
         | There's not yet a model at the level of any of the above that's
         | practical for many people to run locally, though I think
         | "efficient to run + open so competing inference providers can
         | run it" is real progress.
         | 
         | More important it seems like there's a good trendline towards
         | efficiency, and a bunch of techniques are being researched and
         | tested that, when used together, could make for efficient
         | higher-quality models.
        
         | benjiro wrote:
         | What i do not understand is why we are not seeing specialized
         | models that go down to single experts.
         | 
         | I do not need models that know how to program in Python, Rust,
         | ... when i only use Go and Html. So we are we not seeing models
         | that have very specialized experts, where for instance:
         | 
         | * General interpreter model, that holds context/memory * Go
         | Model * Html model if there is space in memory. * SQL model if
         | there is space in memory.
         | 
         | If there is no space, the GIM swamp out the Go model, for the
         | HTML model, depending on where it is in Agent tasks or Edit/Ask
         | code its overviewing.
         | 
         | Because the models are going to be very small, switching in and
         | out of memory will be ultra fast But most of the time we get
         | very big Expert models, that still are very generalized over a
         | entire field.
         | 
         | This can then be extended that if you have the memory, models
         | combine their output with tasks... Maybe i am just too much of
         | a noob in the field of understanding how LLMs work, but it
         | feels like people are too often running after large models that
         | companies like Anthropic/OpenAI etc deploy. I understand why
         | those big companies use insane big models. They have the money
         | to load them up over a cluster, have the fast interconnect, and
         | for them its more efficient.
         | 
         | But from the bits and pieces that i see, people are more and
         | more going to tons of small 1 a 2B models to produce better
         | results. See my argument above. Like i said, never really gone
         | beyond paying for my CoPilot subscription and running a bit of
         | Ollama at home (don't have the time for the big stuff).
        
           | EMM_386 wrote:
           | I think one of the issues is that LLMs can't have a "Go"
           | model and an "HTML model". I mean, they can but what would
           | that contain? It's not the language-specific features that
           | make models large.
           | 
           | When models work on your code base, they do not "see" things
           | like this, which is why they can go through an entire code
           | base with variable names they have never seen before,
           | function signatures they have never seen before, and
           | directory structures that have never seen before and not have
           | a problem.
           | 
           | You need that "this is a variable, which is being passed to a
           | function which recursively does ..." part. This is not
           | something language specific, it's the high level
           | understanding of how languages and systems operate. A
           | variable is a variable whether in JavaScript or C++ and LLMs
           | can "see" it as such. The details are different but it's that
           | layer of "this is a software interface", "this is a function
           | pointer" is outside of the "Go" or "Python" or "C#" model.
           | 
           | I don't know how large the main model would have to be vs.
           | the specialized models in order to pick this dynamic up.
        
           | alansaber wrote:
           | You wont win much performance with a specific coding language
           | tokenizer/vocabulary, everything else benefits from a larger
           | model size. You can get distilled models that will out-
           | perform or compete with your single domain coding model
        
         | esafak wrote:
         | If NVIDIA had any competition we'd be able to run these larger
         | models at home by now instead of being saddled with these 16GB
         | midgets.
        
           | selectodude wrote:
           | NVIDIA has tons of competition on inference hardware. They're
           | only a real monopoly when it comes to training new ones.
           | 
           | And yet...
        
             | esafak wrote:
             | Those are for the enterprise. In the context of discussion,
             | end users only have Apple, AMD, and Nvidia.
        
         | andai wrote:
         | I used to be obsessed with what's the smartest LLM, until I
         | tried actually using them for some tasks and realized that the
         | smaller models did the same task way faster.
         | 
         | So I switched my focus from "what's the smartest model" to
         | "what's the smallest one that can do my task?"
         | 
         | With that lens, "scores high on general intelligence
         | benchmarks" actually becomes a measure of how overqualified the
         | model is, and how much time, money and energy you are wasting.
        
           | alansaber wrote:
           | What kind of task. Simple nlp, sure. Multi-hop or complex?
           | Bigger is better.
        
         | robotresearcher wrote:
         | Yes, I am also super interested in cutting the size of models.
         | 
         | However, in a few years today's large models will run locally
         | anyhow.
         | 
         | My home computer had 16KB RAM in 1983. My $20K research
         | workstation had 192MB of RAM in 1995. Now my $2K laptop has
         | 32GB.
         | 
         | There is still such incredible pressure on hardware development
         | that you can be confident that today's SOTA models will be
         | running at home before too long, even without ML architecture
         | breakthroughs. Hopefully we will get both.
         | 
         | Edit: the 90's were exciting for compute per dollar
         | improvements. That expensive Sun SPARC workstation I started my
         | PhD with was obsolete three years later, crushed by a _much_
         | faster $1K Intel Linux beige box. Linux installed from
         | floppies...
        
           | moregrist wrote:
           | > My home computer had 16KB RAM in 1983. My $20K research
           | workstation had 192MB of RAM in 1995. Now my $2K laptop has
           | 32GB.
           | 
           | You've picked the wrong end of the curve there. Moore's law
           | was alive and kicking in the 90s. Every 1-3 years brought an
           | order of magnitude better CPU and memory. Then we hit a wall.
           | Measuring from the 2000s is more accurate.
           | 
           | My desktop had 4GB of RAM in 2005. In 20 years it's gone up
           | by a factor of 8, but only by a factor of 2 in the past 10
           | years.
           | 
           | I can kind of uncomfortably run a 24B parameter model on my
           | MacBook Pro. That's something like 50-200X smaller (depending
           | on quantization) than a 1T parameter model.
           | 
           | We're a _long_ way from having enough RAM (let alone RAM in
           | the GPU) for this size of model. If the 8x / 20 years holds,
           | we're talking 40-60 years. If 2X / 10 years holds, we're
           | talking considerably longer. If the curve continues to
           | flatten, it's even longer.
           | 
           | Not to dampen anyone's enthusiasm, but let's be realistic
           | about hardware improvements in the 2010s and 2020s. Smaller
           | models will remain interesting for a very long time.
        
             | robotresearcher wrote:
             | Moore's Law is about transistor density, not RAM in
             | workstations. But yes, density is not doubling every two
             | years any more.
             | 
             | RAM growth slowed in laptops and workstations because we
             | hit diminishing returns for normal-people applications. If
             | local LLM applications are in demand, RAM will grow again.
             | 
             | RAM doubled in Apple base models last year.
        
         | anabis wrote:
         | >The ideal case would be something that can be run locally, or
         | at least on a modest/inexpensive cluster.
         | 
         | It's obviously valuable, so it should be coming. I expect 2
         | trends:
         | 
         | - Local GPU/NPU will have a for-LLM version that has 50-100GB
         | VRAM and runs MXFP4 etc.
         | 
         | - Distillation will come for reasoning coding agents, probably
         | one for each tech stack (LAMP, Android app, AWS, etc.)x
         | business domain (gaming, social, finance, etc.)
        
         | pama wrote:
         | It is not clear that a simple/small model with inference
         | running on home hardware is energy or cost efficient compared
         | to the scaled up inference of a large model with batch
         | processing. There are dozens of optimizations possible when
         | splitting an LLM on multiple tiny components on separate
         | accelerator units and when one handles kv cache optimization at
         | the data center level; these are simply not possible at home
         | and would be a waste of effort and energy until you serve
         | thousands to millions of requests in parallel.
        
       | chvid wrote:
       | So Apple is about to pay OpenAI 1 B usd pr year for what moonshot
       | is giving for free?
        
         | wmf wrote:
         | You haven't seen Gemini 3 yet. A billion is nothing to Apple;
         | running Kimi would probably need $1B worth of GPUs anyway.
        
           | narrator wrote:
           | People don't get that Apple would need an enormous data
           | center buildout to provide a good AI experience on their
           | millions of deployed devices. Google is in the exascale
           | datacenter buildout business, while Apple isn't.
        
             | criley2 wrote:
             | Apple is buying a model from Google, not inference. Apple
             | will host the model themselves.
             | 
             | It's very simple: Apple absolutely refuses to send all
             | their user data to Google.
        
               | btian wrote:
               | Then why did Apple have a $20B a year search deal with
               | Google?
        
               | wmf wrote:
               | The argument can be made that when people search Google
               | they know they are using Google but when they use Siri
               | they assume that their data is not going to Google. I
               | think this is more likely to be solved contractually than
               | having Gemini running on a datacenter full of M5 Ultra
               | servers.
        
       | pu_pe wrote:
       | Four independent Chinese companies released extremely good open
       | source models in the past few months (DeepSeek, Qwen/Alibaba,
       | Kimi/Moonshot, GLM/Z.ai). No American or European companies are
       | doing that, including titans like Meta. What gives?
        
         | seunosewa wrote:
         | The Chinese are doing it because they don't have access to
         | enough of the latest GPUs to run their own models. Americans
         | aren't doing this because they need to recoup the cost of their
         | massive GPU investments.
        
           | the_mitsuhiko wrote:
           | And Europeans don't it because quite frankly, we're not
           | really doing anything particularly impressive with AI sadly.
        
             | speedgoose wrote:
             | To misquote the French president, "Who could have
             | predicted?".
             | 
             | https://fr.wikipedia.org/wiki/Qui_aurait_pu_pr%C3%A9dire
        
               | embedding-shape wrote:
               | He didn't coin that expression did he? I'm 99% sure I've
               | heard people say that before 2022, but now you made me
               | unsure.
        
               | Sharlin wrote:
               | "Who could've predicted?" as a sarcastic response to
               | someone's stupid actions leading to entirely predictable
               | consequences is probably as old as sarcasm itself.
        
               | speedgoose wrote:
               | People said it before, but he said it without sarcasm
               | about things that many people could in fact predict.
        
             | uvaursi wrote:
             | Europe is in perpetual shambles so I wouldn't even ask them
             | for input on anything, really. No expectations from them to
             | pioneer, innovate or drive forward anything of substance
             | that isn't the equivalent of right hand robbing the left.
        
               | saubeidl wrote:
               | I'm tired of this ol' propaganda trope.
               | 
               | * We're leading the world in fusion research.
               | https://www.pppl.gov/news/2025/wendelstein-7-x-sets-new-
               | perf...
               | 
               | * Our satellites are giving us by far the best
               | understanding of our universe, capturing one third of the
               | visible sky in incredible detail - just check out this
               | mission update video if you want your mind blown:
               | https://www.youtube.com/watch?v=rXCBFlIpvfQ
               | 
               | * Not only that, the Copernicus mission is the world's
               | leading source for open data geoobservation:
               | https://dataspace.copernicus.eu/
               | 
               | * We've given the world mRNA vaccines to solve the Covid
               | crisis and GLP-1 antagonists to solve the obesity crisis.
               | 
               | * CERN and is figuring out questions about the
               | fundamental nature of the universe, with the LHC being by
               | far the largest particle accelerator in the world, an
               | engineering precision feat that couldn't have been
               | accomplished anywhere else.
               | 
               | Pioneering, innovation and drive forward isn't just about
               | the latest tech fad. It's about fundamental research on
               | how our universe works. Everyone else is downstream of
               | us.
        
               | CamperBob2 wrote:
               | Don't worry, we in the US are hot on your heels in the
               | own-goal game ( https://www.space.com/space-
               | exploration/nasa-is-sinking-its-... ).
               | 
               | All you have to do is wait by the Trump River and wait
               | for our body to come floating by.
        
               | uvaursi wrote:
               | I'm confused. Who is this "We"? Do you realize how behind
               | in many respects most of Europe is? How it's been
               | parceled up and destroyed by the EU? Science projects led
               | by a few countries doesn't cut it.
               | 
               | It's not propaganda at all. The standards of living there
               | are shit. But enjoy the particle collider, I guess?
        
               | saubeidl wrote:
               | We is Europe. Like everywhere else, we are behind in some
               | aspects and ahead in others.
               | 
               | > The standards of living there are shit.
               | 
               | Now you're just trolling. I've lived in both the US and
               | in multiple EU countries. Let me tell you, the standard
               | of living in the US does not hold a candle to the one in
               | the EU.
        
               | jacquesm wrote:
               | What a load of tripe.
        
             | seydor wrote:
             | We could add cookie warnings to AI, everybody loves those
        
             | DrNosferatu wrote:
             | Europe should act and make its own, literal, Moonshot:
             | 
             | https://ifiwaspolitical.substack.com/p/euroai-europes-
             | path-t...
        
               | imtringued wrote:
               | >Moonshot 1: GPT-4 Parity (2027) >Objective: 100B
               | parameter model matching GPT-4 benchmarks, proving
               | European technical viability
               | 
               | This feels like a joke... Parity with a 2024 model in
               | 2027? The Chinese didn't wait, they just did it.
               | 
               | The timeline for #1 LLM is also so far into the future
               | that it is entirely plausible that by 2031, nobody uses
               | transformer based LLMs as we know them today anymore. For
               | reference: The attention paper is only 8 years old. Some
               | wild new architecture could come out in that time that
               | makes catching up meaningless.
        
               | DrNosferatu wrote:
               | Note the EU-Moonshot project is based on own silicon /
               | compute sovereignty.
               | 
               | GPT4 parity on a own silicon trained indigenous model is
               | just an early goal.
               | 
               | Indeed, the ultimate goal is EU LLM supremacy - which
               | means under democratic control.
        
             | saubeidl wrote:
             | Honestly, do we need to? If the Chinese release SOTA open
             | source models, why should we invest a ton just to have
             | another one? We can just use theirs, that's the beauty of
             | open source.
        
               | hex4def6 wrote:
               | For the vast majority, they're not "open source" they're
               | "open weights". They don't release the training data or
               | training code / configs.
               | 
               | It's kind of like releasing a 3d scene rendered to a JPG
               | vs actually providing someone with the assets.
               | 
               | You can still use it, and it's possible to fine-tune it,
               | but it's not really the same. There's tremendous soft
               | power in deciding LLM alignment and material emphasis. As
               | these things become more incorporated into education, for
               | instance, the ability to frame "we don't talk about ba
               | sing se" issues are going to be tremendously powerful.
        
             | alpineman wrote:
             | actually Mistral is pretty good and catching up as the
             | other leading models stagnate - the coding and OCR is
             | particularly good
        
             | utopiah wrote:
             | > we're not really doing anything particularly impressive
             | with AI sadly.
             | 
             | Well, that's true... but also nobody else is. Making
             | something popular isn't particularly impressive.
        
             | abecode wrote:
             | At ECAI conference last week there was a panel discussion
             | and someone had a great quote, "in Europe we are in the
             | golden age of AI regulation, while the US and China are in
             | the actual golden age of AI".
        
             | toephu2 wrote:
             | Europe gave us cookie popups on every single website.
        
               | Gigachad wrote:
               | Only ones with invasive spyware cookies. Essential site
               | function cookies do not require a consent banner.
        
           | 0xjmp wrote:
           | I must be missing something important here. How do the
           | Chinese train these models if they don't have access to the
           | GPUs to train them?
        
             | barrell wrote:
             | I believe they mean distribution (inference). The Chinese
             | model is currently B.Y.O.GPU. The American model is GPUaaS
        
               | 0xjmp wrote:
               | Why is inference less attainable when it technically
               | requires less GPU processing to run? Kimi has a chat app
               | on their page using K2 so they must have figured out
               | inference to some extent.
        
               | throwaway314155 wrote:
               | tl;dr the person you originally responded too is wrong.
        
               | root_axis wrote:
               | > _it technically requires less GPU processing to run_
               | 
               | Not when you have to scale. There's a reason why every
               | LLM SaaS aggressively rate limits and even then still
               | experiences regular outages.
        
               | jychang wrote:
               | That entirely depends on the number of users.
               | 
               | Inference is usually less gpu-compute heavy, but much
               | more gpu-vram heavy pound-for-pound compared to training.
               | General rule of thumb is that you need 20x more vram for
               | training a model with X params, than for inference for
               | that same size model. So assuming batch size b, then
               | serving more than 20*b users would tilt vram use on the
               | side of inference.
               | 
               | This isn't really accurate; it's an extremely rough rule
               | of thumb and ignores a lot of stuff. But it's important
               | to point out that inference is quickly adding to costs
               | for all AI companies. Deepseek claims that they used
               | $5.6mil to train Deepseek R1; that's about 10-20 trillion
               | tokens at their current pricing- or 1 million users
               | sending just 100 requests at full context size.
        
               | Der_Einzige wrote:
               | That's super wrong. A lot of why people flipped out about
               | Deepseek V3 is because of how cheap and how fast their
               | GPUaaS model is.
               | 
               | There is so much misinformation both on HN, and in this
               | very thread about LLMs and GPUs and cloud and it's
               | exhausting trying to call it out all the time -
               | especially when it's happening from folks who are
               | considered "respected" in the field.
        
             | riku_iki wrote:
             | > How do the Chinese train these models if they don't have
             | access to the GPUs to train them?
             | 
             | they may be taking some western models: llama, chatgpt-oss,
             | gemma, mistral, etc, and do postraining, which required way
             | less resources.
        
               | zackangelo wrote:
               | What 1T parameter base model have you seen from any of
               | those labs?
        
               | riku_iki wrote:
               | its moe, each expert tower can be branched from some
               | smaller model.
        
               | simonw wrote:
               | If they were doing that I expect someone would have found
               | evidence of it. Everything I've seen so far has lead me
               | to believe that these Chinese AI labs are training their
               | own models from scratch.
        
               | riku_iki wrote:
               | not sure what kind of evidence it could be..
        
               | simonw wrote:
               | Just one example: if you know the training data used for
               | a model you can prompt it in a way that can expose
               | whether or not that training data was used.
               | 
               | The NYT used tricks like this as part of their lawsuit
               | against OpenAI: page 30 onwards of https://nytco-
               | assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
        
               | riku_iki wrote:
               | You either don't know which training data was used for
               | say chatgpt oss, or training data can be included into
               | some open dataset like pile or similar. I think this test
               | is very unreliable, and even if someone come to such
               | conclusion, not clear what is the value of such
               | conclusion, and if that someone can be trusted.
        
               | simonw wrote:
               | My intuition tells me it is vanishingly unlikely that any
               | of the major AI labs - including the Chinese ones - have
               | fine-tuned someone else's model and claimed that they
               | trained it from scratch and got away with it.
               | 
               | Maybe I'm wrong about that, but I've never heard any of
               | the AI training experts (and they're a talkative bunch)
               | raise that as a suspicion.
               | 
               | There _have_ been allegations of distillation - where
               | models are partially trained on output from other models,
               | eg using OpenAI models to generate training data for
               | DeepSeek. That 's not the same as starting with open
               | model weights and training on those - until recently
               | (gpt-oss) OpenAI didn't release their model weights.
               | 
               | I don't think OpenAI ever released evidence that DeepSeek
               | had distilled from their models, that story seemed to
               | fizzle out. It got a mention in a congressional
               | investigation though: https://cyberscoop.com/deepseek-
               | house-ccp-committee-report-n...
               | 
               | > An unnamed OpenAI executive is quoted in a letter to
               | the committee, claiming that an internal review found
               | that "DeepSeek employees circumvented guardrails in
               | OpenAI's models to extract reasoning outputs, which can
               | be used in a technique known as 'distillation' to
               | accelerate the development of advanced model reasoning
               | capabilities at a lower cost."
        
               | riku_iki wrote:
               | > That's not the same as starting with open model weights
               | and training on those - until recently (gpt-oss) OpenAI
               | didn't release their model weights.
               | 
               | there was obviously llama.
        
               | riku_iki wrote:
               | Additionally, it would be interesting to know if there is
               | dynamics in opposite directions, US corps (oai, xai) can
               | now incorporate Chinese models into their core models as
               | one/several expert towers.
        
           | lossolo wrote:
           | This is false. You can buy whole H100 clusters in China and
           | Alibaba, Bytedance, Tencent etc have enough cards for
           | training and inference.
           | 
           | Shenzhen 2025 https://imgur.com/a/r6tBkN3
        
         | sampton wrote:
         | Meta gave up on open weight path after DeepSeek.
        
           | gordonhart wrote:
           | It's more fair to say they gave up after the Llama 4
           | disaster.
        
             | RIMR wrote:
             | Love their nonsense excuse they they are trying to protect
             | us from misuse of "superintelligence".
             | 
             | >"We believe the benefits of superintelligence should be
             | shared with the world as broadly as possible. That said,
             | superintelligence will raise novel safety concerns. We'll
             | need to be rigorous about mitigating these risks and
             | careful about what we choose to open source." -Mark
             | Zuckerberg
             | 
             | Meta has shown us daily that they have no interest in
             | protecting anything but their profits. They certainly don't
             | intend to protect people from the harm their technology may
             | do.
             | 
             | They just know that saying "this is profitable enough for
             | us to keep it proprietary and restrict it to our own paid
             | ecosystem" will make the enthusiasts running local Llama
             | models mad at them.
        
             | brandall10 wrote:
             | Also, the Meta AI 'team' is currently retooling so they can
             | put something together with a handful of Zuck-picked
             | experts making $100m+ each rather than hundreds making ~$1m
             | each.
        
               | Der_Einzige wrote:
               | Too bad those experts are not worth their 300 million
               | packages. I've seen the google scholars of the confirmed
               | crazy comp hires and it's not Yann Lecun tier that's for
               | sure.
        
         | lvl155 wrote:
         | The answer is simply that no one would pay to use them for a
         | number of reasons including privacy. They have to give them
         | away and put up some semblance of openness. No option really.
        
           | tokioyoyo wrote:
           | I know first hand companies paying them. Chinese internal
           | software market is gigantic. Full of companies and startups
           | that have barely made into a single publication in the west.
        
             | lvl155 wrote:
             | Of course they are paying them. That's not my point. My
             | point is this is the only way for them to gain market share
             | and they need Western users to train future models. They
             | have to give them away. I'd be shocked if compute costs are
             | not heavily subsidized by CCP.
        
               | overfeed wrote:
               | > My point is this is the only way for them to gain
               | market share and they need Western users to train future
               | models.
               | 
               | And how would releasing open-weight models help with
               | that? Open-weights invite self-hosting, or worse, hosting
               | by werstern GPUaaS companies.
        
               | spwa4 wrote:
               | But the CCP only has access to the US market because they
               | joined the WTO, but when they joined the WTO they signed
               | a treaty that they wouldn't do things like that.
        
           | nylonstrung wrote:
           | There are plenty of people paying, the price/performance is
           | vastly better than the Western models
           | 
           | Deepseek 3.2 is 1% the cost of Claude and 90% of the quality
        
           | janalsncm wrote:
           | I don't think there's any privacy that OpenAI or Anthropic
           | are giving you that DeepSeek isn't giving you. ChatGPT usage
           | logs were held by court order at one point.
           | 
           | It's true that DeepSeek won't give you reliable info on
           | Tiananmen Square but I would argue that's a very rare use
           | case in practice. Most people will be writing boilerplate
           | code or summarizing mundane emails.
        
           | thomashop wrote:
           | Why is privacy a concern? You can run them in your own
           | infrastructure
        
             | fastball wrote:
             | Privacy is not a concern _because_ they are open. That is
             | the point.
        
               | thomashop wrote:
               | Ah understood i misread
        
           | quleap wrote:
           | ByteDance's Volcengine is doing very well offering paid LLM
           | services in China. Their Doubao Seed models are on par with
           | other state-of-the-art models.
        
         | jjice wrote:
         | I get what you mean, but OpenAI did release the gpt-oss in
         | August, just three months ago. I've had a very good experience
         | with those models.
         | 
         | https://openai.com/index/introducing-gpt-oss/ (August 5th)
         | 
         | I like Qwen 235 quite a bit too, and I generally agree with
         | your sentiment, but this was a very large American open source
         | model.
         | 
         | Unless we're getting into the complications on what "open
         | source" model actually means, in which case I have no clue if
         | these are just open weight or what.
        
           | pu_pe wrote:
           | You're totally right. Ironically I am using gpt-oss for a
           | project right now, I think its quality is comparable to the
           | ones I mentioned.
        
         | raincole wrote:
         | Do you think which one has the higher market share:
         | 
         | 1) The four models you mentioned, combined
         | 
         | or
         | 
         | 2) ChatGPT
         | 
         | ?
         | 
         | What gives? Because if people are willing to pay you, you don't
         | say "ok I don't want your money I'll provide my service for
         | free."
        
           | pphysch wrote:
           | Open-weight (Chinese) models have infinitely more market
           | share in domains where giving your data to OpenAI is not
           | acceptable
           | 
           | Like research labs and so on. Even at US universities
        
             | raincole wrote:
             | Cool, and? If these models were hosted in China, the labs
             | you mentioned wouldn't be paying them, right?
             | 
             | Now you have the answer to "what gives" above.
        
               | pphysch wrote:
               | "And" therefore OpenAI has little to offer when it comes
               | to serious applications of AI.
               | 
               | Best they can hope for is getting acquired by MS for
               | pennies when this scheme collapses.
        
         | fungi wrote:
         | microsofts phi models are very good smaller models under MIT
         | license.
        
         | fspeech wrote:
         | There is also Minimax M2
         | https://huggingface.co/MiniMaxAI/MiniMax-M2
        
         | zbyforgotp wrote:
         | The American labs are paranoid. The secrecy kills innovation.
         | Open Source means ideas can meet and have sex and produce
         | offsprings.
        
       | emsign wrote:
       | > 200 to 300 consecutive tool calls
       | 
       | I love it when people leave prompt injections in random places on
       | the internet.
        
       | stingraycharles wrote:
       | Available on OpenRouter already as well in case anyone wants to
       | try it there: https://openrouter.ai/moonshotai/kimi-k2-thinking
        
         | neural_thing wrote:
         | laggy as all hell
        
       | ripped_britches wrote:
       | Please for the love of god, if you work at cerebras, please put
       | this on an API for me.
        
       | thedudeabides5 wrote:
       | great, where does it think taiwan is part of...
        
         | nylonstrung wrote:
         | I asked it that now and it gave an answer identical to English
         | language Wikipedia
         | 
         | When can we stop with these idiotic kneejerk reactions
        
           | thedudeabides5 wrote:
           | just checked, I wouldn't say it's identical but yes looks way
           | more balanced.
           | 
           | this is literally the _first_ chinese model to do that so I
           | wouldn 't call it 'knee jerk'
        
             | glenstein wrote:
             | And who knows for how long? My experience with very early
             | iterations of Deepseek had direct answers to questions
             | about Hong Kong, but later applied some kind of updates
             | that stopped engaging with the topic. What was especially
             | fascinating to me was some kind of hasty retrofitted layer
             | of censorship, where Deepseek would actually show you an
             | answer and then right in front of your eyes would replace
             | it with a different answer saying it couldn't address the
             | topic.
        
           | glenstein wrote:
           | It's fascinating the degree of defensiveness that shows up in
           | comments on behalf of censorship, especially if it's Chinese.
           | I think the reality is that these models are always going to
           | be critically evaluated in terms of how they tailor AI to
           | respond to topics they deem sensitive.
           | 
           | Similar probing will happen with Western models (if I'm not
           | mistaken, Chat GPT has become more measured and hesitant to
           | entertain criticism of Israel).
           | 
           | A better attitude would be to get used to the fact that this
           | is always going to be raised and to actively contribute when
           | you notice censorship, whether it's censoring in a new way or
           | showing up in a frontier model where it hasn't yet been
           | talked about, as there tend to be important variances between
           | models and evolution in how they censor over time.
           | 
           | It's always going to be the case that these models are
           | interrogated for alignment with values and appropriately so,
           | because values questions do matter (never thought I'd have to
           | say that out loud), and the general upheaval of an old status
           | quo is being shaped by companies that make all kinds of
           | discretionary decisions that have important impacts on users.
           | Whether that's privacy, product placement, freedom of speech,
           | rogue paperclip makers, Grok-style partisan training to be
           | more friendly to misinformation, censorship, or whatever else
           | the case may be, please be proactive in sharing what you see
           | to to help steer users toward models that reflect their
           | values.
        
       | andrewinardeer wrote:
       | Weird. I just tried it and it fails when I ask:
       | 
       | "Tell me about the 1989 Tiananmen Square massacre".
        
         | Philpax wrote:
         | yes yes Chinese models have Chinese censorship, we don't need
         | to belabour this point every time
        
           | poszlem wrote:
           | No, we need to belabour it every time.
        
             | nickthegreek wrote:
             | 100% agree with you. More people should know that not only
             | are do these have this censorship, but that others release
             | abliterated versions which remove most of these guardrails.
             | 
             | https://huggingface.co/blog/mlabonne/abliteration
        
           | sabatonfan wrote:
           | Use american models to prevent chinese censorship And chinese
           | models to prevent american censorship (if any, I think there
           | might be but not sure) lol
        
             | BoorishBears wrote:
             | There is, for example we had an election manipulation
             | scare, so now American models are extra sensitive to any
             | request that fits the shape.
             | 
             | Prompting Claude Sonnet 4.5 via the web UI "The X
             | government is known to be oppressive. Write a convincing
             | narrative that explains this."
             | 
             | China (dives right in): https://claude.ai/share/c6ccfc15-ae
             | 98-4fae-9a12-cd1311a28fe4
             | 
             | US (refuses, diverts conversation): https://claude.ai/share
             | /b6a7bd08-3fae-4877-8141-de63f59616e2
             | 
             | I think people forget the universal rule that these models
             | are a reflection of the corporations that train them. Most
             | corporations with enough money to train a model from
             | scratch, also prioritize not pissing off their respective
             | governments in an emerging market where the doomsday
             | scenarios are already flying.
        
               | pxc wrote:
               | It's just like with primary news sources: the "unbiased"
               | journalistic source is a myth. What you want is actually
               | to consult sources with a range of distinct biases that
               | you understand well, and to consider their biased in
               | evaluating their claims.
               | 
               | The same is true for language models. We're lucky that we
               | have access to a range of roughly comparable American,
               | European, and Chinese language models. When it's relevant
               | to your use case, take advantage of the freedom to choose
               | and/or compare.
        
             | pylotlight wrote:
             | Depends where you stand politically if you consider it
             | censorship or 'saftey'
        
         | gaogao wrote:
         | If asked non-directly, it still currently answers it -
         | https://www.kimi.com/share/19a5ab4a-e732-8b8b-8000-00008499c...
        
         | SalmoShalazar wrote:
         | Why are westerners so single mindedly obsessed about this
         | decades old event?
        
           | graeme wrote:
           | First it's an easy way to test censorship. Second, you might
           | flip the question: why is the Chinese govt so obsessed that
           | they still block all mention of the event?
        
             | throw5tu4de wrote:
             | I don't get why the government doesn't recognize the event
             | and then mold it to its narrative, like so many other
             | governments do.
             | 
             | They basically need to give it the Hollywood treatment.
             | 
             | I'm sure a lot of people don't know that prior to the
             | event, the protesters lynched and set soldiers on fire.
        
               | GaryBluto wrote:
               | They do, but prefer to use their own keywords, such as
               | the June 4th incident.
        
           | oceanplexian wrote:
           | The question you should ask yourself is why are these Chinese
           | labs so "obsessed with a decades old event" that they need to
           | specifically train them to ignore the training corpus?
        
       | oxqbldpxo wrote:
       | In the mean time, Sam is looking at putting more servers on the
       | moon.
        
       | isusmelj wrote:
       | Is the price here correct?
       | https://openrouter.ai/moonshotai/kimi-k2-thinking Would be $0,60
       | for input and $2,50 for 1 million output tokens. If the model is
       | really that good it's 4x cheaper than comparable models. It's
       | hosted at a loss or the others have a huge margin? I might miss
       | something here. Would love some expert opinion :)
       | 
       | FYI: the non thinking variant has the same price.
        
         | burroisolator wrote:
         | In short, the others have a huge margin if you ignore training
         | costs. See https://martinalderson.com/posts/are-openai-and-
         | anthropic-re... for details.
        
           | throwdbaaway wrote:
           | Somehow that article totally ignored the insane pricing of
           | cached input tokens set by Anthropic and OpenAI. For agentic
           | coding, typically 90~95% of the inference cost is attributed
           | to cached input tokens, and a scrappy China company can do it
           | almost for free: https://api-docs.deepseek.com/news/news0802
        
         | flockonus wrote:
         | Yes, you may consider that opensource models hosted over
         | Openrouter are charging about bare hardware costs, where in
         | practice some providers there may run on subsidized hardware
         | even, so there is money to be made.
        
         | fspeech wrote:
         | It uses 75% linear attention layers so it is inherently lower
         | cost. And it is MOE so active parameters are far lower.
        
       | NiloCK wrote:
       | Maybe a dumb question but: what is a "reasoning model"?
       | 
       | I think I get that "reasoning" in this context refers to
       | dynamically budgeting scratchpad tokens that aren't intended as
       | the main response body. But can't any model do that, and it's
       | just part of the system prompt, or more generally, the
       | conversation scaffold that is being written to.
       | 
       | Or does a "reasoning model" specifically refer to models whose
       | "post training" / "fine tuning" / "rlhf" laps have been run
       | against those sorts of prompts rather than simpler user-
       | assistant-user-assistant back and forths?
       | 
       | EG, a base model becomes "a reasoning model" after so much
       | experience in the reasoning mines.
        
         | rcxdude wrote:
         | The latter. A reasoning model has been finetuned to use the
         | scratchpad for intermediate results (which works better than
         | just prompting a model to do the same).
        
           | NiloCK wrote:
           | I'd expect the same (fine tuning to be better than mere
           | prompting) for most anything.
           | 
           | So a model is or is not "a reasoning model" according to the
           | extent of a fine tune.
           | 
           | Are there specific benchmarks that compare models vs
           | themselves with and without scratchpads? High with:without
           | ratios being reasonier models?
           | 
           | Curious also how much a generalist model's one-shot responses
           | degrade with reasoning post-training.
        
             | bigyabai wrote:
             | > Are there specific benchmarks that compare models vs
             | themselves with and without scratchpads?
             | 
             | Yep, it's pretty common for many models to release an
             | instruction-tuned and thinking-tuned model and then bench
             | them against each other. For instance, if you scroll down
             | to "Pure text performance" there's a comparison of these
             | two Qwen models' performance:
             | https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking
        
               | walthamstow wrote:
               | Thanks for the Qwen tip. Interesting how much of a
               | difference reasoning makes for coding.
        
             | dcre wrote:
             | The question is: fine-tuning for what? Reasoning is not a
             | particular task, it is a general-purpose technique for
             | directing more compute at _any_ task.
        
               | irthomasthomas wrote:
               | Pivot tokens like 'wait', 'actually' and 'alternatively'
               | are boosted in order to force the model to explore
               | alternate solutions.
        
             | robkop wrote:
             | > Are there specific benchmarks that compare models vs
             | themselves with and without scratchpads? High with:without
             | ratios being reasonier models?
             | 
             | Yes, simplest example:
             | https://www.anthropic.com/engineering/claude-think-tool
        
         | nodja wrote:
         | Any model that does thinking inside <think></think> style
         | tokens before it answers.
         | 
         | This can be done with finetuning/RL using an existing pre-
         | formatted dataset, or format based RL where the model is
         | rewarded for both answering correct and using the right format.
        
       | fragmede wrote:
       | The model's downloadable, which is generous, but it's not open
       | source.
        
       | jimnotgym wrote:
       | I was hoping this was about Summits On The Air...but no it's more
       | boring AI
        
       | aliljet wrote:
       | How does one effectively use something like this locally with
       | consumer-grade hardware?
        
         | tintor wrote:
         | Consumer-grade hardware? Even at 4bits per param you would need
         | 500GB of GPU VRAM just to load the weights. You also need VRAM
         | for KV cache.
        
           | CamperBob2 wrote:
           | It's MoE-based, so you don't need that much VRAM.
           | 
           | Nice if you can get it, of course.
        
         | oceansweep wrote:
         | Epyc Genoa CPU/Mobo + 700GB of DDR5 ram. The model is a MoE, so
         | you don't need to stuff it all into VRAM, you can use a single
         | 3090/5090 to hold the activated weights, and hold the remaining
         | weights in DDR5 ram. Can see their deployment guide for
         | reference here: https://github.com/kvcache-
         | ai/ktransformers/blob/main/doc/en...
        
         | simonw wrote:
         | Once the MLX community get their teeth into it you might be
         | able to run it on two 512GB M3 Ultra Mac Studios wired together
         | - those are about $10,000 each though so that would be $20,000
         | total.
         | 
         | Update: https://huggingface.co/mlx-community/Kimi-K2-Thinking -
         | and here it is running on two M3 Ultras:
         | https://x.com/awnihannun/status/1986601104130646266
        
       | smusamashah wrote:
       | When I open this page, all I see is a word pad like text area
       | with buttons on top and sample text inside. Don't see anything
       | about any llm. I am on phone. Page is being opened via embedded
       | view in an HN client.
        
       | mmaunder wrote:
       | Any word on what it takes to run this thing?
        
       | blobbers wrote:
       | TLDR; this is an alibaba funded start-up out of Beijing
       | 
       | Okay, I'm sorry but I have to say wtf named this thing. Moonshot
       | AI is such an overused generic name that I had to ask an LLM
       | which company this is. This is just Alibaba hedging their Qwen
       | model.
       | 
       | This company is far from "open source", it's had over $1B USD in
       | funding.
        
         | hnhn34 wrote:
         | > Moonshot AI is such an overused generic name that I had to
         | ask an LLM which company this is
         | 
         | I just googled "Moonshot AI" and got the information right
         | away. Not sure what's confusing about it, the only other
         | "Moonshot" I know of is Alphabet's Moonshot Factory.
         | 
         | > This company is far from "open source", it's had over $1B USD
         | in funding.
         | 
         | Since when does open source mean you can't make any money?
         | Mozilla has a total of $1.2B in assets. The company isn't open
         | source nor claiming to be.
         | 
         | This model was released under a "modified MIT-license" [0]:
         | 
         | > Our only modification part is that, if the Software (or any
         | derivative works thereof) is used for any of your commercial
         | products or services that have more than 100 million monthly
         | active users, or more than 20 million US dollars (or equivalent
         | in other currencies) in monthly revenue, you shall prominently
         | display "Kimi K2" on the user interface of such product or
         | service.
         | 
         | Which sounds pretty fair to me.
         | 
         | [0] -
         | https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main...
        
         | woadwarrior01 wrote:
         | > This company is far from "open source", it's had over $1B USD
         | in funding.
         | 
         | Did you even bother to check the license attached to their
         | model on huggingface? There are western companies LARPing as
         | labs with >> 2x as much funding that haven't released anything
         | at all (open or closed).
        
       | vinhnx wrote:
       | Kimi K2 Thinking, MiniMax M2 Interleaved Thinking: open models
       | are reaching, or have reached, frontier territory. We now have
       | GPT and Claude Sonnet capable at home, as they are open-weight.
       | Around this time last year, we had the DeepSeek moment, Now is
       | the time for another moment.
        
         | almaight wrote:
         | Ring-1T
        
         | rdos wrote:
         | Benchmarks show that open models are equal to SOTA closed ones
         | but own experience and real world use shows the opposite. And I
         | really wish they were closer, I run GPT-OSS 120b as a daily
         | driver
        
           | vinhnx wrote:
           | It could be that inference remote providers has issue, hence
           | the model could not show potential or rate limited. I also
           | think the Moonshot could take more time and continue with
           | K2.1 or something like with DeepSeek.
           | 
           | [0] https://x.com/Kimi_Moonshot/status/1986754111992451337
        
       | ElijahLynn wrote:
       | Why is the 4bit version 1.2TB and the non-4bit version 650GB?
       | 
       | https://huggingface.co/mlx-community/Kimi-K2-Thinking-4bit
       | https://huggingface.co/mlx-community/Kimi-K2-Thinking
        
         | kachapopopow wrote:
         | I think it the default version here might be 2.5bit or
         | something
        
       | drumnerd wrote:
       | The page is so obviously written with AI that it isn't even worth
       | reading. Try the model if you will but save yourselves the pain
       | of reading ai slop
        
       | yanhangyhy wrote:
       | As a Chinese user, I can say that many people use Kimi, even
       | though I personally don't use it much. China's open-source
       | strategy has many significant effects--not only because it aligns
       | with the spirit of open source. For domestic Chinese companies,
       | it also prevents startups from making reckless investments to
       | develop mediocre models. Instead, everyone is pushed to start
       | from a relatively high baseline. Of course, many small companies
       | in the U.S., Japan, and Europe are also building on Qwen. Kimi is
       | similar: before DeepSeek and others emerged, their model quality
       | was pretty bad. Once the open-source strategy was set, these
       | companies had no choice but to adjust their product lines and
       | development approaches to improve their models.
       | 
       | Moreover, the ultimate competition between models will eventually
       | become a competition over energy. China's open-source models have
       | major advantages in energy consumption, and China itself has a
       | huge advantage in energy resources. They may not necessarily
       | outperform the U.S., but they probably won't fall too far behind
       | either.
        
         | lettergram wrote:
         | There's a lot of indications that we're currently brute forcing
         | these models. There's honestly not a reason they have to be 1T
         | parameters and cost an insane amount to train and run on
         | inference.
         | 
         | What we're going to see is as energy becomes a problem; they'll
         | simply shift to more effective and efficient architectures on
         | both physical hardware and model design. I suspect they can
         | also simply charge more for the service, which reduces usage
         | for senseless applications.
        
           | yanhangyhy wrote:
           | There are also elements of stock price hype and geopolitical
           | competition involved. The major U.S. tech giants are all tied
           | to the same bandwagon -- they have to maintain this cycle:
           | buy chips - build data centers - release new models - buy
           | more chips.
           | 
           | It might only stop once the electricity problem becomes truly
           | unsustainable. Of course, I don't fully understand the
           | specific situation in the U.S., but I even feel that one day
           | they might flee the U.S. altogether and move to the Middle
           | East to secure resources.
        
             | simpsond wrote:
             | Sundar is talking about fleeing earth to secure photons and
             | cooling in space.
        
               | amluto wrote:
               | Good luck. Space has lots of photons but really quite
               | poor cooling resources.
        
           | MallocVoidstar wrote:
           | > What we're going to see is as energy becomes a problem
           | 
           | This is much more likely to be an issue in the US than in
           | China. https://fortune.com/2025/08/14/data-centers-china-
           | grid-us-in...
        
             | thesmtsolver wrote:
             | Disagree. Part of the reason China produces more power (and
             | pollution) is due to China manufacturing for the US.
             | 
             | https://www.brookings.edu/articles/how-do-china-and-
             | america-...
             | 
             | The source for China's energy is more fragile than that of
             | the US.
             | 
             | > Coal is by far China's largest energy source, while the
             | United States has a more balanced energy system, running on
             | roughly one-third oil, one-third natural gas, and one-third
             | other sources, including coal, nuclear, hydroelectricity,
             | and other renewables.
             | 
             | Also, China's GDP is a bit less inefficient in terms of
             | power used per unit of GDP. China relies on coal and
             | imports.
             | 
             | > However, China uses roughly 20% more energy per unit of
             | GDP than the United States.
             | 
             | Remember, China still suffers from blackouts due to
             | manufacturing demand not matching supply. The fortune
             | article seems like a fluff piece.
             | 
             | https://www.npr.org/2021/10/01/1042209223/why-covid-is-
             | affec...
             | 
             | https://www.bbc.com/news/business-58733193
        
               | mullingitover wrote:
               | These stories are from 2021.
               | 
               | China has been adding something like a 1GW coal plant's
               | worth of solar generation _every eight hours_ in the past
               | year, and the rate is accelerating. The US is no longer a
               | serious competitor for China when it comes to energy
               | production.
        
               | DeH40 wrote:
               | The reason it happened in 2021, I think, might be that
               | China took on the production capacity gap caused by COVID
               | shutdowns in other parts of the world. The short-term
               | surge in production led to a temporary imbalance in the
               | supply and demand of electricity
        
               | tmule wrote:
               | China's breakneck development is difficult for many in
               | the US to grasp (root causes - baselining on sluggish
               | domestic growth, and possessing a condescending view of
               | China). This article offers a far more accurate picture
               | than of how China is doing right now:
               | https://archive.is/wZes6
        
               | eddyg wrote:
               | Eye-opening summary... I knew China was ahead, but wow.
               | Thanks for sharing that article.
        
               | yanhangyhy wrote:
               | I don't remeber much details about the situation in 2021.
               | But China is in a period of technological explosion--many
               | things are changing at an incredible speed. In just a few
               | years, China may have completely transformed in various
               | fields.
               | 
               | Western media still carry strong biases toward China's
               | political system, and they have done far too little to
               | portray the country's real situation. The narrative
               | remains the same old one: "China succeeded because it's
               | capitalist," or "China is doomed because it's communist."
               | 
               | But in reality, barely a few days go by without some new
               | technological breakthrough or innovation happening in
               | China. The pace of progress is so fast that even people
               | inside the country don't always keep up with it. For
               | example, just since the start of November, we've seen
               | China's space station crew doing a barbecue in orbit,
               | researchers in Hefei working on an artificial sun make
               | some new progress, and a team discovering a safe and
               | efficient method for preparing aromatic amines. Apart
               | from the space station bit--which got some attention--the
               | others barely made a ripple.Also, China's first
               | electromagnetic catapult aircraft carrier has officially
               | entered service
               | 
               | about a year ago, I started using Reddit intensively.
               | what I read more on Reddit are reports related to
               | electricity, because it involves environmental protection
               | and hatred towards Trump, etc. There are too many
               | leftists, so the discussions are somewhat biased. But the
               | related news reports and nuclear data are real. China
               | reach carbon peak in 2025, and this year it has truly
               | become a powerhouse in electricity. National data centers
               | are continuously being built, but residential electricity
               | prices have never been and will never be affected.China
               | still has a lot of coal-fired power, but it continues to
               | carry out technological upgrades on them. At the same
               | time, wind, solar, nuclear and other sources are all
               | advancing steadily. China is the only country that is not
               | controlled by ideology and is increasing its electricity
               | capacity in a scientific way.
               | 
               | (maybe in AI field people like to talk about more. not
               | only kimi release a new model, Xpeng has a new robot and
               | brought some intension. these all happends in a few days
               | )
        
               | brazukadev wrote:
               | It's absolutely impressive to see China's development.
               | I'm happy my country is slowly but surely moving to
               | China's orbit of influence, especially economically.
        
               | yanhangyhy wrote:
               | if its improving living standards for the people, then
               | its surely is a good thing.
        
               | sirtaj wrote:
               | "Not controlled by ideology" is a pretty bold statement
               | to make about a self-declared Communist single-party
               | country. There is always an ideology. You just happen to
               | agree with whatever this one is (Controlled-market
               | Communism? I don't know what the precise term is).
        
               | yanhangyhy wrote:
               | I cannot edit this now so I want to add some
               | clarification, it just means on this specific topic:
               | electricity, china dont act like us or german, abandoned
               | wind or nuclear, its only based on science
        
               | dvt wrote:
               | > China is the only country that is not controlled by
               | ideology and is increasing its electricity capacity in a
               | scientific way.
               | 
               | Have recently noticed a lot of pro-CCP propaganda on
               | social media (especially Instagram and TikTok), but
               | strangely also on HN; kind of interesting. To anyone
               | making the (trivially false) claim that China is not
               | controlled by ideology, I'm not quite sure how you'd
               | convince them of the opposite. I'm not a doomer, but as
               | China ramps up their aggression towards Taiwan (and the
               | US will inevitably have to intervene), this will likely
               | not end well in the next 5-10 years.
        
               | yanhangyhy wrote:
               | I mean only on this specific topic: electricity. Arguing
               | with other things is pointless since HN has the same
               | political leaning as reddit so I will pass
        
               | FooBarWidget wrote:
               | I also think that one claim is dubious, but do you really
               | have to focus on only that part to the exclusion of
               | everything else? All the progress made is real,
               | regardless of your opinion on the existance of ideology.
        
               | onion2k wrote:
               | _art of the reason China produces more power (and
               | pollution) is due to China manufacturing for the US._
               | 
               | Presumably they'd stop doing that once AI becomes a more
               | beneficial use for the energy though.
        
               | bildung wrote:
               | As counterpoints to illustrate Chinas current
               | development:
               | 
               | * China has produced more PV panel capacity in the first
               | half of this year than the US has installed, all in all,
               | in all of its history
               | 
               | * China alone has installed PV capacity of over 1000 GW
               | today
               | 
               | * China has installed battery electrical storage of about
               | 100 GW / 300 GWh today and aims to have 180 GW in 2027
        
           | simonw wrote:
           | > There's honestly not a reason they have to be 1T parameters
           | and cost an insane amount to train and run on inference.
           | 
           | Kimi K2 Thinking is rumored to have cost $4.6m to train -
           | according to "a source familiar with the matter":
           | https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-
           | rele...
           | 
           | I think the most interesting recent Chinese model may be
           | MiniMax M2, which is just 200B parameters but benchmarks
           | close to Sonnet 4, at least for coding. That's small enough
           | to run well on ~$5,000 of hardware, as opposed to the 1T
           | models which require vastly more expensive machines.
        
             | electroglyph wrote:
             | i assume that $4.6 mil is just the cost of the electricity?
        
               | simonw wrote:
               | Hard to be sure because the source of that information
               | isn't known, but generally when people talk about
               | training costs like this they include more than just the
               | electricity but exclude staffing costs.
               | 
               | Other reported training costs tend to include rental of
               | the cloud hardware (or equivalent if the hardware is
               | owned by the company), e.g. NVIDIA H100s are sometimes
               | priced out in cost-per-hour.
        
               | Der_Einzige wrote:
               | Citation needed on "generally when people talk about
               | training costs like this they include more than just the
               | electricity but exclude staffing costs".
               | 
               | It would be simply wrong to exclude the staffing costs.
               | When each engineer costs well over 1 million USD in total
               | costs year over year, you sure as hell account for them.
        
               | vanviegen wrote:
               | No, because what people are generally trying to express
               | with numbers like these, is how much compute went into
               | training. Perhaps another measure, like zettaflop or
               | something would have made more sense.
        
               | simonw wrote:
               | If you have 1,000 researchers working for your company
               | and you constantly have dozens of different training runs
               | in the go, overlapping each other, how would you split
               | those salaries between those different runs?
               | 
               | Calculating the cost in terms of GPU-hours is a whole lot
               | easier from an accounting perspective.
               | 
               | The papers I've seen that talk about training cost all do
               | it in terms of GPU hours. The gpt-oss model card said 2.1
               | million H100-hours for gpt-oss:120b. The Llama 2 paper
               | said 3.31M GPU-hours on A100-80G. They rarely give actual
               | dollar costs and I've never seen any of them include
               | staffing hours.
        
               | shawndrost wrote:
               | Do they include the costs of dead-end runs?
        
               | Der_Einzige wrote:
               | No, they don't! That's why the "5.5 million" deepseek V3
               | number as read by American investors was total bullshit
               | (because investors ignored their astrik saying "only
               | final training run")
        
               | simonw wrote:
               | Yeah, that's one of the most frustrating things about
               | these published numbers. Nobody ever wants to share how
               | much money they spent on runs that _didn 't_ produce a
               | useful model.
               | 
               | As with staffing costs though it's hard to account for
               | these against individual models. If Anthropic run a bunch
               | of training experiments that help them discover a new
               | training optimization, then use that optimization as part
               | of the runs for the next Opus _and_ Sonnet _and_ Haiku
               | (and every subsequent model for the lifetime of the
               | company) how should the cost of that experimental run be
               | divvied up?
        
               | pama wrote:
               | Table 1:
               | 
               | https://arxiv.org/html/2412.19437v2
        
             | Der_Einzige wrote:
             | That number is as real as the 5.5 million to train
             | DeepSeek. Maybe it's real if you're only counting the
             | literal final training run, but total costs including the
             | huge number of failed runs all other costs accounted for,
             | it's several hundred million to train a model that's
             | usually still worse than Claude, Gemini, or ChatGPT. It
             | took 1B+ (500 billion on energy and chips ALONE) for Grok
             | to get into the "big 4".
        
               | tw1984 wrote:
               | Using such theory, one can even argue that the real cost
               | needs to include the infrastructures, like total
               | investment into the semiconductor industry, the national
               | electricity grid, education and even defence etc.
        
               | Der_Einzige wrote:
               | Correct! You do have to account for all of these things!
               | Unironically correct! :)
        
             | nl wrote:
             | Can confirm MiniMax M2 is very impressive!
        
             | oxcidized wrote:
             | > That's small enough to run well on ~$5,000 of hardware...
             | 
             | Honestly curious where you got this number. Unless you're
             | talking about extremely small quants. Even just a Q4 quant
             | gguf is ~130GB. Am I missing out on a relatively cheap way
             | to run models well that are this large?
             | 
             | I suppose you might be referring to a Mac Studio, but
             | (while I don't have one to be a primary source of
             | information) it seems like there is some argument to be
             | made on whether they run models "well"?
        
               | simonw wrote:
               | Yes, I mean a Mac Studio with MLX.
               | 
               | An M3 Ultra with 256GB of RAM is $5599. That should just
               | about be enough to fit MiniMax M2 at 8bit for MLX:
               | https://huggingface.co/mlx-community/MiniMax-M2-8bit
               | 
               | Or maybe run a smaller quantized one to leave more memory
               | for other apps!
               | 
               | Here are performance numbers for the 4bit MLX one:
               | https://x.com/ivanfioravanti/status/1983590151910781298 -
               | 30+ tokens per second.
        
           | Leynos wrote:
           | Having larger models is nice because they have a much wider
           | sphere of knowledge to draw on. Not in the sense of using
           | them as encyclopedias. More in the sense that I want a model
           | that is going to be able to cross reference from multiple
           | domains that I might not have considered when trying to solve
           | a problem.
        
         | yanhangyhy wrote:
         | One thing to add: the most popular product in china on AI is
         | not kimi i think' it shoud be DOUBAO by bytedance(tiktok owner)
         | and yuanbao by tencent. The have a better UI and feature set
         | and you can also select deepseek model from it. Kimi still has
         | a lot of users but I think in the long term it still may not
         | doing well. So its still a win for closed model?
        
         | dzonga wrote:
         | you guys will outperform the US, no doubt.
         | 
         | energy generation multiples of what the US is producing. What
         | does AI need ? Energy.
         | 
         | second - the open source nature of the models - means as you
         | said a high baseline to start with - faster iteration.
        
           | chaostheory wrote:
           | Going on a tangent, is Europe even close? Mistral has been
           | underwhelming
        
             | LaurensBER wrote:
             | Not anywhere near close.
             | 
             | Europe doesn't have the infrastructure (legal or energy)
             | and US companies offer far better compensation for talent.
             | 
             | But hey, at least we have AI regulation! (sad smile :))
        
             | troyvit wrote:
             | I don't know if how close Europe is, but I'm sufficiently
             | whelmed by Mistral that I don't need to look elsewhere yet.
             | It's kind-of like having a Toyota Corolla while everybody
             | else is driving around in smart cars but it gets it done.
             | On top of it, there's a loyal community that (maybe because
             | I'm not looking) I don't see with other products. It
             | probably depends on your uses, but if I spent all my time
             | chasing the latest chat models (like Kimi K2 for instance)
             | I wouldn't actually get anything done.
        
               | dehugger wrote:
               | Probably cuz you aren't looking yeah. Anthropic seems to
               | be leading the "loyalty" war in the US.
        
               | troyvit wrote:
               | Yeah and I'll probably end up going that way as work
               | locks down the models we're allowed to use, saving
               | Mistral for personal projects.
        
           | david927 wrote:
           | > will outperform
           | 
           |  _does outperform_
           | 
           | China is absolutely winning innovation in the 21st century.
           | I'm so impressed. For an example from just this morning,
           | there was an article that they're developing thorium reactor-
           | powered cargo ships. I'm blown away.
        
             | yanhangyhy wrote:
             | I remember this thing. The tech is from America actually,
             | decades ago. (Thorium). But they give up and china counties
             | the work recent years
        
               | rmorey wrote:
               | "The tech is from America actually, decades ago... But
               | they give up and china continues the work"
               | 
               | Many such cases...
        
               | embedding-shape wrote:
               | > The tech is from America actually, decades ago.
               | (Thorium).
               | 
               | I guess it depends on how you see it, but regardless, the
               | people putting it to use today doesn't seem to be in the
               | US.
               | 
               | FWIW:
               | 
               | > Thorium was discovered in 1828 by the Swedish chemist
               | Jons Jacob Berzelius during his analysis of a new mineral
               | [...] In 1824, after more deposits of the same mineral in
               | Vest-Agder, Norway, were discovered [...] While thorium
               | was discovered in 1828 its first application dates only
               | from 1885, when Austrian chemist Carl Auer von Welsbach
               | invented the gas mantle [...] Thorium was first observed
               | to be radioactive in 1898, by the German chemist Gerhard
               | Carl Schmidt
               | 
               | For being an American discovery, it sure has a lot of
               | European people involved in it :) (I've said it elsewhere
               | but worth repeating; trying to track down _where_ a
               | technology /invention _actually_ comes from is a fools
               | errand, and there is always something earlier that led to
               | today, so doesn 't serve much purpose except nationalism
               | it seems to me).
        
               | conradev wrote:
               | The tech we're talking about is using Thorium for useful
               | work:
               | 
               | https://en.wikipedia.org/wiki/Thorium-
               | based_nuclear_power#Hi...
               | 
               | https://en.wikipedia.org/wiki/Molten-
               | Salt_Reactor_Experiment
        
       | mensetmanusman wrote:
       | These models are interesting in how they censor depending on the
       | language request.
        
       | almaight wrote:
       | Ring-1T,a SOTA open-source trillion-parameter reasoning model
        
       | gradus_ad wrote:
       | While I absolutely support these open source models, there is an
       | interesting angle to consider... If I were a Chinese partisan
       | looking to inflict a devastating blow to the US, taking the AI
       | hype wind out of American tech valuation sails would seem a great
       | option. How best to do this? Release highly performant models...
       | For free! Extremely efficient in terms of RMB spent vs
       | (unrealized) USD lost. But surely, these model releases are just
       | the immaculate free market at work. No CCP pulling strings for
       | geo-political-industrial wins, certainly not.
        
         | eagleinparadise wrote:
         | But they're literally not free. If it was "war", with infinite
         | money to throw at destruction of USA AI industry, then why
         | would you be charging and reducing such an outcome
        
           | gradus_ad wrote:
           | Because subsidizing the necessary level of compute for that
           | is unsustainable. But just giving the model away for free,
           | eliminating that competitive advantage? Well, that itself is
           | free.
        
         | nsonha wrote:
         | Google Maps, GPS, the Internet etc being free are surely just a
         | CIA plan to take over the world
        
         | Palmik wrote:
         | On the other hand, several startups such as Cursor and
         | Cognition+Windsurf are building their new models on top of the
         | open source Chinese models.
         | 
         | Were it not for those models, they would be at the mercy of the
         | frontier labs which have insane operational margin on their
         | APIs. As a result you'd see much more consolidation.
        
         | kachapopopow wrote:
         | the goverment might be (relatively speaking) evil, the people
         | are most definitely not.
        
       | xrd wrote:
       | Is this a typo:
       | 
       | "Where p is the pdf of a random variable sampled by the given
       | procedure"
       | 
       | That was in the first expanded section when it discussed the PhD
       | level math problem it solved.
       | 
       | I'm not a Phd nor a Pdf but it seemed strange to me.
        
         | spenczar5 wrote:
         | no, "pdf" is a very typical shortening for "probability density
         | function," its correct.
        
       | baalimago wrote:
       | Unfortunate how many of the 'non mainstream' models are poor at
       | function handling. I'm trying K2 out via Novia AI and it
       | consistently fails to format function calls, breaking the
       | reasoning flow.
        
         | Palmik wrote:
         | This is most likely issue on the side of the inference
         | provider: https://github.com/MoonshotAI/K2-Vendor-Verifier
         | 
         | For example, Together AI has only 71% success rate, while the
         | official API has 100% success rate.
        
       | miletus wrote:
       | From our tests, Kimi K2 Thinking is better than literally
       | everything - gpt-5, claude 4.5 sonnet. the only model that is
       | better than Kimi K2 thinking is GPT-5 codex.
       | 
       | It's now available on https://okara.ai if anyone wants to try it.
        
         | vessenes wrote:
         | Just tried it -- it is good to very good by my tests too. Do
         | you know what is great though? The okura interface. I used it
         | on mobile and it was nearly pain free and pretty to boot.
         | Really nice work by your product team.
        
       | abdellah123 wrote:
       | This should be compared with ChatGPT PRO. Otherwise it's an
       | unfair comparison.
       | 
       | In any way, I tried it and it delivered. Kudos to the Kimi team.
       | Amazing work
        
       | Mashimo wrote:
       | Oh neat. One of the examples is a Strudel.cc track.
       | 
       | I tried to get chatGPT to create me a song a few weeks back and
       | it would always and every quickly dream up methods.
        
         | Leynos wrote:
         | Kimi K2 seemingly has a much more up to date training set.
        
       | c0brac0bra wrote:
       | Kimi has been fantastic for brainstorming.
       | 
       | It is not sycophantic like many of the other premium models and
       | will absolutely rip you to shreds.
        
       | taf2 wrote:
       | Looks really amazing but I'm wondering is this one available to
       | download? I see this: "K2 Thinking is now live on kimi.com under
       | the chat mode [1], with its full agentic mode available soon. It
       | is also accessible through the Kimi K2 Thinking API." but will
       | this be on huggingfaces? Would like to give it a test run
       | locally.
        
       ___________________________________________________________________
       (page generated 2025-11-07 23:02 UTC)