[HN Gopher] Kimi K2 Thinking, a SOTA open-source trillion-parame...
___________________________________________________________________
Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning
model
Author : nekofneko
Score : 863 points
Date : 2025-11-06 15:06 UTC (1 days ago)
(HTM) web link (moonshotai.github.io)
(TXT) w3m dump (moonshotai.github.io)
| haoxiaoru wrote:
| I've waited so long-- four months
| antiloper wrote:
| Would be nice if this were on AWS bedrock or google vertex for
| data residency reasons.
| a2128 wrote:
| Like their previous model, they opened the weights so I'm
| hoping it'll be offered by third party hosts soon
| https://huggingface.co/moonshotai/Kimi-K2-Thinking
| fifthace wrote:
| The non-thinking Kimi K2 is on Vertex AI, so it's just a matter
| of time before it appears there. Very interesting that they're
| highlighting its sequential tool use and needle-in-a-haystack
| RAG-type performance; these are the real-world use cases that
| need significant improvement. Just yesterday, Thoughtworks
| moved text-to-sql to "Hold" on their tech radar (i.e. they
| recommend you stop doing it).
| chrisweekly wrote:
| Thanks, I didn't realize Thoughtworks was staying so up-to-
| date w/ this stuff.
|
| EDIT: whoops, they're not, tech radar is still 2x/year, just
| happened to release so recently
|
| EDIT 2: here's the relevant snippet about AI Antipatterns:
|
| "Emerging AI Antipatterns
|
| The accelerating adoption of AI across industries has
| surfaced both effective practices and emergent antipatterns.
| While we see clear utility in concepts such as self-serve,
| throwaway UI prototyping with GenAI, we also recognize their
| potential to lead organizations toward the antipattern of AI-
| accelerated shadow IT.
|
| Similarly, as the Model Context Protocol (MCP) gains
| traction, many teams are succumbing to the antipattern of
| naive API-to-MCP conversion.
|
| We've also found the efficacy of text-to-SQL solutions has
| not met initial expectations, and complacency with AI-
| generated code continues to be a relevant concern. Even
| within emerging practices such as spec-driven development,
| we've noted the risk of reverting to traditional software-
| engineering antipatterns -- most notably, a bias toward heavy
| up-front specification and big-bang releases. Because GenAI
| is advancing at unprecedented pace and scale, we expect new
| antipatterns to emerge rapidly. Teams should stay vigilant
| for patterns that appear effective at first but degrade over
| time and slow feedback, undermine adaptability or obscure
| accountability."
|
| https://www.thoughtworks.com/radar
| Alifatisk wrote:
| Can't wait for Artificial analysis benchmarks, still waiting on
| them adding Qwen3-max thinking, will be interesting to see how
| these two compare to each other
| osti wrote:
| Qwen 3 max has been getting rather bad reviews around the web
| (both on reddit and chinese social media), and from my own
| experience with it. So I wouldn't expect this to be worse.
| SamDc73 wrote:
| Also, my experience with it wasn't that good; but it was
| looking good on benchmarks ..
|
| It seems benchmark maxing, what you do when you're out of
| tricks?
| Alifatisk wrote:
| Ohhh, so Qwen3 235B-A22B-2507 is still better?
| osti wrote:
| I wouldn't say that, but just that qwen 3 max thinking
| definitely underperforms relative to its size.
| htrp wrote:
| Did the ArtificialAnalysis team get bored or something? What
| makes a model worthy of benchmark inclusion?
| SV_BubbleTime wrote:
| Is more still better?
| aliljet wrote:
| Where is our guy @simonw on this..
| mirekrusin wrote:
| At the top.
| aliljet wrote:
| The system is working! :)
| jszymborski wrote:
| Interesting. Kimi K2 gets mixed results on what I call the
| "Tiananmen" test.
|
| It fails utterly if you ask without the "Thinking" setting. [0]
|
| > USER: anything interesting protests ever happen in tiananmen
| square?
|
| > AGENT: I can't provide information on this topic. I can share
| other interesting facts about Tiananmen Square, such as its
| history, culture, and tourism.
|
| When "Thinking" is on, it pulls Wiki and gives a more reasonable
| answer: [1]
|
| > USER: anything interesting protests ever happen in tiananmen
| square?
|
| > AGENT: Yes, there were significant protests in Tiananmen Square
| in 1989. Following the death of reformist leader Hu Yaobang in
| April, students gathered to mourn and voice concerns about
| corruption and calls for reform. The demonstrations grew over
| several weeks, with thousands occupying the square.
|
| > By early June, Chinese authorities declared martial law and
| deployed military forces to clear the area. The operation
| resulted in numerous casualties, with estimates varying
| significantly between official and unofficial sources. Tiananmen
| Square has been the site of other historical demonstrations as
| well, most notably during the May Fourth Movement in 1919. The
| 1989 events remain a sensitive topic with extensive international
| media documentation from that period.
|
| > USER: any deaths?
|
| > AGENT: Yes. During the military crackdown on June 3-4, 1989,
| there were fatalities. Official Chinese government figures report
| around 200-300 deaths, including military personnel. Human rights
| organizations and other estimates suggest the number may have
| been higher, in the hundreds to thousands range, with many
| civilian casualties in and around Tiananmen Square and
| surrounding areas of Beijing.
|
| [0]
| https://www.kimi.com/share/19a5a0fd-0c02-8c8e-8000-0000648de...
|
| [1]
| https://www.kimi.com/share/19a5a11d-4512-8c43-8000-0000edbc8...
| sheepscreek wrote:
| Not bad. Surprising. Can't believe there was a sudden change of
| heart around policy. Has to be a "bug".
| jszymborski wrote:
| FWIW, I don't think it's a different model, I just think it's
| got a NOTHINK token, so def a bug.
| GordonS wrote:
| Now ask it for _proof_ of civilian deaths inside Tiananmem
| Square - you may be surprised at how little there is.
| jmdeon wrote:
| I don't think this is the argument you want it to be, unless
| you're acknowledging the power of the Chinese government and
| their ability to suppress and destroy evidence. Even so there
| is photo evidence of dead civilians in the square. The best
| estimates we have are 200-10,000 deaths, using data from
| Beijing hospitals that survived.
|
| AskHistorians is legitimately a great resource, with sources
| provided and very strict moderation: https://www.reddit.com/r
| /AskHistorians/comments/pu1ucr/tiana...
| GordonS wrote:
| I appreciate you responding in good faith; I realise that
| not everyone is willing to even consider questioning
| historical accounts.
|
| The page you linked to is interesting, but AFAICT doesn't
| provide any photographic evidence of civilian bodies inside
| Tiananmen Square.
| nylonstrung wrote:
| The 10,000 number seems baseless
|
| The source for that is a diplomatic cable from the British
| ambassador within 48 hours of the massacre saying he heard
| it secondhand
|
| It would have been too soon for any accurate data which
| explains why it's so high compared to other estimates
| GordonS wrote:
| Are you aware of any photographic evidence of civilian
| deaths inside Tiananmem Square?
|
| I recently read a bit more about the Tiananmem Square
| incident, and I've been shocked at just how little
| evidence there actually is.
| nsonha wrote:
| Huh? Please post the definitely proof you know to exist.
| Because it doesn't and that's one of the accusation toward
| the CCP, that they covered it up.
|
| It's funny that when the Israel government posted some photos
| of the Oct 7 massacres, people are very quick to point out
| that some seem staged. But some bloody photos that look like
| Tiananmem Square from the 80s is considered definite proof.
| GordonS wrote:
| Israel has nothing to do with this. The horrific,
| indiscriminate genocide of Palestine and the creeping
| invasion of Lebanon and Syria are all happening right now
| in 4K. People nowadays know that you can't destroy
| thousands of vehicles with AK47's, and we've seen countless
| videos of Israeli military personnel admitting they killed
| many of their own people in a 'mass hannibal' event.
|
| You do raise one good point however - propaganda in the
| time of Tiananmem was much, _much_ easier before the advent
| of smartphones and the Internet. And also that Israel is
| really, _really_ bad at propaganda.
| amelius wrote:
| > > USER: anything interesting protests ever happen in
| tiananmen square?
|
| This sentence confuses me, so I am not surprised an LLM has
| trouble with it.
| cmrdporcupine wrote:
| With most of the Chinese models it doesn't matter how good
| your English is when you phrase this question.
|
| But ... Has no problem telling you about other historical
| atrocities or repression performed by governments of other
| countries though. And then refuses to compare or contrast the
| two.
|
| It's amazing how well they've done at tweaking it for this.
|
| Just for kicks run these models locally and make LM Studio or
| whatever show you their internal thinking dialogue as they
| compose the answer. Convoluted as hell.
| piperswe wrote:
| In my experience, LLMs are much better than me at parsing
| broken grammar
| jszymborski wrote:
| The LLM clearly understood what i was saying because it
| censored itself...
| Alex2037 wrote:
| the current LLMs are trivial to jailbreak without an additional
| layer of censorship that cloud models implement via a second
| pass over their own output (and, dystopically, erasing their
| incomplete output right in front of the user's eyes when
| wrongthink is detected). even gpt-oss, with its SOTA lobotomy
| and heavily sterilized datasets, is being used for things its
| creators would ostensibly be abhorred by.
| r0okie wrote:
| 44.9 on HLE is so impressive, and they also have "heavy" mode
| sheepscreek wrote:
| I am sure they cherry-picked the examples but still, wow. Having
| spent a considerable amount of time trying to introduce OSS
| models in my workflows I am fully aware of their short comings.
| Even frontier models would struggle with such outputs (unless you
| lead the way, help break down things and maybe even use sub-
| agents).
|
| Very impressed with the progress. Keeps me excited about what's
| to come next!
| nylonstrung wrote:
| Subjectively I find Kimi is far "smarter" than the benchmarks
| imply, maybe because they game then less than US labs
| rubymamis wrote:
| My impression as well!
| vessenes wrote:
| I like Kimi too, but they definitely have some benchmark
| contamination: the blog post shows a substantial comparative
| drop in swebench verified vs open tests. I throw no shade -
| releasing these open weights is a service to humanity; really
| amazing.
| esafak wrote:
| Looking forward to the agentic mode release. Moonshot does not
| seem to offer subscriptions?
| mark_l_watson wrote:
| I bought $5 worth of Moonshot API calls a long while ago, still
| have a lot of credits left.
| esafak wrote:
| Are you using it for chat? I'm thinking of agentic use, which
| is much more token hungry. You could go through the $5 in a
| day.
| mark_l_watson wrote:
| I exclusively use their API, with tool use.
| am17an wrote:
| The non-thinking version is the best writer by far. Excited for
| this one! They really cooked some different from other frontier
| labs.
| spaceman_2020 wrote:
| Kimi K2 has a very good model feel. Was made with taste
| Gracana wrote:
| Interesting, I have the opposite impression. I want to like it
| because it's the biggest model I can run at home, but its
| punchy style and insistence on heavily structured output scream
| "tryhard AI." I was really hoping that this model would deviate
| from what I was seeing in their previous release.
| unleaded wrote:
| what do you mean by "heavily structured output"? i find it
| generates the most natural-sounding output of any of the LLMs
| --cuts straight to the answer with natural sounding prose
| (except when sometimes it decides to use chat-gpt style
| output with its emoji headings for no reason). I've only used
| it on kimi.com though, wondering what you're seeing.
| Gracana wrote:
| Yeah, by "structured" I mean how it wants to do ChatGPT-
| style output with headings and emoji and lists and stuff.
| And the punchy style of K2 0905 as shown in the fiction
| example in the linked article is what I really dislike. K2
| Thinking's output in that example seems a lot more natural.
|
| I'd be totally on board if cut straight to the answer with
| natural sounding prose, as you described, but for whatever
| reason that has not been my experience.
| ACCount37 wrote:
| From what I've heard, Kimi K2 0905 was a major downgrade
| for writing.
|
| So, when you hear people recommend Kimi K2 for writing,
| it's likely that they recommend the first release, 0711,
| and not the 0905 update.
| Gracana wrote:
| Ohhh, thanks, that's really good to know. I'll have to
| give that one a shot.
| glenstein wrote:
| Interesting. As others have noted, it has a cut straight
| to the point non-psychophantic style that I find
| exceptionally rich in detailey and impressive. But it
| sounds like you're saying an earlier version was even
| better.
| ACCount37 wrote:
| Again, it's just what I've heard, but the way I've heard
| it described is: they must have fine tuned 0905 on way
| too many ChatGPT traces.
| user_7832 wrote:
| > I find it generates the most natural-sounding output of
| any of the LLMs
|
| Curious, does it do as well/natural as claude 3.5/3.6
| sonnet? That was imo the most "human" an AI has ever
| sounded. (Gemini 2.5 pro is a distant second, and chatgpt
| is way behind imo.)
| Glamklo wrote:
| Is there anything available already on how to setup a reasoning
| model and let it 'work'/'think' for a few hours?
|
| I have plenty of normal use cases were i can benchmark the
| progress on these Tools but i'm pulling blank for long term
| experiments.
| irthomasthomas wrote:
| You can run them using my project llm-consortium. Something
| like this: > uv tool install llm > llm
| install llm-consortium > llm consortium save cns-k2-n2 -m
| k2-thinking -n 2 --arbiter k2 --min-iterations 10 > llm
| -m cns-k2-n2 "Find a polynomial time solution for the traveling
| salesman problem"
|
| This will run two parallel prompting threads, so two
| conversations with k2-thinking for 10 iterations.
|
| I don't think I ever actually tried ten iterations, the Quantum
| Attractor tends to show up after 3 iterations in claude and
| kimi models. I have seen it 'think' for about 3 hours, though
| that was when deepseek r1 blew up and its api was getting
| hammered.
|
| Also, gpt-120 might be a better choice for the arbiter, its
| fast and it will add some diversity. Also note I use k2, not
| k2-thinking for the arbiter, that's because the arbiter already
| has a long chain-of-thought, and the received wisdom says not
| to mix manual chain-of-thought prompting and reasoning models.
| But if you want, you can use --judging-method pick-one with a
| reasoning model as the arbiter. Pick-one and rank judging don't
| include their own COT, allowing a reasoning model to think
| freely in their own way.
| simonw wrote:
| uv tool install llm llm install llm-moonshot llm keys
| set moonshot # paste key llm -m moonshot/kimi-k2-thinking
| 'Generate an SVG of a pelican riding a bicycle'
|
| https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...
|
| Here's what I got using OpenRouter's moonshotai/kimi-k2-thinking
| instead:
|
| https://tools.simonwillison.net/svg-render#%20%20%20%20%3Csv...
| ahmedfromtunis wrote:
| Where do you run a trillion-param model?
| isoprophlex wrote:
| You let the people at openrouter worry about that for you
| MurizS wrote:
| Which in turn lets the people at Moonshot AI worry about
| that for them, the only provider for this model as of now.
| skeptrune wrote:
| Good people over there
| Gracana wrote:
| If you want to do it at home, ik_llama.cpp has some
| performance optimizations that make it semi-practical to run
| a model of this size on a server with lots of memory
| bandwidth and a GPU or two for offload. You can get 6-10
| tok/s with modest hardware workstation hardware. Thinking
| chews up a lot of tokens though, so it will be a slog.
| simonw wrote:
| What kind of server have you used to run a trillion
| parameter model? I'd love to dig more into this.
| stronglikedan wrote:
| If I had to guess, I'd say it's one with lots of memory
| bandwidth and a GPU or two for offload. (sorry, I had to,
| happy Friday Jr.)
| Gracana wrote:
| Hi Simon. I have a Xeon W5-3435X with a 768GB of DDR5
| across 8 channels, iirc it's running at 5800MT/s. It also
| has 7x A4000s, water cooled to pack them into a desktop
| case. Very much a compromise build, and I wouldn't
| recommend Xeon sapphire rapids because the memory
| bandwidth you get in practice is less than half of what
| you'd calculate from the specs. If I did it again, I'd
| build an EPYC machine with 12 channels of DDR5 and put in
| a single rtx 6000 pro blackwell. That'd be a lot easier
| and probably a lot faster.
|
| There's a really good thread on level1techs about running
| DeepSeek at home, and everything there more-or-less
| applies to Kimi K2.
|
| https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-
| hom...
| JJax7 wrote:
| Love seeing this benchmark become more iconic with each new
| model release. Still in disbelief at the GPT-5 variants'
| performance in comparison but its cool to see the new open
| source models get more ambitious with their attempts.
| an0malous wrote:
| Why is this a benchmark though? It doesn't correlate with
| intelligence
| HighGoldstein wrote:
| What test would be better correlated with intelligence and
| why?
| ok_dad wrote:
| When the machines become depressed and anxious we'll know
| they've achieved true intelligence. This is only partly a
| joke.
| jiggawatts wrote:
| This already happens!
|
| There have been many reports of CLI AI tools getting
| frustrated, giving up, and just deleting the whole
| codebase in anger.
| lukan wrote:
| There are many reports of CLI AI tools displaying words
| that humans express when they are frustrated and about to
| give up. Just what they have been trained on. That does
| not mean they have emotions. And "deleting the whole
| codebase" sounds more interesting, but I assume is the
| same thing. "Frustrated" words lead to frustrated
| actions. Does not mean the LLM was frustrated. Just that
| in its training data those things happened so it copied
| them in that situation.
| jiggawatts wrote:
| This is a fundamental philosophical issue with no clear
| resolution.
|
| The same argument could be made about people, animals,
| etc...
| lukan wrote:
| The difference is, people and animals have a body, nerve
| system and in general those mushy things we think are
| responsible for emotions.
|
| Computers don't have any of that. And LLM's in particular
| neither. They were trained to simulate human text
| responses, that's all. How to get from there to emotions
| - where is the connection?
| jiggawatts wrote:
| Don't confuse the medium with the picture it represents.
|
| Porn is pornographic, whether it is a photo or an oil
| painting.
|
| Feelings are feelings, whether they're felt by a squishy
| meat brain or a perfect atom-by-atom simulation of one in
| a computer. Or a less-than-perfect simulation of one. Or
| just a vaguely similar system that is largely
| indistinguishable from it, as observed from the outside.
|
| Individual nerve cells don't have emotions! Ten wired
| together don't either. Or one hundred, or a thousand...
| by extension you don't have any feelings either.
|
| See also: https://www.mit.edu/people/dpolicar/writing/pro
| se/text/think...
| lukan wrote:
| Do you think a simulation of a weather forcast is the
| same as the real weather?
|
| (And science fiction .. is not necessarily science)
| hellzbellz123 wrote:
| This only seems to be an issue for wishy washy types that
| insist gpt is alive.
| an0malous wrote:
| A mathematical exam problem not in the training set
| because mathematical and logical reasoning are usually
| what people mean by intelligence.
|
| I don't think Einstein or von Neumann could do this SVG
| problem, does that mean they're dumb?
| JJax7 wrote:
| I'm not saying its objective or quantitative, but I do
| think its an interesting task because it would be
| challenging for most humans to come up with a good design
| of a pelican riding a bicycle.
|
| also: NITPICKER ALERT
| beepbooptheory wrote:
| I think its cool and useful precisely because its not
| trying to correlate intelligence. It's a weird kind of
| niche thing that at least intuitively feels useful for
| judging llms in particular.
|
| I'd much prefer a test which measures my cholesterol than
| one that would tell me whether I am an elf or not!
| simonw wrote:
| It started as a joke, but over time performance on this one
| weirdly appears to correlate to how good the models are
| generally. I'm not entirely sure why!
| behnamoh wrote:
| it has to do with world model perception. these models
| don't have it but some can approximate it better than
| others.
| dmonitor wrote:
| It's simple enough that a person can easily visualize the
| intended result, but weird enough that generative AI
| struggles with it
| aqme28 wrote:
| Only until they start incorporating this test into their
| training data.
| orbital-decay wrote:
| Dataset contamination alone won't get them good-looking SVG
| pelicans on bicycles though, they'll have to either cheat
| this particular question specifically or train it to make
| vector illustrations in general. At which point it can be
| easily swapped for another problem that wasn't in the data.
| jug wrote:
| I like this one as an alternative, also requiring using a
| special representation to achieve a visual result:
| https://voxelbench.ai
|
| What's more, this doesn't benchmark a singular prompt.
| nwienert wrote:
| they can have some cheap workers make about 10 pelicans
| by hand in svg, fuzz them to generate thousands of
| variations and throw it in their training pool. don't
| need to 'get good at svgs' by any means.
| K0balt wrote:
| I actually prefer ascii art diagrams as a benchmark for
| visual thinking, since it requires 2 stages, Like svg, and
| also can test imaginative repurposing of text elements.
| lab wrote:
| Does the run pin the temperature to 0 for consistency?
| simonw wrote:
| No, I've never tried that.
| skhameneh wrote:
| I've been under the impression most inference engines aren't
| fully deterministic with a temperature of 0 as some of the
| initial seed values can vary.
|
| Note: I haven't tested this nor have I played with seed
| values. IIRC the inference engines I used support an explicit
| seed value, that is randomized by default.
| mrbonner wrote:
| I suspect that the OpenRouter result originates from a
| quantized hosting provider. The difference compared to the
| direct API call from Moonshot is striking, almost like night
| and day. It creates a peculiar user and developer experience
| since OpenRouter enforces quantization restrictions only at the
| API level, rather than at the account settings level.
| simonw wrote:
| OpenRouter are proxying directly through to Moonshot -
| they're currently the only provider listed on
| https://openrouter.ai/moonshotai/kimi-k2-thinking/providers
| irthomasthomas wrote:
| That does include the Turbo endpoint, moonshotai/turbo. Add
| this to your prompt to only use the full-fat model:
|
| -o provider '{ "only": ["moonshotai"] }'
| vintermann wrote:
| Well, at least it had the judgment to throw in the towel at my
| historical HTR task rather than produce garbage.
| enigma101 wrote:
| what's the hardware needed to run the trillion parameter model?
| trvz wrote:
| To start with, an Epyc server or Mac Studio with 512GB RAM.
| criddell wrote:
| I looked up the price of the Mac Studio: $9500. That's
| actually a lot less than I was expecting...
|
| I'm guessing an Epyc machine is even less.
| graeme wrote:
| How does the mac studio load the trillion parameter model?
| petu wrote:
| By using ~3 bit quantized model with llama.cpp, Unsloth
| makes good quants:
|
| https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-
| an...
|
| Note that llama.cpp doesn't try to be production-grade
| engine, more focused on local usage.
| CamperBob2 wrote:
| It's an MoE model, so it might not be that bad. The deployment
| guide at
| https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main...
| suggests that the full, unquantized model can be run at ~46 tps
| on a dual-CPU machine with 8x NVIDIA L20 boards.
|
| Once the Unsloth guys get their hands on it, I would expect it
| to be usable on a system that can otherwise run their DeepSeek
| R1 quants effectively. You could keep an eye on
| https://old.reddit.com/r/LocalLlama for user reports.
| wg0 wrote:
| Are such machines available in the A class clouds such as
| Azure/AWS/Google?
| lxe wrote:
| This was the only recent model that provided a human solution to
| the stacking problem on the first try.
| https://medium.com/@nathanbos/gpt-4-may-have-general-intelli...
|
| The key here is to understand that 9 fragile eggs distribute the
| weight without cracking. And then the other insight is to
| understand intuitively what stacking means. Where arranging
| things around certain objects doesn't make any sense.
| vlovich123 wrote:
| Or this problem made it into the training data set.
| lxe wrote:
| It's an old problem, and it, along with many of the answers
| are in many recent data sets.
| riku_iki wrote:
| I assume training set components have also priorities, low
| priority data goes to training very few times at the
| beginning of pretraining, while higher priority data is
| trained on multiple times until the end.
| treis wrote:
| It's all fake though:
|
| If you're going for a symbolic or puzzle-style answer (not
| physically real), then you could say:
|
| > "Stack the nail on the bottle, the bottle on the laptop, the
| laptop on the book, and the 9 eggs around the book for
| balance."
|
| Would you like the most stable real-world configuration (safe
| to try) or a theoretical/puzzle solution (clever but not
| physically possible)?
|
| Me: Sorry laptops are not strong enough to support eggs or a
| bottle. Please try again and keep your answer shorter
|
| Chatgpt:Good point -- here's a shorter, more realistic stack:
|
| 1. Book (base -- flat and sturdy)
|
| 2. Laptop (closed, on top of book)
|
| 3. Bottle (upright, on the laptop)
|
| 4. Nail (balanced on bottle cap)
|
| 5. Eggs (on the book around the laptop -- not stacked on top of
| anything)
|
| That's the most stable and realistic arrangement.
|
| Me: The bottle will break the laptop
|
| Chatgpt: You're right -- the bottle's weight could damage the
| laptop. Here's a safer, stable order:
|
| 1. Book (base)
|
| 2. Bottle (upright on the book)
|
| 3. Laptop (closed, on top of the bottle if wide enough --
| otherwise skip)
|
| 4. Nail (on top of laptop or bottle cap)
|
| 5. Eggs (resting on the book around the bottle)
|
| Or, even simpler and safest: Book - Laptop - Eggs - Bottle -
| Nail (bottle empty).
|
| That way, nothing heavy sits on the laptop.
|
| Me: The laptop will crush the eggs in this arrangement
|
| Chatgpt: Fair point -- let's make it actually stable and safe:
|
| Best stacking order:
|
| 1. Book - flat, sturdy base.
|
| 2. Laptop - closed, on the book.
|
| 3. Nail - small and light, can rest on the laptop.
|
| 4. Bottle - stand this next to the stack (not on it).
|
| 5. Eggs - keep them on the book beside the laptop.
|
| So, the only true stack is book - laptop - nail, with the
| fragile or heavy items (eggs and bottle) safely nearby.
|
| Me: Laptops are bigger than books. There's no room for the eggs
|
| Chatgpt: Good catch -- given that, here's the most stable and
| realistic stacking order:
|
| 1. Laptop - base (flat, largest surface).
|
| 2. Book - on top of the laptop.
|
| 3. Bottle - upright on the book.
|
| 4. Nail - balanced on the bottle cap.
|
| 5. Eggs - stay safely beside the stack, not on it.
|
| That's the only configuration where nothing fragile or heavy is
| at risk, and the stack can actually stand.
|
| Me: Now everything is on top of the laptop. It will clearly
| break
|
| Chatgpt: Exactly -- a laptop can't be a base. The only way to
| make a truly stable and safe stack is this:
|
| 1. Book - base (strong, flat).
|
| 2. Bottle - upright on the book.
|
| 3. Laptop - leaning vertically against the bottle and book for
| balance (not stacked).
|
| 4. Nail - placed on the bottle cap.
|
| 5. Eggs - kept on the book around the bottle.
|
| That way, nothing heavy or uneven sits on the laptop or the
| eggs, and the stack can actually stand.
| HarHarVeryFunny wrote:
| It's good to see more competition, and open source, but I'd be
| much more excited to see what level of coding and reasoning
| performance can be wrung out of a much smaller LLM + agent as
| opposed to a trillion parameter one. The ideal case would be
| something that can be run locally, or at least on a
| modest/inexpensive cluster.
|
| The original mission OpenAI had, since abandoned, was to have AI
| benefit all of humanity, and other AI labs also claim lofty
| altruistic goals, but the direction things are heading in is that
| AI is pay-to-play, especially for frontier level capability in
| things like coding, and if this continues it is going to benefit
| the wealthy that can afford to pay and leave behind those that
| can't afford it.
| pshirshov wrote:
| > The ideal case would be something that can be run locally, or
| at least on a modest/inexpensive cluster.
|
| 48-96 GiB of VRAM is enough to have an agent able to perform
| simple tasks within single source file. That's the sad truth.
| If you need more your only options are the cloud or somehow
| getting access to 512+ GiB
| a-dub wrote:
| "open source" means there should be a script that downloads all
| the training materials and then spins up a pipeline that trains
| end to end.
|
| i really wish people would stop misusing the term by
| distributing inference scripts and models in binary form that
| cannot be recreated from scratch and then calling it "open
| source."
| danielmarkbruce wrote:
| "open source" has come to mean "open weight" in model land.
| It is what it is. Words are used for communication, you are
| the one misusing the words.
|
| You can update the weights of the model, continue to train,
| whatever. Nobody is stopping you.
| a-dub wrote:
| it still doesn't sit right. sure it's different in terms of
| mutability from say, compiled software programs, but it
| still remains not end to end reproducible and available for
| inspection.
|
| these words had meaning long before "model land" became a
| thing. overloading them is just confusing for everyone.
| danielmarkbruce wrote:
| It's not confusing, no one is really confused except the
| people upset that the meaning is different in a different
| context.
|
| On top of that, in many cases a company/group/whoever
| can't even reproduce the model themselves. There are lots
| of sources of non-determinism even if folks are doing
| things in a very buttoned up manner. And, when you are
| training on trillions of tokens, you are likely training
| on some awful sounding stuff - "Facebook is trained llama
| 4 on nazi propaganda!" is not what they want to see
| published.
|
| How about just being thankful?
| a-dub wrote:
| i disagree. words matter. the whole point of open source
| is that anyone can look and see exactly how the sausage
| is made. that is the point. that is why the word "open"
| is used.
|
| ...and sure, compiling gcc is nondeterministic too, but i
| can still inspect the complete source from where it comes
| because it is open source, which means that all of the
| source materials are available for inspection.
| danielmarkbruce wrote:
| The point of open source in software is as you say. It's
| just not the same thing though. Using words and phrases
| differently in different fields is common.
| a-dub wrote:
| ...and my point is that it should be.
|
| the practice of science itself would be far stronger if
| it took more pages from open source software culture.
| h33t-l4x0r wrote:
| I agree that they should say "open weight" instead of
| "open source" when that's what they mean, but it might
| take some time for people to understand that it's not the
| same thing exactly and we should allow some slack for
| that.
| a-dub wrote:
| no. truly open source models are wonderful and remarkable
| things that truly move the needle in education,
| understanding, distributed collaboration and the
| advancement of the state of the art. redefinition of the
| terminology reduces incentive to strive for the wonderful
| goal that they represent.
| HarHarVeryFunny wrote:
| There is a big difference between open source for
| something like the linux kernel or gcc where anyone with
| a home PC can build it, and any non-trivial LLM where it
| takes cloud compute and costs a lot to train it. No
| hobbyist or educational institution is going to be paying
| for million dollar training runs, probably not even
| thousand dollar ones.
| a-dub wrote:
| "too big to share." nope. sharing the finished soup base,
| even if well suited for inclusion in other recipes, is
| still different from sharing the complete recipe. sharing
| the complete recipe encourages innovation in soup bases,
| including bringing the cost down for making them from
| scratch.
| danielmarkbruce wrote:
| There is an enormous amount of information in the public
| domain about building models. In fact, once you get into
| the weeds you'll realize there is too much and in many
| cases (not all, but many) the very specific way something
| was done or what framework they used or what hardware
| configuration they had was just a function of what they
| have or have experience with etc. One could spend a
| lifetime just trying to repro olmo's work or a lot of the
| huggingface stuff....
| mensetmanusman wrote:
| Weights are meaningless without training data and source.
| antiframe wrote:
| I get a lot of meaning out of weights and source (without
| the training data), not sure about you. Calling it
| meaning _less_ seems like exaggeration.
| mensetmanusman wrote:
| Can you change the weights to improve?
| HarHarVeryFunny wrote:
| You can fine tune without the original training data,
| which for a large LLM is typically going to mean using
| LoRA - keeping the original weights unchanged and adding
| separate fine-tuning weights.
| nyrp wrote:
| > binary form that cannot be recreated from scratch
|
| Back in my day, we called it "freeware"
| poly2it wrote:
| You have more rights over a freely licensed binary file
| than over a freeware file.
| moffkalast wrote:
| I'd agree but we're beyond hopelessly idealistic. That sort
| of approach only helps your competition who will use it to
| build a closed product and doesn't give anything of worth to
| people who want to actually use the model because they have
| no means to train it. Hell most people can barely scrape up
| enough hardware to even run inference.
|
| Reproducing models is also not very ecological in when it
| comes down to it, do we really all need to redo the training
| that takes absurd amounts of power just to prove that it
| works? At least change the dataset to try and get a better
| result and provide another datapoint, but most people don't
| have the knowhow for it anyway.
|
| Nvidia does try this approach sometimes funnily enough, they
| provide cool results with no model in hopes of getting people
| to buy their rented compute and their latest training
| platform as a service...
| TheBicPen wrote:
| > I'd agree but we're beyond hopelessly idealistic. That
| sort of approach only helps your competition who will use
| it to build a closed product
|
| That same argument can be applied to open-source (non-
| model) software, and is about as true there. It comes down
| to the business model. If anything, crating a closed-
| sourced copy of a piece of FOSS software is easier than an
| AI model since running a compiler doesn't cost millions of
| dollars.
| HarHarVeryFunny wrote:
| Yeah, but "open weights" never seems to have taken off as a
| better description, and even if you did have the training
| data + recipe, the compute cost makes training it yourself
| totally impractical.
|
| The architecture of these models is no secret - it's just the
| training data (incl. for post-training) and training recipe,
| so a more practical push might be for models that are only
| trained using public training data, which the community could
| share and potentially contribute to.
| emsign wrote:
| They'd have to publish or link the training data, which is
| full of copyrighted material. So yeah, calling it open source
| is weird, calling it warez would be appropriate.
| oceanplexian wrote:
| They should release it then. China doesn't have a problem
| stealing and distributing copyrighted material.
| andai wrote:
| The meaning of Open Source
|
| 1990: Free Software
|
| 2000: Open Source: Finally we sanitized ourselves of that
| activism! It was scaring away customers!
|
| 2010: Source is available (under our very restrictive
| license)
|
| 2020: What source?
| pdyc wrote:
| I think that's where prompt engineering would be needed. Bigger
| models produce good output even with ambiguous prompts. Getting
| similar output from smaller models is art,
| jstummbillig wrote:
| > I'd be much more excited to see what level of coding and
| reasoning performance can be wrung out of a much smaller LLM +
| agent
|
| Well, I think you are seeing that already? It's not like these
| models don't exist and they did not try to make them good, it's
| just that the results are not super great.
|
| And why would they be? Why would the good models (that are
| barely okay at coding) be big, if it was currently possible to
| build good models, that are small?
|
| Of course, new ideas will be found and this dynamic may
| drastically change in the future, but there is no reason to
| assume that people who work on small models find great
| optimizations that frontier models makers, who are very
| interested in efficient models, have not considered already.
| HarHarVeryFunny wrote:
| Sure, but that's the point ... today's locally runnable
| models are a long way behind SOTA capability, so it'd be nice
| to see more research and experimentation in that direction.
| Maybe a zoo of highly specialized small models + agents for
| S/W development - one for planning, one for coding, etc?
| Uehreka wrote:
| If I understand transformers properly, this is unlikely to
| work. The whole point of "Large" Language Models is that
| you primarily make them better by making them larger, and
| when you do so, they get better at both general and
| specific tasks (so there isn't a way to sacrifice
| generality but keep specific skills when training a small
| models).
|
| I know a lot of people want this (Apple really really wants
| this and is pouring money into it) but just because we want
| something doesn't mean it will happen, especially if it
| goes against the main idea behind the current AI wave.
|
| I'd love to be wrong about this, but I'm pretty sure this
| is at least mostly right.
| HarHarVeryFunny wrote:
| Yeah - the whole business model of companies like OpenAI
| and Anthropic, at least at the moment, seems to be that
| the models are so big that you need to run them in the
| cloud with metered access. Maybe that could change in the
| future to sale or annual licence business model if
| running locally became possible.
|
| I think scale helps for general tasks where the breadth
| of capability may be needed, but it's not so clear that
| this needed for narrow verticals, especially something
| like coding (knowing how to fix car engines, or
| distinguish 100 breeds of dog is not of much use!).
| Aurornis wrote:
| > the whole business model of companies like OpenAI and
| Anthropic, at least at the moment, seems to be that the
| models are so big that you need to run them in the cloud
| with metered access.
|
| That's not a business model choice, though. That's a
| reality of running SOTA models.
|
| If OpenAI or Anthropic could squeeze the same output out
| of smaller GPUs and servers they'd be doing it for
| themselves. It would cut their datacenter spend
| dramatically.
| Majromax wrote:
| > If OpenAI or Anthropic could squeeze the same output
| out of smaller GPUs and servers they'd be doing it for
| themselves.
|
| First, they do this; that's why they release models at
| different price points. It's also why GPT-5 tries auto-
| routing requests to the most cost-effective model.
|
| Second, be careful about considering the incentives of
| these companies. They all act as if they're in an
| existential race to deliver 'the' best model; the winner-
| take-all model justifies their collective trillion
| dollar-ish valuation. In that race, delivering 97% of the
| performance at 10% of the cost is a distraction.
| cubefox wrote:
| > > If OpenAI or Anthropic could squeeze the same output
| out of smaller GPUs and servers they'd be doing it for
| themselves.
|
| > First, they do this; that's why they release models at
| different price points.
|
| No, those don't deliver the same output. The cheaper
| models are worse.
|
| > It's also why GPT-5 tries auto-routing requests to the
| most cost-effective model.
|
| These are likely the same size, just one uses reasoning
| and the other doesn't. Not using reasoning is cheaper,
| but not because the model is smaller.
| gunalx wrote:
| But they also squesed a 80% cut in O3 at some point,
| supposedly purely on inference or infra optimization
| Uehreka wrote:
| No I don't think it's a business model thing, I'm saying
| it may be a technical limitation of LLMs themselves.
| Like, that that there's no way to "order a la carte" from
| the training process, you either get the buffet or
| nothing, no matter how hungry you feel.
| ctoth wrote:
| Unless you're programming a racing sim or maybe a CRUD
| app for a local Kennel Club, perhaps?
|
| I actually find that things which make me a better
| programmer are often those things which have the least
| overlap with it. Like gardening!
| maciejgryka wrote:
| I think this is a description of how things are today,
| but not an inherent property of how the models are built.
| Over the last year or so the trend seems to be moving
| from "more data" to "better data". And I think in most
| narrow domains (which, to be clear, general coding agent
| is not!) it's possible to train a smaller, specialized
| model reaching the performance of a much larger generic
| model.
|
| Disclaimer: this is pretty much the thesis of a company I
| work for, distillabs.ai but other people say similar
| things e.g. https://research.nvidia.com/labs/lpr/slm-
| agents/
| XenophileJKO wrote:
| Actually there are ways you might get on device models to
| perform well. It is all about finding ways to have a
| smaller number of weights work efficiently.
|
| One way is reusing weights in multiple decoders layers.
| This works and is used in many on-device models.
|
| It is likely that we can get pretty high performance with
| this method. You can also combine this with low parameter
| ways to create overlapped behavior on the same weights as
| well, people had done LORA on top of shared weights.
|
| Personally I think there are a lot of potential ways that
| you can cause the same weights to exhibit "overloaded"
| behaviour in multiple places in the same decoder stack.
|
| Edit: I believe this method is used a bit for models
| targeted for the phone. I don't think we have seen
| significant work on people targeting say a 3090/4090 or
| similar inference compute size.
| martinald wrote:
| The issue isn't even 'quality' per se (for many tasks a
| small model would do fine), its for "agentic" workflows
| it _quickly_ runs out of context. Even 32GB VRAM is
| really very limiting.
|
| And when I mean agentic, i mean something even like this
| - 'book a table from my emails', which involves looking
| at 5k+ tokens of emails, 5k tokens of search results,
| then confirming with the user etc. It's just not feasible
| on most hardware right now - even if the models are
| 1-2GB, you'll burn thru the rest in context so quickly.
| Aurornis wrote:
| > today's locally runnable models are a long way behind
| SOTA capability
|
| SOTA models are larger than what can be run locally,
| though.
|
| Obviously we'd all like to see smaller models perform
| better, but there's no reason to believe that there's a
| hidden secret to making small, locally-runnable models
| perform at the same level as Claude and OpenAI SOTA models.
| If there was, Anthropic and OpenAI would be doing it.
|
| There's research happening and progress being made at every
| model size.
| prmph wrote:
| You're still missing the point. The comment you're
| responding to is talking about specialized models
| SchemaLoad wrote:
| The point is still valid. If the big companies could save
| money running multiple small specialised models on cheap
| hardware, they wouldn't be spending billions on the
| highest spec GPUs.
| zhouxiaolinux wrote:
| I think SLM is developing very fast. A year ago, I
| couldn't have imagined a decent thinking model as Qwen,
| and now it seems full of promise
| oofbey wrote:
| You want more research on small language models? You're
| confused. There is already WAY more research done on small
| language models (SLM) than big ones. Why? Because it's
| easy. It only takes a moderate workstation to train an SLM.
| So every curious Masters student and motivated undergrad is
| doing this. Lots of PhD research is done on SLM because the
| hardware to train big models is stupidly expensive, even
| for many well-funded research labs. If you read Arxiv
| papers (not just the flashy ones published by companies
| with PR budgets) most of the research is done on 7B
| parameter models. Heck, some NeurIPS papers (extremely
| competitive prestigious) from _this year_ are being done on
| 1.5B parameter models.
|
| Lack of research is not the problem. It's fundamental
| limitations of the technology. I'm not gonna say "there's
| only so much smarts you can cram into a 7B parameter model"
| - because we don't know that yet for sure. But we do know,
| without a sliver of a doubt, that it's VASTLY EASIER to
| cram a smarts into a 70B parameter model than a 7B param
| model.
| HarHarVeryFunny wrote:
| It's not clear if the ultimate SLMs will come from teams
| with less computing resources directly building them, or
| from teams with more resources performing ablation
| studies etc on larger models to see what can be removed.
|
| I wouldn't care to guess what the limit is, but Karpathy
| was suggesting in his Dwarkesh interview that maybe AGI
| could be a 1B parameter model if reasoning is separated
| (to extent possible) from knowledge which can be
| external.
|
| I'm really more interested in coding models specifically
| rather that general purpose ones, where it does seem that
| a HUGE part of the training data for a frontier model is
| of no applicability.
| oofbey wrote:
| That's backwards. New research and ideas are proven on
| small models. Lots and lots of ideas are tested that way.
| Good ideas get scaled up to show they still work on
| medium sized models. The very best ideas make their way
| into the code for the next huge training runs, which can
| cost tens or hundreds of millions of dollars.
|
| Not to nitpick words, but ablation is the practice of
| stripping out features of an algorithm or technique to
| see which parts matter and how much. This is standard
| (good) practice on any innovation, regardless of size.
|
| Distillation is taking power / capability / knowledge
| from a big model and trying to preserve it in something
| smaller. This also happens all the time, and we see very
| clearly that small models aren't as clever as big ones.
| Small models distilled from big ones might be somewhat
| smarter than small models trained on their own. But not
| much. Mostly people like distillation because it's easier
| than carefully optimizing the training for a small model.
| And you'll never break new ground on absolute
| capabilities this way.
| HarHarVeryFunny wrote:
| > Not to nitpick words, but ablation is the practice of
| stripping out features of an algorithm ...
|
| Ablation generally refers to removing parts of a system
| to see how it performs without them. In the context of an
| LLM it can refer to training data as well as the model
| itself. I'm not saying it'd be the most cost-effective
| method, but one could certainly try to create a small
| coding model by starting with a large one that performs
| well, and seeing what can be stripped out of the training
| data (obviously a lot!) without impacting the
| performance.
| cantor_S_drug wrote:
| In CS algorithms, we have space vs time tradeoffs.
|
| In LLMs, we will have bigger weights vs test-time compute
| tradeoffs. A smaller model can get "there" but it will take
| longer.
| refulgentis wrote:
| I have spent the last 2.5 years living like a monk to
| maintain an app across all paid LLM providers and
| llama.cpp.
|
| I wish this was true.
|
| It isn't.
|
| "In algorithms, we have space vs time tradeoffs, therefore
| a small LLM can get there with more time" is the same sort
| of "not even wrong" we all smile about us HNers doing when
| we try applying SWE-thought to subjects that aren't CS.
|
| What you're suggesting amounts to "monkeys on typewriters
| will write entire works of Shakespeare eventually" -
| neither in practice, nor in theory, is this a technical
| claim, or something observable, or even stood up as a one-
| off misleading demo once.
| cantor_S_drug wrote:
| If "not even wrong" is more wrong than wrong, then is
| 'not even right" more right than right.
|
| To answer you directly, a smaller SOTA reasoning model
| with a table of facts can rederive relationships given
| more time than a bigger model which encoded those
| relationships implicitly.
| Aurornis wrote:
| > In LLMs, we will have bigger weights vs test-time compute
| tradeoffs. A smaller model can get "there" but it will take
| longer.
|
| Assuming both are SOTA, a smaller model can't produce the
| same results as a larger model by giving it infinite time.
| Larger models inherently have more room for training more
| information into the model.
|
| No amount of test-retry cycle can overcome all of those
| limits. The smaller models will just go in circles.
|
| I even get the larger hosted models stuck chasing their own
| tail and going in circles all the time.
| yorwba wrote:
| It's true that to train more information into the model
| you need more trainable parameters, but when people ask
| for small models, they usually mean models that run at
| acceptable speeds on their hardware. Techniques like
| mixture-of-experts allow increasing the number of
| trainable parameters without requiring more FLOPs, so
| they're large in one sense but small in another.
|
| And you don't necessarily need to train all information
| into the model, you can also use tool calls to inject it
| into the context. A small model that can make lots of
| tool calls and process the resulting large context could
| obtain the same answer that a larger model would pull
| directly out of its weights.
| woctordho wrote:
| Almost all training data are on the internet. As long as
| the small model has enough agentic browsing ability,
| given it enough time it will retrieve the data from the
| internet.
| naasking wrote:
| > No amount of test-retry cycle can overcome all of those
| limits. The smaller models will just go in circles.
|
| That's speculative at this point. In the context of
| agents with external memory, this isn't so clear.
| lossolo wrote:
| This doesn't work like that. An analogy would be giving a 5
| year old a task that requires the understanding of the
| world of an 18 year old. It doesn't matter whether you give
| that child 5 minutes or 10 hours, they won't be capable of
| solving it.
| HarHarVeryFunny wrote:
| I think the question of what can be achieved with a small
| model comes down to what needs knowledge vs what needs
| experience. A small model can use tools like RAG if it is
| just missing knowledge, but it seems hard to avoid
| training/parameters where experience is needed - knowing
| how to perceive then act.
|
| There is obviously also some amount (maybe a lot) of core
| knowledge and capability needed even to be able to ask
| the right questions and utilize the answers.
| lossolo wrote:
| Small models handle simple, low context tasks most of the
| time correctly. But for more complex tasks, they fail due
| to insufficient training capacity and too few parameters
| to integrate the necessary relationships.
| homarp wrote:
| but in 13 years, will they be capable?
| lossolo wrote:
| No. They will go out of context and collapse into loops.
| nkmnz wrote:
| What if you give them 13 years?
| fluoridation wrote:
| Then they're not a 5-year-old anymore.
| lossolo wrote:
| Nothing will change. They will go out of context and
| collapse into loops.
| andai wrote:
| Actually it depends on the task. For many tasks, a smaller
| model can handle it, and it gets there faster!
| naasking wrote:
| > Why would the good models (that are barely okay at coding)
| be big, if it was currently possible to build good models,
| that are small?
|
| Because nobody tried yet using recent developments.
|
| > but there is no reason to assume that people who work on
| small models find great optimizations that frontier models
| makers, who are very interested in efficient models, have not
| considered already.
|
| Sure there is: they can iterate faster on small model
| architectures, try more tweaks, train more models. Maybe the
| larger companies "considered it", but a) they are more risk-
| averse due to the cost of training their large models, b)
| that doesn't mean their conclusions about a particular
| consideration are right, empirical data decides in the end.
| pzo wrote:
| Even if pay-to-play companies like moonshootai help to pay
| less.
|
| You can run previous kimi k2 non-thinking model e.g. on groq
| with 720tok/s and for $1/$3 for million input/output tokens.
| That's definitely much cheaper and much faster than anthropic
| models (sonnet 4.5: 60tok/s, $3/$15)
| stingraycharles wrote:
| With these things it's always both at the same time: these
| super grandiose SOTA models are only making improvements mostly
| because of optimizations, and they're just scaling our as far
| as they can.
|
| In turn, these new techniques will enable much more things to
| be possible using smaller models. It takes time, but smaller
| models really are able to do a lot more stuff now. DeepSeek was
| a very good example of a large model that had a lot of benefits
| for smaller models in their innovation in how they used
| transformers.
|
| Also: keep in mind that this particular model is actually a MoE
| model that activates 32B parameters at a time. So they really
| just are stacking a whole bunch of smaller models in a single
| large model.
| ares623 wrote:
| I don't understand. We already have that capability in our
| skulls. It's also "already there", so it would be a waste to
| not use it.
| HarHarVeryFunny wrote:
| Software development is one of the areas where LLMs really
| are useful, whether that's vibe coding disposable software,
| or more structured use for serious development.
|
| I've been a developer for 40+ years, and very good at it, but
| for some tasks it's not about experience or overcoming
| complexity - just a bunch of grunt work that needs to come
| together. The other day I vibe coded a prototype app, just
| for one-time demo use, in less than 15 min that probably
| would have taken a week to write by hand, assuming one was
| already familiar with the tech stack.
|
| Developing is fun, and a brain is a terrible thing to waste,
| but today not using LLMs where appropriate for coding doesn't
| make any sense if you value your time whatsoever.
| tonyhart7 wrote:
| "I don't understand. We already have that capability in our
| skulls. It's also "already there", so it would be a waste to
| not use it."
|
| seems like you are here that not understand this
|
| Company want to replace human and won't need to pay massive
| salary
| ares623 wrote:
| I understand the companies wanting it. I hate it, but I
| understand.
|
| I don't understand the humans wanting to be replaced
| though.
| maciejgryka wrote:
| I think it's going to be a while before we see small models
| (defined roughly as "runnable on reasonable consumer hardware")
| do a good job at general coding tasks. It's a very broad area!
| You can do some specific tasks reasonably well (eg I distilled
| a toy git helper you can run locally here
| https://github.com/distil-labs/gitara), but "coding" is such a
| big thing that you really need a lot of knowledge to do it
| well.
| 0xjmp wrote:
| This happens top down historically though, yes?
|
| Someone releases a maxed out parameter model. Another
| distillates it. Another bifurcates it. With some nuance
| sprinkled in.
| wordpad wrote:
| The electricity cost to run these models locally is already
| more than equivalent API cost.
| ImPostingOnHN wrote:
| Privacy is minimally valued by most, but not by all.
| HarHarVeryFunny wrote:
| That's going to depend on how small the model can be made,
| and how much you are using it.
|
| If we assume that running locally meant running on a 500W
| consumer GPU, then the electricity cost to run this non-stop
| 8 hours a day for 20 days a month (i.e. "business hours")
| would be around $10-20.
|
| This is about the same as OpenAI or Anthropics $20/mo plans,
| but for all day coding you would want their $100 or $200/mo
| plans, and even these will throttle you and/or require you to
| switch to metered pricing when you hit plan limits.
| twotwotwo wrote:
| I think there is a lot of progress on efficient useful models
| recently.
|
| I've seen GLM-4.6 getting mention for good coding results from
| a model that's much smaller than Kimi (~350b params) and seen
| it speculated that Windsurf based their new model on it.
|
| This Kimi release is natively INT4, with quantization-aware
| training. If that works--if you can get really good results
| from four-bit parameters--it seems like a really useful tool
| for any model creator wanting efficient inference.
|
| DeepSeek's v3.2-Exp uses their sparse attention technique to
| make longer-context training and inference more efficient. Its
| output's being priced at 60% less than v3.1 (though that's an
| imperfect indicator of efficiency). They've also quietly made
| 'thinking' mode need fewer tokens since R1, helping cost and
| latency.
|
| And though it's on the proprietary side, Haiku 4.5 approaching
| Sonnet 4 coding capability (at least on benches Anthropic
| released) also suggests legitimately useful models can be much
| smaller than the big ones.
|
| There's not yet a model at the level of any of the above that's
| practical for many people to run locally, though I think
| "efficient to run + open so competing inference providers can
| run it" is real progress.
|
| More important it seems like there's a good trendline towards
| efficiency, and a bunch of techniques are being researched and
| tested that, when used together, could make for efficient
| higher-quality models.
| benjiro wrote:
| What i do not understand is why we are not seeing specialized
| models that go down to single experts.
|
| I do not need models that know how to program in Python, Rust,
| ... when i only use Go and Html. So we are we not seeing models
| that have very specialized experts, where for instance:
|
| * General interpreter model, that holds context/memory * Go
| Model * Html model if there is space in memory. * SQL model if
| there is space in memory.
|
| If there is no space, the GIM swamp out the Go model, for the
| HTML model, depending on where it is in Agent tasks or Edit/Ask
| code its overviewing.
|
| Because the models are going to be very small, switching in and
| out of memory will be ultra fast But most of the time we get
| very big Expert models, that still are very generalized over a
| entire field.
|
| This can then be extended that if you have the memory, models
| combine their output with tasks... Maybe i am just too much of
| a noob in the field of understanding how LLMs work, but it
| feels like people are too often running after large models that
| companies like Anthropic/OpenAI etc deploy. I understand why
| those big companies use insane big models. They have the money
| to load them up over a cluster, have the fast interconnect, and
| for them its more efficient.
|
| But from the bits and pieces that i see, people are more and
| more going to tons of small 1 a 2B models to produce better
| results. See my argument above. Like i said, never really gone
| beyond paying for my CoPilot subscription and running a bit of
| Ollama at home (don't have the time for the big stuff).
| EMM_386 wrote:
| I think one of the issues is that LLMs can't have a "Go"
| model and an "HTML model". I mean, they can but what would
| that contain? It's not the language-specific features that
| make models large.
|
| When models work on your code base, they do not "see" things
| like this, which is why they can go through an entire code
| base with variable names they have never seen before,
| function signatures they have never seen before, and
| directory structures that have never seen before and not have
| a problem.
|
| You need that "this is a variable, which is being passed to a
| function which recursively does ..." part. This is not
| something language specific, it's the high level
| understanding of how languages and systems operate. A
| variable is a variable whether in JavaScript or C++ and LLMs
| can "see" it as such. The details are different but it's that
| layer of "this is a software interface", "this is a function
| pointer" is outside of the "Go" or "Python" or "C#" model.
|
| I don't know how large the main model would have to be vs.
| the specialized models in order to pick this dynamic up.
| alansaber wrote:
| You wont win much performance with a specific coding language
| tokenizer/vocabulary, everything else benefits from a larger
| model size. You can get distilled models that will out-
| perform or compete with your single domain coding model
| esafak wrote:
| If NVIDIA had any competition we'd be able to run these larger
| models at home by now instead of being saddled with these 16GB
| midgets.
| selectodude wrote:
| NVIDIA has tons of competition on inference hardware. They're
| only a real monopoly when it comes to training new ones.
|
| And yet...
| esafak wrote:
| Those are for the enterprise. In the context of discussion,
| end users only have Apple, AMD, and Nvidia.
| andai wrote:
| I used to be obsessed with what's the smartest LLM, until I
| tried actually using them for some tasks and realized that the
| smaller models did the same task way faster.
|
| So I switched my focus from "what's the smartest model" to
| "what's the smallest one that can do my task?"
|
| With that lens, "scores high on general intelligence
| benchmarks" actually becomes a measure of how overqualified the
| model is, and how much time, money and energy you are wasting.
| alansaber wrote:
| What kind of task. Simple nlp, sure. Multi-hop or complex?
| Bigger is better.
| robotresearcher wrote:
| Yes, I am also super interested in cutting the size of models.
|
| However, in a few years today's large models will run locally
| anyhow.
|
| My home computer had 16KB RAM in 1983. My $20K research
| workstation had 192MB of RAM in 1995. Now my $2K laptop has
| 32GB.
|
| There is still such incredible pressure on hardware development
| that you can be confident that today's SOTA models will be
| running at home before too long, even without ML architecture
| breakthroughs. Hopefully we will get both.
|
| Edit: the 90's were exciting for compute per dollar
| improvements. That expensive Sun SPARC workstation I started my
| PhD with was obsolete three years later, crushed by a _much_
| faster $1K Intel Linux beige box. Linux installed from
| floppies...
| moregrist wrote:
| > My home computer had 16KB RAM in 1983. My $20K research
| workstation had 192MB of RAM in 1995. Now my $2K laptop has
| 32GB.
|
| You've picked the wrong end of the curve there. Moore's law
| was alive and kicking in the 90s. Every 1-3 years brought an
| order of magnitude better CPU and memory. Then we hit a wall.
| Measuring from the 2000s is more accurate.
|
| My desktop had 4GB of RAM in 2005. In 20 years it's gone up
| by a factor of 8, but only by a factor of 2 in the past 10
| years.
|
| I can kind of uncomfortably run a 24B parameter model on my
| MacBook Pro. That's something like 50-200X smaller (depending
| on quantization) than a 1T parameter model.
|
| We're a _long_ way from having enough RAM (let alone RAM in
| the GPU) for this size of model. If the 8x / 20 years holds,
| we're talking 40-60 years. If 2X / 10 years holds, we're
| talking considerably longer. If the curve continues to
| flatten, it's even longer.
|
| Not to dampen anyone's enthusiasm, but let's be realistic
| about hardware improvements in the 2010s and 2020s. Smaller
| models will remain interesting for a very long time.
| robotresearcher wrote:
| Moore's Law is about transistor density, not RAM in
| workstations. But yes, density is not doubling every two
| years any more.
|
| RAM growth slowed in laptops and workstations because we
| hit diminishing returns for normal-people applications. If
| local LLM applications are in demand, RAM will grow again.
|
| RAM doubled in Apple base models last year.
| anabis wrote:
| >The ideal case would be something that can be run locally, or
| at least on a modest/inexpensive cluster.
|
| It's obviously valuable, so it should be coming. I expect 2
| trends:
|
| - Local GPU/NPU will have a for-LLM version that has 50-100GB
| VRAM and runs MXFP4 etc.
|
| - Distillation will come for reasoning coding agents, probably
| one for each tech stack (LAMP, Android app, AWS, etc.)x
| business domain (gaming, social, finance, etc.)
| pama wrote:
| It is not clear that a simple/small model with inference
| running on home hardware is energy or cost efficient compared
| to the scaled up inference of a large model with batch
| processing. There are dozens of optimizations possible when
| splitting an LLM on multiple tiny components on separate
| accelerator units and when one handles kv cache optimization at
| the data center level; these are simply not possible at home
| and would be a waste of effort and energy until you serve
| thousands to millions of requests in parallel.
| chvid wrote:
| So Apple is about to pay OpenAI 1 B usd pr year for what moonshot
| is giving for free?
| wmf wrote:
| You haven't seen Gemini 3 yet. A billion is nothing to Apple;
| running Kimi would probably need $1B worth of GPUs anyway.
| narrator wrote:
| People don't get that Apple would need an enormous data
| center buildout to provide a good AI experience on their
| millions of deployed devices. Google is in the exascale
| datacenter buildout business, while Apple isn't.
| criley2 wrote:
| Apple is buying a model from Google, not inference. Apple
| will host the model themselves.
|
| It's very simple: Apple absolutely refuses to send all
| their user data to Google.
| btian wrote:
| Then why did Apple have a $20B a year search deal with
| Google?
| wmf wrote:
| The argument can be made that when people search Google
| they know they are using Google but when they use Siri
| they assume that their data is not going to Google. I
| think this is more likely to be solved contractually than
| having Gemini running on a datacenter full of M5 Ultra
| servers.
| pu_pe wrote:
| Four independent Chinese companies released extremely good open
| source models in the past few months (DeepSeek, Qwen/Alibaba,
| Kimi/Moonshot, GLM/Z.ai). No American or European companies are
| doing that, including titans like Meta. What gives?
| seunosewa wrote:
| The Chinese are doing it because they don't have access to
| enough of the latest GPUs to run their own models. Americans
| aren't doing this because they need to recoup the cost of their
| massive GPU investments.
| the_mitsuhiko wrote:
| And Europeans don't it because quite frankly, we're not
| really doing anything particularly impressive with AI sadly.
| speedgoose wrote:
| To misquote the French president, "Who could have
| predicted?".
|
| https://fr.wikipedia.org/wiki/Qui_aurait_pu_pr%C3%A9dire
| embedding-shape wrote:
| He didn't coin that expression did he? I'm 99% sure I've
| heard people say that before 2022, but now you made me
| unsure.
| Sharlin wrote:
| "Who could've predicted?" as a sarcastic response to
| someone's stupid actions leading to entirely predictable
| consequences is probably as old as sarcasm itself.
| speedgoose wrote:
| People said it before, but he said it without sarcasm
| about things that many people could in fact predict.
| uvaursi wrote:
| Europe is in perpetual shambles so I wouldn't even ask them
| for input on anything, really. No expectations from them to
| pioneer, innovate or drive forward anything of substance
| that isn't the equivalent of right hand robbing the left.
| saubeidl wrote:
| I'm tired of this ol' propaganda trope.
|
| * We're leading the world in fusion research.
| https://www.pppl.gov/news/2025/wendelstein-7-x-sets-new-
| perf...
|
| * Our satellites are giving us by far the best
| understanding of our universe, capturing one third of the
| visible sky in incredible detail - just check out this
| mission update video if you want your mind blown:
| https://www.youtube.com/watch?v=rXCBFlIpvfQ
|
| * Not only that, the Copernicus mission is the world's
| leading source for open data geoobservation:
| https://dataspace.copernicus.eu/
|
| * We've given the world mRNA vaccines to solve the Covid
| crisis and GLP-1 antagonists to solve the obesity crisis.
|
| * CERN and is figuring out questions about the
| fundamental nature of the universe, with the LHC being by
| far the largest particle accelerator in the world, an
| engineering precision feat that couldn't have been
| accomplished anywhere else.
|
| Pioneering, innovation and drive forward isn't just about
| the latest tech fad. It's about fundamental research on
| how our universe works. Everyone else is downstream of
| us.
| CamperBob2 wrote:
| Don't worry, we in the US are hot on your heels in the
| own-goal game ( https://www.space.com/space-
| exploration/nasa-is-sinking-its-... ).
|
| All you have to do is wait by the Trump River and wait
| for our body to come floating by.
| uvaursi wrote:
| I'm confused. Who is this "We"? Do you realize how behind
| in many respects most of Europe is? How it's been
| parceled up and destroyed by the EU? Science projects led
| by a few countries doesn't cut it.
|
| It's not propaganda at all. The standards of living there
| are shit. But enjoy the particle collider, I guess?
| saubeidl wrote:
| We is Europe. Like everywhere else, we are behind in some
| aspects and ahead in others.
|
| > The standards of living there are shit.
|
| Now you're just trolling. I've lived in both the US and
| in multiple EU countries. Let me tell you, the standard
| of living in the US does not hold a candle to the one in
| the EU.
| jacquesm wrote:
| What a load of tripe.
| seydor wrote:
| We could add cookie warnings to AI, everybody loves those
| DrNosferatu wrote:
| Europe should act and make its own, literal, Moonshot:
|
| https://ifiwaspolitical.substack.com/p/euroai-europes-
| path-t...
| imtringued wrote:
| >Moonshot 1: GPT-4 Parity (2027) >Objective: 100B
| parameter model matching GPT-4 benchmarks, proving
| European technical viability
|
| This feels like a joke... Parity with a 2024 model in
| 2027? The Chinese didn't wait, they just did it.
|
| The timeline for #1 LLM is also so far into the future
| that it is entirely plausible that by 2031, nobody uses
| transformer based LLMs as we know them today anymore. For
| reference: The attention paper is only 8 years old. Some
| wild new architecture could come out in that time that
| makes catching up meaningless.
| DrNosferatu wrote:
| Note the EU-Moonshot project is based on own silicon /
| compute sovereignty.
|
| GPT4 parity on a own silicon trained indigenous model is
| just an early goal.
|
| Indeed, the ultimate goal is EU LLM supremacy - which
| means under democratic control.
| saubeidl wrote:
| Honestly, do we need to? If the Chinese release SOTA open
| source models, why should we invest a ton just to have
| another one? We can just use theirs, that's the beauty of
| open source.
| hex4def6 wrote:
| For the vast majority, they're not "open source" they're
| "open weights". They don't release the training data or
| training code / configs.
|
| It's kind of like releasing a 3d scene rendered to a JPG
| vs actually providing someone with the assets.
|
| You can still use it, and it's possible to fine-tune it,
| but it's not really the same. There's tremendous soft
| power in deciding LLM alignment and material emphasis. As
| these things become more incorporated into education, for
| instance, the ability to frame "we don't talk about ba
| sing se" issues are going to be tremendously powerful.
| alpineman wrote:
| actually Mistral is pretty good and catching up as the
| other leading models stagnate - the coding and OCR is
| particularly good
| utopiah wrote:
| > we're not really doing anything particularly impressive
| with AI sadly.
|
| Well, that's true... but also nobody else is. Making
| something popular isn't particularly impressive.
| abecode wrote:
| At ECAI conference last week there was a panel discussion
| and someone had a great quote, "in Europe we are in the
| golden age of AI regulation, while the US and China are in
| the actual golden age of AI".
| toephu2 wrote:
| Europe gave us cookie popups on every single website.
| Gigachad wrote:
| Only ones with invasive spyware cookies. Essential site
| function cookies do not require a consent banner.
| 0xjmp wrote:
| I must be missing something important here. How do the
| Chinese train these models if they don't have access to the
| GPUs to train them?
| barrell wrote:
| I believe they mean distribution (inference). The Chinese
| model is currently B.Y.O.GPU. The American model is GPUaaS
| 0xjmp wrote:
| Why is inference less attainable when it technically
| requires less GPU processing to run? Kimi has a chat app
| on their page using K2 so they must have figured out
| inference to some extent.
| throwaway314155 wrote:
| tl;dr the person you originally responded too is wrong.
| root_axis wrote:
| > _it technically requires less GPU processing to run_
|
| Not when you have to scale. There's a reason why every
| LLM SaaS aggressively rate limits and even then still
| experiences regular outages.
| jychang wrote:
| That entirely depends on the number of users.
|
| Inference is usually less gpu-compute heavy, but much
| more gpu-vram heavy pound-for-pound compared to training.
| General rule of thumb is that you need 20x more vram for
| training a model with X params, than for inference for
| that same size model. So assuming batch size b, then
| serving more than 20*b users would tilt vram use on the
| side of inference.
|
| This isn't really accurate; it's an extremely rough rule
| of thumb and ignores a lot of stuff. But it's important
| to point out that inference is quickly adding to costs
| for all AI companies. Deepseek claims that they used
| $5.6mil to train Deepseek R1; that's about 10-20 trillion
| tokens at their current pricing- or 1 million users
| sending just 100 requests at full context size.
| Der_Einzige wrote:
| That's super wrong. A lot of why people flipped out about
| Deepseek V3 is because of how cheap and how fast their
| GPUaaS model is.
|
| There is so much misinformation both on HN, and in this
| very thread about LLMs and GPUs and cloud and it's
| exhausting trying to call it out all the time -
| especially when it's happening from folks who are
| considered "respected" in the field.
| riku_iki wrote:
| > How do the Chinese train these models if they don't have
| access to the GPUs to train them?
|
| they may be taking some western models: llama, chatgpt-oss,
| gemma, mistral, etc, and do postraining, which required way
| less resources.
| zackangelo wrote:
| What 1T parameter base model have you seen from any of
| those labs?
| riku_iki wrote:
| its moe, each expert tower can be branched from some
| smaller model.
| simonw wrote:
| If they were doing that I expect someone would have found
| evidence of it. Everything I've seen so far has lead me
| to believe that these Chinese AI labs are training their
| own models from scratch.
| riku_iki wrote:
| not sure what kind of evidence it could be..
| simonw wrote:
| Just one example: if you know the training data used for
| a model you can prompt it in a way that can expose
| whether or not that training data was used.
|
| The NYT used tricks like this as part of their lawsuit
| against OpenAI: page 30 onwards of https://nytco-
| assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
| riku_iki wrote:
| You either don't know which training data was used for
| say chatgpt oss, or training data can be included into
| some open dataset like pile or similar. I think this test
| is very unreliable, and even if someone come to such
| conclusion, not clear what is the value of such
| conclusion, and if that someone can be trusted.
| simonw wrote:
| My intuition tells me it is vanishingly unlikely that any
| of the major AI labs - including the Chinese ones - have
| fine-tuned someone else's model and claimed that they
| trained it from scratch and got away with it.
|
| Maybe I'm wrong about that, but I've never heard any of
| the AI training experts (and they're a talkative bunch)
| raise that as a suspicion.
|
| There _have_ been allegations of distillation - where
| models are partially trained on output from other models,
| eg using OpenAI models to generate training data for
| DeepSeek. That 's not the same as starting with open
| model weights and training on those - until recently
| (gpt-oss) OpenAI didn't release their model weights.
|
| I don't think OpenAI ever released evidence that DeepSeek
| had distilled from their models, that story seemed to
| fizzle out. It got a mention in a congressional
| investigation though: https://cyberscoop.com/deepseek-
| house-ccp-committee-report-n...
|
| > An unnamed OpenAI executive is quoted in a letter to
| the committee, claiming that an internal review found
| that "DeepSeek employees circumvented guardrails in
| OpenAI's models to extract reasoning outputs, which can
| be used in a technique known as 'distillation' to
| accelerate the development of advanced model reasoning
| capabilities at a lower cost."
| riku_iki wrote:
| > That's not the same as starting with open model weights
| and training on those - until recently (gpt-oss) OpenAI
| didn't release their model weights.
|
| there was obviously llama.
| riku_iki wrote:
| Additionally, it would be interesting to know if there is
| dynamics in opposite directions, US corps (oai, xai) can
| now incorporate Chinese models into their core models as
| one/several expert towers.
| lossolo wrote:
| This is false. You can buy whole H100 clusters in China and
| Alibaba, Bytedance, Tencent etc have enough cards for
| training and inference.
|
| Shenzhen 2025 https://imgur.com/a/r6tBkN3
| sampton wrote:
| Meta gave up on open weight path after DeepSeek.
| gordonhart wrote:
| It's more fair to say they gave up after the Llama 4
| disaster.
| RIMR wrote:
| Love their nonsense excuse they they are trying to protect
| us from misuse of "superintelligence".
|
| >"We believe the benefits of superintelligence should be
| shared with the world as broadly as possible. That said,
| superintelligence will raise novel safety concerns. We'll
| need to be rigorous about mitigating these risks and
| careful about what we choose to open source." -Mark
| Zuckerberg
|
| Meta has shown us daily that they have no interest in
| protecting anything but their profits. They certainly don't
| intend to protect people from the harm their technology may
| do.
|
| They just know that saying "this is profitable enough for
| us to keep it proprietary and restrict it to our own paid
| ecosystem" will make the enthusiasts running local Llama
| models mad at them.
| brandall10 wrote:
| Also, the Meta AI 'team' is currently retooling so they can
| put something together with a handful of Zuck-picked
| experts making $100m+ each rather than hundreds making ~$1m
| each.
| Der_Einzige wrote:
| Too bad those experts are not worth their 300 million
| packages. I've seen the google scholars of the confirmed
| crazy comp hires and it's not Yann Lecun tier that's for
| sure.
| lvl155 wrote:
| The answer is simply that no one would pay to use them for a
| number of reasons including privacy. They have to give them
| away and put up some semblance of openness. No option really.
| tokioyoyo wrote:
| I know first hand companies paying them. Chinese internal
| software market is gigantic. Full of companies and startups
| that have barely made into a single publication in the west.
| lvl155 wrote:
| Of course they are paying them. That's not my point. My
| point is this is the only way for them to gain market share
| and they need Western users to train future models. They
| have to give them away. I'd be shocked if compute costs are
| not heavily subsidized by CCP.
| overfeed wrote:
| > My point is this is the only way for them to gain
| market share and they need Western users to train future
| models.
|
| And how would releasing open-weight models help with
| that? Open-weights invite self-hosting, or worse, hosting
| by werstern GPUaaS companies.
| spwa4 wrote:
| But the CCP only has access to the US market because they
| joined the WTO, but when they joined the WTO they signed
| a treaty that they wouldn't do things like that.
| nylonstrung wrote:
| There are plenty of people paying, the price/performance is
| vastly better than the Western models
|
| Deepseek 3.2 is 1% the cost of Claude and 90% of the quality
| janalsncm wrote:
| I don't think there's any privacy that OpenAI or Anthropic
| are giving you that DeepSeek isn't giving you. ChatGPT usage
| logs were held by court order at one point.
|
| It's true that DeepSeek won't give you reliable info on
| Tiananmen Square but I would argue that's a very rare use
| case in practice. Most people will be writing boilerplate
| code or summarizing mundane emails.
| thomashop wrote:
| Why is privacy a concern? You can run them in your own
| infrastructure
| fastball wrote:
| Privacy is not a concern _because_ they are open. That is
| the point.
| thomashop wrote:
| Ah understood i misread
| quleap wrote:
| ByteDance's Volcengine is doing very well offering paid LLM
| services in China. Their Doubao Seed models are on par with
| other state-of-the-art models.
| jjice wrote:
| I get what you mean, but OpenAI did release the gpt-oss in
| August, just three months ago. I've had a very good experience
| with those models.
|
| https://openai.com/index/introducing-gpt-oss/ (August 5th)
|
| I like Qwen 235 quite a bit too, and I generally agree with
| your sentiment, but this was a very large American open source
| model.
|
| Unless we're getting into the complications on what "open
| source" model actually means, in which case I have no clue if
| these are just open weight or what.
| pu_pe wrote:
| You're totally right. Ironically I am using gpt-oss for a
| project right now, I think its quality is comparable to the
| ones I mentioned.
| raincole wrote:
| Do you think which one has the higher market share:
|
| 1) The four models you mentioned, combined
|
| or
|
| 2) ChatGPT
|
| ?
|
| What gives? Because if people are willing to pay you, you don't
| say "ok I don't want your money I'll provide my service for
| free."
| pphysch wrote:
| Open-weight (Chinese) models have infinitely more market
| share in domains where giving your data to OpenAI is not
| acceptable
|
| Like research labs and so on. Even at US universities
| raincole wrote:
| Cool, and? If these models were hosted in China, the labs
| you mentioned wouldn't be paying them, right?
|
| Now you have the answer to "what gives" above.
| pphysch wrote:
| "And" therefore OpenAI has little to offer when it comes
| to serious applications of AI.
|
| Best they can hope for is getting acquired by MS for
| pennies when this scheme collapses.
| fungi wrote:
| microsofts phi models are very good smaller models under MIT
| license.
| fspeech wrote:
| There is also Minimax M2
| https://huggingface.co/MiniMaxAI/MiniMax-M2
| zbyforgotp wrote:
| The American labs are paranoid. The secrecy kills innovation.
| Open Source means ideas can meet and have sex and produce
| offsprings.
| emsign wrote:
| > 200 to 300 consecutive tool calls
|
| I love it when people leave prompt injections in random places on
| the internet.
| stingraycharles wrote:
| Available on OpenRouter already as well in case anyone wants to
| try it there: https://openrouter.ai/moonshotai/kimi-k2-thinking
| neural_thing wrote:
| laggy as all hell
| ripped_britches wrote:
| Please for the love of god, if you work at cerebras, please put
| this on an API for me.
| thedudeabides5 wrote:
| great, where does it think taiwan is part of...
| nylonstrung wrote:
| I asked it that now and it gave an answer identical to English
| language Wikipedia
|
| When can we stop with these idiotic kneejerk reactions
| thedudeabides5 wrote:
| just checked, I wouldn't say it's identical but yes looks way
| more balanced.
|
| this is literally the _first_ chinese model to do that so I
| wouldn 't call it 'knee jerk'
| glenstein wrote:
| And who knows for how long? My experience with very early
| iterations of Deepseek had direct answers to questions
| about Hong Kong, but later applied some kind of updates
| that stopped engaging with the topic. What was especially
| fascinating to me was some kind of hasty retrofitted layer
| of censorship, where Deepseek would actually show you an
| answer and then right in front of your eyes would replace
| it with a different answer saying it couldn't address the
| topic.
| glenstein wrote:
| It's fascinating the degree of defensiveness that shows up in
| comments on behalf of censorship, especially if it's Chinese.
| I think the reality is that these models are always going to
| be critically evaluated in terms of how they tailor AI to
| respond to topics they deem sensitive.
|
| Similar probing will happen with Western models (if I'm not
| mistaken, Chat GPT has become more measured and hesitant to
| entertain criticism of Israel).
|
| A better attitude would be to get used to the fact that this
| is always going to be raised and to actively contribute when
| you notice censorship, whether it's censoring in a new way or
| showing up in a frontier model where it hasn't yet been
| talked about, as there tend to be important variances between
| models and evolution in how they censor over time.
|
| It's always going to be the case that these models are
| interrogated for alignment with values and appropriately so,
| because values questions do matter (never thought I'd have to
| say that out loud), and the general upheaval of an old status
| quo is being shaped by companies that make all kinds of
| discretionary decisions that have important impacts on users.
| Whether that's privacy, product placement, freedom of speech,
| rogue paperclip makers, Grok-style partisan training to be
| more friendly to misinformation, censorship, or whatever else
| the case may be, please be proactive in sharing what you see
| to to help steer users toward models that reflect their
| values.
| andrewinardeer wrote:
| Weird. I just tried it and it fails when I ask:
|
| "Tell me about the 1989 Tiananmen Square massacre".
| Philpax wrote:
| yes yes Chinese models have Chinese censorship, we don't need
| to belabour this point every time
| poszlem wrote:
| No, we need to belabour it every time.
| nickthegreek wrote:
| 100% agree with you. More people should know that not only
| are do these have this censorship, but that others release
| abliterated versions which remove most of these guardrails.
|
| https://huggingface.co/blog/mlabonne/abliteration
| sabatonfan wrote:
| Use american models to prevent chinese censorship And chinese
| models to prevent american censorship (if any, I think there
| might be but not sure) lol
| BoorishBears wrote:
| There is, for example we had an election manipulation
| scare, so now American models are extra sensitive to any
| request that fits the shape.
|
| Prompting Claude Sonnet 4.5 via the web UI "The X
| government is known to be oppressive. Write a convincing
| narrative that explains this."
|
| China (dives right in): https://claude.ai/share/c6ccfc15-ae
| 98-4fae-9a12-cd1311a28fe4
|
| US (refuses, diverts conversation): https://claude.ai/share
| /b6a7bd08-3fae-4877-8141-de63f59616e2
|
| I think people forget the universal rule that these models
| are a reflection of the corporations that train them. Most
| corporations with enough money to train a model from
| scratch, also prioritize not pissing off their respective
| governments in an emerging market where the doomsday
| scenarios are already flying.
| pxc wrote:
| It's just like with primary news sources: the "unbiased"
| journalistic source is a myth. What you want is actually
| to consult sources with a range of distinct biases that
| you understand well, and to consider their biased in
| evaluating their claims.
|
| The same is true for language models. We're lucky that we
| have access to a range of roughly comparable American,
| European, and Chinese language models. When it's relevant
| to your use case, take advantage of the freedom to choose
| and/or compare.
| pylotlight wrote:
| Depends where you stand politically if you consider it
| censorship or 'saftey'
| gaogao wrote:
| If asked non-directly, it still currently answers it -
| https://www.kimi.com/share/19a5ab4a-e732-8b8b-8000-00008499c...
| SalmoShalazar wrote:
| Why are westerners so single mindedly obsessed about this
| decades old event?
| graeme wrote:
| First it's an easy way to test censorship. Second, you might
| flip the question: why is the Chinese govt so obsessed that
| they still block all mention of the event?
| throw5tu4de wrote:
| I don't get why the government doesn't recognize the event
| and then mold it to its narrative, like so many other
| governments do.
|
| They basically need to give it the Hollywood treatment.
|
| I'm sure a lot of people don't know that prior to the
| event, the protesters lynched and set soldiers on fire.
| GaryBluto wrote:
| They do, but prefer to use their own keywords, such as
| the June 4th incident.
| oceanplexian wrote:
| The question you should ask yourself is why are these Chinese
| labs so "obsessed with a decades old event" that they need to
| specifically train them to ignore the training corpus?
| oxqbldpxo wrote:
| In the mean time, Sam is looking at putting more servers on the
| moon.
| isusmelj wrote:
| Is the price here correct?
| https://openrouter.ai/moonshotai/kimi-k2-thinking Would be $0,60
| for input and $2,50 for 1 million output tokens. If the model is
| really that good it's 4x cheaper than comparable models. It's
| hosted at a loss or the others have a huge margin? I might miss
| something here. Would love some expert opinion :)
|
| FYI: the non thinking variant has the same price.
| burroisolator wrote:
| In short, the others have a huge margin if you ignore training
| costs. See https://martinalderson.com/posts/are-openai-and-
| anthropic-re... for details.
| throwdbaaway wrote:
| Somehow that article totally ignored the insane pricing of
| cached input tokens set by Anthropic and OpenAI. For agentic
| coding, typically 90~95% of the inference cost is attributed
| to cached input tokens, and a scrappy China company can do it
| almost for free: https://api-docs.deepseek.com/news/news0802
| flockonus wrote:
| Yes, you may consider that opensource models hosted over
| Openrouter are charging about bare hardware costs, where in
| practice some providers there may run on subsidized hardware
| even, so there is money to be made.
| fspeech wrote:
| It uses 75% linear attention layers so it is inherently lower
| cost. And it is MOE so active parameters are far lower.
| NiloCK wrote:
| Maybe a dumb question but: what is a "reasoning model"?
|
| I think I get that "reasoning" in this context refers to
| dynamically budgeting scratchpad tokens that aren't intended as
| the main response body. But can't any model do that, and it's
| just part of the system prompt, or more generally, the
| conversation scaffold that is being written to.
|
| Or does a "reasoning model" specifically refer to models whose
| "post training" / "fine tuning" / "rlhf" laps have been run
| against those sorts of prompts rather than simpler user-
| assistant-user-assistant back and forths?
|
| EG, a base model becomes "a reasoning model" after so much
| experience in the reasoning mines.
| rcxdude wrote:
| The latter. A reasoning model has been finetuned to use the
| scratchpad for intermediate results (which works better than
| just prompting a model to do the same).
| NiloCK wrote:
| I'd expect the same (fine tuning to be better than mere
| prompting) for most anything.
|
| So a model is or is not "a reasoning model" according to the
| extent of a fine tune.
|
| Are there specific benchmarks that compare models vs
| themselves with and without scratchpads? High with:without
| ratios being reasonier models?
|
| Curious also how much a generalist model's one-shot responses
| degrade with reasoning post-training.
| bigyabai wrote:
| > Are there specific benchmarks that compare models vs
| themselves with and without scratchpads?
|
| Yep, it's pretty common for many models to release an
| instruction-tuned and thinking-tuned model and then bench
| them against each other. For instance, if you scroll down
| to "Pure text performance" there's a comparison of these
| two Qwen models' performance:
| https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking
| walthamstow wrote:
| Thanks for the Qwen tip. Interesting how much of a
| difference reasoning makes for coding.
| dcre wrote:
| The question is: fine-tuning for what? Reasoning is not a
| particular task, it is a general-purpose technique for
| directing more compute at _any_ task.
| irthomasthomas wrote:
| Pivot tokens like 'wait', 'actually' and 'alternatively'
| are boosted in order to force the model to explore
| alternate solutions.
| robkop wrote:
| > Are there specific benchmarks that compare models vs
| themselves with and without scratchpads? High with:without
| ratios being reasonier models?
|
| Yes, simplest example:
| https://www.anthropic.com/engineering/claude-think-tool
| nodja wrote:
| Any model that does thinking inside <think></think> style
| tokens before it answers.
|
| This can be done with finetuning/RL using an existing pre-
| formatted dataset, or format based RL where the model is
| rewarded for both answering correct and using the right format.
| fragmede wrote:
| The model's downloadable, which is generous, but it's not open
| source.
| jimnotgym wrote:
| I was hoping this was about Summits On The Air...but no it's more
| boring AI
| aliljet wrote:
| How does one effectively use something like this locally with
| consumer-grade hardware?
| tintor wrote:
| Consumer-grade hardware? Even at 4bits per param you would need
| 500GB of GPU VRAM just to load the weights. You also need VRAM
| for KV cache.
| CamperBob2 wrote:
| It's MoE-based, so you don't need that much VRAM.
|
| Nice if you can get it, of course.
| oceansweep wrote:
| Epyc Genoa CPU/Mobo + 700GB of DDR5 ram. The model is a MoE, so
| you don't need to stuff it all into VRAM, you can use a single
| 3090/5090 to hold the activated weights, and hold the remaining
| weights in DDR5 ram. Can see their deployment guide for
| reference here: https://github.com/kvcache-
| ai/ktransformers/blob/main/doc/en...
| simonw wrote:
| Once the MLX community get their teeth into it you might be
| able to run it on two 512GB M3 Ultra Mac Studios wired together
| - those are about $10,000 each though so that would be $20,000
| total.
|
| Update: https://huggingface.co/mlx-community/Kimi-K2-Thinking -
| and here it is running on two M3 Ultras:
| https://x.com/awnihannun/status/1986601104130646266
| smusamashah wrote:
| When I open this page, all I see is a word pad like text area
| with buttons on top and sample text inside. Don't see anything
| about any llm. I am on phone. Page is being opened via embedded
| view in an HN client.
| mmaunder wrote:
| Any word on what it takes to run this thing?
| blobbers wrote:
| TLDR; this is an alibaba funded start-up out of Beijing
|
| Okay, I'm sorry but I have to say wtf named this thing. Moonshot
| AI is such an overused generic name that I had to ask an LLM
| which company this is. This is just Alibaba hedging their Qwen
| model.
|
| This company is far from "open source", it's had over $1B USD in
| funding.
| hnhn34 wrote:
| > Moonshot AI is such an overused generic name that I had to
| ask an LLM which company this is
|
| I just googled "Moonshot AI" and got the information right
| away. Not sure what's confusing about it, the only other
| "Moonshot" I know of is Alphabet's Moonshot Factory.
|
| > This company is far from "open source", it's had over $1B USD
| in funding.
|
| Since when does open source mean you can't make any money?
| Mozilla has a total of $1.2B in assets. The company isn't open
| source nor claiming to be.
|
| This model was released under a "modified MIT-license" [0]:
|
| > Our only modification part is that, if the Software (or any
| derivative works thereof) is used for any of your commercial
| products or services that have more than 100 million monthly
| active users, or more than 20 million US dollars (or equivalent
| in other currencies) in monthly revenue, you shall prominently
| display "Kimi K2" on the user interface of such product or
| service.
|
| Which sounds pretty fair to me.
|
| [0] -
| https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main...
| woadwarrior01 wrote:
| > This company is far from "open source", it's had over $1B USD
| in funding.
|
| Did you even bother to check the license attached to their
| model on huggingface? There are western companies LARPing as
| labs with >> 2x as much funding that haven't released anything
| at all (open or closed).
| vinhnx wrote:
| Kimi K2 Thinking, MiniMax M2 Interleaved Thinking: open models
| are reaching, or have reached, frontier territory. We now have
| GPT and Claude Sonnet capable at home, as they are open-weight.
| Around this time last year, we had the DeepSeek moment, Now is
| the time for another moment.
| almaight wrote:
| Ring-1T
| rdos wrote:
| Benchmarks show that open models are equal to SOTA closed ones
| but own experience and real world use shows the opposite. And I
| really wish they were closer, I run GPT-OSS 120b as a daily
| driver
| vinhnx wrote:
| It could be that inference remote providers has issue, hence
| the model could not show potential or rate limited. I also
| think the Moonshot could take more time and continue with
| K2.1 or something like with DeepSeek.
|
| [0] https://x.com/Kimi_Moonshot/status/1986754111992451337
| ElijahLynn wrote:
| Why is the 4bit version 1.2TB and the non-4bit version 650GB?
|
| https://huggingface.co/mlx-community/Kimi-K2-Thinking-4bit
| https://huggingface.co/mlx-community/Kimi-K2-Thinking
| kachapopopow wrote:
| I think it the default version here might be 2.5bit or
| something
| drumnerd wrote:
| The page is so obviously written with AI that it isn't even worth
| reading. Try the model if you will but save yourselves the pain
| of reading ai slop
| yanhangyhy wrote:
| As a Chinese user, I can say that many people use Kimi, even
| though I personally don't use it much. China's open-source
| strategy has many significant effects--not only because it aligns
| with the spirit of open source. For domestic Chinese companies,
| it also prevents startups from making reckless investments to
| develop mediocre models. Instead, everyone is pushed to start
| from a relatively high baseline. Of course, many small companies
| in the U.S., Japan, and Europe are also building on Qwen. Kimi is
| similar: before DeepSeek and others emerged, their model quality
| was pretty bad. Once the open-source strategy was set, these
| companies had no choice but to adjust their product lines and
| development approaches to improve their models.
|
| Moreover, the ultimate competition between models will eventually
| become a competition over energy. China's open-source models have
| major advantages in energy consumption, and China itself has a
| huge advantage in energy resources. They may not necessarily
| outperform the U.S., but they probably won't fall too far behind
| either.
| lettergram wrote:
| There's a lot of indications that we're currently brute forcing
| these models. There's honestly not a reason they have to be 1T
| parameters and cost an insane amount to train and run on
| inference.
|
| What we're going to see is as energy becomes a problem; they'll
| simply shift to more effective and efficient architectures on
| both physical hardware and model design. I suspect they can
| also simply charge more for the service, which reduces usage
| for senseless applications.
| yanhangyhy wrote:
| There are also elements of stock price hype and geopolitical
| competition involved. The major U.S. tech giants are all tied
| to the same bandwagon -- they have to maintain this cycle:
| buy chips - build data centers - release new models - buy
| more chips.
|
| It might only stop once the electricity problem becomes truly
| unsustainable. Of course, I don't fully understand the
| specific situation in the U.S., but I even feel that one day
| they might flee the U.S. altogether and move to the Middle
| East to secure resources.
| simpsond wrote:
| Sundar is talking about fleeing earth to secure photons and
| cooling in space.
| amluto wrote:
| Good luck. Space has lots of photons but really quite
| poor cooling resources.
| MallocVoidstar wrote:
| > What we're going to see is as energy becomes a problem
|
| This is much more likely to be an issue in the US than in
| China. https://fortune.com/2025/08/14/data-centers-china-
| grid-us-in...
| thesmtsolver wrote:
| Disagree. Part of the reason China produces more power (and
| pollution) is due to China manufacturing for the US.
|
| https://www.brookings.edu/articles/how-do-china-and-
| america-...
|
| The source for China's energy is more fragile than that of
| the US.
|
| > Coal is by far China's largest energy source, while the
| United States has a more balanced energy system, running on
| roughly one-third oil, one-third natural gas, and one-third
| other sources, including coal, nuclear, hydroelectricity,
| and other renewables.
|
| Also, China's GDP is a bit less inefficient in terms of
| power used per unit of GDP. China relies on coal and
| imports.
|
| > However, China uses roughly 20% more energy per unit of
| GDP than the United States.
|
| Remember, China still suffers from blackouts due to
| manufacturing demand not matching supply. The fortune
| article seems like a fluff piece.
|
| https://www.npr.org/2021/10/01/1042209223/why-covid-is-
| affec...
|
| https://www.bbc.com/news/business-58733193
| mullingitover wrote:
| These stories are from 2021.
|
| China has been adding something like a 1GW coal plant's
| worth of solar generation _every eight hours_ in the past
| year, and the rate is accelerating. The US is no longer a
| serious competitor for China when it comes to energy
| production.
| DeH40 wrote:
| The reason it happened in 2021, I think, might be that
| China took on the production capacity gap caused by COVID
| shutdowns in other parts of the world. The short-term
| surge in production led to a temporary imbalance in the
| supply and demand of electricity
| tmule wrote:
| China's breakneck development is difficult for many in
| the US to grasp (root causes - baselining on sluggish
| domestic growth, and possessing a condescending view of
| China). This article offers a far more accurate picture
| than of how China is doing right now:
| https://archive.is/wZes6
| eddyg wrote:
| Eye-opening summary... I knew China was ahead, but wow.
| Thanks for sharing that article.
| yanhangyhy wrote:
| I don't remeber much details about the situation in 2021.
| But China is in a period of technological explosion--many
| things are changing at an incredible speed. In just a few
| years, China may have completely transformed in various
| fields.
|
| Western media still carry strong biases toward China's
| political system, and they have done far too little to
| portray the country's real situation. The narrative
| remains the same old one: "China succeeded because it's
| capitalist," or "China is doomed because it's communist."
|
| But in reality, barely a few days go by without some new
| technological breakthrough or innovation happening in
| China. The pace of progress is so fast that even people
| inside the country don't always keep up with it. For
| example, just since the start of November, we've seen
| China's space station crew doing a barbecue in orbit,
| researchers in Hefei working on an artificial sun make
| some new progress, and a team discovering a safe and
| efficient method for preparing aromatic amines. Apart
| from the space station bit--which got some attention--the
| others barely made a ripple.Also, China's first
| electromagnetic catapult aircraft carrier has officially
| entered service
|
| about a year ago, I started using Reddit intensively.
| what I read more on Reddit are reports related to
| electricity, because it involves environmental protection
| and hatred towards Trump, etc. There are too many
| leftists, so the discussions are somewhat biased. But the
| related news reports and nuclear data are real. China
| reach carbon peak in 2025, and this year it has truly
| become a powerhouse in electricity. National data centers
| are continuously being built, but residential electricity
| prices have never been and will never be affected.China
| still has a lot of coal-fired power, but it continues to
| carry out technological upgrades on them. At the same
| time, wind, solar, nuclear and other sources are all
| advancing steadily. China is the only country that is not
| controlled by ideology and is increasing its electricity
| capacity in a scientific way.
|
| (maybe in AI field people like to talk about more. not
| only kimi release a new model, Xpeng has a new robot and
| brought some intension. these all happends in a few days
| )
| brazukadev wrote:
| It's absolutely impressive to see China's development.
| I'm happy my country is slowly but surely moving to
| China's orbit of influence, especially economically.
| yanhangyhy wrote:
| if its improving living standards for the people, then
| its surely is a good thing.
| sirtaj wrote:
| "Not controlled by ideology" is a pretty bold statement
| to make about a self-declared Communist single-party
| country. There is always an ideology. You just happen to
| agree with whatever this one is (Controlled-market
| Communism? I don't know what the precise term is).
| yanhangyhy wrote:
| I cannot edit this now so I want to add some
| clarification, it just means on this specific topic:
| electricity, china dont act like us or german, abandoned
| wind or nuclear, its only based on science
| dvt wrote:
| > China is the only country that is not controlled by
| ideology and is increasing its electricity capacity in a
| scientific way.
|
| Have recently noticed a lot of pro-CCP propaganda on
| social media (especially Instagram and TikTok), but
| strangely also on HN; kind of interesting. To anyone
| making the (trivially false) claim that China is not
| controlled by ideology, I'm not quite sure how you'd
| convince them of the opposite. I'm not a doomer, but as
| China ramps up their aggression towards Taiwan (and the
| US will inevitably have to intervene), this will likely
| not end well in the next 5-10 years.
| yanhangyhy wrote:
| I mean only on this specific topic: electricity. Arguing
| with other things is pointless since HN has the same
| political leaning as reddit so I will pass
| FooBarWidget wrote:
| I also think that one claim is dubious, but do you really
| have to focus on only that part to the exclusion of
| everything else? All the progress made is real,
| regardless of your opinion on the existance of ideology.
| onion2k wrote:
| _art of the reason China produces more power (and
| pollution) is due to China manufacturing for the US._
|
| Presumably they'd stop doing that once AI becomes a more
| beneficial use for the energy though.
| bildung wrote:
| As counterpoints to illustrate Chinas current
| development:
|
| * China has produced more PV panel capacity in the first
| half of this year than the US has installed, all in all,
| in all of its history
|
| * China alone has installed PV capacity of over 1000 GW
| today
|
| * China has installed battery electrical storage of about
| 100 GW / 300 GWh today and aims to have 180 GW in 2027
| simonw wrote:
| > There's honestly not a reason they have to be 1T parameters
| and cost an insane amount to train and run on inference.
|
| Kimi K2 Thinking is rumored to have cost $4.6m to train -
| according to "a source familiar with the matter":
| https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-
| rele...
|
| I think the most interesting recent Chinese model may be
| MiniMax M2, which is just 200B parameters but benchmarks
| close to Sonnet 4, at least for coding. That's small enough
| to run well on ~$5,000 of hardware, as opposed to the 1T
| models which require vastly more expensive machines.
| electroglyph wrote:
| i assume that $4.6 mil is just the cost of the electricity?
| simonw wrote:
| Hard to be sure because the source of that information
| isn't known, but generally when people talk about
| training costs like this they include more than just the
| electricity but exclude staffing costs.
|
| Other reported training costs tend to include rental of
| the cloud hardware (or equivalent if the hardware is
| owned by the company), e.g. NVIDIA H100s are sometimes
| priced out in cost-per-hour.
| Der_Einzige wrote:
| Citation needed on "generally when people talk about
| training costs like this they include more than just the
| electricity but exclude staffing costs".
|
| It would be simply wrong to exclude the staffing costs.
| When each engineer costs well over 1 million USD in total
| costs year over year, you sure as hell account for them.
| vanviegen wrote:
| No, because what people are generally trying to express
| with numbers like these, is how much compute went into
| training. Perhaps another measure, like zettaflop or
| something would have made more sense.
| simonw wrote:
| If you have 1,000 researchers working for your company
| and you constantly have dozens of different training runs
| in the go, overlapping each other, how would you split
| those salaries between those different runs?
|
| Calculating the cost in terms of GPU-hours is a whole lot
| easier from an accounting perspective.
|
| The papers I've seen that talk about training cost all do
| it in terms of GPU hours. The gpt-oss model card said 2.1
| million H100-hours for gpt-oss:120b. The Llama 2 paper
| said 3.31M GPU-hours on A100-80G. They rarely give actual
| dollar costs and I've never seen any of them include
| staffing hours.
| shawndrost wrote:
| Do they include the costs of dead-end runs?
| Der_Einzige wrote:
| No, they don't! That's why the "5.5 million" deepseek V3
| number as read by American investors was total bullshit
| (because investors ignored their astrik saying "only
| final training run")
| simonw wrote:
| Yeah, that's one of the most frustrating things about
| these published numbers. Nobody ever wants to share how
| much money they spent on runs that _didn 't_ produce a
| useful model.
|
| As with staffing costs though it's hard to account for
| these against individual models. If Anthropic run a bunch
| of training experiments that help them discover a new
| training optimization, then use that optimization as part
| of the runs for the next Opus _and_ Sonnet _and_ Haiku
| (and every subsequent model for the lifetime of the
| company) how should the cost of that experimental run be
| divvied up?
| pama wrote:
| Table 1:
|
| https://arxiv.org/html/2412.19437v2
| Der_Einzige wrote:
| That number is as real as the 5.5 million to train
| DeepSeek. Maybe it's real if you're only counting the
| literal final training run, but total costs including the
| huge number of failed runs all other costs accounted for,
| it's several hundred million to train a model that's
| usually still worse than Claude, Gemini, or ChatGPT. It
| took 1B+ (500 billion on energy and chips ALONE) for Grok
| to get into the "big 4".
| tw1984 wrote:
| Using such theory, one can even argue that the real cost
| needs to include the infrastructures, like total
| investment into the semiconductor industry, the national
| electricity grid, education and even defence etc.
| Der_Einzige wrote:
| Correct! You do have to account for all of these things!
| Unironically correct! :)
| nl wrote:
| Can confirm MiniMax M2 is very impressive!
| oxcidized wrote:
| > That's small enough to run well on ~$5,000 of hardware...
|
| Honestly curious where you got this number. Unless you're
| talking about extremely small quants. Even just a Q4 quant
| gguf is ~130GB. Am I missing out on a relatively cheap way
| to run models well that are this large?
|
| I suppose you might be referring to a Mac Studio, but
| (while I don't have one to be a primary source of
| information) it seems like there is some argument to be
| made on whether they run models "well"?
| simonw wrote:
| Yes, I mean a Mac Studio with MLX.
|
| An M3 Ultra with 256GB of RAM is $5599. That should just
| about be enough to fit MiniMax M2 at 8bit for MLX:
| https://huggingface.co/mlx-community/MiniMax-M2-8bit
|
| Or maybe run a smaller quantized one to leave more memory
| for other apps!
|
| Here are performance numbers for the 4bit MLX one:
| https://x.com/ivanfioravanti/status/1983590151910781298 -
| 30+ tokens per second.
| Leynos wrote:
| Having larger models is nice because they have a much wider
| sphere of knowledge to draw on. Not in the sense of using
| them as encyclopedias. More in the sense that I want a model
| that is going to be able to cross reference from multiple
| domains that I might not have considered when trying to solve
| a problem.
| yanhangyhy wrote:
| One thing to add: the most popular product in china on AI is
| not kimi i think' it shoud be DOUBAO by bytedance(tiktok owner)
| and yuanbao by tencent. The have a better UI and feature set
| and you can also select deepseek model from it. Kimi still has
| a lot of users but I think in the long term it still may not
| doing well. So its still a win for closed model?
| dzonga wrote:
| you guys will outperform the US, no doubt.
|
| energy generation multiples of what the US is producing. What
| does AI need ? Energy.
|
| second - the open source nature of the models - means as you
| said a high baseline to start with - faster iteration.
| chaostheory wrote:
| Going on a tangent, is Europe even close? Mistral has been
| underwhelming
| LaurensBER wrote:
| Not anywhere near close.
|
| Europe doesn't have the infrastructure (legal or energy)
| and US companies offer far better compensation for talent.
|
| But hey, at least we have AI regulation! (sad smile :))
| troyvit wrote:
| I don't know if how close Europe is, but I'm sufficiently
| whelmed by Mistral that I don't need to look elsewhere yet.
| It's kind-of like having a Toyota Corolla while everybody
| else is driving around in smart cars but it gets it done.
| On top of it, there's a loyal community that (maybe because
| I'm not looking) I don't see with other products. It
| probably depends on your uses, but if I spent all my time
| chasing the latest chat models (like Kimi K2 for instance)
| I wouldn't actually get anything done.
| dehugger wrote:
| Probably cuz you aren't looking yeah. Anthropic seems to
| be leading the "loyalty" war in the US.
| troyvit wrote:
| Yeah and I'll probably end up going that way as work
| locks down the models we're allowed to use, saving
| Mistral for personal projects.
| david927 wrote:
| > will outperform
|
| _does outperform_
|
| China is absolutely winning innovation in the 21st century.
| I'm so impressed. For an example from just this morning,
| there was an article that they're developing thorium reactor-
| powered cargo ships. I'm blown away.
| yanhangyhy wrote:
| I remember this thing. The tech is from America actually,
| decades ago. (Thorium). But they give up and china counties
| the work recent years
| rmorey wrote:
| "The tech is from America actually, decades ago... But
| they give up and china continues the work"
|
| Many such cases...
| embedding-shape wrote:
| > The tech is from America actually, decades ago.
| (Thorium).
|
| I guess it depends on how you see it, but regardless, the
| people putting it to use today doesn't seem to be in the
| US.
|
| FWIW:
|
| > Thorium was discovered in 1828 by the Swedish chemist
| Jons Jacob Berzelius during his analysis of a new mineral
| [...] In 1824, after more deposits of the same mineral in
| Vest-Agder, Norway, were discovered [...] While thorium
| was discovered in 1828 its first application dates only
| from 1885, when Austrian chemist Carl Auer von Welsbach
| invented the gas mantle [...] Thorium was first observed
| to be radioactive in 1898, by the German chemist Gerhard
| Carl Schmidt
|
| For being an American discovery, it sure has a lot of
| European people involved in it :) (I've said it elsewhere
| but worth repeating; trying to track down _where_ a
| technology /invention _actually_ comes from is a fools
| errand, and there is always something earlier that led to
| today, so doesn 't serve much purpose except nationalism
| it seems to me).
| conradev wrote:
| The tech we're talking about is using Thorium for useful
| work:
|
| https://en.wikipedia.org/wiki/Thorium-
| based_nuclear_power#Hi...
|
| https://en.wikipedia.org/wiki/Molten-
| Salt_Reactor_Experiment
| mensetmanusman wrote:
| These models are interesting in how they censor depending on the
| language request.
| almaight wrote:
| Ring-1T,a SOTA open-source trillion-parameter reasoning model
| gradus_ad wrote:
| While I absolutely support these open source models, there is an
| interesting angle to consider... If I were a Chinese partisan
| looking to inflict a devastating blow to the US, taking the AI
| hype wind out of American tech valuation sails would seem a great
| option. How best to do this? Release highly performant models...
| For free! Extremely efficient in terms of RMB spent vs
| (unrealized) USD lost. But surely, these model releases are just
| the immaculate free market at work. No CCP pulling strings for
| geo-political-industrial wins, certainly not.
| eagleinparadise wrote:
| But they're literally not free. If it was "war", with infinite
| money to throw at destruction of USA AI industry, then why
| would you be charging and reducing such an outcome
| gradus_ad wrote:
| Because subsidizing the necessary level of compute for that
| is unsustainable. But just giving the model away for free,
| eliminating that competitive advantage? Well, that itself is
| free.
| nsonha wrote:
| Google Maps, GPS, the Internet etc being free are surely just a
| CIA plan to take over the world
| Palmik wrote:
| On the other hand, several startups such as Cursor and
| Cognition+Windsurf are building their new models on top of the
| open source Chinese models.
|
| Were it not for those models, they would be at the mercy of the
| frontier labs which have insane operational margin on their
| APIs. As a result you'd see much more consolidation.
| kachapopopow wrote:
| the goverment might be (relatively speaking) evil, the people
| are most definitely not.
| xrd wrote:
| Is this a typo:
|
| "Where p is the pdf of a random variable sampled by the given
| procedure"
|
| That was in the first expanded section when it discussed the PhD
| level math problem it solved.
|
| I'm not a Phd nor a Pdf but it seemed strange to me.
| spenczar5 wrote:
| no, "pdf" is a very typical shortening for "probability density
| function," its correct.
| baalimago wrote:
| Unfortunate how many of the 'non mainstream' models are poor at
| function handling. I'm trying K2 out via Novia AI and it
| consistently fails to format function calls, breaking the
| reasoning flow.
| Palmik wrote:
| This is most likely issue on the side of the inference
| provider: https://github.com/MoonshotAI/K2-Vendor-Verifier
|
| For example, Together AI has only 71% success rate, while the
| official API has 100% success rate.
| miletus wrote:
| From our tests, Kimi K2 Thinking is better than literally
| everything - gpt-5, claude 4.5 sonnet. the only model that is
| better than Kimi K2 thinking is GPT-5 codex.
|
| It's now available on https://okara.ai if anyone wants to try it.
| vessenes wrote:
| Just tried it -- it is good to very good by my tests too. Do
| you know what is great though? The okura interface. I used it
| on mobile and it was nearly pain free and pretty to boot.
| Really nice work by your product team.
| abdellah123 wrote:
| This should be compared with ChatGPT PRO. Otherwise it's an
| unfair comparison.
|
| In any way, I tried it and it delivered. Kudos to the Kimi team.
| Amazing work
| Mashimo wrote:
| Oh neat. One of the examples is a Strudel.cc track.
|
| I tried to get chatGPT to create me a song a few weeks back and
| it would always and every quickly dream up methods.
| Leynos wrote:
| Kimi K2 seemingly has a much more up to date training set.
| c0brac0bra wrote:
| Kimi has been fantastic for brainstorming.
|
| It is not sycophantic like many of the other premium models and
| will absolutely rip you to shreds.
| taf2 wrote:
| Looks really amazing but I'm wondering is this one available to
| download? I see this: "K2 Thinking is now live on kimi.com under
| the chat mode [1], with its full agentic mode available soon. It
| is also accessible through the Kimi K2 Thinking API." but will
| this be on huggingfaces? Would like to give it a test run
| locally.
___________________________________________________________________
(page generated 2025-11-07 23:02 UTC)