[HN Gopher] Kimi K2
___________________________________________________________________
Kimi K2
Author : c4pt0r
Score : 118 points
Date : 2025-07-11 15:38 UTC (7 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| gs17 wrote:
| > 1T total / 32B active MoE model
|
| Is this the largest open-weight model?
| bigeagle wrote:
| I believe so.
|
| Grok-1 is 341B, DeepSeek-v3 is 671B, and recent new open
| weights models are around 70B~300B.
| adt wrote:
| No.
|
| At 1T MoE on 15.5T tokens, K2 is one of the largest open source
| models to date. But BAAI's TeleFM is 1T dense on 15.7T tokens:
| https://huggingface.co/CofeAI/Tele-FLM-1T
|
| You can always check here: https://lifearchitect.ai/models-
| table/
| simonw wrote:
| Big release - https://huggingface.co/moonshotai/Kimi-K2-Instruct
| model weights are 958.52 GB
| c4pt0r wrote:
| Paired with programming tools like Claude Code, it could be a
| low-cost/open-source replacement for Sonnet
| kkzz99 wrote:
| According to the bench its closer to Opus, but I venture
| primarily for English and Chinese.
| martin_ wrote:
| how do you low cost run a 1T param model?
| maven29 wrote:
| 32B active parameters with a single shared expert.
| JustFinishedBSG wrote:
| This doesn't change the VRAM usage, only the compute
| requirements.
| maven29 wrote:
| You can probably run this on CPU if you have a 4090D for
| prompt processing, since 1TB of DDR4 only comes out to
| around $600.
|
| For GPU inference at scale, I think token-level batching
| is used.
| t1amat wrote:
| With 32B active parameters it would be ridiculously slow
| at generation.
| selfhoster11 wrote:
| DDR3 workstation here - R1 generates at 1 token per
| second. In practice, this means that for complex queries,
| the speed of replying is closer to an email response than
| a chat message, but this is acceptable to me for
| confidential queries or queries where I need the model to
| be steerable. I can always hit the R1 API from a provider
| instead, if I want to.
|
| Given that R1 uses 37B active parameters (compared to 32B
| for K2), K2 should be slightly faster than that - around
| 1.15 tokens/second.
| zackangelo wrote:
| Typically a combination of expert level parallelism and
| tensor level parallelism is used.
|
| For the big MLP tensors they would be split across GPUs
| in a cluster. Then for the MoE parts you would spread the
| experts across the GPUs and route to them based on which
| experts are active (there would likely be more than one
| if the batch size is > 1).
| selfhoster11 wrote:
| It does not have to be VRAM, it could be system RAM, or
| weights streamed from SSD storage. Reportedly, the latter
| method achieves around 1 token per second on computers
| with 64 GB of system RAM.
|
| R1 (and K2) is MoE, whereas Llama 3 is a dense model
| family. MoE actually makes these models practical to run
| on cheaper hardware. DeepSeek R1 is more comfortable for
| me than Llama 3 70B for exactly that reason - if it
| spills out of the GPU, you take a large performance hit.
|
| If you need to spill into CPU inference, you really want
| to be multiplying a different set of 32B weights for
| every token compared to the same 70B (or more) instead,
| simply because the computation takes so long.
| refulgentis wrote:
| The amount of people who will be using it at 1 token/sec
| because there's no better option, _and_ have 64 GB of
| RAM, is _vanishingly_ small.
|
| IMHO it sets the local LLM community back when we lean on
| extreme quantization & streaming weights from disk to say
| something is possible*, because when people try it out,
| it turns out it's an awful experience.
|
| * the implication being, _anything_ is possible in that
| scenario
| homarp wrote:
| agentic loop can run all night long. It's just a
| different way to work: prepare your prompt queue, set it
| up, check result in the morning, adjust. 'local vibe' in
| 10h instead of 10mn is still better than 10 days of
| manual side coding.
| cyanf wrote:
| This is both the largest oss model release thus far, and the
| largest Muon training run.
| wiradikusuma wrote:
| I've only started using Claude, Gemini, etc in the last few
| months (I guess it comes with age, I'm no longer interested in
| trying the latest "tech"). I assume those are "non-agentic"
| models.
|
| From reading articles online, "agentic" means like you have a
| "virtual" Virtual Assistant with "hands" that can google, open
| apps, etc, on their own.
|
| Why not use existing "non-agentic" model and "orchestrate" them
| using LangChain, MCP etc? Why create a new breed of model?
|
| I'm sorry if my questions sound silly. Following AI world is like
| following JavaScript world.
| ozten wrote:
| It is not a silly question. The various flavors of LLM have
| issues with reliability. In software we expect five 9s, LLMs
| aren't even a one 9. Early on it was reliability of them
| writing JSON output. Then instruction following. Then tool use.
| Now it's "computer use" and orchestration.
|
| Creating models for this specific problem domain will have a
| better chance at reliability, which is not a solved problem.
|
| Jules is the gemini coder that links to github. Half the time
| it doesn't create a pull request and forgets and assumes I'll
| do some testing or something. It's wild.
| simonw wrote:
| "Agentic" and "agent" can mean pretty much anything, there are
| a ton of different definitions out there.
|
| When an LLM says it's "agentic" it usually means that it's been
| optimized for tool use. Pretty much _all_ the big models (and
| most of the small ones) are designed for tool use these days,
| it 's an incredibly valuable feature for a model to offer.
|
| I don't think this new model is any more "agentic" than o3,
| o4-mini, Gemini 2.5 or Claude 4. All of those models are
| trained for tools, all of them are very competent at running
| tool calls in a loop to try to achieve a goal they have been
| given.
| dcre wrote:
| Reasonable question, simple answer: "New breed of model" is
| overstating it -- all these models for years have been fine-
| tuned using reinforcement learning on a variety of tasks, it's
| just that the set of tasks (and maybe the amount of RL) has
| changed over time to include more tool use tasks, and this has
| made them much, much better at the latter. The explosion of
| tools like Claude Code this year is driven by the models just
| being more effective at it. The orchestration external to the
| model you mention is what people did before this year and it
| did not work as well.
| selfhoster11 wrote:
| > I'm sorry if my questions sound silly. Following AI world is
| like following JavaScript world.
|
| You are more right than you could possibly imagine.
|
| TL;DR: "agentic" just means "can call tools it's been given
| access to, autonomously, and then access the output" combined
| with an infinite loop in which the model runs over and over
| (compared to a one-off interaction like you'd see in ChatGPT).
| MCP is essentially one of the methods to expose the tools to
| the model.
|
| Is this something the models could do for a long while with a
| wrapper? Yup. "Agentic" is the current term for it, that's all.
| There's some hype around "agentic AI" that's unwarranted, but
| part of the reason for the hype is that models have become
| better at tool calling and using data in their context since
| the early days.
| simonw wrote:
| Pelican on a bicycle result:
| https://simonwillison.net/2025/Jul/11/kimi-k2/
| _alex_ wrote:
| wow!
| ebiester wrote:
| At this point, they _have_ to be training it. At what point
| will you start using something else?
| simonw wrote:
| Once I get a picture that genuinely looks like a pelican
| riding a bicycle!
| MaxPock wrote:
| Would be hilarious if Zuck with his billion dollar poaching
| failed to beat budget Chinese models.
| physix wrote:
| That reminds me of a thought I had about the poachings.
|
| The poaching was probably more aimed at hamstringing Meta's
| competition.
|
| Because the disruption caused by them leaving in droves is
| probably more severe than the benefits of having them on board.
| Unless they are gods, of course.
| rfoo wrote:
| Wikipedia listed a FAIR alumni as cofounder for this "Moonshot
| AI". Make it funnier probably.
| aliljet wrote:
| If the SWE Bench results are to be believed... this looks best in
| class right now for a local LLM. To be fair, show me the guy who
| is running this locally...
| selfhoster11 wrote:
| It's challenging, but not impossible. With 2-bit quantisation,
| only about 250-ish gigabytes of RAM is required. It doesn't
| have to be VRAM either, and you can mix and match GPU+CPU
| inference.
|
| In addition, some people on /r/localLlama are having success
| with streaming the weights off SSD storage at 1 token/second,
| which is about the rate I get for DeepSeek R1.
| helloericsf wrote:
| How does it stack up against the new Grok 4 model?
| Imustaskforhelp wrote:
| I really really want to try this model for free since I just
| don't have a gpu.
|
| Is there any way that I could do so?
|
| Open Router? Or does kimi have their own website? Just curious to
| really try it out!
___________________________________________________________________
(page generated 2025-07-11 23:01 UTC)