hngopher.com

       [HN Gopher] Kimi K2
       ___________________________________________________________________
        
       Kimi K2
        
       Author : c4pt0r
       Score  : 118 points
       Date   : 2025-07-11 15:38 UTC (7 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | gs17 wrote:
       | > 1T total / 32B active MoE model
       | 
       | Is this the largest open-weight model?
        
         | bigeagle wrote:
         | I believe so.
         | 
         | Grok-1 is 341B, DeepSeek-v3 is 671B, and recent new open
         | weights models are around 70B~300B.
        
         | adt wrote:
         | No.
         | 
         | At 1T MoE on 15.5T tokens, K2 is one of the largest open source
         | models to date. But BAAI's TeleFM is 1T dense on 15.7T tokens:
         | https://huggingface.co/CofeAI/Tele-FLM-1T
         | 
         | You can always check here: https://lifearchitect.ai/models-
         | table/
        
       | simonw wrote:
       | Big release - https://huggingface.co/moonshotai/Kimi-K2-Instruct
       | model weights are 958.52 GB
        
         | c4pt0r wrote:
         | Paired with programming tools like Claude Code, it could be a
         | low-cost/open-source replacement for Sonnet
        
           | kkzz99 wrote:
           | According to the bench its closer to Opus, but I venture
           | primarily for English and Chinese.
        
           | martin_ wrote:
           | how do you low cost run a 1T param model?
        
             | maven29 wrote:
             | 32B active parameters with a single shared expert.
        
               | JustFinishedBSG wrote:
               | This doesn't change the VRAM usage, only the compute
               | requirements.
        
               | maven29 wrote:
               | You can probably run this on CPU if you have a 4090D for
               | prompt processing, since 1TB of DDR4 only comes out to
               | around $600.
               | 
               | For GPU inference at scale, I think token-level batching
               | is used.
        
               | t1amat wrote:
               | With 32B active parameters it would be ridiculously slow
               | at generation.
        
               | selfhoster11 wrote:
               | DDR3 workstation here - R1 generates at 1 token per
               | second. In practice, this means that for complex queries,
               | the speed of replying is closer to an email response than
               | a chat message, but this is acceptable to me for
               | confidential queries or queries where I need the model to
               | be steerable. I can always hit the R1 API from a provider
               | instead, if I want to.
               | 
               | Given that R1 uses 37B active parameters (compared to 32B
               | for K2), K2 should be slightly faster than that - around
               | 1.15 tokens/second.
        
               | zackangelo wrote:
               | Typically a combination of expert level parallelism and
               | tensor level parallelism is used.
               | 
               | For the big MLP tensors they would be split across GPUs
               | in a cluster. Then for the MoE parts you would spread the
               | experts across the GPUs and route to them based on which
               | experts are active (there would likely be more than one
               | if the batch size is > 1).
        
               | selfhoster11 wrote:
               | It does not have to be VRAM, it could be system RAM, or
               | weights streamed from SSD storage. Reportedly, the latter
               | method achieves around 1 token per second on computers
               | with 64 GB of system RAM.
               | 
               | R1 (and K2) is MoE, whereas Llama 3 is a dense model
               | family. MoE actually makes these models practical to run
               | on cheaper hardware. DeepSeek R1 is more comfortable for
               | me than Llama 3 70B for exactly that reason - if it
               | spills out of the GPU, you take a large performance hit.
               | 
               | If you need to spill into CPU inference, you really want
               | to be multiplying a different set of 32B weights for
               | every token compared to the same 70B (or more) instead,
               | simply because the computation takes so long.
        
               | refulgentis wrote:
               | The amount of people who will be using it at 1 token/sec
               | because there's no better option, _and_ have 64 GB of
               | RAM, is _vanishingly_ small.
               | 
               | IMHO it sets the local LLM community back when we lean on
               | extreme quantization & streaming weights from disk to say
               | something is possible*, because when people try it out,
               | it turns out it's an awful experience.
               | 
               | * the implication being, _anything_ is possible in that
               | scenario
        
               | homarp wrote:
               | agentic loop can run all night long. It's just a
               | different way to work: prepare your prompt queue, set it
               | up, check result in the morning, adjust. 'local vibe' in
               | 10h instead of 10mn is still better than 10 days of
               | manual side coding.
        
       | cyanf wrote:
       | This is both the largest oss model release thus far, and the
       | largest Muon training run.
        
       | wiradikusuma wrote:
       | I've only started using Claude, Gemini, etc in the last few
       | months (I guess it comes with age, I'm no longer interested in
       | trying the latest "tech"). I assume those are "non-agentic"
       | models.
       | 
       | From reading articles online, "agentic" means like you have a
       | "virtual" Virtual Assistant with "hands" that can google, open
       | apps, etc, on their own.
       | 
       | Why not use existing "non-agentic" model and "orchestrate" them
       | using LangChain, MCP etc? Why create a new breed of model?
       | 
       | I'm sorry if my questions sound silly. Following AI world is like
       | following JavaScript world.
        
         | ozten wrote:
         | It is not a silly question. The various flavors of LLM have
         | issues with reliability. In software we expect five 9s, LLMs
         | aren't even a one 9. Early on it was reliability of them
         | writing JSON output. Then instruction following. Then tool use.
         | Now it's "computer use" and orchestration.
         | 
         | Creating models for this specific problem domain will have a
         | better chance at reliability, which is not a solved problem.
         | 
         | Jules is the gemini coder that links to github. Half the time
         | it doesn't create a pull request and forgets and assumes I'll
         | do some testing or something. It's wild.
        
         | simonw wrote:
         | "Agentic" and "agent" can mean pretty much anything, there are
         | a ton of different definitions out there.
         | 
         | When an LLM says it's "agentic" it usually means that it's been
         | optimized for tool use. Pretty much _all_ the big models (and
         | most of the small ones) are designed for tool use these days,
         | it 's an incredibly valuable feature for a model to offer.
         | 
         | I don't think this new model is any more "agentic" than o3,
         | o4-mini, Gemini 2.5 or Claude 4. All of those models are
         | trained for tools, all of them are very competent at running
         | tool calls in a loop to try to achieve a goal they have been
         | given.
        
         | dcre wrote:
         | Reasonable question, simple answer: "New breed of model" is
         | overstating it -- all these models for years have been fine-
         | tuned using reinforcement learning on a variety of tasks, it's
         | just that the set of tasks (and maybe the amount of RL) has
         | changed over time to include more tool use tasks, and this has
         | made them much, much better at the latter. The explosion of
         | tools like Claude Code this year is driven by the models just
         | being more effective at it. The orchestration external to the
         | model you mention is what people did before this year and it
         | did not work as well.
        
         | selfhoster11 wrote:
         | > I'm sorry if my questions sound silly. Following AI world is
         | like following JavaScript world.
         | 
         | You are more right than you could possibly imagine.
         | 
         | TL;DR: "agentic" just means "can call tools it's been given
         | access to, autonomously, and then access the output" combined
         | with an infinite loop in which the model runs over and over
         | (compared to a one-off interaction like you'd see in ChatGPT).
         | MCP is essentially one of the methods to expose the tools to
         | the model.
         | 
         | Is this something the models could do for a long while with a
         | wrapper? Yup. "Agentic" is the current term for it, that's all.
         | There's some hype around "agentic AI" that's unwarranted, but
         | part of the reason for the hype is that models have become
         | better at tool calling and using data in their context since
         | the early days.
        
       | simonw wrote:
       | Pelican on a bicycle result:
       | https://simonwillison.net/2025/Jul/11/kimi-k2/
        
         | _alex_ wrote:
         | wow!
        
         | ebiester wrote:
         | At this point, they _have_ to be training it. At what point
         | will you start using something else?
        
           | simonw wrote:
           | Once I get a picture that genuinely looks like a pelican
           | riding a bicycle!
        
       | MaxPock wrote:
       | Would be hilarious if Zuck with his billion dollar poaching
       | failed to beat budget Chinese models.
        
         | physix wrote:
         | That reminds me of a thought I had about the poachings.
         | 
         | The poaching was probably more aimed at hamstringing Meta's
         | competition.
         | 
         | Because the disruption caused by them leaving in droves is
         | probably more severe than the benefits of having them on board.
         | Unless they are gods, of course.
        
         | rfoo wrote:
         | Wikipedia listed a FAIR alumni as cofounder for this "Moonshot
         | AI". Make it funnier probably.
        
       | aliljet wrote:
       | If the SWE Bench results are to be believed... this looks best in
       | class right now for a local LLM. To be fair, show me the guy who
       | is running this locally...
        
         | selfhoster11 wrote:
         | It's challenging, but not impossible. With 2-bit quantisation,
         | only about 250-ish gigabytes of RAM is required. It doesn't
         | have to be VRAM either, and you can mix and match GPU+CPU
         | inference.
         | 
         | In addition, some people on /r/localLlama are having success
         | with streaming the weights off SSD storage at 1 token/second,
         | which is about the rate I get for DeepSeek R1.
        
       | helloericsf wrote:
       | How does it stack up against the new Grok 4 model?
        
       | Imustaskforhelp wrote:
       | I really really want to try this model for free since I just
       | don't have a gpu.
       | 
       | Is there any way that I could do so?
       | 
       | Open Router? Or does kimi have their own website? Just curious to
       | really try it out!
        
       ___________________________________________________________________
       (page generated 2025-07-11 23:01 UTC)