[HN Gopher] Kimi Released Kimi K2.5, Open-Source Visual SOTA-Age...
       ___________________________________________________________________
        
       Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model
        
       Author : nekofneko
       Score  : 473 points
       Date   : 2026-01-27 05:42 UTC (1 days ago)
        
 (HTM) web link (www.kimi.com)
 (TXT) w3m dump (www.kimi.com)
        
       | billyellow wrote:
       | Cool
        
       | mangolie wrote:
       | they cooked
        
       | jumploops wrote:
       | > For complex tasks, Kimi K2.5 can self-direct an agent swarm
       | with up to 100 sub-agents, executing parallel workflows across up
       | to 1,500 tool calls.
       | 
       | > K2.5 Agent Swarm improves performance on complex tasks through
       | parallel, specialized execution [..] leads to an 80% reduction in
       | end-to-end runtime
       | 
       | Not just RL on tool calling, but RL on agent orchestration, neat!
        
         | mohsen1 wrote:
         | Parallel agents are such a simple, yet powerful hack. Using it
         | in Claude Code with TeammateTool and getting lots of good
         | results!
        
           | esperent wrote:
           | > TeammateTool
           | 
           | What is this?
        
             | jlu wrote:
             | claude code hidden feaure currently under a feature flag:
             | 
             | https://github.com/mikekelly/claude-sneakpeek
        
             | frimmy wrote:
             | https://x.com/kieranklaassen/status/2014830266515382693 -
             | agent swarms tool shipping w/ cc soon..
        
         | XCSme wrote:
         | > Kimi K2.5 can self-direct an agent swarm
         | 
         | Is this within the model? Or within the IDE/service that runs
         | the model?
         | 
         | Because tool calling is mostly just the agent outputting "call
         | tool X", and the IDE does it and returns the data back to AI's
         | context
        
           | mzl wrote:
           | An LLM model only outputs tokens, so this could be seen as an
           | extension of tool calling where it has trained on the
           | knowledge and use-cases for "tool-calling" itself as a sub-
           | agent.
        
             | XCSme wrote:
             | Ok, so agent swarm = tool calling where the tool is a LLM
             | call and the argument is the prompt
        
               | dcre wrote:
               | Sort of. It's not necessarily a single call. In the
               | general case it would be spinning up a long-running agent
               | with various kinds of configuration -- prompts, but also
               | coding environment and which tools are available to it --
               | like subagents in Claude Code.
        
               | IanCal wrote:
               | Yes largely, although they've trained a model
               | specifically for this task rather than using the base
               | model and a bit of prompting.
        
         | storystarling wrote:
         | 1,500 tool calls per task sounds like a nightmare for unit
         | economics though. I've been optimizing my own agent workflows
         | and even a few dozen steps makes it hard to keep margins
         | positive, so I'm not sure how this is viable for anyone not
         | burning VC cash.
        
           | zozbot234 wrote:
           | "tool call" is just a reference to any elementary interaction
           | with the outside system. It's not calling third-party APIs or
           | anything like that.
        
             | storystarling wrote:
             | True, but that's still 1,500 inference cycles. Even without
             | external API fees, the latency and compute burden seems
             | huge. I don't see how the economics work there without
             | significant subsidies.
        
               | darrinm wrote:
               | FWIW many tool calls can be and often are made in one
               | inference cycle.
        
       | DeathArrow wrote:
       | Those are some impressive benchmark results. I wonder how well it
       | does in real life.
       | 
       | Maybe we can get away with something cheaper than Claude for
       | coding.
        
         | oneneptune wrote:
         | I'm curious about the "cheaper" claim -- I checked Kimi
         | pricing, and it's a $200/mo subscription too?
        
           | NitpickLawyer wrote:
           | On openrouter 2.5 is at 0.60/3$ per Mtok. That's haiku
           | pricing.
        
             | storystarling wrote:
             | The unit economics seem tough at that price for a 1T
             | parameter model. Even with MoE sparsity you are still VRAM
             | bound just keeping the weights resident, which is a much
             | higher baseline cost than serving a smaller model like
             | Haiku.
        
           | mrklol wrote:
           | They also have a $20 and $40 tier.
        
             | Alifatisk wrote:
             | If you bargain with their bot Kimmmmy (not joking), you can
             | even get lower pricing.
        
               | mohsen1 wrote:
               | tell me more...
        
               | Alifatisk wrote:
               | Go to kimi chat, there will come up multiple suggestions
               | of use cases. One of them will be the bargain robot. If
               | you download their mobile app, the challenge to bargain
               | will probably popup too!
               | 
               | Depending on how well you bargain with the robot, you can
               | go as low as 0,99$ (difficult). Either way, their
               | moderate plan doesn't have to be 20$. The agent wants a
               | good reason for why it should lower the price for you.
               | 
               | Here's the direct link to Kimmmmy:
               | 
               | https://www.kimi.com/kimiplus/sale
               | 
               | I'll send an invite link too if you don't mind:
               | 
               | https://www.kimi.com/kimiplus/sale?activity_enter_method=
               | h5_...
        
               | mohsen1 wrote:
               | omg this is so funny!
        
             | esafak wrote:
             | https://www.kimi.com/code
        
       | spaceman_2020 wrote:
       | Kimi was already one of the best writing models. Excited to try
       | this one out
        
         | Alifatisk wrote:
         | To me, Kimi has been the best with writing and conversing, its
         | way more human like!
        
       | Tepix wrote:
       | Huggingface Link: https://huggingface.co/moonshotai/Kimi-K2.5
       | 
       | 1T parameters, 32b active parameters.
       | 
       | License: MIT with the following modification:
       | 
       |  _Our only modification part is that, if the Software (or any
       | derivative works thereof) is used for any of your commercial
       | products or services that have more than 100 million monthly
       | active users, or more than 20 million US dollars (or equivalent
       | in other currencies) in monthly revenue, you shall prominently
       | display "Kimi K2.5" on the user interface of such product or
       | service._
        
         | Imustaskforhelp wrote:
         | Hey have they open sourced all Kimi k2.5
         | (thinking,instruct,agent,agent swarm [beta])?
         | 
         | Because I feel like they mentioned that agent swarm is
         | available their api and that made me feel as if it wasn't open
         | (weights)*? Please let me know if all are open source or not?
        
           | XenophileJKO wrote:
           | I'm assuming the swarm part is all harness. Well I mean a
           | harness and way of thinking that the weights have just been
           | fine tuned to use.
        
             | mccoyb wrote:
             | It's not in the harness today, it's a special RL technique
             | they discuss in https://www.kimi.com/blog/kimi-k2-5.html
             | (see "2. Agent Swarm")
             | 
             | I looked through the harness and all I could find is a
             | `Task` tool.
        
         | dheera wrote:
         | > or more than 20 million US dollars (or equivalent in other
         | currencies) in monthly revenue, you shall prominently display
         | "Kimi K2.5" on the user interface of such product or service.
         | 
         | Why not just say "you shall pay us 1 million dollars"?
        
           | clayhacks wrote:
           | I assume this allows them to sue for different amounts. And
           | not discourage too many people from using it.
        
           | vessenes wrote:
           | ? They prefer the branding. The license just says you have to
           | say it was them if you make > $250mm a year on the model.
        
           | viraptor wrote:
           | Companies with $20M revenue will not normally have spare $1M
           | available. They'd get more money by charging reasonable
           | subscriptions than by using lawyers to chase sudden company-
           | ending fees.
        
             | laurentb wrote:
             | it's monthly :) $240M revenue companies will absolutely
             | find a way to fork $1M if they need to. Kimi most likely
             | sees the eyeballs of free advertising as more profitable in
             | the grander scheme of things
        
         | endymi0n wrote:
         | One. Trillion. Even on native int4 that's... half a terabyte of
         | vram?!
         | 
         | Technical awe at this marvel aside that cracks the 50th
         | percentile of HLE, the snarky part of me says there's only half
         | the danger in giving something away nobody can run at home
         | anyway...
        
           | Davidzheng wrote:
           | that's what intelligence takes. Most of intelligence is just
           | compute
        
           | wongarsu wrote:
           | Which conveniently fits on one 8xH100 machine. With 100-200
           | GB left over for overhead, kv-cache, etc.
        
             | storystarling wrote:
             | The unit economics seem pretty rough though. You're locking
             | up 8xH100s for the compute of ~32B active parameters. I
             | guess memory is the bottleneck but hard to see how the
             | margins work on that.
        
           | johndough wrote:
           | The model absolutely can be run at home. There even is a big
           | community around running large models locally:
           | https://www.reddit.com/r/LocalLLaMA/
           | 
           | The cheapest way is to stream it from a fast SSD, but it will
           | be quite slow (one token every few seconds).
           | 
           | The next step up is an old server with lots of RAM and many
           | memory channels with maybe a GPU thrown in for faster prompt
           | processing (low two digits tokens/second).
           | 
           | At the high end, there are servers with multiple GPUs with
           | lots of VRAM or multiple chained Macs or Strix Halo mini PCs.
           | 
           | The key enabler here is that the models are MoE (Mixture of
           | Experts), which means that only a small(ish) part of the
           | model is required to compute the next token. In this case,
           | there are 32B active parameters, which is about 16GB at 4 bit
           | per parameter. This only leaves the question of how to get
           | those 16GB to the processor as fast as possible.
        
             | 1dom wrote:
             | > The model absolutely can be run at home. There even is a
             | big community around running large models locally
             | 
             | IMO 1tln parameters and 32bln active seems like a different
             | scale to what most are talking about when they say
             | localLLMs IMO. Totally agree there will be people messing
             | with this, but the real value in localLLMs is that you can
             | actually use them and get value from them with standard
             | consumer hardware. I don't think that's really possible
             | with this model.
        
               | zozbot234 wrote:
               | 32B active is nothing special, there's local setups that
               | will easily support that. 1T total parameters ultimately
               | requires keeping the bulk of them on SSD. This need not
               | be an issue if there's enough locality in expert choice
               | for any given workload; the "hot" experts will simply be
               | cached in available spare RAM.
        
               | 1dom wrote:
               | I never said it was special.
               | 
               | I was trying to correct the record that a lot of people
               | will be using models of this size locally because of the
               | local LLM community.
               | 
               | The most commonly downloaded local LLMs are normally <30b
               | (e.g.
               | https://huggingface.co/unsloth/models?sort=downloads).
               | The things you're saying, especially when combined
               | together, make it not usable by a lot of people in the
               | local LLM community at the moment.
        
               | spmurrayzzz wrote:
               | When I've measured this myself, I've never seen a medium-
               | to-long task horizon that would have expert locality such
               | that you wouldn't be hitting the SSD constantly to swap
               | layers (not to say it doesn't exist, just that in the
               | literature and in my own empirics, it doesn't seem to be
               | observed in a way you could rely on it for cache
               | performance).
               | 
               | Over any task that has enough prefill input diversity and
               | a decode phase thats more than a few tokens, its at least
               | intuitive that experts activate nearly uniformly in the
               | aggregate, since they're activated per token. This is why
               | when you do something more than bs=1, you see forward
               | passes light up the whole network.
        
               | zozbot234 wrote:
               | > hitting the SSD constantly to swap layers
               | 
               | Thing is, people in the local llm community are already
               | doing that to run the largest MoE models, using mmap such
               | that spare-RAM-as-cache is managed automatically by the
               | OS. It's a drag on performance to be sure but still
               | somewhat usable, if you're willing to wait for results.
               | And it unlocks these larger models on what's effectively
               | semi-pro if not true consumer hardware. On the enterprise
               | side, high bandwidth NAND Flash is just around the corner
               | and perfectly suited for storing these large read-only
               | model parameters (no wear and tear issues with the NAND
               | storage) while preserving RAM-like throughput.
        
               | spmurrayzzz wrote:
               | I've tested this myself often (as an aside: I'm in said
               | community, I run 2x RTX Pro 6000 locally, 4x 3090 before
               | that), and I think what you said re: "willing to wait" is
               | probably the difference maker for me.
               | 
               | I can run Minimax 2.1 in 5bpw at 200k context fully
               | offloaded to GPU. The 30-40 tk/s feels like a lifetime
               | for long horizon tasks, especially with subagent
               | delegation etc, but it's still fast enough to be a daily
               | driver.
               | 
               | But that's more or less my cutoff. Whenever I've tested
               | other setups that dip into the single and sub-single
               | digit throughput rates, it becomes maddening and entirely
               | unusable (for me).
        
               | zamadatix wrote:
               | Local LLMs are just LLMs people run locally. It's not a
               | definition of size, feature set, or what's most popular.
               | What the "real" value is for local LLMs will depend on
               | each person you ask. The person who runs small local LLMs
               | will tell you the real value is in small models, the
               | person who runs large local LLMs will tell you it's large
               | ones, those who use cloud will say the value is in shared
               | compute, and those who don't like AI will say there is no
               | value in any.
               | 
               | LLMs which the weights aren't available are an example of
               | when it's not local LLMs, not when the model happens to
               | be large.
        
               | 1dom wrote:
               | > LLMs which the weights aren't available are an example
               | of when it's not local LLMs, not when the model happens
               | to be large.
               | 
               | I agree. My point was that most aren't thinking of models
               | this large when they're talking about local LLMs. That's
               | what I said, right? This is supported by the download
               | counts on hf: the most downloaded local models are
               | significantly smaller than 1tln, normally 1 - 12bln.
               | 
               | I'm not sure I understand what point you're trying to
               | make here?
        
               | zamadatix wrote:
               | Mostly a "We know local LLMs as being this, and all of
               | the mentioned variants of this can provide real value
               | regardless of which is most commonly referenced" point.
               | I.e. large local LLMs aren't only something people mess
               | with, they often provide a lot of value for a relative
               | few people rather than a little value for a relative lot
               | of people as small local LLMs do. Who thinks which
               | modality and type brings the most value is largely a
               | matter of opinion of the user getting the value, not just
               | the option which runs on consumer hardware or etc alone.
               | 
               | You're of course accurate that smaller LLMs are more
               | commonly deployed, it's just not the part I was really
               | responding to.
        
               | GeorgeOldfield wrote:
               | do you guys understand that different experts are loaded
               | PER TOKEN?
        
             | dev_l1x_be wrote:
             | How do you split the model between multiple GPUs?
        
               | evilduck wrote:
               | With "only" 32B active params, you don't necessarily need
               | to. We're straying from common home users to serious
               | enthusiasts and professionals but this seems like it
               | would run ok on a workstation with a half terabyte of RAM
               | and a single RTX6000.
               | 
               | But to answer your question directly, tensor parallelism.
               | https://github.com/ggml-org/llama.cpp/discussions/8735 ht
               | tps://docs.vllm.ai/en/latest/configuration/conserving_mem
               | o...
        
             | WhitneyLand wrote:
             | Its often pointed out in the first sentence of a comment
             | how a model can be run at home, then (maybe) towards the
             | end of the comment it's mentioned how it's quantized.
             | 
             | Back when 4k movies needed expensive hardware, no one was
             | saying they could play 4k on a home system, then later
             | mentioning they actually scaled down the resolution to make
             | it possible.
             | 
             | The degree of quality loss is not often characterized.
             | Which makes sense because it's not easy to fully quantify
             | quality loss with a few simple benchmarks.
             | 
             | By the time it's quantized to 4 bits, 2 bits or whatever,
             | does anyone really have an idea of how much they've gained
             | vs just running a model that is sized more appropriately
             | for their hardware, but not lobotomized?
        
               | selfhoster11 wrote:
               | Except the parent comment said you can stream the weights
               | from an SSD. The full weights, uncompressed. It takes a
               | little longer (a lot longer), but the model at least
               | works without lossy pre-processing.
        
               | FuckButtons wrote:
               | From my own usage, the former is almost always better
               | than the latter. Because it's less like a lobotomy and
               | more like a hangover, though I have run some quantized
               | models that seem still drunk.
               | 
               | Any model that I can run in 128 gb in full precision is
               | far inferior to the models that I can just barely get to
               | run after reap + quantization for actually useful work.
               | 
               | I also read a paper a while back about improvements to
               | model performance in contrastive learning when
               | quantization was included during training as a form of
               | perturbation, to try to force the model to reach a
               | smoother loss landscape, it made me wonder if something
               | similar might work for llms, which I think might be what
               | the people over at minimax are doing with m2.1 since they
               | released it in fp8.
               | 
               | In principle, if the model has been effective during its
               | learning at separating and compressing concepts into
               | approximately orthogonal subspaces (and assuming the
               | white box transformer architecture approximates what
               | typical transformers do), quantization should really only
               | impact outliers which are not well characterized during
               | learning.
        
               | WhitneyLand wrote:
               | Interesting.
               | 
               | If this were the case however, why would labs go through
               | the trouble of distilling their smaller models rather
               | than releasing quantized versions of the flagships?
        
               | dabockster wrote:
               | Hanlon's razor.
               | 
               | "Never attribute to malice that which is adequately
               | explained by stupidity."
               | 
               | Yes, I'm calling labs that don't distill smaller sized
               | models stupid for not doing so.
        
               | zozbot234 wrote:
               | > ...Back when 4k movies needed expensive hardware, no
               | one was saying they could play 4k on a home system, then
               | later mentioning they actually scaled down the resolution
               | to make it possible. ...
               | 
               | int4 quantization is the original release in this case;
               | it's not been quantized after the fact. It's a bit of a
               | nuisance when running on hardware that doesn't natively
               | support the format (might waste some fraction of memory
               | throughput on padding, specifically on NPU hw that can't
               | do the unpacking on its own) but no one here is reducing
               | quality to make the model fit.
        
               | WhitneyLand wrote:
               | Good point thanks for the clarification.
               | 
               | The broader point remains though which is, "you can run
               | this model as home..." when actually the caveats are
               | potentially substantial.
               | 
               | It would be so incredibly slow...
        
               | Gracana wrote:
               | The level of deceit you're describing is kind of
               | ridiculous. Anybody talking about their specific setup is
               | going to be happy to tell you the model and quant they're
               | running and the speeds they're getting, and if you want
               | to understand the effects of quantization on model
               | quality, it's really easy to spin up a GPU server
               | instance and play around.
        
               | jasonjmcghee wrote:
               | > if you want to understand the effects of quantization
               | on model quality, it's really easy to spin up a GPU
               | server instance and play around
               | 
               | Fwiw, not necessarily. I've noticed quantized models have
               | strange and surprising failure modes where everything
               | seems to be working well and then does a death spiral
               | repeating a specific word or completely failing on one
               | task of a handful of similar tasks.
               | 
               | 8-bit vs 4-bit can be almost imperceptible or night and
               | day.
               | 
               | This isn't something you'd necessarily see playing
               | around, but when trying to do something specific
        
               | codexon wrote:
               | Didn't this paper demonstrate that you only need 1.58
               | bits to be equivalent to 16 bits in performance?
               | 
               | https://arxiv.org/abs/2402.17764
        
               | WhitneyLand wrote:
               | Iirc the paper was solid, but it still hasn't been
               | adopted/proven out at large scale. Harder to adapt
               | hardware and code kernels to something like this compared
               | to int4.
        
               | Ey7NFZ3P0nzAe wrote:
               | This technique showed that there are ways _during
               | training_ to optimize weights to neatly quantize while
               | remaining performant. This isn 't a _post training_
               | quantization like int4.
        
             | PlatoIsADisease wrote:
             | >The model absolutely can be run at home.
             | 
             | There is a huge difference between "look I got it to answer
             | the prompt: '1+1='"
             | 
             | and actually using it for anything of value.
             | 
             | I remember early on people bought Macs (or some marketing
             | team was shoveling it), and proposing people could
             | reasonably run the 70B+ models on it.
             | 
             | They were talking about 'look it gave an answer', not 'look
             | this is useful'.
             | 
             | While it was a bit obvious that 'integrated GPU' is not
             | Nvidia VRAM, we did have 1 mac laptop at work that
             | validated this.
             | 
             | Its cool these models are out in the open, but its going to
             | be a decade before people are running them at a useful
             | level locally.
        
               | esafak wrote:
               | Hear, hear. Even if the model fits, a few tokens per
               | second make no sense. Time is money too.
        
               | tempoponet wrote:
               | Maybe for a coding agent, but a daily/weekly report on
               | sensitive info?
               | 
               | If it were 2016 and this technology existed but only in 1
               | t/s, every company would find a way to extract the most
               | leverage out of it.
        
               | esafak wrote:
               | But it's 2026 and 'secure' (by executive standards)
               | hosted options exist.
        
               | dabockster wrote:
               | > 'secure' (by executive standards)
               | 
               | "Secure" in the sense that they can sue someone after the
               | fact, instead of preventing data from leaking in the
               | first place.
        
               | michaellee8 wrote:
               | If they figured out it can be this useful in 2016 running
               | 1 t/s, they would make it run at least 20 t/s by 2019
        
               | hex4def6 wrote:
               | If I can start an agent and be able to walk away for 8
               | hours, and be confident it's 'smart' enough to complete a
               | task unattended, that's still useful.
               | 
               | At 3 tk/s, that's still 100-150 pages of a book, give or
               | take.
        
               | esafak wrote:
               | True, that's still faster than a human, but they're not
               | nearly that reliable yet.
        
             | dabockster wrote:
             | You can run AI models on unified/shared memory specifically
             | on Windows, not Linux (unfortunately). It uses the same
             | memory sharing system that Microsoft originally had built
             | for gaming when a game would run out of vram. If you:
             | 
             | - have an i5 or better or equivalent manufactured within
             | the last 5-7 years
             | 
             | - have an nvidia consumer gaming GPU (RTX 3000 series or
             | better) with at least 8 GB vram
             | 
             | - have at least 32 GB system ram (tested with DDR4 on my
             | end)
             | 
             | - build llama-cpp yourself with every compiler optimization
             | flag possible
             | 
             | - pair it with a MoE model compatible with your unified
             | memory amount
             | 
             | - and configure MoE offload to the CPU to reduce memory
             | pressure on the GPU
             | 
             | then you can honestly get to about 85-90% of cloud AI
             | capability totally on-device, depending on what program you
             | interface with the model.
             | 
             | And here's the shocking idea: those system specs can be met
             | by an off the shelf gaming computer from, for example, Best
             | Buy or Costco today and right now. You can literally buy a
             | CyberPower or iBuyPower model, again for example, download
             | the source, run the compilation, and have that level of AI
             | inference available to you.
             | 
             | Now, the reason why it won't work on Linux is that the
             | Linux kernel and Linux distros both leave that unified
             | memory capability up to the GPU driver to implement. Which
             | Nvidia hasn't done yet. You can code it somewhat into
             | source code, but it's still super unstable and flaky from
             | what I've read.
             | 
             | (In fact, that lack of unified memory tech on Linux is
             | probably why everyone feels the need to build all these
             | data centers everywhere.)
        
             | side_up_down wrote:
             | I'd take "running at home" to mean running on reasonably
             | available consumer hardware, which your setup is not. You
             | can obviously build custom, but who's actually going to do
             | that? OP's point is valid
        
           | the_sleaze_ wrote:
           | 3,998.99 for 500gb of RAM on amazon
           | 
           | "Good Luck" - Kimi <Taken voice>
        
           | mrinterweb wrote:
           | VRAM is the new moat, and controlling pricing and access to
           | VRAM is part of it. There will be very few hobbyists who can
           | run models of this size. I appreciate the spirit of making
           | the weights open, but realistically, it is impractical for
           | >99.999% of users to run locally.
        
           | segmondy wrote:
           | I run KimiK2 at home, Most of it on system ram with a few
           | layers offloaded to old 3090s. This is a cheap budget build.
           | 
           | Kimi-K2-Thinking-UD-Q3_K_XL-00001-of-00010.gguf Generation -
           | 5,231 tokens 604.63s 8.65 tokens/s
        
             | mapkkk wrote:
             | Could I trouble you for the specifics of your build? I'd
             | love to see if it would be a viable upgrade for me.
             | 
             | I currently have a 3970x with a bunch of 3090s.
        
               | segmondy wrote:
               | 4 3090s, epyc MB with 8 channel memory, 7352 cpu, slow
               | 2400mhz ddr4 rams.
        
         | redox99 wrote:
         | Cursor devs, who go out of their way to not mention their
         | Composer model is based on GLM, are not going to like that.
        
           | msp26 wrote:
           | Source? I've heard this rumour twice but never seen proof. I
           | assume it would be based on tokeniser quirks?
        
       | lrvick wrote:
       | Actually open source, or yet another public model, which is the
       | equivalent of a binary?
       | 
       | URL is down so cannot tell.
        
         | Tepix wrote:
         | It's open weights, not open source.
        
         | typ wrote:
         | The label 'open source' has become a reputation reaping and
         | marketing vehicle rather than an informative term since the
         | Hugging Face benchmark race started. With the weights only, we
         | cannot actually audit that if a model is a) contaminated by
         | benchmarks, b) built with deliberate biases, or c) trained on
         | copyrighted/privacy data, let alone allowing other vendors to
         | replicate the results. Anyways, people still love free stuff.
        
           | Der_Einzige wrote:
           | Just accept that IP laws don't matter and the old "free
           | software" paradigm is dead. Aaron Swartz died so that GenAI
           | may live. RMS and his model of "copyleft" are so Web 1.0 (not
           | even 2.0). No one in GenAI cares AT ALL about the true
           | definition of open source. Good.
        
             | duskdozer wrote:
             | Good?
        
       | Reubend wrote:
       | I've read several people say that Kimi K2 has a better "emotional
       | intelligence" than other models. I'll be interested to see
       | whether K2.5 continues or even improves on that.
        
         | storystarling wrote:
         | yes, though this is highly subjective - it 'feels' like that to
         | me as well (comapred to Gemini 3, GPT 5.2, Opus 4.5).
        
         | Alifatisk wrote:
         | Yup, I experience the same. I don't know what they do to
         | achieve this but it gives them this edge, really curious to
         | learn more about what makes it so good at it.
        
           | in-silico wrote:
           | A lot of people point to the Muon optimizer that Moonshot
           | (the creators of Kimi) pioneered. Compared to the standard
           | optimizer AdamW, Muon amplifies low-magnitude gradient
           | directions which makes the model learn faster (and maybe
           | gives Kimi its unique qualities).
           | 
           | Muon paper: https://arxiv.org/abs/2502.16982
        
             | Alifatisk wrote:
             | Wow! Thank you
        
         | mohsen1 wrote:
         | I'll test it out on mafia-arena.com once it is available on
         | Open Router
        
       | pplonski86 wrote:
       | There are so many models, is there any website with list of all
       | of them and comparison of performance on different tasks?
        
         | coffeeri wrote:
         | There is https://artificialanalysis.ai
        
           | pplonski86 wrote:
           | Thank you! Exactly what I was looking for
        
           | XCSme wrote:
           | There are many lists, but I find all of them outdated or
           | containing wrong information or missing the actual benchmarks
           | I'm looking for.
           | 
           | I was thinking, that maybe it's better to make my own
           | benchmarks with the questions/things I'm interested in, and
           | whenever a new model comes out run those tests with that
           | model using open-router.
        
         | Reubend wrote:
         | The post actually has great benchmark tables inside of it. They
         | might be outdated in a few months, but for now, it gives you a
         | great summary. Seems like Gemini wins on image and video perf,
         | Claude is the best at coding, ChatGPT is the best for general
         | knowledge.
         | 
         | But ultimately, you need to try them yourself on the tasks you
         | care about and just see. My personal experience is that right
         | now, Gemini Pro performs the best at everything I throw at it.
         | I think it's superior to Claude and all of the OSS models by a
         | small margin, even for things like coding.
        
           | Imustaskforhelp wrote:
           | I like Gemini Pro's UI over Claude so much but honestly I
           | might start using Kimi K2.5 if its open source & just +/-
           | Gemini Pro/Chatgpt/Claude because at that point I feel like
           | the results are negligible and we are getting SOTA open
           | source models again.
        
             | wobfan wrote:
             | > honestly I might start using Kimi K2.5 if its open source
             | & just +/- Gemini Pro/Chatgpt/Claude because at that point
             | I feel like the results are negligible and we are getting
             | SOTA open source models again.
             | 
             | Me too!
             | 
             | > I like Gemini Pro's UI over Claude so much
             | 
             | This I don't understand. I mean, I don't see a lot of
             | difference in both UIs. Quite the opposite, apart from some
             | animations, round corners and color gradings, they seem to
             | look very alike, no?
        
               | Imustaskforhelp wrote:
               | Y'know I ended up buying Kimi's moderato plan which is
               | 19$ but they had this unique idea where you can talk to a
               | bot and they could reduce the price
               | 
               | I made it reduce the price of first month to 1.49$ (It
               | could go to 0.99$ and my frugal mind wanted it haha but I
               | just couldn't have it do that lol)
               | 
               | Anyways, afterwards for privacy purposes/( I am a minor
               | so don't have a card), ended up going to g2a to get a 10$
               | Visa gift card essentially and used it. (I had to pay a
               | 1$ extra but sure)
               | 
               | Installed kimi code on my mac and trying it out.
               | Honestly, I am kind of liking it.
               | 
               | My internal benchmark is creating pomodoro apps in golang
               | web... Gemini 3 pro has nailed it, I just tried the kimi
               | version and it does have some bugs but it feels like it
               | added more features.
               | 
               | Gonna have to try it out for a month.
               | 
               | I mean I just wish it was this cheap for the whole year
               | :< (As I could then move from, say using the completely
               | free models)
               | 
               | Gonna have to try it out more!
        
       | striking wrote:
       | https://archive.is/P98JR
        
       | zmmmmm wrote:
       | Curious what would be the most minimal reasonable hardware one
       | would need to deploy this locally?
        
         | NitpickLawyer wrote:
         | I parsed "reasonable" as in having reasonable speed to actually
         | use this as intended (in agentic setups). In that case, it's a
         | minimum of 70-100k for hardware (8x 6000 PRO + all the other
         | pieces to make it work). The model comes with native INT4
         | quant, so ~600GB for the weights alone. An 8x 96GB setup would
         | give you ~160GB for kv caching.
         | 
         | You can of course "run" this on cheaper hardware, but the
         | speeds will not be suitable for actual use (i.e. minutes for a
         | simple prompt, tens of minutes for high context sessions per
         | turn).
        
         | simonw wrote:
         | Models of this size can usually be run using MLX on a pair of
         | 512GB Mac Studio M3 Ultras, which are about $10,000 each so
         | $20,000 for the pair.
        
           | PlatoIsADisease wrote:
           | You might want to clarify that this is more of a "Look it
           | technically works"
           | 
           | Not a "I actually use this"
           | 
           | The difference between waiting 20 minutes to answer the
           | prompt '1+1='
           | 
           | and actually using it for something useful is massive here. I
           | wonder where this idea of running AI on CPU comes from. Was
           | it Apple astroturfing? Was it Apple fanboys? I don't see
           | people wasting time on non-Apple CPUs. (Although, I did do
           | this for a 7B model)
        
             | tucnak wrote:
             | Mac studio way is not "AI on CPU," as M2/M4 are complex
             | SoC, that includes a GPU with unified memory access.
        
               | PlatoIsADisease wrote:
               | If it worked IRL for anything useful, I'd be more
               | interested in the technical differences. But it was a
               | mere toy for a few tests at my fortune 20 company.
               | 
               | Language is full of issues of particulars vs universals,
               | and you could debate if its just an integrated GPU with
               | different marketing.
               | 
               | Whatever the case, we couldn't use it in production, and
               | NVIDIAs stock price reflects the reality on the ground.
        
               | tucnak wrote:
               | Well, I've been using a fine-tuned variant of Gemma 3
               | model since it came out, and some embedding models, on a
               | laptop. It's not "useless" by any means, in fact it still
               | beats the latest Claude for my use-case in Ukrainian. Not
               | to mention that if you travel by train a lot, you will
               | find it quite useful. I own a Mac studio M2 Max (96 GB)
               | variant at home, and I'm routinely using the larger
               | models for the kind of stuff I don't wish to share with
               | model providers.
               | 
               | My 2 cents
        
             | mholm wrote:
             | The reason Macs get recommended is the unified memory,
             | which is usable as VRAM for the GPU. People are similarly
             | using the AMD Strix Halo for AI which also has a similar
             | memory architecture. Time to first token for something like
             | '1+1=' would be seconds, and then you'd be getting ~20
             | tokens per second, which is absolutely plenty fast for
             | regular use. Token/s slows down at the higher end of
             | context, but it's absolutely still practical for a lot of
             | usecases. Though I agree that agentic coding, especially
             | over large projects, would likely get too slow to be
             | practical.
        
               | zozbot234 wrote:
               | Not too slow if you just let it run overnight/in the
               | background. But the biggest draw would be no rate limits
               | whatsoever compared to the big proprietary APIs,
               | especially Claude's. No risk of sudden rugpulls either,
               | and the model will have very consistent performance.
        
               | PlatoIsADisease wrote:
               | We are getting into a debate between particulars and
               | universals. To call the 'unified memory' VRAM is quite a
               | generalization. Whatever the case, we can tell from stock
               | prices that whatever this VRAM is, its nothing compared
               | to NVIDIA.
               | 
               | Anyway, we were trying to run a 70B model on a
               | macbook(can't remember which M model) at a fortune 20
               | company, it never became practical. We were trying to
               | compare strings of character length ~200. It was like
               | 400-ish characters plus a pre-prompt.
               | 
               | I can't imagine this being reasonable on a 1T model, let
               | alone the 400B models of deepseek and LLAMA.
        
               | simonw wrote:
               | Here's a video of a previous 1T K2 model running using
               | MLX on a a pair of Mac Studios:
               | https://twitter.com/awnihannun/status/1943723599971443134
               | - performance isn't terrible.
        
               | PlatoIsADisease wrote:
               | Is there a catch? I was not getting anything like this on
               | a 70B model.
               | 
               | EDIT: oh its a marketing account and the program never
               | finished... who knows the validity.
        
               | simonw wrote:
               | I don't think Awni should be dismissed as a "marketing
               | account" - they're an engineer at Apple who's been
               | driving the MLX project for a couple of years now,
               | they've earned a lot of respect from me.
        
               | PlatoIsADisease wrote:
               | Given how secretive Apple is, oh my, its super duper
               | marketing account.
        
               | mholm wrote:
               | Jeff Geerling and a few others also got access to
               | similarly specced mac clusters. They replicated this
               | performance.
               | 
               | The tooling involved has improved significantly over the
               | past year.
        
               | Gracana wrote:
               | With 32B active parameters, Kimi K2.5 will run faster
               | than your 70B model.
        
             | simonw wrote:
             | MLX uses the GPU.
             | 
             | That said, I wouldn't necessarily recommend spending
             | $20,000 on a pair of Mac Studios to run models like this.
             | The performance won't be nearly as good as the server-class
             | GPU hardware that hosted models run on.
        
         | tosh wrote:
         | I think you can put a bunch of apple silicon macs with enough
         | ram together
         | 
         | e.g. in an office or coworking space
         | 
         | 800-1000 gb ram perhaps?
        
       | rvz wrote:
       | The chefs at Moonshot have cooked once again.
        
       | Jackson__ wrote:
       | As your local vision nut, their claims about "SOTA" vision are
       | absolutely BS in my tests.
       | 
       | Sure it's SOTA at standard vision benchmarks. But on tasks that
       | require proper image understanding, see for example BabyVision[0]
       | it appears very much lacking compared to Gemini 3 Pro.
       | 
       | [0] https://arxiv.org/html/2601.06521v1
        
         | nostrebored wrote:
         | Gemini remains the only usable vision fm :(
        
       | Topfi wrote:
       | K2 0905 and K2 Thinking shortly after that have done impressively
       | well in my personal use cases and was severely slept on. Faster,
       | more accurate, less expensive, more flexible in terms of hosting
       | and available months before Gemini 3 Flash, I really struggle to
       | understand why Flash got such positive attention at launch.
       | 
       | Interested in the dedicated Agent and Agent Swarm releases,
       | especially in how that could affect third party hosting of the
       | models.
        
         | msp26 wrote:
         | K2 thinking didn't have vision which was a big drawback for my
         | projects.
        
       | bertili wrote:
       | The "Deepseek moment" is just one year ago today!
       | 
       | Coincidence or not, let's just marvel for a second over this
       | amount of magic/technology that's being given away for free...
       | and how liberating and different this is than OpenAI and others
       | that were closed to "protect us all".
        
         | motoboi wrote:
         | What amazes me is why would someone spend millions to train
         | this model and give it away for free. What is the business
         | here?
        
           | testfrequency wrote:
           | Curious to hear what "OpenAI" thinks the answer to this is
        
           | YetAnotherNick wrote:
           | Hosting the model is cheaper per token, the more batched
           | token you get. So they have big advantage here.
        
           | whizzter wrote:
           | Chinese state that maybe sees open collaboration as the way
           | to nullify any US lead in the field, concurrently if the next
           | "search-winner" is built upon their model the Chinese
           | worldview that Taiwan belongs to China and Tiamen Square
           | massacre never happened.
           | 
           | Also their license says that if you have a big product you
           | need to promote them, remember how Google "gave away" site
           | searche widgets and that was perhaps one of the major ways
           | they gained recognition for being the search leader.
           | 
           | OpenAI/NVidia is the Pets.com/Sun of our generation, insane
           | valuations, stupid spend, expensive options, expensive
           | hardware and so on.
           | 
           | Sun hardware bought for 50k USD to run websites in 2000 are
           | less capable than perhaps 5 dollar/month VPS's today?
           | 
           | "Scaling to AGI/ASI" was always a fools errand, best case
           | OpenAI should've squirreled away money to have a solid
           | engineering department that could focus on algorithmic
           | innovations but considering that Antrophic, Google and
           | Chinese firms have caught up or surpassed them it seems they
           | didn't.
           | 
           | Once things blows up, those closed options that had somewhat
           | sane/solid model research that handles things better will be
           | left and a ton of new competitors running modern/cheaper
           | hardware and just using models are building blocks.
        
             | zozbot234 wrote:
             | > "Scaling to AGI/ASI" was always a fools errand
             | 
             | Scaling depends on hardware, so cheaper hardware on a
             | compute-per-watt basis only makes scaling easier. There is
             | no clear definition of AGI/ASI but AI has already scaled to
             | be quite useful.
        
             | dev_l1x_be wrote:
             | > Taiwan belongs to China
             | 
             | So they are on the same page as the UN and US?
             | 
             | The One China policy refers to a United States policy of
             | strategic ambiguity regarding Taiwan.[1] In a 1972 joint
             | communique with the PRC, the United States "acknowledges
             | that all Chinese on either side of the Taiwan Strait
             | maintain there is but one China and that Taiwan is a part
             | of China" and "does not challenge that position."
             | 
             | https://en.wikipedia.org/wiki/One_China
             | https://en.wikipedia.org/wiki/Taiwan_and_the_United_Nations
        
               | 9cb14c1ec0 wrote:
               | The One China policy is a fiction of foreign policy
               | statecraft, designed to sideline the issue without having
               | to actually deal with it. It is quite clear that apart
               | from the official fiction there is a real policy that is
               | not One China. This is made clear by the weapons sales to
               | Taiwan that specifically calibrated to make a Chinese
               | military action harder.
        
               | pqtyw wrote:
               | Existence of an independent and effectively sovereign
               | state on the island of Taiwan (however one calls it) is a
               | fact. Whatever doublespeak governments of other countries
               | or international organizations engage in due to political
               | reasons does not change that.
        
             | two_tasty wrote:
             | I love how Tiananmen square is always brought up as some
             | unique and tragic example of disinformation that could
             | never occur in the west, as though western governments
             | don't do the exact same thing with our worldview. Your
             | veneer of cynicism scarcely hides the structure of naivety
             | behind.
        
               | igneo676 wrote:
               | The difference is that, in the west, there's an
               | acceptable counter narrative. I can tell you that Ruby
               | Ridge and Waco never should've happened and were examples
               | of government overreach and massacre of it's own
               | citizens. Or <insert pet issue with the government here>
               | 
               | You can't with Tiananmen square in China
        
               | mannanj wrote:
               | I still see/hear cynicism with a hidden structure of
               | naivety behind.
        
           | ggdG wrote:
           | I think this fits into some "Commoditize The Complement"
           | strategy.
           | 
           | https://gwern.net/complement
        
           | Balinares wrote:
           | Speculating: there are two connected businesses here,
           | creating the models, and serving the models. Outside of a few
           | moneyed outliers, no one is going to run this at home. So at
           | worst opening this model allows mid-sized competitors to
           | serve it to customers from their own infra -- which helps
           | Kimi gain mindshare, particularly against the large
           | incumbents who are definitely _not_ going to be serving Kimi
           | and so don 't benefit from its openness.
           | 
           | Given the shallowness of moats in the LLM market, optimizing
           | for mindshare would not be the worst move.
        
           | tokioyoyo wrote:
           | Moonshot's (Kimi's owner) investors are Alibaba/Tencent et
           | al. Chinese market is stupidly competitive, and there's a
           | general attitude of "household name will take it all".
           | However getting there requires having a WeChat-esque user
           | base, through one way or another. If it's paid, there'll be
           | friction and it won't work. Plus, it undermines a lot of
           | other companies, which is a win for a lot of people.
        
           | WarmWash wrote:
           | It's another state project funded at the discretion of the
           | party.
           | 
           | If you look at past state projects, profitability wasn't
           | really considered much. They are notorious for a "Money hose
           | until a diamond is found in the mountains of waste"
        
           | deskamess wrote:
           | I think there is a book (Chip War) about how the USSR did not
           | effectively participate in staying at the edge of the
           | semiconductor revolution. And they have suffered for it.
           | 
           | China has decided they are going to participate in the
           | LLM/AGI/etc revolution at any cost. So it is a sunk cost, and
           | the models are just an end product and any revenue is
           | validation and great, but not essential. The cheaper price
           | points keep their models used and relevant. It challenges the
           | other (US, EU) models to innovate and keep ahead to justify
           | their higher valuations (both monthly plan, and investor).
           | Once those advances are made, it can be bought back to their
           | own models. In effect, the currently leading models are
           | running from a second place candidate who never gets tired
           | and eventually does what they do at a lower price point.
        
             | kaibee wrote:
             | In some way, the US won the cold war by spending so much on
             | military that the USSR, in trying to keep up, collapsed. I
             | don't see any parallels between that and China providing
             | infinite free compute to their AI labs, why do you ask?
        
           | culi wrote:
           | All economically transformative technologies have done
           | similar. If it's privatized, it's not gonna be transformative
           | across the industry. The GPS, the internet, touchscreens, AI
           | voice assistants, microchips, LCDs, etc were all publicly
           | funded (or made by Bell Labs which had a state-mandated
           | monopoly that forced them to open up their patents).
           | 
           | The economist Mariana Mazzucato wrote a great book about this
           | called _The Entrepreneurial State: Debunking Public vs.
           | Private Sector Myths_
        
           | overfeed wrote:
           | > What amazes me is why would someone spend millions to train
           | this model and give it away for free. What is the business
           | here?
           | 
           | How many millions did Google spend on Android (acquisition
           | and salaries), only to give it away for free?
           | 
           | Usually, companies do this to break into a monopolized market
           | (or one that's at risk of becoming one), with openness as a
           | sweetener. IBM with Linux to break UNIX-on-big-iron
           | domination, Google with Android vs. iPhone, Sun with
           | OpenSolaris vs. Linux-on-x86.
        
         | jimmydoe wrote:
         | It's not coincidence. Chinese companies tend to do big releases
         | before Chinese new year. So expect more to come before Feb 17.
        
         | catigula wrote:
         | I mean, there are credible safety issues here. A Kimi fine-tune
         | will absolutely be able to help people do cybersecurity related
         | attacks - very good ones.
         | 
         | In a few years, or less, biological attacks and other sorts of
         | attacks will be plausible with the help of these agents.
         | 
         | Chinese companies aren't humanitarian endeavors.
        
         | PlatoIsADisease wrote:
         | I am convinced that was mostly just marketing. No one uses
         | deepseek as far as I can tell. People are not running it
         | locally. People choose GPT/Gemini/Claude/Grok if you are giving
         | your data away anyway.
         | 
         | My biggest source of my conspiracy is that I made a reddit
         | thread asking a question: "Why all the deepseek hype" or
         | something like that. And to this day, I get odd, 'pro deepseek'
         | comments from accounts only used every few months. Its not like
         | this was some highly upvoted topic that is in the 'Top'.
         | 
         | I'd put that deepseek marketing on-par with an Apple marketing
         | campaign.
        
           | mekpro wrote:
           | Except that, In OpenRouter, Deepseek always maintain in Top
           | 10 Ranking. Although I did not use it personally, i believe
           | that their main advantage over other model is
           | price/performance.
        
             | culi wrote:
             | Fifth in market share in fact!
             | 
             | https://openrouter.ai/rankings
             | 
             | There are a lot of applications where you really just want
             | a cheap and efficient model that's still somewhat
             | competitive and that's exactly the niche DeepSeek fulfills
             | the best.
        
           | logicprog wrote:
           | I don't use DeepSeek, but I prefer Kimi and GLM to closed
           | models for most of my work.
        
         | segmondy wrote:
         | There's been so many moments that folks not really heavy into
         | LLM have missed, DeepSeekR1 was great, but so was all the
         | "incremental" improvements, v3-0324, v3.1, v3.1-terminus, and
         | now v3.2-speciale. With that this is the 3rd great Kimi model,
         | then GLM has been awesome, since 4.5, with 4.5, 4.5-air, 4.6,
         | 4.7 and now 4.7 flash. Minimax-M2 has also been making waves
         | lately. ... and i'm just talking about the Chinese model
         | without adding the 10+ Qwen models. Outside of Chinese models,
         | mistral-small/devstral, gemma-27b-it, gpt-oss-120b, seed-os
         | have been great, and I'm still talking about just LLM, not
         | image, audio or special domain models like deepseek-prover and
         | deepseek-math. It's really a marvel what we have at home. I
         | cancelled OpenAI and Anthropic subscription 2 years ago once
         | they started calling for regulation of open models and I
         | haven't missed them one bit.
        
       | pu_pe wrote:
       | I don't get this "agent swarm" concept. You set up a task and
       | they boot up 100 LLMs to try to do it in parallel, and then one
       | "LLM judge" puts it all together? Is there anywhere I can read
       | more about it?
        
         | jonkoops wrote:
         | The datacenters yearn for the chips.
        
         | rvnx wrote:
         | You have a team lead that establishes a list of tasks that are
         | needed to achieve your mission
         | 
         | then it creates a list of employees, each of them is
         | specialized for a task, and they work in parallel.
         | 
         | Essentially hiring a team of people who get specialized on one
         | problem.
         | 
         | Do one thing and do it well.
        
           | XCSme wrote:
           | But in the end, isn't this the same idea with the MoE?
           | 
           | Where we have more specialized "jobs", which the model is
           | actually trained for.
           | 
           | I think the main difference with agents swarm is the ability
           | to run them in parallel. I don't see how this adds much
           | compared to simply sending multiple API calls in parallel
           | with your desired tasks. I guess the only difference is that
           | you let the AI decide how to split those requests and what
           | each task should be.
        
             | zozbot234 wrote:
             | Nope. MoE is strictly about model parameter sparsity.
             | Agents are about running multiple small-scale tasks in
             | parallel and aggregating the results for further processing
             | - it saves a lot of context length compared to having it
             | all in a single session, and context length has quadratic
             | compute overhead so this matters. You can have both.
             | 
             | One positive side effect of this is that if subagent tasks
             | can be dispatched to cheaper and more efficient edge-
             | inference hardware that can be deployed at scale (think
             | nVidia Jetsons or even Apple Macs or AMD APU's) even though
             | it might be highly limited in what can fit on the single
             | node, then complex coding tasks ultimately become a lot
             | _cheaper_ per token than generic chat.
        
               | XCSme wrote:
               | Yes, I know you can have both.
               | 
               | My point was that this is just a different way of
               | creating specialised task solvers, the same as with MoE.
               | 
               | And, as you said, with MoE it's about the model itself,
               | and it's done at training level so that's not something
               | we can easily do ourselves.
               | 
               | But with agent swarm, isn't it simply splitting a task in
               | multiple sub-tasks and sending each one in a different
               | API call? So this can be done with any of the previous
               | models too, only that the user has to manually define
               | those tasks/contexts for each query.
               | 
               | Or is this at a much more granular level than this, which
               | would not be feasible to be done by hand?
               | 
               | I was already doing this in n8n, creating different
               | agents with different system prompts for different tasks.
               | I am not sure if automating this (with swarm) would work
               | well in my most cases, I don't see how this fully
               | complements Tools or Skills
        
               | zozbot234 wrote:
               | MoE has nothing whatsoever to do with specialized task
               | solvers. It always operates per token within a single
               | task, you can think of it perhaps as a kind of learned
               | "attention" for model parameters as opposed to context
               | data.
        
               | XCSme wrote:
               | Yes, specific weights/parameters have be trained to solve
               | specific tasks (trained on different data).
               | 
               | Or did I misunderstand the concept of MoE, and it's not
               | about having specific parts of the model (parameters) do
               | better on specific input contexts?
        
         | vessenes wrote:
         | You can read about this basically everywhere - the term of art
         | is agent orchestration. Gas town, Claude's secret swarm mode,
         | or people who like to use phrases like "Wiggum loop" will get
         | you there.
         | 
         | If you're really lazy - the quick summary is that you can
         | benefit from the sweet spot of context length and reduce
         | instruction overload while getting some parallelism benefits
         | from farming tasks out to LLMs with different instructions. The
         | way this is generally implemented today is through tool
         | calling, although Claude also has a skills interface it has
         | been trained against.
         | 
         | So the idea would be for software development, why not have a
         | project/product manager spin out tasks to a bunch of agents
         | that are primed to be good at different things? E.g. an
         | architect, a designer, and so on. Then you just need something
         | that can rectify GitHub PRs and bob's your uncle.
         | 
         | Gas town takes a different approach and parallelizes on coding
         | tasks of any sort at the base layer, and uses the orchestration
         | infrastructure to keep those coders working constantly,
         | optimizing for minimal human input.
        
           | IanCal wrote:
           | I'm not sure whether there are parts of this done for claude
           | but those other ones are layers on top of the usual LLMs we
           | see. This seems to be a bit different, in that there's a
           | different model trained specifically for splitting up and
           | managing the workload.
        
         | Rebuff5007 wrote:
         | I've also been quite skeptical, and I became even _more_
         | skeptical after hearing a tech talk from a startup in this
         | space [1].
         | 
         | I think the best way to think about it is that its an
         | engineering hack to deal with a shortcoming of LLMs: for
         | complex queries LLMs are unable to directly compute a SOLUTION
         | given a PROMPT, but are instead able to break down the prompt
         | to intermediate solutions and eventually solve the original
         | prompt. These "orchestrator" / "swarm" agents add some
         | formalism to this and allow you to distribute compute, and then
         | also use specialized models for some of the sub problems.
         | 
         | [1] https://www.deepflow.com/
        
       | vinhnx wrote:
       | One thing caught my eyes is that besides K2.5 model, Moonshot AI
       | also launched Kimi Code (https://www.kimi.com/code), evolved from
       | Kimi CLI. It is a terminal coding agent, I've been used it last
       | month with Kimi subscription, it is capable agent with stable
       | harness.
       | 
       | GitHub: https://github.com/MoonshotAI/kimi-cli
        
         | Imanari wrote:
         | How does it fare against CC?
        
         | forgotpwd16 wrote:
         | >Kimi Code CLI is not only a coding agent, but also a shell.
         | 
         | That's cool. It also has a zsh hook, allowing you to switch to
         | agent mode wherever you're.
        
           | vinhnx wrote:
           | It is, Kimi Code CLI supports Zed' Agent Client Protocol
           | (http://agentclientprotocol.com/), so it can acts as an
           | external agent that could run in any ACP-compatible client,
           | eg: Zed, Jetbrain, Toad CLI, Minano Notebook. Also, it
           | supports Agent Skills. Moonshot AI developers are actively
           | update the agent and every active. I really like their CLI.
        
         | esafak wrote:
         | Does it support the swarm feature? Does Opencode?
        
       | monkeydust wrote:
       | Is this actually good or just optimized heavily for benchmarks? I
       | am hopefully its the former based on the writeup but need to put
       | it through its paces.
        
         | kurtis_reed wrote:
         | Quite good in my testing
        
       | Barathkanna wrote:
       | A realistic setup for this would be a 16x H100 80GB with NVLink.
       | That comfortably handles the active 32B experts plus KV cache
       | without extreme quantization. Cost-wise we are looking at roughly
       | $500k-$700k upfront or $40-60/hr on-demand, which makes it clear
       | this model is aimed at serious infra teams, not casual single-GPU
       | deployments. I'm curious how API providers will price tokens on
       | top of that hardware reality.
        
         | bertili wrote:
         | The other realistic setup is $20k, for a small company that
         | needs a private AI for coding or other internal agentic use
         | with two Mac Studios connected over thunderbolt 5 RMDA.
        
           | zozbot234 wrote:
           | That's great for affordable local use but it'll be slow: even
           | with the proper multi-node inference setup, the thunderbolt
           | link will be a comparative bottleneck.
        
           | embedding-shape wrote:
           | I'd love to see the prompt processing speed difference
           | between 16x H100 and 2x Mac Studio.
        
             | zozbot234 wrote:
             | Prompt processing/prefill can even get some speedup from
             | local NPU use most likely: when you're ultimately limited
             | by thermal/power limit throttling, having more efficient
             | compute available means more headroom.
        
             | Barathkanna wrote:
             | I asked GPT for a rough estimate to benchmark prompt
             | prefill on an 8,192 token input. * 16x H100: 8,192 / (20k
             | to 80k tokens/sec) [?] 0.10 to 0.41s * 2x Mac Studio (M3
             | Max): 8,192 / (150 to 700 tokens/sec) [?] 12 to 55s
             | 
             | These are order-of-magnitude numbers, but the takeaway is
             | that multi H100 boxes are plausibly ~100x faster than
             | workstation Macs for this class of model, especially for
             | long-context prefill.
        
               | ffsm8 wrote:
               | You do realize that's entirely made up, right?
               | 
               | Could be true, could be fake - the only thing we can be
               | sure of is that it's made up with no basis in reality.
               | 
               | This is not how you use llms effectively, that's how you
               | give everyone that's using them a bad name from
               | association
        
           | Barathkanna wrote:
           | That won't realistically work for this model. Even with only
           | ~32B active params, a 1T-scale MoE still needs the full
           | expert set available for fast routing, which means hundreds
           | of GB to TBs of weights resident. Mac Studios don't share
           | unified memory across machines, Thunderbolt isn't remotely
           | comparable to NVLink for expert exchange, and bandwidth
           | becomes the bottleneck immediately. You could maybe load
           | fragments experimentally, but inference would be
           | impractically slow and brittle. It's a very different class
           | of workload than private coding models.
        
             | zozbot234 wrote:
             | If "fast" routing is per-token, the experts can just reside
             | on SSD's. the performance is good enough these days. You
             | don't need to globally share unified memory across the
             | nodes, you'd just run distributed inference.
             | 
             | Anyway, in the future your local model setups will just be
             | downloading experts on the fly from experts-exchange. That
             | site will become as important to AI as downloadmoreram.com.
        
             | bertili wrote:
             | People are running the previous Kimi K2 on 2 Mac Studios at
             | 21tokens/s or 4 Macs at 30tokens/s. Its still premature,
             | but not a completely crazy proposition for the near future,
             | giving the rate of progress.
        
               | NitpickLawyer wrote:
               | > 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s
               | 
               | Keep in mind that most people posting speed benchmarks
               | try them with basically 0 context. Those speeds will not
               | hold at 32/64/128k context length.
        
             | YetAnotherNick wrote:
             | Depends on if you are using tensor parallelism or pipeline
             | parallelism, in the second case you don't need any sharing.
        
             | omneity wrote:
             | RDMA over Thunderbolt is a thing now.
        
         | reissbaker wrote:
         | Generally speaking, 8xH200s will be a lot cheaper than
         | 16xH100s, and faster too. But both should technically work.
        
           | pama wrote:
           | You can do it and may be ok for single user with idle waiting
           | times, but performance/throughput will be roughly halved
           | (closer to 2/3) and free context will be more limited with
           | 8xH200 vs 16xH100 (assuming decent interconnect). Depending a
           | bit on usecase and workload 16xH100 (or 16xB200) may be a
           | better config for cost optimization. Often there is a huge
           | economy of scale with such large mixture of expert models so
           | that it would even be cheaper to use 96 GPU instead of just 8
           | or 16. The reasons are complicatet and involve better prefill
           | cache, less memory transfer per node.
        
         | wongarsu wrote:
         | The weights are int4, so you'd only need 8xH100
        
         | a2128 wrote:
         | You don't need to wait and see, Kimi K2 has the same hardware
         | requirements and has several providers on OpenRouter:
         | 
         | https://openrouter.ai/moonshotai/kimi-k2-thinking
         | https://openrouter.ai/moonshotai/kimi-k2-0905
         | https://openrouter.ai/moonshotai/kimi-k2-0905:exacto
         | https://openrouter.ai/moonshotai/kimi-k2
         | 
         | Generally it seems to be in the neighborhood of $0.50/1M for
         | input and $2.50/1M for output
        
       | hmate9 wrote:
       | About 600GB needed for weights alone, so on AWS you need an
       | p5.48xlarge (8x H100) which costs $55/hour.
        
       | Alifatisk wrote:
       | Have you all noted that the latest releases (Qwen3 max thinking,
       | now Kimi k2.5) from Chinese companies are benching against Claude
       | opus now and not Sonnet? They are truly catching up, almost at
       | the same pace?
        
         | zozbot234 wrote:
         | The benching is sus, it's way more important to look at real
         | usage scenarios.
        
         | conception wrote:
         | https://clocks.brianmoore.com
         | 
         | K2 is one of the only models to nail the clock face test as
         | well. It's a great model.
        
           | DJBunnies wrote:
           | Cool comparison, but none of them get both the face and the
           | time correct when I look at it.
        
             | conception wrote:
             | Refresh. It's not every time but k2 hits a perfect clock
             | for me about 7/10 or so.
        
           | culi wrote:
           | Kimi 2 is remarkably consistently the best. I wonder if it's
           | somehow been trained specifically on tasks like these. It
           | seems too consistent to be coincidence
           | 
           | Also shocking is how the most common runner up I've seen is
           | DeepSeek
        
         | WarmWash wrote:
         | They distill the major western models, so anytime a new SOTA
         | model drops, you can expect the Chinese labs to update their
         | models within a few months.
        
           | zozbot234 wrote:
           | This is just a conspiracy theory/urban legend. How do you
           | "distill" a proprietary model with no access to the original
           | weights? Just doing the equivalent of training on chat/API
           | logs has terrible effectiveness (you're trying to drink from
           | a giant firehose through a tiny straw) and gives you no
           | underlying improvements.
        
           | Alifatisk wrote:
           | Yes, they do distill. But just saying all they do is distill
           | is not correct and actually kind of unfair. These Chinese
           | labs have done lots of research in this field and publish it
           | to the public, some of not majority contribute with open-
           | weight models making a future of local llm possible!
           | Deepseek, Moonshot, Minimax, Z.a, Alibabai (Qwen).
           | 
           | They are not just leeching here, they took this innovation,
           | refined it and improved it further. This is what the Chinese
           | is good at.
        
           | Balinares wrote:
           | Source?
        
         | esafak wrote:
         | They are, in benchmarks. In practice Anthropic's models are
         | ahead of where their benchmarks suggest.
        
           | HNisCIS wrote:
           | Bear in mind that lead may be, in large part, from the
           | tooling rather than the model
        
       | simonw wrote:
       | Pretty cute pelican https://tools.simonwillison.net/svg-
       | render#%3Csvg%20viewBox%...
        
         | mythz wrote:
         | doesn't work, looks like the link or SVG was cropped.
        
         | bavell wrote:
         | No pelican for me :(
        
         | simonw wrote:
         | Oops, here's a working link:
         | https://gist.github.com/simonw/32a85e337fbc6ee935d10d89726c0...
        
       | throwaw12 wrote:
       | Congratulations, great work Kimi team.
       | 
       | Why is that Claude still at the top in coding, are they heavily
       | focused on training for coding or is it their general training is
       | so good that it performs well in coding?
       | 
       | Someone please beat the Opus 4.5 in coding, I want to replace it.
        
         | MattRix wrote:
         | Opus 4.5 only came out two months ago, and yes Anthropic spends
         | a lot of effort making it particularly good at coding.
        
         | Balinares wrote:
         | I replaced Opus with Gemini Pro and it's just plain a better
         | coder IMO. It'll restructure code to enable support for new
         | requirements where Opus seems to just pile on more indirection
         | layers by default, when it doesn't outright hardcode special
         | cases inside existing functions, or drop the cases it's failing
         | to support from the requirements while smugly informing you you
         | don't need that anyway.
        
         | pokot0 wrote:
         | I don't think that kind of difference in benchmarks has any
         | meaning at all. Your agentic coding tool and the task you are
         | working on introduce a lot more "noise" than that small delta.
         | 
         | Also consider they are all overfitting on the benchmark itself
         | so there might be that as well (which can go in either
         | directions)
         | 
         | I consider the top models practically identical for coding
         | applications (just personal experience with heavy use of both
         | GPT5.2 and Opus 4.5).
         | 
         | Excited to see how this model compares in real applications.
         | It's 1/5th of the price of top models!!
        
         | symisc_devel wrote:
         | Gemini 3 pro is way better than Opus especially for large
         | codebases.
        
           | redox99 wrote:
           | My experience is the total opposite.
        
           | rubslopes wrote:
           | Do you use it only for code editing, or also for running bash
           | commands? My experience is that it is very bad at the latter.
        
       | jdeng wrote:
       | Glad to to see open source models are catching up and treat
       | vision as first-class citizen (a.k.a native multimodal agentic
       | model). GLM and Qwen models takes different approach, by having a
       | base model and a vision variant (glm-4.6 vs glm-4.6v).
       | 
       | I guess after Kimi K2.5, other vendors are going to the same
       | route?
       | 
       | Can't wait to see how this model performs on computer automation
       | use cases like VITA AI Coworker.
       | 
       | https://www.vita-ai.net/
        
       | teiferer wrote:
       | Can we please stop calling those models "open source"? Yes the
       | weights are open. So, "open weight" maybe. But the source isn't
       | open, the thing that allows to re-create it. That's what "open
       | source" used to mean. (Together with a license that allows you to
       | use that source for various things.)
        
         | Onavo wrote:
         | No major AI lab will admit to training on proprietary or
         | copyrighted data so what you are asking is an impossibility.
         | You can make a pretty good LLM if you train on Anna's Archive
         | but it will either be released anonymously, or with a research
         | only non commercial license.
         | 
         | There aren't enough public domain data to create good LLMs,
         | especially once you get into the newer benchmarks that expect
         | PhD level of domain expertise in various niche verticals.
         | 
         | It's also a logical impossibility to create a zero knowledge
         | proof that will allow you to attribute to specific training
         | data without admitting to usage.
         | 
         | I can think of a few technical options but none would hold
         | water legally.
         | 
         | You can use a S-protocol OR-composition to prove that it was
         | trained either on a copyrighted dataset or a non copyrighted
         | dataset without admitting to which one (technically
         | interesting, legally unsound).
         | 
         | You can prove that a model trained on copywrited data is
         | statistically indistinguishable from one trained on non-
         | copywrited data (an information theoretic impossibility unless
         | there exist as much public domain data as copywrited data, in
         | similar distributions).
         | 
         | You can prove a public domain and copywrited dataset are
         | equivalent if the model performance produced is
         | indistinguishable from each other.
         | 
         | All the proofs fail irl, ignoring the legal implications,
         | because there's less public domain information, so given the
         | lemma that more training data == improved model performance,
         | all the above are close to impossible.
        
       | dev_l1x_be wrote:
       | I had these weird situations like some models are refusing to use
       | SSH as a tool. Not sure if it was the coding tool limitation or
       | it is baked into in some of the models.
        
       | erichocean wrote:
       | Running on Apple Silicon:
       | https://x.com/awnihannun/status/2016221496084205965
        
       | stopachka wrote:
       | Is there a startup that takes models like this, and effectively
       | gives you a secure setup, where you have (a) a mobile app that
       | (b) talks to some giant machine that only you have access too.
       | 
       | If a 10K computer could run this, it may be worth it to have a
       | "fully on prem" version of ChatGPT running for you.
        
       | 2001zhaozhao wrote:
       | The directionally interesting part is that according to the
       | announcement, K2.5 seems to be trained specifically to create
       | sub-agents and work in an agent swarm usefully. The key part is
       | that you don't need to manually create or prompt sub-agents, K2.5
       | creates them automatically, so from the looks of things it's
       | similar to Claude Code dynamic sub-agents except the model is
       | trained to scale to many more agents autonomously.
       | 
       | I wonder whether Claude is doing the same kind of training and
       | it's coming with the next model, and that's why the agent swarm
       | mode in Claude Code is hidden for now. We might be getting very
       | very good agent orchestrators/swarms very soon.
        
       | culi wrote:
       | I posted this elsewhere but thought I'd repost here:
       | 
       | * https://lmarena.ai/leaderboard -- crowd-sourced head-to-head
       | battles between models using ELO
       | 
       | * https://dashboard.safe.ai/ -- CAIS' incredible dashboard
       | 
       | * https://clocks.brianmoore.com/ -- a visual comparison of how
       | well models can draw a clock. A new clock is drawn every minute
       | 
       | * https://eqbench.com/ -- emotional intelligence benchmarks for
       | LLMs
       | 
       | * https://www.ocrarena.ai/battle -- OCR battles, ELO
       | 
       | * https://mafia-arena.com/ -- LLMs playing the social deduction
       | game Mafia
       | 
       | * https://openrouter.ai/rankings -- marketshare based on
       | OpenRouter
        
       | enricoros wrote:
       | CCP-bench has gotten WAY better on K2.5!
       | 
       | https://big-agi.com/static/kimi-k2.5-less-censored.jpg
        
       ___________________________________________________________________
       (page generated 2026-01-28 07:01 UTC)