[HN Gopher] Kimi Released Kimi K2.5, Open-Source Visual SOTA-Age...
___________________________________________________________________
Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model
Author : nekofneko
Score : 473 points
Date : 2026-01-27 05:42 UTC (1 days ago)
(HTM) web link (www.kimi.com)
(TXT) w3m dump (www.kimi.com)
| billyellow wrote:
| Cool
| mangolie wrote:
| they cooked
| jumploops wrote:
| > For complex tasks, Kimi K2.5 can self-direct an agent swarm
| with up to 100 sub-agents, executing parallel workflows across up
| to 1,500 tool calls.
|
| > K2.5 Agent Swarm improves performance on complex tasks through
| parallel, specialized execution [..] leads to an 80% reduction in
| end-to-end runtime
|
| Not just RL on tool calling, but RL on agent orchestration, neat!
| mohsen1 wrote:
| Parallel agents are such a simple, yet powerful hack. Using it
| in Claude Code with TeammateTool and getting lots of good
| results!
| esperent wrote:
| > TeammateTool
|
| What is this?
| jlu wrote:
| claude code hidden feaure currently under a feature flag:
|
| https://github.com/mikekelly/claude-sneakpeek
| frimmy wrote:
| https://x.com/kieranklaassen/status/2014830266515382693 -
| agent swarms tool shipping w/ cc soon..
| XCSme wrote:
| > Kimi K2.5 can self-direct an agent swarm
|
| Is this within the model? Or within the IDE/service that runs
| the model?
|
| Because tool calling is mostly just the agent outputting "call
| tool X", and the IDE does it and returns the data back to AI's
| context
| mzl wrote:
| An LLM model only outputs tokens, so this could be seen as an
| extension of tool calling where it has trained on the
| knowledge and use-cases for "tool-calling" itself as a sub-
| agent.
| XCSme wrote:
| Ok, so agent swarm = tool calling where the tool is a LLM
| call and the argument is the prompt
| dcre wrote:
| Sort of. It's not necessarily a single call. In the
| general case it would be spinning up a long-running agent
| with various kinds of configuration -- prompts, but also
| coding environment and which tools are available to it --
| like subagents in Claude Code.
| IanCal wrote:
| Yes largely, although they've trained a model
| specifically for this task rather than using the base
| model and a bit of prompting.
| storystarling wrote:
| 1,500 tool calls per task sounds like a nightmare for unit
| economics though. I've been optimizing my own agent workflows
| and even a few dozen steps makes it hard to keep margins
| positive, so I'm not sure how this is viable for anyone not
| burning VC cash.
| zozbot234 wrote:
| "tool call" is just a reference to any elementary interaction
| with the outside system. It's not calling third-party APIs or
| anything like that.
| storystarling wrote:
| True, but that's still 1,500 inference cycles. Even without
| external API fees, the latency and compute burden seems
| huge. I don't see how the economics work there without
| significant subsidies.
| darrinm wrote:
| FWIW many tool calls can be and often are made in one
| inference cycle.
| DeathArrow wrote:
| Those are some impressive benchmark results. I wonder how well it
| does in real life.
|
| Maybe we can get away with something cheaper than Claude for
| coding.
| oneneptune wrote:
| I'm curious about the "cheaper" claim -- I checked Kimi
| pricing, and it's a $200/mo subscription too?
| NitpickLawyer wrote:
| On openrouter 2.5 is at 0.60/3$ per Mtok. That's haiku
| pricing.
| storystarling wrote:
| The unit economics seem tough at that price for a 1T
| parameter model. Even with MoE sparsity you are still VRAM
| bound just keeping the weights resident, which is a much
| higher baseline cost than serving a smaller model like
| Haiku.
| mrklol wrote:
| They also have a $20 and $40 tier.
| Alifatisk wrote:
| If you bargain with their bot Kimmmmy (not joking), you can
| even get lower pricing.
| mohsen1 wrote:
| tell me more...
| Alifatisk wrote:
| Go to kimi chat, there will come up multiple suggestions
| of use cases. One of them will be the bargain robot. If
| you download their mobile app, the challenge to bargain
| will probably popup too!
|
| Depending on how well you bargain with the robot, you can
| go as low as 0,99$ (difficult). Either way, their
| moderate plan doesn't have to be 20$. The agent wants a
| good reason for why it should lower the price for you.
|
| Here's the direct link to Kimmmmy:
|
| https://www.kimi.com/kimiplus/sale
|
| I'll send an invite link too if you don't mind:
|
| https://www.kimi.com/kimiplus/sale?activity_enter_method=
| h5_...
| mohsen1 wrote:
| omg this is so funny!
| esafak wrote:
| https://www.kimi.com/code
| spaceman_2020 wrote:
| Kimi was already one of the best writing models. Excited to try
| this one out
| Alifatisk wrote:
| To me, Kimi has been the best with writing and conversing, its
| way more human like!
| Tepix wrote:
| Huggingface Link: https://huggingface.co/moonshotai/Kimi-K2.5
|
| 1T parameters, 32b active parameters.
|
| License: MIT with the following modification:
|
| _Our only modification part is that, if the Software (or any
| derivative works thereof) is used for any of your commercial
| products or services that have more than 100 million monthly
| active users, or more than 20 million US dollars (or equivalent
| in other currencies) in monthly revenue, you shall prominently
| display "Kimi K2.5" on the user interface of such product or
| service._
| Imustaskforhelp wrote:
| Hey have they open sourced all Kimi k2.5
| (thinking,instruct,agent,agent swarm [beta])?
|
| Because I feel like they mentioned that agent swarm is
| available their api and that made me feel as if it wasn't open
| (weights)*? Please let me know if all are open source or not?
| XenophileJKO wrote:
| I'm assuming the swarm part is all harness. Well I mean a
| harness and way of thinking that the weights have just been
| fine tuned to use.
| mccoyb wrote:
| It's not in the harness today, it's a special RL technique
| they discuss in https://www.kimi.com/blog/kimi-k2-5.html
| (see "2. Agent Swarm")
|
| I looked through the harness and all I could find is a
| `Task` tool.
| dheera wrote:
| > or more than 20 million US dollars (or equivalent in other
| currencies) in monthly revenue, you shall prominently display
| "Kimi K2.5" on the user interface of such product or service.
|
| Why not just say "you shall pay us 1 million dollars"?
| clayhacks wrote:
| I assume this allows them to sue for different amounts. And
| not discourage too many people from using it.
| vessenes wrote:
| ? They prefer the branding. The license just says you have to
| say it was them if you make > $250mm a year on the model.
| viraptor wrote:
| Companies with $20M revenue will not normally have spare $1M
| available. They'd get more money by charging reasonable
| subscriptions than by using lawyers to chase sudden company-
| ending fees.
| laurentb wrote:
| it's monthly :) $240M revenue companies will absolutely
| find a way to fork $1M if they need to. Kimi most likely
| sees the eyeballs of free advertising as more profitable in
| the grander scheme of things
| endymi0n wrote:
| One. Trillion. Even on native int4 that's... half a terabyte of
| vram?!
|
| Technical awe at this marvel aside that cracks the 50th
| percentile of HLE, the snarky part of me says there's only half
| the danger in giving something away nobody can run at home
| anyway...
| Davidzheng wrote:
| that's what intelligence takes. Most of intelligence is just
| compute
| wongarsu wrote:
| Which conveniently fits on one 8xH100 machine. With 100-200
| GB left over for overhead, kv-cache, etc.
| storystarling wrote:
| The unit economics seem pretty rough though. You're locking
| up 8xH100s for the compute of ~32B active parameters. I
| guess memory is the bottleneck but hard to see how the
| margins work on that.
| johndough wrote:
| The model absolutely can be run at home. There even is a big
| community around running large models locally:
| https://www.reddit.com/r/LocalLLaMA/
|
| The cheapest way is to stream it from a fast SSD, but it will
| be quite slow (one token every few seconds).
|
| The next step up is an old server with lots of RAM and many
| memory channels with maybe a GPU thrown in for faster prompt
| processing (low two digits tokens/second).
|
| At the high end, there are servers with multiple GPUs with
| lots of VRAM or multiple chained Macs or Strix Halo mini PCs.
|
| The key enabler here is that the models are MoE (Mixture of
| Experts), which means that only a small(ish) part of the
| model is required to compute the next token. In this case,
| there are 32B active parameters, which is about 16GB at 4 bit
| per parameter. This only leaves the question of how to get
| those 16GB to the processor as fast as possible.
| 1dom wrote:
| > The model absolutely can be run at home. There even is a
| big community around running large models locally
|
| IMO 1tln parameters and 32bln active seems like a different
| scale to what most are talking about when they say
| localLLMs IMO. Totally agree there will be people messing
| with this, but the real value in localLLMs is that you can
| actually use them and get value from them with standard
| consumer hardware. I don't think that's really possible
| with this model.
| zozbot234 wrote:
| 32B active is nothing special, there's local setups that
| will easily support that. 1T total parameters ultimately
| requires keeping the bulk of them on SSD. This need not
| be an issue if there's enough locality in expert choice
| for any given workload; the "hot" experts will simply be
| cached in available spare RAM.
| 1dom wrote:
| I never said it was special.
|
| I was trying to correct the record that a lot of people
| will be using models of this size locally because of the
| local LLM community.
|
| The most commonly downloaded local LLMs are normally <30b
| (e.g.
| https://huggingface.co/unsloth/models?sort=downloads).
| The things you're saying, especially when combined
| together, make it not usable by a lot of people in the
| local LLM community at the moment.
| spmurrayzzz wrote:
| When I've measured this myself, I've never seen a medium-
| to-long task horizon that would have expert locality such
| that you wouldn't be hitting the SSD constantly to swap
| layers (not to say it doesn't exist, just that in the
| literature and in my own empirics, it doesn't seem to be
| observed in a way you could rely on it for cache
| performance).
|
| Over any task that has enough prefill input diversity and
| a decode phase thats more than a few tokens, its at least
| intuitive that experts activate nearly uniformly in the
| aggregate, since they're activated per token. This is why
| when you do something more than bs=1, you see forward
| passes light up the whole network.
| zozbot234 wrote:
| > hitting the SSD constantly to swap layers
|
| Thing is, people in the local llm community are already
| doing that to run the largest MoE models, using mmap such
| that spare-RAM-as-cache is managed automatically by the
| OS. It's a drag on performance to be sure but still
| somewhat usable, if you're willing to wait for results.
| And it unlocks these larger models on what's effectively
| semi-pro if not true consumer hardware. On the enterprise
| side, high bandwidth NAND Flash is just around the corner
| and perfectly suited for storing these large read-only
| model parameters (no wear and tear issues with the NAND
| storage) while preserving RAM-like throughput.
| spmurrayzzz wrote:
| I've tested this myself often (as an aside: I'm in said
| community, I run 2x RTX Pro 6000 locally, 4x 3090 before
| that), and I think what you said re: "willing to wait" is
| probably the difference maker for me.
|
| I can run Minimax 2.1 in 5bpw at 200k context fully
| offloaded to GPU. The 30-40 tk/s feels like a lifetime
| for long horizon tasks, especially with subagent
| delegation etc, but it's still fast enough to be a daily
| driver.
|
| But that's more or less my cutoff. Whenever I've tested
| other setups that dip into the single and sub-single
| digit throughput rates, it becomes maddening and entirely
| unusable (for me).
| zamadatix wrote:
| Local LLMs are just LLMs people run locally. It's not a
| definition of size, feature set, or what's most popular.
| What the "real" value is for local LLMs will depend on
| each person you ask. The person who runs small local LLMs
| will tell you the real value is in small models, the
| person who runs large local LLMs will tell you it's large
| ones, those who use cloud will say the value is in shared
| compute, and those who don't like AI will say there is no
| value in any.
|
| LLMs which the weights aren't available are an example of
| when it's not local LLMs, not when the model happens to
| be large.
| 1dom wrote:
| > LLMs which the weights aren't available are an example
| of when it's not local LLMs, not when the model happens
| to be large.
|
| I agree. My point was that most aren't thinking of models
| this large when they're talking about local LLMs. That's
| what I said, right? This is supported by the download
| counts on hf: the most downloaded local models are
| significantly smaller than 1tln, normally 1 - 12bln.
|
| I'm not sure I understand what point you're trying to
| make here?
| zamadatix wrote:
| Mostly a "We know local LLMs as being this, and all of
| the mentioned variants of this can provide real value
| regardless of which is most commonly referenced" point.
| I.e. large local LLMs aren't only something people mess
| with, they often provide a lot of value for a relative
| few people rather than a little value for a relative lot
| of people as small local LLMs do. Who thinks which
| modality and type brings the most value is largely a
| matter of opinion of the user getting the value, not just
| the option which runs on consumer hardware or etc alone.
|
| You're of course accurate that smaller LLMs are more
| commonly deployed, it's just not the part I was really
| responding to.
| GeorgeOldfield wrote:
| do you guys understand that different experts are loaded
| PER TOKEN?
| dev_l1x_be wrote:
| How do you split the model between multiple GPUs?
| evilduck wrote:
| With "only" 32B active params, you don't necessarily need
| to. We're straying from common home users to serious
| enthusiasts and professionals but this seems like it
| would run ok on a workstation with a half terabyte of RAM
| and a single RTX6000.
|
| But to answer your question directly, tensor parallelism.
| https://github.com/ggml-org/llama.cpp/discussions/8735 ht
| tps://docs.vllm.ai/en/latest/configuration/conserving_mem
| o...
| WhitneyLand wrote:
| Its often pointed out in the first sentence of a comment
| how a model can be run at home, then (maybe) towards the
| end of the comment it's mentioned how it's quantized.
|
| Back when 4k movies needed expensive hardware, no one was
| saying they could play 4k on a home system, then later
| mentioning they actually scaled down the resolution to make
| it possible.
|
| The degree of quality loss is not often characterized.
| Which makes sense because it's not easy to fully quantify
| quality loss with a few simple benchmarks.
|
| By the time it's quantized to 4 bits, 2 bits or whatever,
| does anyone really have an idea of how much they've gained
| vs just running a model that is sized more appropriately
| for their hardware, but not lobotomized?
| selfhoster11 wrote:
| Except the parent comment said you can stream the weights
| from an SSD. The full weights, uncompressed. It takes a
| little longer (a lot longer), but the model at least
| works without lossy pre-processing.
| FuckButtons wrote:
| From my own usage, the former is almost always better
| than the latter. Because it's less like a lobotomy and
| more like a hangover, though I have run some quantized
| models that seem still drunk.
|
| Any model that I can run in 128 gb in full precision is
| far inferior to the models that I can just barely get to
| run after reap + quantization for actually useful work.
|
| I also read a paper a while back about improvements to
| model performance in contrastive learning when
| quantization was included during training as a form of
| perturbation, to try to force the model to reach a
| smoother loss landscape, it made me wonder if something
| similar might work for llms, which I think might be what
| the people over at minimax are doing with m2.1 since they
| released it in fp8.
|
| In principle, if the model has been effective during its
| learning at separating and compressing concepts into
| approximately orthogonal subspaces (and assuming the
| white box transformer architecture approximates what
| typical transformers do), quantization should really only
| impact outliers which are not well characterized during
| learning.
| WhitneyLand wrote:
| Interesting.
|
| If this were the case however, why would labs go through
| the trouble of distilling their smaller models rather
| than releasing quantized versions of the flagships?
| dabockster wrote:
| Hanlon's razor.
|
| "Never attribute to malice that which is adequately
| explained by stupidity."
|
| Yes, I'm calling labs that don't distill smaller sized
| models stupid for not doing so.
| zozbot234 wrote:
| > ...Back when 4k movies needed expensive hardware, no
| one was saying they could play 4k on a home system, then
| later mentioning they actually scaled down the resolution
| to make it possible. ...
|
| int4 quantization is the original release in this case;
| it's not been quantized after the fact. It's a bit of a
| nuisance when running on hardware that doesn't natively
| support the format (might waste some fraction of memory
| throughput on padding, specifically on NPU hw that can't
| do the unpacking on its own) but no one here is reducing
| quality to make the model fit.
| WhitneyLand wrote:
| Good point thanks for the clarification.
|
| The broader point remains though which is, "you can run
| this model as home..." when actually the caveats are
| potentially substantial.
|
| It would be so incredibly slow...
| Gracana wrote:
| The level of deceit you're describing is kind of
| ridiculous. Anybody talking about their specific setup is
| going to be happy to tell you the model and quant they're
| running and the speeds they're getting, and if you want
| to understand the effects of quantization on model
| quality, it's really easy to spin up a GPU server
| instance and play around.
| jasonjmcghee wrote:
| > if you want to understand the effects of quantization
| on model quality, it's really easy to spin up a GPU
| server instance and play around
|
| Fwiw, not necessarily. I've noticed quantized models have
| strange and surprising failure modes where everything
| seems to be working well and then does a death spiral
| repeating a specific word or completely failing on one
| task of a handful of similar tasks.
|
| 8-bit vs 4-bit can be almost imperceptible or night and
| day.
|
| This isn't something you'd necessarily see playing
| around, but when trying to do something specific
| codexon wrote:
| Didn't this paper demonstrate that you only need 1.58
| bits to be equivalent to 16 bits in performance?
|
| https://arxiv.org/abs/2402.17764
| WhitneyLand wrote:
| Iirc the paper was solid, but it still hasn't been
| adopted/proven out at large scale. Harder to adapt
| hardware and code kernels to something like this compared
| to int4.
| Ey7NFZ3P0nzAe wrote:
| This technique showed that there are ways _during
| training_ to optimize weights to neatly quantize while
| remaining performant. This isn 't a _post training_
| quantization like int4.
| PlatoIsADisease wrote:
| >The model absolutely can be run at home.
|
| There is a huge difference between "look I got it to answer
| the prompt: '1+1='"
|
| and actually using it for anything of value.
|
| I remember early on people bought Macs (or some marketing
| team was shoveling it), and proposing people could
| reasonably run the 70B+ models on it.
|
| They were talking about 'look it gave an answer', not 'look
| this is useful'.
|
| While it was a bit obvious that 'integrated GPU' is not
| Nvidia VRAM, we did have 1 mac laptop at work that
| validated this.
|
| Its cool these models are out in the open, but its going to
| be a decade before people are running them at a useful
| level locally.
| esafak wrote:
| Hear, hear. Even if the model fits, a few tokens per
| second make no sense. Time is money too.
| tempoponet wrote:
| Maybe for a coding agent, but a daily/weekly report on
| sensitive info?
|
| If it were 2016 and this technology existed but only in 1
| t/s, every company would find a way to extract the most
| leverage out of it.
| esafak wrote:
| But it's 2026 and 'secure' (by executive standards)
| hosted options exist.
| dabockster wrote:
| > 'secure' (by executive standards)
|
| "Secure" in the sense that they can sue someone after the
| fact, instead of preventing data from leaking in the
| first place.
| michaellee8 wrote:
| If they figured out it can be this useful in 2016 running
| 1 t/s, they would make it run at least 20 t/s by 2019
| hex4def6 wrote:
| If I can start an agent and be able to walk away for 8
| hours, and be confident it's 'smart' enough to complete a
| task unattended, that's still useful.
|
| At 3 tk/s, that's still 100-150 pages of a book, give or
| take.
| esafak wrote:
| True, that's still faster than a human, but they're not
| nearly that reliable yet.
| dabockster wrote:
| You can run AI models on unified/shared memory specifically
| on Windows, not Linux (unfortunately). It uses the same
| memory sharing system that Microsoft originally had built
| for gaming when a game would run out of vram. If you:
|
| - have an i5 or better or equivalent manufactured within
| the last 5-7 years
|
| - have an nvidia consumer gaming GPU (RTX 3000 series or
| better) with at least 8 GB vram
|
| - have at least 32 GB system ram (tested with DDR4 on my
| end)
|
| - build llama-cpp yourself with every compiler optimization
| flag possible
|
| - pair it with a MoE model compatible with your unified
| memory amount
|
| - and configure MoE offload to the CPU to reduce memory
| pressure on the GPU
|
| then you can honestly get to about 85-90% of cloud AI
| capability totally on-device, depending on what program you
| interface with the model.
|
| And here's the shocking idea: those system specs can be met
| by an off the shelf gaming computer from, for example, Best
| Buy or Costco today and right now. You can literally buy a
| CyberPower or iBuyPower model, again for example, download
| the source, run the compilation, and have that level of AI
| inference available to you.
|
| Now, the reason why it won't work on Linux is that the
| Linux kernel and Linux distros both leave that unified
| memory capability up to the GPU driver to implement. Which
| Nvidia hasn't done yet. You can code it somewhat into
| source code, but it's still super unstable and flaky from
| what I've read.
|
| (In fact, that lack of unified memory tech on Linux is
| probably why everyone feels the need to build all these
| data centers everywhere.)
| side_up_down wrote:
| I'd take "running at home" to mean running on reasonably
| available consumer hardware, which your setup is not. You
| can obviously build custom, but who's actually going to do
| that? OP's point is valid
| the_sleaze_ wrote:
| 3,998.99 for 500gb of RAM on amazon
|
| "Good Luck" - Kimi <Taken voice>
| mrinterweb wrote:
| VRAM is the new moat, and controlling pricing and access to
| VRAM is part of it. There will be very few hobbyists who can
| run models of this size. I appreciate the spirit of making
| the weights open, but realistically, it is impractical for
| >99.999% of users to run locally.
| segmondy wrote:
| I run KimiK2 at home, Most of it on system ram with a few
| layers offloaded to old 3090s. This is a cheap budget build.
|
| Kimi-K2-Thinking-UD-Q3_K_XL-00001-of-00010.gguf Generation -
| 5,231 tokens 604.63s 8.65 tokens/s
| mapkkk wrote:
| Could I trouble you for the specifics of your build? I'd
| love to see if it would be a viable upgrade for me.
|
| I currently have a 3970x with a bunch of 3090s.
| segmondy wrote:
| 4 3090s, epyc MB with 8 channel memory, 7352 cpu, slow
| 2400mhz ddr4 rams.
| redox99 wrote:
| Cursor devs, who go out of their way to not mention their
| Composer model is based on GLM, are not going to like that.
| msp26 wrote:
| Source? I've heard this rumour twice but never seen proof. I
| assume it would be based on tokeniser quirks?
| lrvick wrote:
| Actually open source, or yet another public model, which is the
| equivalent of a binary?
|
| URL is down so cannot tell.
| Tepix wrote:
| It's open weights, not open source.
| typ wrote:
| The label 'open source' has become a reputation reaping and
| marketing vehicle rather than an informative term since the
| Hugging Face benchmark race started. With the weights only, we
| cannot actually audit that if a model is a) contaminated by
| benchmarks, b) built with deliberate biases, or c) trained on
| copyrighted/privacy data, let alone allowing other vendors to
| replicate the results. Anyways, people still love free stuff.
| Der_Einzige wrote:
| Just accept that IP laws don't matter and the old "free
| software" paradigm is dead. Aaron Swartz died so that GenAI
| may live. RMS and his model of "copyleft" are so Web 1.0 (not
| even 2.0). No one in GenAI cares AT ALL about the true
| definition of open source. Good.
| duskdozer wrote:
| Good?
| Reubend wrote:
| I've read several people say that Kimi K2 has a better "emotional
| intelligence" than other models. I'll be interested to see
| whether K2.5 continues or even improves on that.
| storystarling wrote:
| yes, though this is highly subjective - it 'feels' like that to
| me as well (comapred to Gemini 3, GPT 5.2, Opus 4.5).
| Alifatisk wrote:
| Yup, I experience the same. I don't know what they do to
| achieve this but it gives them this edge, really curious to
| learn more about what makes it so good at it.
| in-silico wrote:
| A lot of people point to the Muon optimizer that Moonshot
| (the creators of Kimi) pioneered. Compared to the standard
| optimizer AdamW, Muon amplifies low-magnitude gradient
| directions which makes the model learn faster (and maybe
| gives Kimi its unique qualities).
|
| Muon paper: https://arxiv.org/abs/2502.16982
| Alifatisk wrote:
| Wow! Thank you
| mohsen1 wrote:
| I'll test it out on mafia-arena.com once it is available on
| Open Router
| pplonski86 wrote:
| There are so many models, is there any website with list of all
| of them and comparison of performance on different tasks?
| coffeeri wrote:
| There is https://artificialanalysis.ai
| pplonski86 wrote:
| Thank you! Exactly what I was looking for
| XCSme wrote:
| There are many lists, but I find all of them outdated or
| containing wrong information or missing the actual benchmarks
| I'm looking for.
|
| I was thinking, that maybe it's better to make my own
| benchmarks with the questions/things I'm interested in, and
| whenever a new model comes out run those tests with that
| model using open-router.
| Reubend wrote:
| The post actually has great benchmark tables inside of it. They
| might be outdated in a few months, but for now, it gives you a
| great summary. Seems like Gemini wins on image and video perf,
| Claude is the best at coding, ChatGPT is the best for general
| knowledge.
|
| But ultimately, you need to try them yourself on the tasks you
| care about and just see. My personal experience is that right
| now, Gemini Pro performs the best at everything I throw at it.
| I think it's superior to Claude and all of the OSS models by a
| small margin, even for things like coding.
| Imustaskforhelp wrote:
| I like Gemini Pro's UI over Claude so much but honestly I
| might start using Kimi K2.5 if its open source & just +/-
| Gemini Pro/Chatgpt/Claude because at that point I feel like
| the results are negligible and we are getting SOTA open
| source models again.
| wobfan wrote:
| > honestly I might start using Kimi K2.5 if its open source
| & just +/- Gemini Pro/Chatgpt/Claude because at that point
| I feel like the results are negligible and we are getting
| SOTA open source models again.
|
| Me too!
|
| > I like Gemini Pro's UI over Claude so much
|
| This I don't understand. I mean, I don't see a lot of
| difference in both UIs. Quite the opposite, apart from some
| animations, round corners and color gradings, they seem to
| look very alike, no?
| Imustaskforhelp wrote:
| Y'know I ended up buying Kimi's moderato plan which is
| 19$ but they had this unique idea where you can talk to a
| bot and they could reduce the price
|
| I made it reduce the price of first month to 1.49$ (It
| could go to 0.99$ and my frugal mind wanted it haha but I
| just couldn't have it do that lol)
|
| Anyways, afterwards for privacy purposes/( I am a minor
| so don't have a card), ended up going to g2a to get a 10$
| Visa gift card essentially and used it. (I had to pay a
| 1$ extra but sure)
|
| Installed kimi code on my mac and trying it out.
| Honestly, I am kind of liking it.
|
| My internal benchmark is creating pomodoro apps in golang
| web... Gemini 3 pro has nailed it, I just tried the kimi
| version and it does have some bugs but it feels like it
| added more features.
|
| Gonna have to try it out for a month.
|
| I mean I just wish it was this cheap for the whole year
| :< (As I could then move from, say using the completely
| free models)
|
| Gonna have to try it out more!
| striking wrote:
| https://archive.is/P98JR
| zmmmmm wrote:
| Curious what would be the most minimal reasonable hardware one
| would need to deploy this locally?
| NitpickLawyer wrote:
| I parsed "reasonable" as in having reasonable speed to actually
| use this as intended (in agentic setups). In that case, it's a
| minimum of 70-100k for hardware (8x 6000 PRO + all the other
| pieces to make it work). The model comes with native INT4
| quant, so ~600GB for the weights alone. An 8x 96GB setup would
| give you ~160GB for kv caching.
|
| You can of course "run" this on cheaper hardware, but the
| speeds will not be suitable for actual use (i.e. minutes for a
| simple prompt, tens of minutes for high context sessions per
| turn).
| simonw wrote:
| Models of this size can usually be run using MLX on a pair of
| 512GB Mac Studio M3 Ultras, which are about $10,000 each so
| $20,000 for the pair.
| PlatoIsADisease wrote:
| You might want to clarify that this is more of a "Look it
| technically works"
|
| Not a "I actually use this"
|
| The difference between waiting 20 minutes to answer the
| prompt '1+1='
|
| and actually using it for something useful is massive here. I
| wonder where this idea of running AI on CPU comes from. Was
| it Apple astroturfing? Was it Apple fanboys? I don't see
| people wasting time on non-Apple CPUs. (Although, I did do
| this for a 7B model)
| tucnak wrote:
| Mac studio way is not "AI on CPU," as M2/M4 are complex
| SoC, that includes a GPU with unified memory access.
| PlatoIsADisease wrote:
| If it worked IRL for anything useful, I'd be more
| interested in the technical differences. But it was a
| mere toy for a few tests at my fortune 20 company.
|
| Language is full of issues of particulars vs universals,
| and you could debate if its just an integrated GPU with
| different marketing.
|
| Whatever the case, we couldn't use it in production, and
| NVIDIAs stock price reflects the reality on the ground.
| tucnak wrote:
| Well, I've been using a fine-tuned variant of Gemma 3
| model since it came out, and some embedding models, on a
| laptop. It's not "useless" by any means, in fact it still
| beats the latest Claude for my use-case in Ukrainian. Not
| to mention that if you travel by train a lot, you will
| find it quite useful. I own a Mac studio M2 Max (96 GB)
| variant at home, and I'm routinely using the larger
| models for the kind of stuff I don't wish to share with
| model providers.
|
| My 2 cents
| mholm wrote:
| The reason Macs get recommended is the unified memory,
| which is usable as VRAM for the GPU. People are similarly
| using the AMD Strix Halo for AI which also has a similar
| memory architecture. Time to first token for something like
| '1+1=' would be seconds, and then you'd be getting ~20
| tokens per second, which is absolutely plenty fast for
| regular use. Token/s slows down at the higher end of
| context, but it's absolutely still practical for a lot of
| usecases. Though I agree that agentic coding, especially
| over large projects, would likely get too slow to be
| practical.
| zozbot234 wrote:
| Not too slow if you just let it run overnight/in the
| background. But the biggest draw would be no rate limits
| whatsoever compared to the big proprietary APIs,
| especially Claude's. No risk of sudden rugpulls either,
| and the model will have very consistent performance.
| PlatoIsADisease wrote:
| We are getting into a debate between particulars and
| universals. To call the 'unified memory' VRAM is quite a
| generalization. Whatever the case, we can tell from stock
| prices that whatever this VRAM is, its nothing compared
| to NVIDIA.
|
| Anyway, we were trying to run a 70B model on a
| macbook(can't remember which M model) at a fortune 20
| company, it never became practical. We were trying to
| compare strings of character length ~200. It was like
| 400-ish characters plus a pre-prompt.
|
| I can't imagine this being reasonable on a 1T model, let
| alone the 400B models of deepseek and LLAMA.
| simonw wrote:
| Here's a video of a previous 1T K2 model running using
| MLX on a a pair of Mac Studios:
| https://twitter.com/awnihannun/status/1943723599971443134
| - performance isn't terrible.
| PlatoIsADisease wrote:
| Is there a catch? I was not getting anything like this on
| a 70B model.
|
| EDIT: oh its a marketing account and the program never
| finished... who knows the validity.
| simonw wrote:
| I don't think Awni should be dismissed as a "marketing
| account" - they're an engineer at Apple who's been
| driving the MLX project for a couple of years now,
| they've earned a lot of respect from me.
| PlatoIsADisease wrote:
| Given how secretive Apple is, oh my, its super duper
| marketing account.
| mholm wrote:
| Jeff Geerling and a few others also got access to
| similarly specced mac clusters. They replicated this
| performance.
|
| The tooling involved has improved significantly over the
| past year.
| Gracana wrote:
| With 32B active parameters, Kimi K2.5 will run faster
| than your 70B model.
| simonw wrote:
| MLX uses the GPU.
|
| That said, I wouldn't necessarily recommend spending
| $20,000 on a pair of Mac Studios to run models like this.
| The performance won't be nearly as good as the server-class
| GPU hardware that hosted models run on.
| tosh wrote:
| I think you can put a bunch of apple silicon macs with enough
| ram together
|
| e.g. in an office or coworking space
|
| 800-1000 gb ram perhaps?
| rvz wrote:
| The chefs at Moonshot have cooked once again.
| Jackson__ wrote:
| As your local vision nut, their claims about "SOTA" vision are
| absolutely BS in my tests.
|
| Sure it's SOTA at standard vision benchmarks. But on tasks that
| require proper image understanding, see for example BabyVision[0]
| it appears very much lacking compared to Gemini 3 Pro.
|
| [0] https://arxiv.org/html/2601.06521v1
| nostrebored wrote:
| Gemini remains the only usable vision fm :(
| Topfi wrote:
| K2 0905 and K2 Thinking shortly after that have done impressively
| well in my personal use cases and was severely slept on. Faster,
| more accurate, less expensive, more flexible in terms of hosting
| and available months before Gemini 3 Flash, I really struggle to
| understand why Flash got such positive attention at launch.
|
| Interested in the dedicated Agent and Agent Swarm releases,
| especially in how that could affect third party hosting of the
| models.
| msp26 wrote:
| K2 thinking didn't have vision which was a big drawback for my
| projects.
| bertili wrote:
| The "Deepseek moment" is just one year ago today!
|
| Coincidence or not, let's just marvel for a second over this
| amount of magic/technology that's being given away for free...
| and how liberating and different this is than OpenAI and others
| that were closed to "protect us all".
| motoboi wrote:
| What amazes me is why would someone spend millions to train
| this model and give it away for free. What is the business
| here?
| testfrequency wrote:
| Curious to hear what "OpenAI" thinks the answer to this is
| YetAnotherNick wrote:
| Hosting the model is cheaper per token, the more batched
| token you get. So they have big advantage here.
| whizzter wrote:
| Chinese state that maybe sees open collaboration as the way
| to nullify any US lead in the field, concurrently if the next
| "search-winner" is built upon their model the Chinese
| worldview that Taiwan belongs to China and Tiamen Square
| massacre never happened.
|
| Also their license says that if you have a big product you
| need to promote them, remember how Google "gave away" site
| searche widgets and that was perhaps one of the major ways
| they gained recognition for being the search leader.
|
| OpenAI/NVidia is the Pets.com/Sun of our generation, insane
| valuations, stupid spend, expensive options, expensive
| hardware and so on.
|
| Sun hardware bought for 50k USD to run websites in 2000 are
| less capable than perhaps 5 dollar/month VPS's today?
|
| "Scaling to AGI/ASI" was always a fools errand, best case
| OpenAI should've squirreled away money to have a solid
| engineering department that could focus on algorithmic
| innovations but considering that Antrophic, Google and
| Chinese firms have caught up or surpassed them it seems they
| didn't.
|
| Once things blows up, those closed options that had somewhat
| sane/solid model research that handles things better will be
| left and a ton of new competitors running modern/cheaper
| hardware and just using models are building blocks.
| zozbot234 wrote:
| > "Scaling to AGI/ASI" was always a fools errand
|
| Scaling depends on hardware, so cheaper hardware on a
| compute-per-watt basis only makes scaling easier. There is
| no clear definition of AGI/ASI but AI has already scaled to
| be quite useful.
| dev_l1x_be wrote:
| > Taiwan belongs to China
|
| So they are on the same page as the UN and US?
|
| The One China policy refers to a United States policy of
| strategic ambiguity regarding Taiwan.[1] In a 1972 joint
| communique with the PRC, the United States "acknowledges
| that all Chinese on either side of the Taiwan Strait
| maintain there is but one China and that Taiwan is a part
| of China" and "does not challenge that position."
|
| https://en.wikipedia.org/wiki/One_China
| https://en.wikipedia.org/wiki/Taiwan_and_the_United_Nations
| 9cb14c1ec0 wrote:
| The One China policy is a fiction of foreign policy
| statecraft, designed to sideline the issue without having
| to actually deal with it. It is quite clear that apart
| from the official fiction there is a real policy that is
| not One China. This is made clear by the weapons sales to
| Taiwan that specifically calibrated to make a Chinese
| military action harder.
| pqtyw wrote:
| Existence of an independent and effectively sovereign
| state on the island of Taiwan (however one calls it) is a
| fact. Whatever doublespeak governments of other countries
| or international organizations engage in due to political
| reasons does not change that.
| two_tasty wrote:
| I love how Tiananmen square is always brought up as some
| unique and tragic example of disinformation that could
| never occur in the west, as though western governments
| don't do the exact same thing with our worldview. Your
| veneer of cynicism scarcely hides the structure of naivety
| behind.
| igneo676 wrote:
| The difference is that, in the west, there's an
| acceptable counter narrative. I can tell you that Ruby
| Ridge and Waco never should've happened and were examples
| of government overreach and massacre of it's own
| citizens. Or <insert pet issue with the government here>
|
| You can't with Tiananmen square in China
| mannanj wrote:
| I still see/hear cynicism with a hidden structure of
| naivety behind.
| ggdG wrote:
| I think this fits into some "Commoditize The Complement"
| strategy.
|
| https://gwern.net/complement
| Balinares wrote:
| Speculating: there are two connected businesses here,
| creating the models, and serving the models. Outside of a few
| moneyed outliers, no one is going to run this at home. So at
| worst opening this model allows mid-sized competitors to
| serve it to customers from their own infra -- which helps
| Kimi gain mindshare, particularly against the large
| incumbents who are definitely _not_ going to be serving Kimi
| and so don 't benefit from its openness.
|
| Given the shallowness of moats in the LLM market, optimizing
| for mindshare would not be the worst move.
| tokioyoyo wrote:
| Moonshot's (Kimi's owner) investors are Alibaba/Tencent et
| al. Chinese market is stupidly competitive, and there's a
| general attitude of "household name will take it all".
| However getting there requires having a WeChat-esque user
| base, through one way or another. If it's paid, there'll be
| friction and it won't work. Plus, it undermines a lot of
| other companies, which is a win for a lot of people.
| WarmWash wrote:
| It's another state project funded at the discretion of the
| party.
|
| If you look at past state projects, profitability wasn't
| really considered much. They are notorious for a "Money hose
| until a diamond is found in the mountains of waste"
| deskamess wrote:
| I think there is a book (Chip War) about how the USSR did not
| effectively participate in staying at the edge of the
| semiconductor revolution. And they have suffered for it.
|
| China has decided they are going to participate in the
| LLM/AGI/etc revolution at any cost. So it is a sunk cost, and
| the models are just an end product and any revenue is
| validation and great, but not essential. The cheaper price
| points keep their models used and relevant. It challenges the
| other (US, EU) models to innovate and keep ahead to justify
| their higher valuations (both monthly plan, and investor).
| Once those advances are made, it can be bought back to their
| own models. In effect, the currently leading models are
| running from a second place candidate who never gets tired
| and eventually does what they do at a lower price point.
| kaibee wrote:
| In some way, the US won the cold war by spending so much on
| military that the USSR, in trying to keep up, collapsed. I
| don't see any parallels between that and China providing
| infinite free compute to their AI labs, why do you ask?
| culi wrote:
| All economically transformative technologies have done
| similar. If it's privatized, it's not gonna be transformative
| across the industry. The GPS, the internet, touchscreens, AI
| voice assistants, microchips, LCDs, etc were all publicly
| funded (or made by Bell Labs which had a state-mandated
| monopoly that forced them to open up their patents).
|
| The economist Mariana Mazzucato wrote a great book about this
| called _The Entrepreneurial State: Debunking Public vs.
| Private Sector Myths_
| overfeed wrote:
| > What amazes me is why would someone spend millions to train
| this model and give it away for free. What is the business
| here?
|
| How many millions did Google spend on Android (acquisition
| and salaries), only to give it away for free?
|
| Usually, companies do this to break into a monopolized market
| (or one that's at risk of becoming one), with openness as a
| sweetener. IBM with Linux to break UNIX-on-big-iron
| domination, Google with Android vs. iPhone, Sun with
| OpenSolaris vs. Linux-on-x86.
| jimmydoe wrote:
| It's not coincidence. Chinese companies tend to do big releases
| before Chinese new year. So expect more to come before Feb 17.
| catigula wrote:
| I mean, there are credible safety issues here. A Kimi fine-tune
| will absolutely be able to help people do cybersecurity related
| attacks - very good ones.
|
| In a few years, or less, biological attacks and other sorts of
| attacks will be plausible with the help of these agents.
|
| Chinese companies aren't humanitarian endeavors.
| PlatoIsADisease wrote:
| I am convinced that was mostly just marketing. No one uses
| deepseek as far as I can tell. People are not running it
| locally. People choose GPT/Gemini/Claude/Grok if you are giving
| your data away anyway.
|
| My biggest source of my conspiracy is that I made a reddit
| thread asking a question: "Why all the deepseek hype" or
| something like that. And to this day, I get odd, 'pro deepseek'
| comments from accounts only used every few months. Its not like
| this was some highly upvoted topic that is in the 'Top'.
|
| I'd put that deepseek marketing on-par with an Apple marketing
| campaign.
| mekpro wrote:
| Except that, In OpenRouter, Deepseek always maintain in Top
| 10 Ranking. Although I did not use it personally, i believe
| that their main advantage over other model is
| price/performance.
| culi wrote:
| Fifth in market share in fact!
|
| https://openrouter.ai/rankings
|
| There are a lot of applications where you really just want
| a cheap and efficient model that's still somewhat
| competitive and that's exactly the niche DeepSeek fulfills
| the best.
| logicprog wrote:
| I don't use DeepSeek, but I prefer Kimi and GLM to closed
| models for most of my work.
| segmondy wrote:
| There's been so many moments that folks not really heavy into
| LLM have missed, DeepSeekR1 was great, but so was all the
| "incremental" improvements, v3-0324, v3.1, v3.1-terminus, and
| now v3.2-speciale. With that this is the 3rd great Kimi model,
| then GLM has been awesome, since 4.5, with 4.5, 4.5-air, 4.6,
| 4.7 and now 4.7 flash. Minimax-M2 has also been making waves
| lately. ... and i'm just talking about the Chinese model
| without adding the 10+ Qwen models. Outside of Chinese models,
| mistral-small/devstral, gemma-27b-it, gpt-oss-120b, seed-os
| have been great, and I'm still talking about just LLM, not
| image, audio or special domain models like deepseek-prover and
| deepseek-math. It's really a marvel what we have at home. I
| cancelled OpenAI and Anthropic subscription 2 years ago once
| they started calling for regulation of open models and I
| haven't missed them one bit.
| pu_pe wrote:
| I don't get this "agent swarm" concept. You set up a task and
| they boot up 100 LLMs to try to do it in parallel, and then one
| "LLM judge" puts it all together? Is there anywhere I can read
| more about it?
| jonkoops wrote:
| The datacenters yearn for the chips.
| rvnx wrote:
| You have a team lead that establishes a list of tasks that are
| needed to achieve your mission
|
| then it creates a list of employees, each of them is
| specialized for a task, and they work in parallel.
|
| Essentially hiring a team of people who get specialized on one
| problem.
|
| Do one thing and do it well.
| XCSme wrote:
| But in the end, isn't this the same idea with the MoE?
|
| Where we have more specialized "jobs", which the model is
| actually trained for.
|
| I think the main difference with agents swarm is the ability
| to run them in parallel. I don't see how this adds much
| compared to simply sending multiple API calls in parallel
| with your desired tasks. I guess the only difference is that
| you let the AI decide how to split those requests and what
| each task should be.
| zozbot234 wrote:
| Nope. MoE is strictly about model parameter sparsity.
| Agents are about running multiple small-scale tasks in
| parallel and aggregating the results for further processing
| - it saves a lot of context length compared to having it
| all in a single session, and context length has quadratic
| compute overhead so this matters. You can have both.
|
| One positive side effect of this is that if subagent tasks
| can be dispatched to cheaper and more efficient edge-
| inference hardware that can be deployed at scale (think
| nVidia Jetsons or even Apple Macs or AMD APU's) even though
| it might be highly limited in what can fit on the single
| node, then complex coding tasks ultimately become a lot
| _cheaper_ per token than generic chat.
| XCSme wrote:
| Yes, I know you can have both.
|
| My point was that this is just a different way of
| creating specialised task solvers, the same as with MoE.
|
| And, as you said, with MoE it's about the model itself,
| and it's done at training level so that's not something
| we can easily do ourselves.
|
| But with agent swarm, isn't it simply splitting a task in
| multiple sub-tasks and sending each one in a different
| API call? So this can be done with any of the previous
| models too, only that the user has to manually define
| those tasks/contexts for each query.
|
| Or is this at a much more granular level than this, which
| would not be feasible to be done by hand?
|
| I was already doing this in n8n, creating different
| agents with different system prompts for different tasks.
| I am not sure if automating this (with swarm) would work
| well in my most cases, I don't see how this fully
| complements Tools or Skills
| zozbot234 wrote:
| MoE has nothing whatsoever to do with specialized task
| solvers. It always operates per token within a single
| task, you can think of it perhaps as a kind of learned
| "attention" for model parameters as opposed to context
| data.
| XCSme wrote:
| Yes, specific weights/parameters have be trained to solve
| specific tasks (trained on different data).
|
| Or did I misunderstand the concept of MoE, and it's not
| about having specific parts of the model (parameters) do
| better on specific input contexts?
| vessenes wrote:
| You can read about this basically everywhere - the term of art
| is agent orchestration. Gas town, Claude's secret swarm mode,
| or people who like to use phrases like "Wiggum loop" will get
| you there.
|
| If you're really lazy - the quick summary is that you can
| benefit from the sweet spot of context length and reduce
| instruction overload while getting some parallelism benefits
| from farming tasks out to LLMs with different instructions. The
| way this is generally implemented today is through tool
| calling, although Claude also has a skills interface it has
| been trained against.
|
| So the idea would be for software development, why not have a
| project/product manager spin out tasks to a bunch of agents
| that are primed to be good at different things? E.g. an
| architect, a designer, and so on. Then you just need something
| that can rectify GitHub PRs and bob's your uncle.
|
| Gas town takes a different approach and parallelizes on coding
| tasks of any sort at the base layer, and uses the orchestration
| infrastructure to keep those coders working constantly,
| optimizing for minimal human input.
| IanCal wrote:
| I'm not sure whether there are parts of this done for claude
| but those other ones are layers on top of the usual LLMs we
| see. This seems to be a bit different, in that there's a
| different model trained specifically for splitting up and
| managing the workload.
| Rebuff5007 wrote:
| I've also been quite skeptical, and I became even _more_
| skeptical after hearing a tech talk from a startup in this
| space [1].
|
| I think the best way to think about it is that its an
| engineering hack to deal with a shortcoming of LLMs: for
| complex queries LLMs are unable to directly compute a SOLUTION
| given a PROMPT, but are instead able to break down the prompt
| to intermediate solutions and eventually solve the original
| prompt. These "orchestrator" / "swarm" agents add some
| formalism to this and allow you to distribute compute, and then
| also use specialized models for some of the sub problems.
|
| [1] https://www.deepflow.com/
| vinhnx wrote:
| One thing caught my eyes is that besides K2.5 model, Moonshot AI
| also launched Kimi Code (https://www.kimi.com/code), evolved from
| Kimi CLI. It is a terminal coding agent, I've been used it last
| month with Kimi subscription, it is capable agent with stable
| harness.
|
| GitHub: https://github.com/MoonshotAI/kimi-cli
| Imanari wrote:
| How does it fare against CC?
| forgotpwd16 wrote:
| >Kimi Code CLI is not only a coding agent, but also a shell.
|
| That's cool. It also has a zsh hook, allowing you to switch to
| agent mode wherever you're.
| vinhnx wrote:
| It is, Kimi Code CLI supports Zed' Agent Client Protocol
| (http://agentclientprotocol.com/), so it can acts as an
| external agent that could run in any ACP-compatible client,
| eg: Zed, Jetbrain, Toad CLI, Minano Notebook. Also, it
| supports Agent Skills. Moonshot AI developers are actively
| update the agent and every active. I really like their CLI.
| esafak wrote:
| Does it support the swarm feature? Does Opencode?
| monkeydust wrote:
| Is this actually good or just optimized heavily for benchmarks? I
| am hopefully its the former based on the writeup but need to put
| it through its paces.
| kurtis_reed wrote:
| Quite good in my testing
| Barathkanna wrote:
| A realistic setup for this would be a 16x H100 80GB with NVLink.
| That comfortably handles the active 32B experts plus KV cache
| without extreme quantization. Cost-wise we are looking at roughly
| $500k-$700k upfront or $40-60/hr on-demand, which makes it clear
| this model is aimed at serious infra teams, not casual single-GPU
| deployments. I'm curious how API providers will price tokens on
| top of that hardware reality.
| bertili wrote:
| The other realistic setup is $20k, for a small company that
| needs a private AI for coding or other internal agentic use
| with two Mac Studios connected over thunderbolt 5 RMDA.
| zozbot234 wrote:
| That's great for affordable local use but it'll be slow: even
| with the proper multi-node inference setup, the thunderbolt
| link will be a comparative bottleneck.
| embedding-shape wrote:
| I'd love to see the prompt processing speed difference
| between 16x H100 and 2x Mac Studio.
| zozbot234 wrote:
| Prompt processing/prefill can even get some speedup from
| local NPU use most likely: when you're ultimately limited
| by thermal/power limit throttling, having more efficient
| compute available means more headroom.
| Barathkanna wrote:
| I asked GPT for a rough estimate to benchmark prompt
| prefill on an 8,192 token input. * 16x H100: 8,192 / (20k
| to 80k tokens/sec) [?] 0.10 to 0.41s * 2x Mac Studio (M3
| Max): 8,192 / (150 to 700 tokens/sec) [?] 12 to 55s
|
| These are order-of-magnitude numbers, but the takeaway is
| that multi H100 boxes are plausibly ~100x faster than
| workstation Macs for this class of model, especially for
| long-context prefill.
| ffsm8 wrote:
| You do realize that's entirely made up, right?
|
| Could be true, could be fake - the only thing we can be
| sure of is that it's made up with no basis in reality.
|
| This is not how you use llms effectively, that's how you
| give everyone that's using them a bad name from
| association
| Barathkanna wrote:
| That won't realistically work for this model. Even with only
| ~32B active params, a 1T-scale MoE still needs the full
| expert set available for fast routing, which means hundreds
| of GB to TBs of weights resident. Mac Studios don't share
| unified memory across machines, Thunderbolt isn't remotely
| comparable to NVLink for expert exchange, and bandwidth
| becomes the bottleneck immediately. You could maybe load
| fragments experimentally, but inference would be
| impractically slow and brittle. It's a very different class
| of workload than private coding models.
| zozbot234 wrote:
| If "fast" routing is per-token, the experts can just reside
| on SSD's. the performance is good enough these days. You
| don't need to globally share unified memory across the
| nodes, you'd just run distributed inference.
|
| Anyway, in the future your local model setups will just be
| downloading experts on the fly from experts-exchange. That
| site will become as important to AI as downloadmoreram.com.
| bertili wrote:
| People are running the previous Kimi K2 on 2 Mac Studios at
| 21tokens/s or 4 Macs at 30tokens/s. Its still premature,
| but not a completely crazy proposition for the near future,
| giving the rate of progress.
| NitpickLawyer wrote:
| > 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s
|
| Keep in mind that most people posting speed benchmarks
| try them with basically 0 context. Those speeds will not
| hold at 32/64/128k context length.
| YetAnotherNick wrote:
| Depends on if you are using tensor parallelism or pipeline
| parallelism, in the second case you don't need any sharing.
| omneity wrote:
| RDMA over Thunderbolt is a thing now.
| reissbaker wrote:
| Generally speaking, 8xH200s will be a lot cheaper than
| 16xH100s, and faster too. But both should technically work.
| pama wrote:
| You can do it and may be ok for single user with idle waiting
| times, but performance/throughput will be roughly halved
| (closer to 2/3) and free context will be more limited with
| 8xH200 vs 16xH100 (assuming decent interconnect). Depending a
| bit on usecase and workload 16xH100 (or 16xB200) may be a
| better config for cost optimization. Often there is a huge
| economy of scale with such large mixture of expert models so
| that it would even be cheaper to use 96 GPU instead of just 8
| or 16. The reasons are complicatet and involve better prefill
| cache, less memory transfer per node.
| wongarsu wrote:
| The weights are int4, so you'd only need 8xH100
| a2128 wrote:
| You don't need to wait and see, Kimi K2 has the same hardware
| requirements and has several providers on OpenRouter:
|
| https://openrouter.ai/moonshotai/kimi-k2-thinking
| https://openrouter.ai/moonshotai/kimi-k2-0905
| https://openrouter.ai/moonshotai/kimi-k2-0905:exacto
| https://openrouter.ai/moonshotai/kimi-k2
|
| Generally it seems to be in the neighborhood of $0.50/1M for
| input and $2.50/1M for output
| hmate9 wrote:
| About 600GB needed for weights alone, so on AWS you need an
| p5.48xlarge (8x H100) which costs $55/hour.
| Alifatisk wrote:
| Have you all noted that the latest releases (Qwen3 max thinking,
| now Kimi k2.5) from Chinese companies are benching against Claude
| opus now and not Sonnet? They are truly catching up, almost at
| the same pace?
| zozbot234 wrote:
| The benching is sus, it's way more important to look at real
| usage scenarios.
| conception wrote:
| https://clocks.brianmoore.com
|
| K2 is one of the only models to nail the clock face test as
| well. It's a great model.
| DJBunnies wrote:
| Cool comparison, but none of them get both the face and the
| time correct when I look at it.
| conception wrote:
| Refresh. It's not every time but k2 hits a perfect clock
| for me about 7/10 or so.
| culi wrote:
| Kimi 2 is remarkably consistently the best. I wonder if it's
| somehow been trained specifically on tasks like these. It
| seems too consistent to be coincidence
|
| Also shocking is how the most common runner up I've seen is
| DeepSeek
| WarmWash wrote:
| They distill the major western models, so anytime a new SOTA
| model drops, you can expect the Chinese labs to update their
| models within a few months.
| zozbot234 wrote:
| This is just a conspiracy theory/urban legend. How do you
| "distill" a proprietary model with no access to the original
| weights? Just doing the equivalent of training on chat/API
| logs has terrible effectiveness (you're trying to drink from
| a giant firehose through a tiny straw) and gives you no
| underlying improvements.
| Alifatisk wrote:
| Yes, they do distill. But just saying all they do is distill
| is not correct and actually kind of unfair. These Chinese
| labs have done lots of research in this field and publish it
| to the public, some of not majority contribute with open-
| weight models making a future of local llm possible!
| Deepseek, Moonshot, Minimax, Z.a, Alibabai (Qwen).
|
| They are not just leeching here, they took this innovation,
| refined it and improved it further. This is what the Chinese
| is good at.
| Balinares wrote:
| Source?
| esafak wrote:
| They are, in benchmarks. In practice Anthropic's models are
| ahead of where their benchmarks suggest.
| HNisCIS wrote:
| Bear in mind that lead may be, in large part, from the
| tooling rather than the model
| simonw wrote:
| Pretty cute pelican https://tools.simonwillison.net/svg-
| render#%3Csvg%20viewBox%...
| mythz wrote:
| doesn't work, looks like the link or SVG was cropped.
| bavell wrote:
| No pelican for me :(
| simonw wrote:
| Oops, here's a working link:
| https://gist.github.com/simonw/32a85e337fbc6ee935d10d89726c0...
| throwaw12 wrote:
| Congratulations, great work Kimi team.
|
| Why is that Claude still at the top in coding, are they heavily
| focused on training for coding or is it their general training is
| so good that it performs well in coding?
|
| Someone please beat the Opus 4.5 in coding, I want to replace it.
| MattRix wrote:
| Opus 4.5 only came out two months ago, and yes Anthropic spends
| a lot of effort making it particularly good at coding.
| Balinares wrote:
| I replaced Opus with Gemini Pro and it's just plain a better
| coder IMO. It'll restructure code to enable support for new
| requirements where Opus seems to just pile on more indirection
| layers by default, when it doesn't outright hardcode special
| cases inside existing functions, or drop the cases it's failing
| to support from the requirements while smugly informing you you
| don't need that anyway.
| pokot0 wrote:
| I don't think that kind of difference in benchmarks has any
| meaning at all. Your agentic coding tool and the task you are
| working on introduce a lot more "noise" than that small delta.
|
| Also consider they are all overfitting on the benchmark itself
| so there might be that as well (which can go in either
| directions)
|
| I consider the top models practically identical for coding
| applications (just personal experience with heavy use of both
| GPT5.2 and Opus 4.5).
|
| Excited to see how this model compares in real applications.
| It's 1/5th of the price of top models!!
| symisc_devel wrote:
| Gemini 3 pro is way better than Opus especially for large
| codebases.
| redox99 wrote:
| My experience is the total opposite.
| rubslopes wrote:
| Do you use it only for code editing, or also for running bash
| commands? My experience is that it is very bad at the latter.
| jdeng wrote:
| Glad to to see open source models are catching up and treat
| vision as first-class citizen (a.k.a native multimodal agentic
| model). GLM and Qwen models takes different approach, by having a
| base model and a vision variant (glm-4.6 vs glm-4.6v).
|
| I guess after Kimi K2.5, other vendors are going to the same
| route?
|
| Can't wait to see how this model performs on computer automation
| use cases like VITA AI Coworker.
|
| https://www.vita-ai.net/
| teiferer wrote:
| Can we please stop calling those models "open source"? Yes the
| weights are open. So, "open weight" maybe. But the source isn't
| open, the thing that allows to re-create it. That's what "open
| source" used to mean. (Together with a license that allows you to
| use that source for various things.)
| Onavo wrote:
| No major AI lab will admit to training on proprietary or
| copyrighted data so what you are asking is an impossibility.
| You can make a pretty good LLM if you train on Anna's Archive
| but it will either be released anonymously, or with a research
| only non commercial license.
|
| There aren't enough public domain data to create good LLMs,
| especially once you get into the newer benchmarks that expect
| PhD level of domain expertise in various niche verticals.
|
| It's also a logical impossibility to create a zero knowledge
| proof that will allow you to attribute to specific training
| data without admitting to usage.
|
| I can think of a few technical options but none would hold
| water legally.
|
| You can use a S-protocol OR-composition to prove that it was
| trained either on a copyrighted dataset or a non copyrighted
| dataset without admitting to which one (technically
| interesting, legally unsound).
|
| You can prove that a model trained on copywrited data is
| statistically indistinguishable from one trained on non-
| copywrited data (an information theoretic impossibility unless
| there exist as much public domain data as copywrited data, in
| similar distributions).
|
| You can prove a public domain and copywrited dataset are
| equivalent if the model performance produced is
| indistinguishable from each other.
|
| All the proofs fail irl, ignoring the legal implications,
| because there's less public domain information, so given the
| lemma that more training data == improved model performance,
| all the above are close to impossible.
| dev_l1x_be wrote:
| I had these weird situations like some models are refusing to use
| SSH as a tool. Not sure if it was the coding tool limitation or
| it is baked into in some of the models.
| erichocean wrote:
| Running on Apple Silicon:
| https://x.com/awnihannun/status/2016221496084205965
| stopachka wrote:
| Is there a startup that takes models like this, and effectively
| gives you a secure setup, where you have (a) a mobile app that
| (b) talks to some giant machine that only you have access too.
|
| If a 10K computer could run this, it may be worth it to have a
| "fully on prem" version of ChatGPT running for you.
| 2001zhaozhao wrote:
| The directionally interesting part is that according to the
| announcement, K2.5 seems to be trained specifically to create
| sub-agents and work in an agent swarm usefully. The key part is
| that you don't need to manually create or prompt sub-agents, K2.5
| creates them automatically, so from the looks of things it's
| similar to Claude Code dynamic sub-agents except the model is
| trained to scale to many more agents autonomously.
|
| I wonder whether Claude is doing the same kind of training and
| it's coming with the next model, and that's why the agent swarm
| mode in Claude Code is hidden for now. We might be getting very
| very good agent orchestrators/swarms very soon.
| culi wrote:
| I posted this elsewhere but thought I'd repost here:
|
| * https://lmarena.ai/leaderboard -- crowd-sourced head-to-head
| battles between models using ELO
|
| * https://dashboard.safe.ai/ -- CAIS' incredible dashboard
|
| * https://clocks.brianmoore.com/ -- a visual comparison of how
| well models can draw a clock. A new clock is drawn every minute
|
| * https://eqbench.com/ -- emotional intelligence benchmarks for
| LLMs
|
| * https://www.ocrarena.ai/battle -- OCR battles, ELO
|
| * https://mafia-arena.com/ -- LLMs playing the social deduction
| game Mafia
|
| * https://openrouter.ai/rankings -- marketshare based on
| OpenRouter
| enricoros wrote:
| CCP-bench has gotten WAY better on K2.5!
|
| https://big-agi.com/static/kimi-k2.5-less-censored.jpg
___________________________________________________________________
(page generated 2026-01-28 07:01 UTC)