[HN Gopher] Kimi K2 is a state-of-the-art mixture-of-experts (Mo...
___________________________________________________________________
Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language
model
GitHub: https://github.com/MoonshotAI/Kimi-K2
Author : c4pt0r
Score : 289 points
Date : 2025-07-11 15:38 UTC (2 days ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| gs17 wrote:
| > 1T total / 32B active MoE model
|
| Is this the largest open-weight model?
| bigeagle wrote:
| I believe so.
|
| Grok-1 is 341B, DeepSeek-v3 is 671B, and recent new open
| weights models are around 70B~300B.
| adt wrote:
| No.
|
| At 1T MoE on 15.5T tokens, K2 is one of the largest open source
| models to date. But BAAI's TeleFM is 1T dense on 15.7T tokens:
| https://huggingface.co/CofeAI/Tele-FLM-1T
|
| You can always check here: https://lifearchitect.ai/models-
| table/
| simonw wrote:
| Big release - https://huggingface.co/moonshotai/Kimi-K2-Instruct
| model weights are 958.52 GB
| c4pt0r wrote:
| Paired with programming tools like Claude Code, it could be a
| low-cost/open-source replacement for Sonnet
| kkzz99 wrote:
| According to the bench its closer to Opus, but I venture
| primarily for English and Chinese.
| martin_ wrote:
| how do you low cost run a 1T param model?
| maven29 wrote:
| 32B active parameters with a single shared expert.
| JustFinishedBSG wrote:
| This doesn't change the VRAM usage, only the compute
| requirements.
| maven29 wrote:
| You can probably run this on CPU if you have a 4090D for
| prompt processing, since 1TB of DDR4 only comes out to
| around $600.
|
| For GPU inference at scale, I think token-level batching
| is used.
| t1amat wrote:
| With 32B active parameters it would be ridiculously slow
| at generation.
| selfhoster11 wrote:
| DDR3 workstation here - R1 generates at 1 token per
| second. In practice, this means that for complex queries,
| the speed of replying is closer to an email response than
| a chat message, but this is acceptable to me for
| confidential queries or queries where I need the model to
| be steerable. I can always hit the R1 API from a provider
| instead, if I want to.
|
| Given that R1 uses 37B active parameters (compared to 32B
| for K2), K2 should be slightly faster than that - around
| 1.15 tokens/second.
| CamperBob2 wrote:
| That's pretty good. Are you running the real 600B+
| parameter R1, or a distill, though?
| zackangelo wrote:
| Typically a combination of expert level parallelism and
| tensor level parallelism is used.
|
| For the big MLP tensors they would be split across GPUs
| in a cluster. Then for the MoE parts you would spread the
| experts across the GPUs and route to them based on which
| experts are active (there would likely be more than one
| if the batch size is > 1).
| selfhoster11 wrote:
| It does not have to be VRAM, it could be system RAM, or
| weights streamed from SSD storage. Reportedly, the latter
| method achieves around 1 token per second on computers
| with 64 GB of system RAM.
|
| R1 (and K2) is MoE, whereas Llama 3 is a dense model
| family. MoE actually makes these models practical to run
| on cheaper hardware. DeepSeek R1 is more comfortable for
| me than Llama 3 70B for exactly that reason - if it
| spills out of the GPU, you take a large performance hit.
|
| If you need to spill into CPU inference, you really want
| to be multiplying a different set of 32B weights for
| every token compared to the same 70B (or more) instead,
| simply because the computation takes so long.
| refulgentis wrote:
| The amount of people who will be using it at 1 token/sec
| because there's no better option, _and_ have 64 GB of
| RAM, is _vanishingly_ small.
|
| IMHO it sets the local LLM community back when we lean on
| extreme quantization & streaming weights from disk to say
| something is possible*, because when people try it out,
| it turns out it's an awful experience.
|
| * the implication being, _anything_ is possible in that
| scenario
| homarp wrote:
| agentic loop can run all night long. It's just a
| different way to work: prepare your prompt queue, set it
| up, check result in the morning, adjust. 'local vibe' in
| 10h instead of 10mn is still better than 10 days of
| manual side coding.
| hereme888 wrote:
| Right on! Especially if its coding abilities are better
| than Claude 4 Opus. I spent thousands on my PC in
| anticipation of this rather than to play fancy video
| games.
|
| Now, where's that spare SSD...
| selfhoster11 wrote:
| Good. Vanishingly small is still more than zero. Over
| time, running such models will become easier too, as
| people slowly upgrade to better hardware. It's not like
| there aren't options for the compute-constrained either.
| There are lots of Chinese models in the 3-32B range, and
| Gemma 3 is particularly good too.
|
| I will also point out that having three API-based
| providers deploying an impractically-large open-weights
| model beats the pants of having just one. Back in the
| day, this was called second-sourcing IIRC. With
| proprietary models, you're at the mercy of one
| corporation and their Kafkaesque ToS enforcement.
| refulgentis wrote:
| You said "Good." then wrote a nice stirring bit about how
| having a bad experience with a 1T model will force people
| to try 4B/32B models.
|
| That seems separate from the post it was replying to,
| about 1T param models.
|
| If it is intended to be a reply, it hand waves about how
| having a bad experience with it will teach them to buy
| more expensive hardware.
|
| Is that "Good."?
|
| The post points out that if people are taught they need
| an expensive computer to get 1 token/second, much less
| try it and find out it's a horrible experience (let's
| talk about prefill), it will turn them off against local
| LLMs unnecessarily.
|
| Is that "Good."?
| jimjimwii wrote:
| Had you posted this comment in the early 90s about linux
| instead of local models, it would have made about the
| same amount of sense but aged just as poorly as this
| comment will.
|
| I'll remain here happily using 2.something tokens /
| second model.
| cyanf wrote:
| This is both the largest oss model release thus far, and the
| largest Muon training run.
| wiradikusuma wrote:
| I've only started using Claude, Gemini, etc in the last few
| months (I guess it comes with age, I'm no longer interested in
| trying the latest "tech"). I assume those are "non-agentic"
| models.
|
| From reading articles online, "agentic" means like you have a
| "virtual" Virtual Assistant with "hands" that can google, open
| apps, etc, on their own.
|
| Why not use existing "non-agentic" model and "orchestrate" them
| using LangChain, MCP etc? Why create a new breed of model?
|
| I'm sorry if my questions sound silly. Following AI world is like
| following JavaScript world.
| ozten wrote:
| It is not a silly question. The various flavors of LLM have
| issues with reliability. In software we expect five 9s, LLMs
| aren't even a one 9. Early on it was reliability of them
| writing JSON output. Then instruction following. Then tool use.
| Now it's "computer use" and orchestration.
|
| Creating models for this specific problem domain will have a
| better chance at reliability, which is not a solved problem.
|
| Jules is the gemini coder that links to github. Half the time
| it doesn't create a pull request and forgets and assumes I'll
| do some testing or something. It's wild.
| simonw wrote:
| "Agentic" and "agent" can mean pretty much anything, there are
| a ton of different definitions out there.
|
| When an LLM says it's "agentic" it usually means that it's been
| optimized for tool use. Pretty much _all_ the big models (and
| most of the small ones) are designed for tool use these days,
| it 's an incredibly valuable feature for a model to offer.
|
| I don't think this new model is any more "agentic" than o3,
| o4-mini, Gemini 2.5 or Claude 4. All of those models are
| trained for tools, all of them are very competent at running
| tool calls in a loop to try to achieve a goal they have been
| given.
| dcre wrote:
| Reasonable question, simple answer: "New breed of model" is
| overstating it -- all these models for years have been fine-
| tuned using reinforcement learning on a variety of tasks, it's
| just that the set of tasks (and maybe the amount of RL) has
| changed over time to include more tool use tasks, and this has
| made them much, much better at the latter. The explosion of
| tools like Claude Code this year is driven by the models just
| being more effective at it. The orchestration external to the
| model you mention is what people did before this year and it
| did not work as well.
| selfhoster11 wrote:
| > I'm sorry if my questions sound silly. Following AI world is
| like following JavaScript world.
|
| You are more right than you could possibly imagine.
|
| TL;DR: "agentic" just means "can call tools it's been given
| access to, autonomously, and then access the output" combined
| with an infinite loop in which the model runs over and over
| (compared to a one-off interaction like you'd see in ChatGPT).
| MCP is essentially one of the methods to expose the tools to
| the model.
|
| Is this something the models could do for a long while with a
| wrapper? Yup. "Agentic" is the current term for it, that's all.
| There's some hype around "agentic AI" that's unwarranted, but
| part of the reason for the hype is that models have become
| better at tool calling and using data in their context since
| the early days.
| simonw wrote:
| Pelican on a bicycle result:
| https://simonwillison.net/2025/Jul/11/kimi-k2/
| _alex_ wrote:
| wow!
| ebiester wrote:
| At this point, they _have_ to be training it. At what point
| will you start using something else?
| simonw wrote:
| Once I get a picture that genuinely looks like a pelican
| riding a bicycle!
| qmmmur wrote:
| I'm glad we are looking to build nuclear reactors so we can do
| more of this...
| sergiotapia wrote:
| me too - we must energymaxx. i want a nuclear reactor in my
| backyard powering everything. I want ac units in every room
| and my open door garage while i workout.
| GenerWork wrote:
| You're saying this in jest, but I would LOVE to have a
| nuclear reactor in my backyard that produced enough power
| to where I could have a minisplit for every room in my
| house, including the garage so I could work out in there.
| CaptainFever wrote:
| Related: https://en.wikipedia.org/wiki/Kardashev_scale
|
| > The Kardashev scale (Russian: shkala Kardashiova,
| romanized: shkala Kardashyova) is a method of measuring a
| civilization's level of technological advancement based
| on the amount of energy it is capable of harnessing and
| using.
|
| > Under this scale, the sum of human civilization does
| not reach Type I status, though it continues to approach
| it.
| sergiotapia wrote:
| I am not joking
| 1vuio0pswjnm7 wrote:
| "I'm glad we are looking to build nuclear reactors so we can
| do more of this..."
|
| Does this actually mean "they" not "we"
| csomar wrote:
| Much better than that of Grok 4.
| jug wrote:
| That's perhaps the best one I've seen yet! For an open weight
| model, this performance is of course particularly remarkable
| and impactful.
| MaxPock wrote:
| Would be hilarious if Zuck with his billion dollar poaching
| failed to beat budget Chinese models.
| physix wrote:
| That reminds me of a thought I had about the poachings.
|
| The poaching was probably more aimed at hamstringing Meta's
| competition.
|
| Because the disruption caused by them leaving in droves is
| probably more severe than the benefits of having them on board.
| Unless they are gods, of course.
| stogot wrote:
| I thought that too
| rfoo wrote:
| Wikipedia listed a FAIR alumni as cofounder for this "Moonshot
| AI". Make it funnier probably.
| jug wrote:
| I can't tell if Kimi is quite top tier, but since Llama 4
| performed so poorly then yes, this did in fact happen just now.
| aliljet wrote:
| If the SWE Bench results are to be believed... this looks best in
| class right now for a local LLM. To be fair, show me the guy who
| is running this locally...
| selfhoster11 wrote:
| It's challenging, but not impossible. With 2-bit quantisation,
| only about 250-ish gigabytes of RAM is required. It doesn't
| have to be VRAM either, and you can mix and match GPU+CPU
| inference.
|
| In addition, some people on /r/localLlama are having success
| with streaming the weights off SSD storage at 1 token/second,
| which is about the rate I get for DeepSeek R1.
| helloericsf wrote:
| How does it stack up against the new Grok 4 model?
| Imustaskforhelp wrote:
| I really really want to try this model for free since I just
| don't have a gpu.
|
| Is there any way that I could do so?
|
| Open Router? Or does kimi have their own website? Just curious to
| really try it out!
| blahgeek wrote:
| Kimi.com
| Alifatisk wrote:
| Quite impressive benchmark, how come I don't see Kimi in
| Artificial analysis benchmarks?
| viraptor wrote:
| How well separated are experts per domain in a model like that?
| Specifically, if I'm interested in a programming use only, could
| we possibly strip it to one or two of them? Or should I assume a
| much wider spread? (And there would be some overlap anyway from
| the original root model)
| orbital-decay wrote:
| Inseparable, routing is done per token in a statistically
| optimal way, not per request on the knowledge domain basis.
| viraptor wrote:
| Sure, it's done per token, but the question is: how much do
| the knowledge domains match up with experts. I could not find
| hard data on this.
| boroboro4 wrote:
| Check out DeepSeek v3 model paper. They changed the way
| they train experts (went from aux loss to different kind
| expert separation training). It did improve experts domain
| specialization, they have neat graphics on it in the paper.
| renonce wrote:
| My experience is that experts are not separated in any
| intuitive way. I would be very interested (and surprised) if
| someone manages to prune a majority of experts in a way that
| preserves model capabilities in a specific domain but not
| others.
|
| See https://github.com/peteryuqin/Kimi-K2-Mini, a project that
| keeps a small portion of experts and layers and keep the model
| capabilities across multiple domains.
| viraptor wrote:
| Sounds like dumping the routing information from programming
| questions would answer that... I guess I can do a dump from
| qwen or deepseek locally. You'd think someone would created
| that kind of graph already, but I couldn't find one.
|
| What I did find instead is that some MoE models are
| explicitly domain-routed (MoDEM), but it doesn't apply to
| deepseek which is just equally load balanced, so it's
| unlikely to apply to Kimi. On the other hand,
| https://arxiv.org/html/2505.21079v1 shows modality
| preferences between experts, even in mostly random training.
| So maybe there's something there.
| 38 wrote:
| The web chat has extremely low limits FYI. I ran into the limit
| twice before getting a sane answer and gave up
| brcmthrowaway wrote:
| Is Kimi the new deep seek?
| Alifatisk wrote:
| It kinda feels like it, but Moonshots delivery has been like
| this before aswell, it was just now their new release got way
| more highlight than usual. When they released Kimi k1.5, those
| bench were impressive at the time! But everyone was busy with
| Deepseek v3 and QwQ-32B
| ozgune wrote:
| This is a very impressive general purpose LLM (GPT 4o,
| DeepSeek-V3 family). It's also open source.
|
| I think it hasn't received much attention because the frontier
| shifted to reasoning and multi-modal AI models. In accuracy
| benchmarks, all the top models are reasoning ones:
|
| https://artificialanalysis.ai/
|
| If someone took Kimi k2 and trained a reasoning model with it,
| I'd be curious how that model performs.
| GaggiX wrote:
| >If someone took Kimi k2 and trained a reasoning model with it
|
| I imagine that's what they are going at MoonshotAI right now
| Alifatisk wrote:
| Why hasn't Kimis current and older models been benchmarked and
| added to Artificial analysis yet?
| awestroke wrote:
| This is the model release that made Sam Altman go "Oh wait
| actually we can't release the new open source model this week,
| sorry. Something something security concerns".
|
| Perhaps their open source model release doesn't look so good
| compared to this one
| data_maan wrote:
| "Open source" lol
|
| Open-weight. As usual, you don't get the dataset, training
| scripts, etc.
| mistercheph wrote:
| Wont happen under the current copyright regime, it is
| impossible to train SOTA without copyrighted text, how do you
| propose distributing that?
| irthomasthomas wrote:
| List the titles.
| mixel wrote:
| But probably they don't have the rights to actually train
| on them and that's why they do not publish the list.
| Otherwise it may be laziness who knows
| msk-lywenn wrote:
| Bibtex
| CaptainFever wrote:
| It's not even open-weight. It's weight-available. It uses a
| "modified MIT license": Modified MIT License
| Copyright (c) 2025 Moonshot AI Permission is
| hereby granted, free of charge, to any person obtaining a copy
| of this software and associated documentation files (the
| "Software"), to deal in the Software without
| restriction, including without limitation the rights to
| use, copy, modify, merge, publish, distribute, sublicense,
| and/or sell copies of the Software, and to permit
| persons to whom the Software is furnished to do so,
| subject to the following conditions: The above
| copyright notice and this permission notice shall be included
| in all copies or substantial portions of the Software.
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
| EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
| WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
| PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES
| OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
| TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
| CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
| THE SOFTWARE. Our only modification
| part is that, if the Software (or any derivative works
| thereof) is used for any of your commercial products or
| services that have more than 100 million monthly active
| users, or more than 20 million US dollars (or
| equivalent in other currencies) in monthly revenue, you shall
| prominently display "Kimi K2" on the user interface of
| such product or service.
| mitthrowaway2 wrote:
| This seems significantly more permissive than GPL. I think
| it's reasonable to consider it open-weight.
| MallocVoidstar wrote:
| 4-clause BSD is considered open source by Debian and the FSF
| and has a similar requirement.
| weitendorf wrote:
| So "MIT with attribution" (but only for huge commercial use
| cases making tons of money off the product) is not open-
| weight? Do you consider CC BY photos on Wikipedia to be Image
| Available or GPL licensed software to be code-available too?
|
| Tangent: I don't understand the contingent that gets upset
| about open LLMs not shipping with their full training regimes
| or source data. The software a company spent hundreds of
| millions of dollars creating, which you are now free to use
| and distribute with essentially no restrictions, is open
| source. It has weights in it, and a bunch of related software
| for actually running a model with those weights. How dare
| they!
| spookie wrote:
| We really need to stop diluting the meaning of open source
| data_maan wrote:
| Open source" lol
|
| It's open-weight. As usual, you don't get the dataset, training
| scripts, etc.
| vessenes wrote:
| I tried Kimi on a few coding problems that Claude was spinning
| on. It's good. It's huge, way too big to be a "local" model -- I
| think you need something like 16 H200s to run it - but it has a
| slightly different vibe than some of the other models. I liked
| it. It would definitely be useful in ensemble use cases at the
| very least.
| summarity wrote:
| Reasonable speeds are possible with 4bit quants on 2 512GB Mac
| Studios (MLX TB4 Ring - see
| https://x.com/awnihannun/status/1943723599971443134) or even a
| single socket Epyc system with >1TB of RAM (about the same real
| world memory throughput as the M Ultra). So $20k-ish to play
| with it.
|
| For real-world speeds though yeah, you'd need serious hardware.
| This is more of a "deploy your own stamp" model, less a "local"
| model.
| refulgentis wrote:
| I write a local LLM client, but sometimes, I hate that local
| models have enough knobs to turn that people can advocate
| they're reasonable in _any_ scenario - in yesterday 's post
| re: Kimi k2, multiple people spoke up that you can "just"
| stream the active expert weights out of 64 GB of RAM, and use
| the lowest GGUF quant, and then you get something that rounds
| to 1 token/s, and that is reasonable for use.
|
| Good on you for not exaggerating.
|
| I am very curious what exactly they see in that, 2-3 people
| hopped in to handwave that you just have it do agent stuff
| overnight and it's well worth it. I can't even begin to
| imagine unless you have a metric **-ton of easily solved
| problems that aren't coding. Even a 90% success rate gets you
| into "useless" territory quick when one step depends on the
| other, and you're running it autonomoously for hours
| segmondy wrote:
| I do deepseek at 5tk/sec at home and I'm happy with it. I
| don't need to do agent stuff to gain from it, I was saving
| to eventually build out enough to run it at 10tk/sec, but
| with kimi k2, plan has changed and the savings continue
| with a goal to run it at 5 tk/sec at home.
| fzzzy wrote:
| I agree, 5 tokens per second is plenty fast for casual
| use.
| refulgentis wrote:
| Cosign for chat, that's my bar for usable on mobile phone
| (and correlates well with avg. reading speed)
| overfeed wrote:
| Also works perfectly fine in fire-and-forget, non-
| interactive agentic workflows. My dream scenario is that
| I create a bunch of kanban tickets and assign them to one
| or more AI personas[1], and wake up to some Pull Requests
| the next morning. I'd me more concerned about tickets-
| per-day, and not tk/s as I have no interest in watching
| the inner-workings of the model.
|
| 1. Some more creative than others, with slightly
| different injected prompts or perhaps even different
| models entirely.
| numpad0 wrote:
| > I create a bunch of kanban tickets and assign them to
| one or more AI personas[1],
|
| Yeah that. Why can't we just `find ./tasks/ | grep \\.md$
| | xargs llm`. Can't we just write up a government
| proposal style document, have LLM recursively down into
| sub-sub-projects and back up until the original proposal
| document can be translated into a completion report.
| Constantly correcting a humongous LLM with infinite
| context length that can keep everything in its head
| doesn't feel like the right approach.
| londons_explore wrote:
| In my experience, this sort of thing _nearly_ works...
| But never quite works well enough and errors and
| misunderstandings build at every stage and the output is
| garbage.
|
| Maybe with bigger models it'll work well.
| SV_BubbleTime wrote:
| It was, last year 5tk/s was reasonable. If you wanted to
| proof read a paragraph or rewrite some bullet points into
| a PowerPoint slide.
|
| Now, with agentic coding, thinking models, a "chat with
| my pdf" or whatever artifacts are being called now, no, I
| don't think 5/s is enough.
| gpm wrote:
| > or even a single socket Epyc system with >1TB of RAM
|
| How many tokens/second would this likely achieve?
| neuroelectron wrote:
| 1
| kachapopopow wrote:
| around 1 by the time you try to do anything useful with it
| (>10000 tokens)
| tuananh wrote:
| looks very much usable for local usage.
| wongarsu wrote:
| Reasonable speeds are possible if you pay someone else to run
| it. Right now both NovitaAI and Parasail are running it, both
| available through Openrouter and both promising not to store
| any data. I'm sure the other big model hosters will follow if
| there's demand.
|
| I may not be able to reasonably run it myself, but at least I
| can choose who I trust to run it and can have inference
| pricing determined by a competitive market. According to
| their benchmarks the model is about in a class with Claude 4
| Sonet, yet already costs less than one third of Sonet's
| inference pricing
| winter_blue wrote:
| I'm actually finding Claude 4 Sonnet's thinking model to be
| too slow to meet my needs. It literally takes several
| minutes per query on Cursor.
|
| So running it locally is the exact opposite of what I'm
| looking for.
|
| Rather, I'm willing to pay more, to have it be run on a
| faster than normal cloud inference machine.
|
| Anthropic is already too slow.
|
| Since this model is open source, maybe someone could offer
| it at a "premium" pay per use price, where the response
| rate / inference is done a lot faster, with more resources
| thrown at it.
| terhechte wrote:
| Anthropic isn't slow. I'm running Claude Max and it's
| pretty fast. The problem is that Cursor slowed down their
| responses in order to optimize their costs. At least a
| ton of people are experiencing this.
| satvikpendem wrote:
| > It literally takes several minutes per query on
| _Cursor._
|
| There's your issue. Use Claude Code or the API directly
| and compare the speeds. Cursor is slowing down requests
| to maintain costs.
| spaceman_2020 wrote:
| This is fairly affordable if you're a business honestly
| moffkalast wrote:
| Still pretty good, someone with enough resources could distil
| it down to a more manageable size for the rest of us.
| handzhiev wrote:
| I tried it a couple of times in comparison to Claude. Kimi
| wrote much simpler and more readable code than Claude's over-
| engineered solutions. It missed a few minor subtle edge cases
| that Claude took care of though.
| airstrike wrote:
| Claude what? Sonnet? 3.7? 3.5? Opus? 4?
| nathan_compton wrote:
| The first question I gave it (a sort of pretty simple
| recreational math question I asked it to code up for me) and it
| was outrageously wrong. In fairness, and to my surprise,
| OpenAI's model also failed with this task, although with some
| prompting, sort of got it.
| Xmd5a wrote:
| I asked it to give me its opinion on a mail I'm writing. 95% of
| its content is quotes from famous authors, and the 5% I wrote
| is actually minimal glue in-between.
|
| All the models I tested, which includes Sonnet 4, DeepSeekR1,
| 4o and Gemini 2.5 understand this isn't your normal email and
| what I ask is literary/philosophical criticism, not remarks
| about conventions, formatting or how to convey my message in a
| more impactful way. Quick fix: -
| Replace half the commas with periods. - Let one
| metaphor bloom and then stop. - Read it aloud; wherever
| you inhale, insert a line break. Before:
| >And if the collective imagination is so seized by it, it's
| because we are already--not in some other world--but in
| this very life, in a state of socio-, photo-, and
| videosynthesis. After: >The collective
| imagination is seized. >We are not in another world.
| >We are in this life, in a state of videosynthesis.
|
| Yes, this quote is by Baudrillard. None of the other models
| fixated on the fact that it's an email (I only used the word
| once in the prompt). My gut feeling is that this reflects not
| so much a lack of intelligence as a difference in model
| personality. Here's what it replied when I shared Gemini's
| analysis. The point was to have it understand that:
| Gemini, second sentence: It is less a simple email than an
| intellectual and personal exploration still in the
| making.
|
| Point deflected as soon as understood:
| **Gemini is polite and pedagogical; I'm going to be blunt.**
| 1. **This isn't an email, it's a trip.** ...
| 2. **The content is good, the format is not.** ...
| 3. **Gemini is right about one thing:** ...
| * cut 80% of the material, * ask **one clear
| question** * put the rest in a separate Google Doc
| or PDF for posterity. 4. **Brutal summary:**
| Keep your draft for yourself -- it's valuable. But for the
| email, write a new one in **three sentences**:
| * thank them, * recall the context in one line,
| * ask the question. You've already got
| Gemini's answer to stroke your ego -- now give me the short
| version you're actually going to send.
|
| The solution to sycophancy is not disparagement (misplaced
| criticism). The classical true/false positive/negative dilemma
| is at play here. I guess the bot got caught in the crossfire of
| 1deg) its no-bullshit attitude (it can only be an attitude)
| 2deg) preference for delivering blunt criticism over insincere
| flattery 3deg) being a helpful assistant. Remove point 3deg),
| and it could have replied: "I'm not engaging in this nonsense".
| Preserve it and it will politely suggest that you condense your
| bullshit text, because shorter explanations are better than
| long winding rants (it's probably in the prompt).
| jug wrote:
| I like new, solid non-reasoning models that push the frontier.
| These still have nice use cases (basically anything where logic
| puzzles or STEM subjects don't apply) where you don't want to
| spend cash on reasoning tokens.
| fzysingularity wrote:
| If I had to guess, the OpenAI open-source model got delayed
| because Kimi K2 stole their thunder and beat their numbers.
| tempaccount420 wrote:
| Time to RL the hell out of it so it looks better on
| benchmarks... It's going to be fried.
| fzysingularity wrote:
| If I had to guess, the OpenAI open-source model got delayed
| because Kimi K2 stole their thunder and beat their numbers.
| irthomasthomas wrote:
| Someone at openai did say it was too big to host at home, so
| you could be right. They will probably be benchmaxxing, right
| now, searching for a few evals they can beat.
| johnb231 wrote:
| These are all "too big to host at home". I don't think that
| is the issue here.
|
| https://github.com/MoonshotAI/Kimi-K2/blob/main/docs/deploy_.
| ..
|
| "The smallest deployment unit for Kimi-K2 FP8 weights with
| 128k seqlen on mainstream H200 or H20 platform is a cluster
| with 16 GPUs with either Tensor Parallel (TP) or "data
| parallel + expert parallel" (DP+EP)."
|
| 16 GPUs costing ~$30k each. No one is running a ~$500k server
| at home.
| pxc wrote:
| I think what GP means is that because the (hopefully)
| pending OpenAI release is also "too big to run at home",
| these two models may be close enough in size that they seem
| more directly comparable, meaning that it's even more
| important for OpenAI to outperform Kimi K2 on some key
| benchmarks.
| ls612 wrote:
| This is a dumb question I know, but how expensive is model
| distillation? How much training hardware do you need to
| take something like this and create a 7B and 12B version
| for consumer hardware?
| johnb231 wrote:
| The process involves running the original model. You can
| rent these big GPUs for ~$10 per hour, so that is ~$160
| per hour for as long as it takes
| qeternity wrote:
| You can rent H100s for $1.50/gpu/hr these days.
| weitendorf wrote:
| For most people, before it makes sense to just buy all the
| hardware yourself, you probably should be renting GPUs by
| the hour from the various providers serving that need. On
| Modal, I think should cost about $72/hr to serve Kimi K2
| https://modal.com/pricing
|
| Once that's running it can serve the needs of many
| users/clients simultaneously. It'd be too expensive and
| underutilized for almost any individual to use regularly,
| but it's not unreasonable for them to do it in short
| intervals just to play around with it. And it might
| actually be reasonable for a small number of students or
| coworkers to share a $70/hr deployment for ~40hr/week in a
| lot of cases; in other cases, that $70/hr expense could be
| shared across a large number of coworkers or product users
| if they use it somewhat infrequently.
|
| So maybe you won't host it at home, but it's actually quite
| feasible to self-host, and is it ever really worth
| physically hosting anything at home except as a hobby?
| spaceman_2020 wrote:
| The real users for these open source models are businesses
| that want something on premises for data privacy reasons
|
| Not sure if they'll trust a Chinese model but dropping
| $50-100k for a quantized model that replaces, say, 10
| paralegals is good enough for a law firm
| MaxPock wrote:
| An on-premise,open source Chinese model for my
| business,or a closed source American model from a company
| that's a defense contractor .Shouldn't be too difficult a
| decision to make.
| cubefox wrote:
| According to the benchmarks, Kimi K2 beats GPT-4.1 in many
| ways. So to "compete", OpenAI would have to release the GPT-4.1
| weights, or a similar model. Which, I guess, they likely won't
| do.
| satvikpendem wrote:
| This is not open source, they have a "modified MIT license" where
| they have other restrictions on users over a certain threshold.
| Our only modification part is that, if the Software (or any
| derivative works thereof) is used for any of your
| commercial products or services that have more than 100
| million monthly active users, or more than 20 million US dollars
| (or equivalent in other currencies) in monthly revenue, you shall
| prominently display "Kimi K2" on the user interface of
| such product or service.
| diggan wrote:
| That seems like a combination of Llama's "prominently display
| "Built with Llama"" and "greater than 700 million monthly
| active users" terms but put into one and masquerading as
| "slightly changed MIT".
| mrob wrote:
| The difference is it doesn't include Llama's usage
| restrictions that disqualify it from being an Open Source
| license.
| kragen wrote:
| I feel like those restrictions don't violate the OSD (or the
| FSF's Free Software Definition, or Debian's); there are similar
| restrictions in the GPLv2, the GPLv3, the 4-clause BSD license,
| and so on. They just don't have user or revenue thresholds. The
| GPLv2, for example, says:
|
| > _c) If the modified program normally reads commands
| interactively when run, you must cause it, when started running
| for such interactive use in the most ordinary way, to print or
| display an announcement including an appropriate copyright
| notice and a notice that there is no warranty (or else, saying
| that you provide a warranty) and that users may redistribute
| the program under these conditions, and telling the user how to
| view a copy of this License. (Exception: if the Program itself
| is interactive but does not normally print such an
| announcement, your work based on the Program is not required to
| print an announcement.)_
|
| And the 4-clause BSD license says:
|
| > _3. All advertising materials mentioning features or use of
| this software must display the following acknowledgement: This
| product includes software developed by_ the organization.
|
| Both of these licenses are not just non-controversially open-
| source licenses; they're such central open-source licenses that
| IIRC much of the debate on the adoption of the OSD was centered
| on ensuring that they, or the more difficult Artistic license,
| were not excluded.
|
| It's sort of nonsense to talk about neural networks being "open
| source" or "not open source", because there isn't source code
| that they could be built from. The nearest equivalent would be
| the training materials and training procedure, which isn't
| provided, but running that is not very similar to
| recompilation: it costs millions of dollars and doesn't produce
| the same results every time.
|
| But that's not a question about the _license_.
| mindcrime wrote:
| It may not violate the OSD, but I would still argue that this
| license is a Bad Idea. Not because what they're trying to do
| is inherently bad in any way, but simply because it's yet
| another new, unknown, not-fully-understood license to deal
| with. The fact that we're having this conversation
| illustrating that very fact.
|
| My personal feeling is that almost every project (I'll hedge
| a little because life is complicated) should prefer an OSI
| certified license and NOT make up their own license (even if
| that new license is "just" a modification of an existing
| license). License proliferation[1] is generally considered a
| Bad Thing for good reason.
|
| [1]: https://en.wikipedia.org/wiki/License_proliferation
| wongarsu wrote:
| Aren't most licenses "not fully understood" in any
| reasonable legal sense? To my knowledge only the Artistic
| License and the GPL have seen the inside of a court room.
| And yet to this day nobody really knows how the GPL works
| with languages that don't follow C's model of a compile and
| a link step. And the boundaries of what's a derivative work
| in the GPL are still mostly set by convention, not a legal
| framework.
|
| What makes us comfortable with the "traditional open source
| licenses" is that people have been using them for decades
| and nothing bad has happened. But that's mostly because
| breaking an open source license is rarely litigated
| against, not because we have some special knowledge of what
| those licenses mean and how to abide by that
| mindcrime wrote:
| _Aren 't most licenses "not fully understood" in any
| reasonable legal sense?_
|
| OK, fair enough. Pretend I said "not well understood"
| instead. The point is, the long-standing, well known
| licenses that have been around for decades are better
| understood that some random "I made up my own thing"
| license. And yes, some of that may be down to just norms
| and conventions, and yes, not all of these licenses have
| been tested in court. But I think most people would feel
| more comfortable using an OSI approved license, and are
| hesitant to foster the creation of even more licenses.
|
| If nothing else, license proliferation is bad because of
| the combinatorics of understanding license compatibility
| issues. Every new license makes the number of
| permutations that much bigger, and creates more unknown
| situations.
| user_7832 wrote:
| I'm of the personal opinion that it's quite reasonable for
| the creators to want attribution in case you manage to
| build a "successful product" off their work. The fact that
| it's a new or different license is a much smaller thing.
|
| A lot of open source, copyleft things already have
| attribution clauses. You're allowed commerical use of
| someone else's work already, regardless of scale.
| Attribution is a very benign ask.
| mindcrime wrote:
| I personally have no (or at least little) problem with
| attribution. As you say, quite a few licenses have some
| degree of attribution required. There's even a whole
| dedicated (and OSI approved) license who's raison d'etre
| is about attribution:
|
| https://en.wikipedia.org/wiki/Common_Public_Attribution_L
| ice...
|
| What I'm saying, if I'm saying anything at all, is that
| it might have been better to pick one of these existing
| licenses that has some attribution requirement, rather
| than adding to the license proliferation problem.
| hnfong wrote:
| You speak as if "license proliferation" is actually a
| problem.
|
| But is it really?
|
| Sure, it may make some licenses incompatible with each
| other, but that's basically equivalent to whining about
| somebody releasing their code in GPL and it can't be used
| in a project that uses MIT...
|
| And your argument that the terms are "less understood"
| really doesn't matter. It's not like people know the
| Common Public Attribution License in and out either. (I'm
| going to argue that 99% devs don't even know the GPL
| well.) Poor drafting could be an issue, but I don't think
| this is the case here.
|
| And on an ideological standpoint, I don't think people
| should be shamed into releasing their code under terms
| they aren't 100% comfortable with.
| ensignavenger wrote:
| The OSD does not allow for discrimination:
|
| "The license must not discriminate against any person or
| group of persons."
|
| "The license must not restrict anyone from making use of the
| program in a specific field of endeavor. For example, it may
| not restrict the program from being used in a business, or
| from being used for genetic research."
|
| By having a clause that discriminates based on revenue, it
| cannot be Open Source.
|
| If they had required everyone to provide attribution in the
| same manner, then we would have to examine the specifics of
| the attribution requirement to determine if it is
| compatible... but since they discriminate, it violates the
| open source definition, and no further analysis is necessary.
| sophiebits wrote:
| This license with the custom clause seems equivalent to
| dual-licensing the product under the following licenses
| combined:
|
| * Small companies may use it without attribution
|
| * Anyone may use it with attribution
|
| The first may not be OSI compatible, but if the second
| license is then it's fair to call the offering open
| weights, in the same way that dual-licensing software under
| GPL and a commercial license is a type of open source.
|
| Presumably the restriction on discrimination relates to
| license terms which grant _no_ valid open source license to
| some group of people.
| moffkalast wrote:
| That's basically less restrictive than OpenStreetMap.
| echelon wrote:
| > This is not open source
|
| OSI purism is deleterious and has led to industry capture.
|
| Non-viral open source is simply a license for hyperscalers to
| take advantage. To co-opt offerings and make hundreds of
| millions without giving anything back.
|
| We need more "fair source" licensing to support sustainable
| engineering that rewards the small ICs rather than mega
| conglomerate corporations with multi-trillion dollar market
| caps. The same companies that are destroying the open web.
|
| This license isn't even that protective of the authors. It just
| asks for credit if you pass a MAU/ARR threshold. They should
| honestly ask for money if you hit those thresholds and should
| blacklist the Mag7 from usage altogether.
|
| The resources put into building this are significant and
| they're giving it to you for free. We should applaud it.
| teiferer wrote:
| > small ICs
|
| The majority of open source code is contributed by companies,
| typically very large corporations. The thought of the open
| source ecosystem being largely carried by lone hobbyist
| contributors in their spare time after work is a myth. There
| are such folks (heck I'm one of them) and they are
| appreciated and important, but their perception far exceeds
| their real role in the open source ecosystem.
| wredcoll wrote:
| I've heard people go back and fortg on this before but you
| seem pretty certain about it, can you share some stats so I
| can see also?
| satvikpendem wrote:
| That's great, nothing wrong with giving away something for
| free, just don't call it open source.
| Intermernet wrote:
| Yep, awesome stuff. Call it "fair source" if you want to.
| Don't call it open source. I'm an absolutist about very few
| things, but the definition of open source is one of them.
| Every bit of variation given in the definition is a win for
| those who have ulterior motives for polluting the definition.
| Open source isn't a vague concept, it's a defined term with a
| legally accepted meaning. Very much like "fair use". It's
| dangerous to allow this definition to be altered. OpenAI (A
| deliberate misnomer if ever there was one) and friends would
| really love to co-opt the term.
| alt187 wrote:
| What part of this goes against the four fundamental freedoms?
| Can you point at it?
| Alifatisk wrote:
| Exactly, I wouldn't mind adding that text on our service if
| we made 20m $, the parent made it sound like a huge clause
| tonyhart7 wrote:
| Yeah, its fair for them if they want a little bit credit
|
| nothing gucci there
| simonw wrote:
| "The freedom to run the program as you wish, for any purpose
| (freedom 0)."
|
| Being required to display branding in that way contradicts
| "run the program as you wish".
| a2128 wrote:
| Being required to store the GPL license notice on my hard
| drive is contradicting my wishes. And I'm not even earning
| $20 million US dollars per month off GPL software!
| weitendorf wrote:
| You are still free to run the program as you wish, you just
| have to provide attribution to the end user. It's
| essentially CC BY but even more permissive, because the
| attribution only kicks in once when specific, relatively
| uncommon conditions are met.
|
| I think basically everybody considers CC BY to be open
| source, so a strictly more permissive license should be
| too, I think.
| owebmaster wrote:
| This freedom might be against the freedom of others to get
| your modifications.
| drawnwren wrote:
| It's silly, but in the LLM world - "open source" is usually
| used to mean "weights are published". This is not to be
| confused with the software licensing meaning of "open source".
| simonw wrote:
| The more tasteful corners of the LLM world use "open weights"
| instead of "open source" for licenses that aren't OSI.
| randomNumber7 wrote:
| This is just so Google doesn't build a woke version of it and
| calls it gemini-3.0-pro
| bhouston wrote:
| Impressive benchmarks!
| emacdona wrote:
| To me, K2 is a mountain and SOTA is "summits on the air". I saw
| that headline and thought "holy crap" :-)
| esafak wrote:
| To me K2 is the Kotlin 2.0 compiler.
| https://blog.jetbrains.com/kotlin/2023/02/k2-kotlin-2-0/
| 38 wrote:
| The web chat has extremely low limits FYI. I ran into the limit
| twice before getting a sane answer and gave up
| awestroke wrote:
| You can use it on OpenRouter without limits (paid API calls)
| exegeist wrote:
| Technical strengths aside, I've been impressed with how non-
| robotic Kimi K2 is. Its personality is closer to Anthropic's
| best: pleasant, sharp, and eloquent. A small victory over botslop
| prose.
| orbital-decay wrote:
| I have a different experience in chatting/creative writing. It
| tends to overuse certain speech patterns without repeating them
| verbatim, and is strikingly close to the original R1 writing,
| without being "chaotic" like R1 - unexpected and overly
| dramatic sci-fi and horror story turns, "somewhere, X happens"
| at the end etc.
|
| Interestingly enough, EQ-Bench/Creative Writing Bench doesn't
| spot this despite clearly having it in their samples. This
| makes me trust it even less.
| pxc wrote:
| So far, I like the answer quality and its voice (a bit less
| obsequious than either ChatGPT or DeepSeek, more direct), but it
| seems to badly mangle the format of its answers more often than
| I've seen with SOTA models (I'd include DeepSeek in that
| category, or close enough).
| irthomasthomas wrote:
| Which host did you use? I noticed the same using parasail.
| Switching to novita and temp 0.4 solved it.
| pxc wrote:
| The host was Moonshot AI at Kimi dot com :)
| jacooper wrote:
| The problem with Chinese models is finding decent hosting. The
| best you can find right now for kimi k2 is only 30 tps, not
| great.
| sagarpatil wrote:
| All the AI models are no using em-dashes. ChatGPT keeps using
| them even after explicitly told not to. Anybody know what's up
| with these models?
| cristoperb wrote:
| I don't know, but as someone who likes using em-dashes in my
| writing it is disappointing that they have become a marker of
| LLM slop.
| ksec wrote:
| _Kimi K2 is the large language model series developed by Moonshot
| AI team._
|
| _Moonshot AI [1] (Moonshot; Chinese: Yue Zhi An Mian ; pinyin:
| Yue Zhi Anmian) is an artificial intelligence (AI) company based
| in Beijing, China. As of 2024, it has been dubbed one of China 's
| "AI Tiger" companies by investors with its focus on developing
| large language models._
|
| I guess everyone is up to date with AI stuff but this is the
| first time I heard of Kimi and Moonshot and was wondering where
| it is from. And it wasn't obvious from a quick glance of
| comments.
|
| [1] https://en.wikipedia.org/wiki/Moonshot_AI
| RandyOrion wrote:
| This is an open weight model, which is in contrast with closed-
| source models.
|
| However, 1t parameters makes it nearly impossible for local
| inference, let alone fine-tuning.
| lvl155 wrote:
| I love the fact that I can use this right away and test it out in
| practice. The ecosystem around LLM is simply awesome and
| improving by the day.
___________________________________________________________________
(page generated 2025-07-13 23:01 UTC)