[HN Gopher] Cerebras Code now supports GLM 4.6 at 1000 tokens/sec
___________________________________________________________________
Cerebras Code now supports GLM 4.6 at 1000 tokens/sec
Author : nathabonfim59
Score : 166 points
Date : 2025-11-08 00:00 UTC (23 hours ago)
(HTM) web link (www.cerebras.ai)
(TXT) w3m dump (www.cerebras.ai)
| alyxya wrote:
| It would be nice if there was more information provided on that
| page. I assume this is just the output token generation speed. Is
| it using speculative decoding to get to 1000 tokens/sec? Is there
| lossy quantization being used to speed things up? I tend to think
| the number of tokens per second a model can generate to be
| relatively low on the list of things I care about, when things
| like model/inference quality and harness play a much bigger role
| in how I feel about using a coding agent.
| cschneid wrote:
| Yes this is the output speed. Code just flashes onto the page,
| it's pretty impressive.
|
| They've claimed repeatedly in their discord that they don't
| quantize models.
|
| The speed of things does change how you interact with it I
| think. I had this new GLM model hooked up to opencode as the
| harness with their $50/mo subscription plan. It was seriously
| fast to answer questions, although there are still big pauses
| in workflow when the per-minute request cap is hit.
|
| I got a meaningful refactor done, maybe a touch faster than I
| would have in claude code + sonnet? But my human interaction
| with it felt like the slow part.
| alyxya wrote:
| The human interaction part is one of the main limitations to
| speed, where the more autonomous a model can be, the faster
| it is for me.
| behnamoh wrote:
| If they don't quantize the model, how do they achieve these
| speeds? Groq also says they don't quantize models (and I want to
| believe them) but we literally have no way to prove they're
| right.
|
| This is important because their premium $50 (as opposed to $20 on
| Claude Pro or ChatGPT Plus) should be justified by the speed. GLM
| 4.6 is fine but I don't think it's still at the GPT-5/Claude
| Sonnet 4.5 level, so if I'm paying $50 for it on Cerebras it
| should be mainly because of speed.
|
| What kind of workflow justifies this? I'm genuinely curious.
| cschneid wrote:
| so apparently they have custom hardware that is basically
| absolutely gigantic chips - across the scale of a whole wafer
| at a time. Presumably they keep the entire model right on chip,
| in effectively L3 cache or whatever. So the memory bandwidth is
| absurdly fast, allowing very fast inference.
|
| It's more expensive to get the same raw compute as a cluster of
| nvidia chips, but they don't have the same peak throughput.
|
| As far as price as a coder, I am giving a month of the $50 plan
| a shot. I haven't figured out how to adapt my workflow yet to
| faster speeds (also learning and setting up opencode).
| bigyabai wrote:
| For $50/month, it's a non-starter. I hope they can find a way
| to use all this excess bandwidth to put out a $10 equivalent
| to Claude Code instead of a 1000 tok/s party trick I can't
| use properly.
| typpilol wrote:
| I feel the same and it's also why I can't understand all
| these people using small local models.
|
| Every local model I've used and even most open source are
| just not good
| behnamoh wrote:
| the only good-enough model I still use it gpt-
| oss-120b-mxfp4 (not 20b) and glm-4.6 at q8 (not q4).
|
| quantization ruins models and some models aren't that
| smart to begin with.
| csomar wrote:
| GLM-4.6 is on par with Sonnet 4.5. Sometimes it is
| better, sometimes it is worse. Give it a shot. It's the
| only model that made me (almost) ditch Claude. The only
| problem is, Claude Code is still the best agentic program
| in town and search doesn't function without a proper
| subscription.
| mcpeepants wrote:
| z.ai hosted GLM 4.6 works great with claude code, drops
| right in
| esafak wrote:
| Have you tried opencode?
| DeathArrow wrote:
| Have you tried Claude Code Router with GLM 4.6?
|
| https://github.com/musistudio/claude-code-router
| xadhominemx wrote:
| $600 per year is a trivial cost for a professional tool
| nine_k wrote:
| > _What kind of workflow justifies this?_
|
| Think about waiting for compilation to complete: the difference
| between 5 minutes and 15 seconds is dramatic.
|
| Same applies to AI-based code-wrangling tasks. The preserved
| concentration may be well worth the $50, especially when paid
| by your employer.
| behnamoh wrote:
| they should offer a free trial so we build confidence in the
| model quality (e.g., to make sure it's not
| nerfed/quantized/limited-context/etc.).
| conception wrote:
| A trial is literally front and center on their website.
| NitpickLawyer wrote:
| You can usually use them with things like openrouter. Load
| some credits there and use the API in your preferred IDE
| like you'd use any provider. For some quick tests it's
| probably be <5$ for a few coding sessions so you can check
| out the capabilities and see if it's worth it for you.
| behnamoh wrote:
| openrouter charges me $12 on a $100 credit...
| NitpickLawyer wrote:
| > What kind of workflow justifies this? I'm genuinely curious.
|
| Any workflow where verification is faster / cheaper than
| generation. If you have a well tested piece of code and want to
| "refactor it to use such and such paradigme", you can run n
| faster model queries and pick the fastest.
|
| My colleagues that do frontend use faster models (not this one
| specifically, but they did try fast-code-1) to build
| components. Someone worked out a workflow w/ worktrees where
| the model generates n variants of a component, and displays
| them next to each other. A human can "at a glance" choose which
| one they like. And sometimes pick and choose from multiple
| variants (something like passing it to claude and say "keep the
| styling of component A but the data management of component
| B"), and at the end of the day is faster / cheaper than having
| cc do all that work.
| msp26 wrote:
| Groq does quantise. Look at this benchmark from moonshotai for
| K2 where they compare their official implementation to third
| party providers.
|
| https://github.com/MoonshotAI/K2-Vendor-Verifier
|
| It's one of the lowest rated on that table.
| threeducks wrote:
| > but we literally have no way to prove they're right
|
| Of course we do. Just run a benchmark with Cerebras/Groq and
| compare to the results produced in a trusted environment. If
| the scores are equal, the model is is either unquantized, or
| quantized so well that we can not tell, in which case it does
| not matter.
|
| For example, here is a comparison of different providers for
| gpt-oss-120b, with differences of over 10 % for best and worst
| provider.
|
| https://artificialanalysis.ai/models/gpt-oss-120b/providers#...
| xadhominemx wrote:
| It's because the model weights and KV cache are stored in SRAM.
| It's extremely expensive per token.
| niklassheth wrote:
| This is more evidence that Cognition's SWE-1.5 is a GLM-4.6
| finetune
| prodigycorp wrote:
| Can you provide more context for this? (eg Was SWE-1.5 released
| recently? Is it considered good? Is it considered fast? Was
| there speculation about what the underlying model was? How does
| this prove that it's a GLM finetune?)
| mhuffman wrote:
| I suspect they are referencing the 950tok/s claim on
| Cognition's page.
| prodigycorp wrote:
| Ah. Thx. Blogpost for others:
| https://cognition.ai/blog/swe-1-5
|
| Takeaway is that this is sonnet-ish model at 10x the speed.
| NitpickLawyer wrote:
| People saw chinese characters in generations made by swe-1.5
| (windsurfs model) and also in the one made by cursor. This
| led to suspicions that the models are finetunes of chinese
| models (which makes sense, as there aren't many us/eu strong
| coding models out there). GLM4.5/4.6 are the "strongest"
| coding models atm (with dsv3.2 and qwen somewhat behind) so
| that's where the speculation came from. Cerebras serving them
| at roughly the same speeds kinda adds to that story (e.g. if
| it'd be something heavier like dsv3 or kimik2 it would be
| slower).
| prodigycorp wrote:
| Really appreciate this context. Thank you!
| nl wrote:
| Not at all. Any model with somewhat-similar architecture and
| roughly similar size should run at the same speed on Cerabras.
|
| It's like saying Llama 3.2 3B and Gemma 4B are fine tunes of
| each other because they run at similar speeds on NVidia
| hardware.
| gatienboquet wrote:
| Vibe Slopping at 1000 tokens per second
| mmaunder wrote:
| Yeah honestly having max cognitive capability is #1 for me.
| Faster tokens is a distant second. I think anyone working on
| creating valuable unique IP feels this way.
| conception wrote:
| This us where agents actually shine. Having a smart model
| write code and plan is great and then having cerebra's do ask
| the command line work, write documents effectively instantly
| and other simple tasks does sped things up quite a bit.
| lordofgibbons wrote:
| At what quantization? And if it is in fact quantized below fp8,
| how is the performance impacted on all the various benchmarks?
| antonvs wrote:
| They claim they don't use quantization.
|
| The reason for their speed is this chip:
| https://www.cerebras.ai/chip
| renewiltord wrote:
| Unfortunately for me, the models on Cerebras weren't as good as
| Claude Code. Speedy but I needed to iterate more. Codex is
| trustworthy and slow. Claude is better at iterating. But none of
| the Cerebras models at the $50 tier were worth anything for me.
| They would have been something if they'd just come out but we
| have these alternatives now.
| elzbardico wrote:
| I don't care. I want LLMs to help with the boring stuff, the
| toil. It may not be as intelligent as Claude, but if it takes
| care of the boring stuff, and it is fast while doing it, I am
| happy. Use it surgically, do the top-down design, and just let
| it fill the blanks.
| renewiltord wrote:
| Give it a crack. It took a lot of iteration for it to write
| decent code. If you figure out differences in prompting
| technique, do share. I was really hoping for the speed to
| improve a lot of execution - because that's genuinely the
| primary problem for me. Unfortunately, speed is great but
| quality wasn't great for me.
|
| Good luck. Maybe it'll do well in some self-directed agent
| loop.
| Flux159 wrote:
| Was able to sign up for the Max plan & start using it via
| opencode. It does a way better job than Qwen3 Coder in my
| opinion. Still extremely fast, but in less than 1 hour I was able
| to use 7M input tokens, so with a single agent running I would be
| able easily to pass that 120M daily token limit. The speed
| difference between Claude Code is significant though - to the
| point where I'm not waiting for generation most of the time, I'm
| waiting for my tests to run.
|
| For reference, each new request needs to send all previous
| messages - tool calls force new requests too. So it's essentially
| cumulative when you're chatting with an agent - my opencode
| agent's context window is only 50% used at 72k tokens, but
| Cerebra's tracking online shows that I've used 1M input tokens
| and 10k output tokens already.
| NitpickLawyer wrote:
| > For reference, each new request needs to send all previous
| messages - tool calls force new requests too. So it's
| essentially cumulative when you're chatting with an agent - my
| opencode agent's context window is only 50% used at 72k tokens,
| but Cerebra's tracking online shows that I've used 1M input
| tokens and 10k output tokens already.
|
| This is how every "chatbot" / "agentic flow" / etc works behind
| the scenes. That's why I liked that "you should build an agent"
| post a few days ago. It gets people to really understand what's
| behind the curtain. It's requests all the way down, sometimes
| with more context added, sometimes with less (subagents & co).
| zaptrem wrote:
| They don't have prefix caching? Claude and Codex have this.
| versteegen wrote:
| At those speeds, it's probably impossible. It would require
| enormous amounts of memory (which the chip simply doesn't
| have, there's no room for it) or rather a lot of bandwidth
| off-chip to storage, and again they wouldn't want to waste
| surface area on the wiring. Bit of a drawback of increasing
| density.
| elzbardico wrote:
| 50 dollars month cerebras code plan, first with qwen-420, now
| with glm, is my secret weapon.
|
| Stalin used to say that in war "quantity has a quality all its
| own". And I think that in terms of coding agents, speed is
| quality all its own too.
|
| Maybe not for blind vibe coding, but if you are a developer, and
| is able to understand the code the agent generates and change it,
| the fast feedback of fast inference is a game changer. I don't
| care if claude is better than GLM 4.6, fast iteractions are king
| for me now.
|
| It is like moving from DSL to gigabit fiber FTTH
| divmain wrote:
| I have been an AI-coding skeptic for some time. I always
| acknowledged LLMs as useful for solving specific problems and
| making certain things possible that weren't possible before. But
| I've not been surprised to see AI fail to live up to the hype.
| And I never had a personally magical moment - an experience that
| shifted my perspective a la the peak end rule.
|
| I've been using GLM 4.6 on Cerebras for the last week or so,
| since they began the transition, and I've been blown away.
|
| I'm not a vibe coder; when I use AI coding tools, they're in the
| hot path. They save me time when whipping up a bash script and I
| can't remember the exact syntax, or for finding easily
| falsifiable answers that would otherwise take me a few minutes of
| reading. But, even though GLM 4.6 is not as smart as Sonnet 4.5,
| it is smart enough. And because it is so fast on Cerebras, I
| genuinely feel that it augments my own ability and productivity;
| the raw speed has considerably shifted the tipping point of time-
| savings for me.
|
| YMMV, of course. I'm very precise with the instructions I
| provide. And I'm constantly interleaving my own design choices
| into the process - I usually have a very clear idea in my mind of
| what the end result should look like - so, in the end, the code
| ends up how I would have written it without AI. But building
| happens much faster.
|
| No affiliation with Cerebras, just a happy customer. Just
| upgraded to the $200/mo plan - and I'll admit that I was one that
| scoffed when folks jumped on the original $200/mo Claude plan. I
| think this particular way of working with LLMs just fits well
| with how I think and work.
| ramraj07 wrote:
| Your post has inspired me to check them out. How do you use it,
| with their UI oe to power some other open source tool?
|
| Do you suggest that this thing is so fast its simpler now to
| quickly work on one thing at a time instead of the 5 background
| tools running in parallel which might have been a pattern we
| invented because these things are so slow?
| bananapub wrote:
| you want to be using something like opencode in a terminal,
| not the web ui.
|
| you'll need to try it and see what the speed does to your
| workflow.
| divmain wrote:
| I've been using the crush TUI primarily. I like that I have
| the flexibility to switch to a smarter model on occasion -
| for awhile I hesitated to pick up AI coding at all, simply
| because I didn't want to be locked into a model that could be
| immediately surpassed. It's also customizable enough with
| sane defaults.
| realo wrote:
| I was AI skeptic too a year ago , but recently i wanted a
| windows exe program to do the same as a complicated bash script
| on linux.
|
| i gave the bash script to claude code, which immediately
| started implementing something in the zig language. after a few
| iterations, i had zig source code that compiled in linux ,
| produced a windows exe and perfectly mimicked the bash script.
|
| I know nothing about zig programming.
| aurareturn wrote:
| I've been maintaining my company's Go repos using Claude
| after our Go developer left. I don't know anything about Go.
| mythz wrote:
| AI moves so fast that Vibe Coding still has a negative stigma
| attached to it, but even after 25 years of development, I'm not
| able to match the productivity of getting AI to implement the
| features I want. It's basically getting multiple devs to set out
| and go do work for you where you just tell them what you want and
| provide iterative feedback till they implement all the features
| you want, in the way you want and to fix all the issues you find
| along the way, which they can create tests and all the automated
| and deployment scripts for.
|
| This is clearly the future of Software Development, but the
| models are so good atm that the future is possible now. I'm still
| getting used to and having to rethink my entire dev workflow for
| maximum productivity, and whilst I wouldn't unleash AI Agents on
| a decade old code base, all my new Web Apps will likely end up
| being AI-first unless there's a very good reason why it wouldn't
| provide a net benefit.
| namanyayg wrote:
| Exactly, Codex gpt-5-high is quite like sending smart devs. It
| still makes mistakes, and when it does they're extremely stupid
| ones, but I am now accepting the code it generates as throwable
| and I just reroll when it does something dumb.
| dust42 wrote:
| It just depends on what you are doing. A green field react app
| in typescript with a CRUD API behind? The LLMs are a mind
| blowing assistant and 1000t/s is crazy.
|
| You are doing embedded development or anything else not as
| mainstream as web dev? LLMs are still useful but no longer mind
| blowing and often produce hallucinations. You need to read
| every line of their output. 1000t/s is crazy but no longer
| always in a good way.
|
| You are doing stuff which the LLMs haven't seen yet? You are on
| your own. There is quite a bit of irony in the fact that the
| devs of llama.cpp barely use AI - just have a look at the
| development of support for Qwen3-Next-80B [1].
|
| [1] https://github.com/ggml-org/llama.cpp/pull/16095
| almostgotcaught wrote:
| I've said it before but no one takes it seriously: LLMs are
| only useful if you're building something that's already in
| the training set ie _already commodity_. In which case _why
| are you building it_???
| whiterook6 wrote:
| It's not that the product you're building is a commodity.
| It's that the tools you're using to built it are. Why not
| build a landing page using HTML and CSS and tailwind? Why
| not use swift to make an app? Why not write an AWS lambda
| using JavaScript?
| mythz wrote:
| "LLMs are only useful..."
|
| Is likely why no one takes you seriously, as it's a good
| indication you don't have much experience with them.
| nurettin wrote:
| Do you avoid writing anything that the programming
| community has ever built? How are you alive???
| antonvs wrote:
| The obvious point that you're missing is that there are
| literally infinite ways to assemble software systems from
| the pieces that an LLM is able to manipulate due to its
| training. With minor guidance, LLMs can put together an
| unlimited number of novel combinations. The idea that the
| entire end product has to be in the training set is
| trivially false.
| becquerel wrote:
| Because I'm getting paid to.
| philipp-gayret wrote:
| It's true, when I was working with LLMs on a novel idea it
| said sorry I can't help you with that!
| dboreham wrote:
| Historically big AI skeptic here: what you say is very not
| true now. LLMs aren't just regurgitating their training
| data per se. I've used LLMs on _languages_ the LLM has not
| seen, and it performed well. I 've used LLMs on code that
| is about as far from a React todo app as it's possible to
| get.
| energy123 wrote:
| Bell Labs should have fired all their toilet cleaners.
| Nothing innovative about a toilet.
| lifthrasiir wrote:
| There aren't many things that LLMs haven't really seen yet,
| however. I have successfully used LLMs to develop a large
| portion of WebAssembly 3.0 interpreter [1], which surely
| aren't in their training set because WebAssembly 3.0 was only
| released months ago. Sure, it took me tons of guidance but it
| was useful enough for me.
|
| Even llama.cpp is not a truly novel thing to LLMs, there are
| several performant machine learning model executors available
| in their training sets anyway, and I'm sure llama.cpp _can_
| benefit from LLMs if they want; they just chose not to.
|
| [1] https://github.com/lifthrasiir/wah/
| koito17 wrote:
| > You are doing embedded development or anything else not as
| mainstream as web dev? LLMs are still useful but no longer
| mind blowing and often produce hallucinations.
|
| I experienced this with Claude 4 Sonnet and, to some extent,
| gpt-5-mini-high.
|
| When able to run tests against its output, Claude produces
| pretty good Rust backend and TypeScript frontend code.
| However, Claude became borderline _unproductive_ once I
| started experimenting with uefi-rs. Other LLMs, like
| gpt-5-mini-high, did not fare much better, but they were at
| least capable of admitting _lack of knowledge_. In
| particular, GPT-5 would provide output akin to "here is some
| pseudocode that you may be able to adapt to your choice of
| UEFI bindings".
|
| Testing in a UEFI environment is quite difficult; the LLM
| can't just run `cargo test` and verify its output. Things get
| worse in embedded, because crates like embedded_hal made
| massive API changes between 0.2 and 1.0 (the latest version),
| and each LLM I've tried seems to only have knowledge of 0.2
| releases. Also, for embedded, forget even thinking about
| testing harnesses (which at least exist in _some form_ with
| UEFI, it 's just difficult to automate the execution and
| output for an LLM). In this case, you cannot really trust the
| output of the LLM. To minimize risk of hallucination, I would
| try maintaining data sheets and library code in context, but
| at that point, it took more time to prompt an LLM than
| handwrite code.
|
| I've been writing a lot of embedded Rust over the past two
| weeks, and my usage of LLMs _in general_ decreased because of
| that. Currently planning to resume development on some of my
| "easier" projects, since I have about 300 Claude prompts
| remaining in my Zed subscription, and I don't want them to go
| to waste.
| RealityVoid wrote:
| > Also, for embedded, forget even thinking about testing
| harnesses (which at least exist in some form with UEFI,
| it's just difficult to automate the execution and output
| for an LLM).
|
| I think this doesn't have to be like this and we can do
| better for this. If LLMs keep this up, good testing
| infrastructure might become more important.
| koito17 wrote:
| One of my expectations for the future is the development
| of testing tools whose output is "optimized" in some way
| for LLM consumption. This is already occurring with Bun's
| test runner, for instance.[0] They are implementing a
| flag in the test runner so that the output is structured
| _and_ optimized for token count.
|
| Overall, I agree with your point. LLMs feel a lot more
| reliable when a codebase has thorough, easy-to-run tests.
| For a similar reason, I have been drifting towards
| strong, statically-typed languages. Both Rust and
| TypeScript have rich type systems that can express many
| kinds of runtime behavior with just types. When a
| compiler can make strong guarantees about a program's
| behavior, I assume that helps nudge the quality of LLM
| output a bit higher. Tests then help prevent silly
| regressions from occurring. I have no evidence for this
| besides my anecdotal experience using LLMs across several
| programming languages.
|
| In general, I've had the best experience with LLMs when
| there's plenty of static analysis (and tests) on the
| codebase. When a codebase can't be easily tested, then I
| get much less productivity gains from LLMs. So yeah, I'm
| all for improving testing infrastructure.
|
| [0] https://x.com/jarredsumner/status/1944948478184186366
| miki123211 wrote:
| This is where Rust's "if it compiles, it's probably
| correct" philosophy may come in handy.
|
| "Shifting bugs left" is even more important for LLMs than
| it is for humans. There are certain tests LLMs can't run,
| so if we can detect bugs at compile time and run the LLM in
| a loop until things compile, that's a significant benefit.
| nathan_compton wrote:
| My recent experience is that llms are dogshit at rust,
| though, unable to correct bugs without inserting new
| ones, going back and forth fixing and breaking the same
| thing, etc.
| energy123 wrote:
| A while ago I gathered every HN comment going back a year
| that contains Rust and LLM and about half are positive
| and half are negative.
| manmal wrote:
| Aren't we all though?
| RealityVoid wrote:
| > You are doing embedded development or anything else not as
| mainstream as web dev?
|
| Counterpoint, but also kind of reinforcing your point. It
| depends on the kind of embedded development. I did a small
| utility PCB with an ESP32, and their libs are good there is
| active community, they have test frameworks. LLM's did a
| great job there.
|
| On the other hand, I wanted to drive a timer and a pwm module
| and a DMA dma engine to generate some precise pulses. The way
| I chained hw was... Not typical, but it was what I needed and
| the hw could do it. At that, Claude failed miserably and it
| only led me to waste my time, so I had to spend the time to
| do it manually.
| ffsm8 wrote:
| The industry of "software" is so large... While I agree with
| web development going this route, I'm not sure about
| "everything else".
|
| You could argue that that's the bulk of all software jobs in
| tech, and you'd likely be correct... But depending on what your
| actual challenge is, LLM assistance is more of a hindrance then
| help. However creating a web platform without external
| constraints makes LLM assistance shine, that's true
| killerstorm wrote:
| Well, there are certainly kinds of code LLMs would struggle
| with, but people generally underestimate what LLMs are
| capable of.
|
| E.g. Victor Taelin is implementing ultra-advanced programming
| language/runtime writing almost all code using LLM now.
| Runtime (HVM) is based on Interaction Calculus model which
| was only an obscure academic curiosity until Taelin started
| working on it. So a hypothesis that LLMs are only capable of
| copying bits of code from Stack Overflow shall be dismissed.
| thesz wrote:
| I took a look at the Taelin's work [1].
|
| [1] https://github.com/HigherOrderCO/HVM
|
| From my understanding, main problem there is a compilation
| into (optimal) CUDA code and CUDA runtime, not language or
| internal representation per se. CUDA is hard to debug, some
| help can be warranted.
|
| BTW, this HVM thing smells strange. The PAPER does not
| provide any description of experiments where linear
| parallel speedups were achieved. What were these 16K cores?
| What were these tasks?
| killerstorm wrote:
| Taelin is experimenting with possible applications of
| interaction calculus. That CUDA thing was one of
| experiments, and it didn't quite work out.
|
| Currently he's working on a different thing: a code
| synthesis tool. AFAIK he got something better than
| anything else in this category, but whether it's useful
| is another question.
| thesz wrote:
| > something better than anything else in this category
|
| That is a strong statement.
|
| [1]
| https://en.wikipedia.org/wiki/Id_(programming_language)
|
| Id [1] was run on the CM-5 (then) supercomputer and
| demonstrated superlinear parallel speedups on some of the
| tasks. That superlinear speedup was due to better cache
| utilization on individual nodes.
|
| In some of the tasks the amount of parallel execution
| discovered by Id90 would lead to overflow of content-
| addressable memory and Id90's runtime implemented
| throttling _to reduce available parallelism_ to make
| things to be done at all.
|
| Does the PAPER of HVM refers to Id (Id90 to be precise)?
| No, it does not.
|
| This is serious negligence of Taelin.
| keithba wrote:
| I've also experimented with using rust to create a new
| programming language where I vibe coded (eg never wrote
| myself). My opinion is that it's quite capable with
| disciplined management.
|
| https://github.com/GoogleCloudPlatform/aether
|
| Note: the syntax is ugly as a trade-off to make it explicit
| and unambiguous for LLMs to use.
| kaspermarstal wrote:
| We need a new term for LLMs actually solving a hard problems.
| When I help Claude Code solve a nasty bug it doesn't feel like
| "vibing" as in "I tell the model what I want the website to
| look like". It feels like sniping as in "I spot for Claude
| Code, telling how to adjust for wind, range, and elevation so
| it can hit my far away target".
| scosman wrote:
| From what I recall of the original Karpathy definition, it's
| only "vibe coding" if you aren't reading the code it produces
| lukan wrote:
| Yes, I vote for keeping that definition and not throw it
| all into a box. LLM assisted coding is not vibe coding.
| kaspermarstal wrote:
| Cool, I did not know that. That makes perfect sense.
| JimDabell wrote:
| You're right. It's explicitly about not caring about the
| code:
|
| > There's a new kind of coding I call "vibe coding", where
| you fully give in to the vibes, embrace exponentials, and
| forget that the code even exists. It's possible because the
| LLMs (e.g. Cursor Composer w Sonnet) are getting too good.
| Also I just talk to Composer with SuperWhisper so I barely
| even touch the keyboard. I ask for the dumbest things like
| "decrease the padding on the sidebar by half" because I'm
| too lazy to find it. I "Accept All" always, I don't read
| the diffs anymore. When I get error messages I just copy
| paste them in with no comment, usually that fixes it. The
| code grows beyond my usual comprehension, I'd have to
| really read through it for a while. Sometimes the LLMs
| can't fix a bug so I just work around it or ask for random
| changes until it goes away. It's not too bad for throwaway
| weekend projects, but still quite amusing. I'm building a
| project or webapp, but it's not really coding - I just see
| stuff, say stuff, run stuff, and copy paste stuff, and it
| mostly works.
|
| -- https://x.com/karpathy/status/1886192184808149383
| fuzzy_biscuit wrote:
| So we're the spotter in that metaphor. I like it!
| riskable wrote:
| "spotter coding" or perhaps "checker coding"?
|
| "verified vivisection development" when you're working with
| older code :D
| demarq wrote:
| - backseat engineer
|
| - keyboard princess
|
| - Robin to the Batman
|
| - meatstack engineer
|
| - artificial manager
| cyanydeez wrote:
| Do you think that people will read your flowery prose and
| suspect you're just part of the dead Internet. We are still
| waiting for all these AI enhanced apps to flood the market.
| odie5533 wrote:
| I find the fast models good for rapidly iterating UI changes with
| voice chat. Like "add some padding above the text box" or "right
| align the button". But I find the fast models useless for deep
| coding work. But a fast model has its place. Not $50/month
| though. Cursor has Compose 1 and Grok Code Fast for free. Not
| sure what $50/month gets me that those don't. I liked the stealth
| supernova model a lot too.
| gardnr wrote:
| GLM 4.6 isn't a "fast" model. It does well in benchmarks vs
| Sonnet 4.5.
|
| Cerebras makes a giant chip that runs inference at unreal
| speeds. I suspect they run their cloud service more as an
| advertising mechanism for their core business: hardware. You
| can hear the founder describing their journey:
|
| https://podcasts.apple.com/us/podcast/launching-the-fastest-...
| bn-l wrote:
| Composer and grok fast are not free.
| versteegen wrote:
| grok-code-fast-1 is free (for a limited time) in opencode
| zen, has been for a while. (Originally it was billed as a
| test/to gather training data for xAI.) But right now GLM 4.6
| is also temporarily free there (hosted by opencode
| themselves; they call it "big pickle", and there's no data
| collection), has been for weeks, and GLM 4.6 is _far_ better
| (better than Haiku and not very far off Sonnet), and still
| very fast, so I have no use for gcf1 anymore.
| odie5533 wrote:
| They are both free in Cursor right now.
| seduerr wrote:
| It's just amazing to have a reliable model at the speed of light.
| Was waiting for such a great model for a long time!
| dust42 wrote:
| 1000 tokens/s is pretty fancy. I just wonder how sustainable the
| pricing is or if they are VC-fueled drug dealers trying to
| convert us into AI-coholics...
|
| It is definitely fun playing with these models at these speeds.
| The question is just how far from real pricing is 500M tokens for
| $50?
|
| Either way the LLM usage will grow for some time to come and so
| will grow energy usage. Good times for renewables and probably
| fusion and fission.
|
| Selling shovels in a gold rush was always reliable business.
| Cerebras is only rated at $8.1B as of one month ago. Compared to
| Nvidia that seems pocket change.
| ojosilva wrote:
| Here's a customer of the $200 max plan for 2 months. I fell in
| love with the Qwen3 Coder 480B model, Q3C, _that_ was fast, twice
| the speed of GLM. GLM 4.6 is just meh, I mean, way faster than
| competitors, and practically at Sonnet 4.x level in coding and
| tool use, but not a life-changing difference.
|
| Yes, Qwen3 made more mistakes than GLM, around 15% more in my
| quick throwaway evals, but it was a more professional model
| overall, more polished in some aspects, better with international
| languages, and being non-reasoning, ideal for a lot of tasks
| through the API that could be ran instantaneously. I think the
| Qwen line of models is a more consistent offering, with other
| versions of the model for 32B and VL, now a 80B one, etc. I guess
| the problem was that Qwen Max was closed source, signalling that
| Qwen may not have a way forward for Cerebras to evolve. GLM 4.6
| covers precisely that hole. Not that Cerebras is a model provider
| of any kind, their service levels are "buggy" (right now it's
| been down for 1h and probably won't be fixed until California
| wakes up at 9am PST). So it does feel like we are not the
| customers, but the product, a marketing stunt for them to get
| visibility for their tech.
|
| GLM feels like they (Z.ai) are just distilling whatever they can
| get into it. GLM switches to Chinese sometimes, or just cuts off.
| It does have a bit of more "intelligence" than Q3C, but not
| enough to say it solves the toughest problems. Regardless, for
| tough nuts to crack I use my Codex Plus plan.
|
| Ex: In one of my evals, it took 15 turns to solve an issue using
| Cerebras Q3C. I took 12 turns with GLM, but overall GLM takes 2x
| the time, so instead of doing a full task from zero-to-commit in
| say 15 minutes, it takes 24 minutes.
|
| In another eval (Next.js CSS editing), my task with Q3C coder was
| done in 1:30 minutes. GLM 4.6 took 2:24. The same task in Codex
| took 5:37 minutes, with maybe 1 or 2 turns. Codex DX is that of
| working unattended: prompt it and go do something else, there's a
| good chance it will get it right after 0, 1 or 2 nudges. With
| CC+Cerebras it's a completely different DX, given the speed it
| feels just like programming, but super-fast. Prompt, read the
| change, accept (or don't), accept, accept, accept, test it out,
| accept, prompt, accept, interrupt, prompt, accept, and 1:30 min
| later we're done.
|
| Like I said I use Claude Code + a proxy (llmux). The coding agent
| makes a HUGE difference, and CC is hands-down the best agent out
| there.
| andai wrote:
| Asked GLM-4.6 to introduce itself. "Hello! I'm glad you asked.
| I'm a large language model, trained by Google. (...)"
|
| It seems to have been fine-tuned on Claude Code interactions as
| well. Though unfortunately not much of Claude's coding style
| itself? (I wish!)
| KronisLV wrote:
| Been using Cerebras for quite a while now, previously with their
| Qwen3 Coder and now GLM 4.6, overall the new model feels better
| at tool calls and code in general. Fewer tool call failures with
| RooCode (should also apply to Cline and others too), but
| obviously still not perfect.
|
| Currently on the 50 USD tier, very much worth the money, am kinda
| considering going for the 200 USD tier, BUT GPT-5 and Sonnet 4.5
| and Gemini 2.5 Pro still feel needed occasionally, so it'd be
| stupid to go for the 200 USD tier and not use it fully and still
| have to pay up to around 100 USD for tokens in the other models
| per month. Maybe that will change in the future, when dealing
| with lots of changes (e.g. needing to make a component showcase
| of 90 components, but with enough differences between then to
| make codegen unviable) Cerebras is already invaluable.
|
| Plus the performance actually makes iterating faster, to a degree
| where I believe that other models should also eventually run this
| fast. Oddly enough, the other day their 200 USD plan showed as
| "Sold out", maybe they're scaling up the capacity gradually. I
| really hope they never axe the Code plans, they literally have no
| competition for this mode of use. Maybe they'll also have a 100
| USD plan some day, one can hope, but maybe offering _just_ the
| 200 plan is better from an upsell perspective for them.
|
| Oh also, when I spill over my daily 24M limits, please let me use
| the Pay2Go thing on top of that, if instead of 24M tokens some
| day I need 40M, I'd pay for those additional ones.
| andai wrote:
| I have been using Z.ai's (creators of GLM) "Coding Plan" with
| GLM-4.6. $3/month and 3x higher limits than Claude Pro, they say.
|
| (I have both and haven't run into any limits yet, so I'm probably
| not a very heavy user.)
|
| I'm quite impressed with the model. I have been using GLM-4.6 in
| Claude Code instead of Sonnet, and finding it fine for my use
| cases. (Simple scripting and web stuff.)
|
| (Note: Z.ai's GLM doesn't seem to support web search (fails) or
| image recognition (hallucinates). To fix that I use Claude Code
| Router and hooked up those two features to Gemini (free)
| instead.)
|
| I find that Sonnet produces much nicer code. I often find myself
| asking Sonnet to clean up GLM's code. More recently, I just got
| the Pro plan for Claude so I'm mostly just using Sonnet directly
| now. (Haven't had the rate limits yet but we'll see!)
|
| So in my experience if you're not too fussy you can currently get
| "80% of Claude Code" for like $3/month, which is pretty nuts.
|
| GLM also works well in Charm Crush, though it seems to be better
| optimized for Claude Code (I think they might have fine tuned
| it.)
|
| ---
|
| I have tested Kimi K2 at 1000 tok/s via OpenRouter, and it's
| bloody amazing, so I imagine this supercharged GLM will be great
| too. Alas, $50!
| realo wrote:
| I just asked glm-4.6 how to setup a z.ai api key with claude
| code and it kept on saying it has no idea what claude code
| is...
|
| Quite funny, actually.
| andai wrote:
| Ask and you shall receive!
|
| https://docs.z.ai/devpack/tool/claude
|
| tldr "env": {
| "ANTHROPIC_AUTH_TOKEN": "your_zai_api_key",
| "ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic"
| }
|
| Although if you want an Actually Good Experience I recommend
| using Claude Code Router
|
| https://github.com/musistudio/claude-code-router
|
| because it allows you to intercept the requests and forward
| them to other models. (e.g. GLM doesn't seem to support
| search or images, so I use Gemini's free tier for that.)
|
| (CCR just launches Claude with the base url set to a local
| proxy. The more adventurous reader can also set up his own
| proxy... :)
| realo wrote:
| Benchmark score
|
| Humans : 1
|
| AI : 0
| brianjking wrote:
| Does this allow me to use Claude Code as the orchestration
| harness with GLM 4.6 as the LLM along with other LLMs?
| Seems so based on your description, thanks for the link.
| andai wrote:
| Explain like I'm 5?
| andai wrote:
| I have created a "semi-interactive" AI coding workflow.
|
| I write what I want, the LLM responds with edits, my 100 lines of
| Python implement them in the project. It can edit any number of
| files in one LLM call, which is very nice (and very cheap and
| fast).
|
| I tested this with Kimi K2 on Groq (also 1000 tok/s) and very
| impressed.
|
| I want to say this is the best use case for fast models -- that
| frictionlessness in your workflow -- though it burns tokens
| pretty fast working like that! (Agentic is even more nuts with
| how fast it burns tokens on fast models, so the $50 is actually
| pretty great value.)
| lvl155 wrote:
| I want to use Cerebaras but it's just not production ready. I
| will root for them on sideline for now.
| hereme888 wrote:
| So basically "change this in the UI" and you see it happen almost
| real time.
| sheepscreek wrote:
| I used their $50 plan and with the previously offered Qwen3 coder
| 480B. While fast - none of the "supported" tools I tried were
| able to use it in a way that didn't hit the per minute request
| limit in a few seconds. It was incredibly frustrating. For the
| record, I tried OpenCoder, VSCode, Quen Coder CLI, octofriend and
| a few others I don't remember.
|
| Fast forward to now, when GLM 4.6 has replaced Qwen3 coder in
| their subscription plan. My subscription was still active so I
| wanted to give this setup another shot. This time though, I
| decided to give Cline a try. I've got to say, I was very
| pleasantly surprised - it worked really well out of the box. I
| guess whatever Cline does behind the scenes is more conducive to
| Cerebra's API. I used Claude 4.5 + Thinking for "Plan" mode and
| Cerebras/GLM 4.6 for "Act".
|
| The combo feels solid. Much better than GPT-5 Codex alone. I
| found codex to be very high quality but so godawful slow for long
| interactive coding sessions. The worst part is I cannot see what
| it's "thinking" to stop it in its tracks when it's going in the
| wrong direction.
|
| In an essence, Cerebras + GLM 4.6 feels like Grok Fast 1 on
| steroids. Just couple it with a frontier + thinking model for
| planning (Claude 4.5/GPT-5/Gemini Pro 2.5).
|
| One caveat: sometimes the Cerebras API starts choking "because of
| high demand" which has nothing to do with hitting subscription
| limits. Just an FYI.
|
| Note: For the record, I was coding on a semi-complex Rust
| application tuned for low-latency mix of IO + CPU workload. The
| application is multi-threaded and makes extensive use of locking
| primitives and explicit reference counting (Arc). All models were
| able to handle the code really well given the constraints.
|
| Note2: I am also evaluating Synthetic's (synthetic.new) open-
| source model inference subscription and I like it a lot. There's
| a large number of models to choose from, including gpt-oss-120
| and their usage limits are very very generous. To the point that
| I don't think I will ever hit them.
| anonzzzies wrote:
| I run opencode with cerebras and it goes on and on. No issues
| so far. It is not a codex or claude code but its fast that it
| allows for a much more interactive experience. Our results with
| Claude Code and Codex are much better quality but when, let's
| use the term vibing, opencode wirh cerebras is more fun.
| zixuanlimit wrote:
| Have you tried Z.ai's official coding plan? Any differences in
| performance?
| Alifatisk wrote:
| I don't know about GLM 4.6, some have said they are bench-maxing
| so I kinda lost my interest in trying them out, does it live up
| to its reputation? Is it that good like Sonnet 4.5?
| jauntywundrkind wrote:
| I'm curious to know what the cost is to switch contexts. Pure
| performance is amazing, but how long does it take to get a system
| going, to load the model and build context? What systems cans
| witch contexts while keeping the model non-destructively vs when
| is executing destructive?
|
| I have a lot of questions about how models are run at scale; so
| curious to know more. With such a massive wafer as chip as
| Cerebras, it feels like perhaps switching might be even more
| consuming. Or maybe there's some brilliant strategy to have
| multiple contexts all loaded that it can flip between!
| Inventorying & using so much ram so spread out is it's own
| challenge!
| w-m wrote:
| I wanted to try GLM 4.6 through their API with Cline, before
| spending the $50. But I'm getting hit with API limits. And now
| I'm noticing a red banner "GLM4.6 Temporarily Sold Out. Check
| back soon." at cloud.cerebras.ai. HN hug of death, or was this
| there before?
| vladgur wrote:
| Where I'm unfortunately hitting the wall with these assistants is
| developing a desktop app in say Java swing - Claude cannot verify
| that ui it produced is actually functional
| NaomiLehman wrote:
| what is your AI dev stack? have you tried Kilo Code?
___________________________________________________________________
(page generated 2025-11-08 23:01 UTC)