[HN Gopher] Cerebras Code now supports GLM 4.6 at 1000 tokens/sec
       ___________________________________________________________________
        
       Cerebras Code now supports GLM 4.6 at 1000 tokens/sec
        
       Author : nathabonfim59
       Score  : 166 points
       Date   : 2025-11-08 00:00 UTC (23 hours ago)
        
 (HTM) web link (www.cerebras.ai)
 (TXT) w3m dump (www.cerebras.ai)
        
       | alyxya wrote:
       | It would be nice if there was more information provided on that
       | page. I assume this is just the output token generation speed. Is
       | it using speculative decoding to get to 1000 tokens/sec? Is there
       | lossy quantization being used to speed things up? I tend to think
       | the number of tokens per second a model can generate to be
       | relatively low on the list of things I care about, when things
       | like model/inference quality and harness play a much bigger role
       | in how I feel about using a coding agent.
        
         | cschneid wrote:
         | Yes this is the output speed. Code just flashes onto the page,
         | it's pretty impressive.
         | 
         | They've claimed repeatedly in their discord that they don't
         | quantize models.
         | 
         | The speed of things does change how you interact with it I
         | think. I had this new GLM model hooked up to opencode as the
         | harness with their $50/mo subscription plan. It was seriously
         | fast to answer questions, although there are still big pauses
         | in workflow when the per-minute request cap is hit.
         | 
         | I got a meaningful refactor done, maybe a touch faster than I
         | would have in claude code + sonnet? But my human interaction
         | with it felt like the slow part.
        
           | alyxya wrote:
           | The human interaction part is one of the main limitations to
           | speed, where the more autonomous a model can be, the faster
           | it is for me.
        
       | behnamoh wrote:
       | If they don't quantize the model, how do they achieve these
       | speeds? Groq also says they don't quantize models (and I want to
       | believe them) but we literally have no way to prove they're
       | right.
       | 
       | This is important because their premium $50 (as opposed to $20 on
       | Claude Pro or ChatGPT Plus) should be justified by the speed. GLM
       | 4.6 is fine but I don't think it's still at the GPT-5/Claude
       | Sonnet 4.5 level, so if I'm paying $50 for it on Cerebras it
       | should be mainly because of speed.
       | 
       | What kind of workflow justifies this? I'm genuinely curious.
        
         | cschneid wrote:
         | so apparently they have custom hardware that is basically
         | absolutely gigantic chips - across the scale of a whole wafer
         | at a time. Presumably they keep the entire model right on chip,
         | in effectively L3 cache or whatever. So the memory bandwidth is
         | absurdly fast, allowing very fast inference.
         | 
         | It's more expensive to get the same raw compute as a cluster of
         | nvidia chips, but they don't have the same peak throughput.
         | 
         | As far as price as a coder, I am giving a month of the $50 plan
         | a shot. I haven't figured out how to adapt my workflow yet to
         | faster speeds (also learning and setting up opencode).
        
           | bigyabai wrote:
           | For $50/month, it's a non-starter. I hope they can find a way
           | to use all this excess bandwidth to put out a $10 equivalent
           | to Claude Code instead of a 1000 tok/s party trick I can't
           | use properly.
        
             | typpilol wrote:
             | I feel the same and it's also why I can't understand all
             | these people using small local models.
             | 
             | Every local model I've used and even most open source are
             | just not good
        
               | behnamoh wrote:
               | the only good-enough model I still use it gpt-
               | oss-120b-mxfp4 (not 20b) and glm-4.6 at q8 (not q4).
               | 
               | quantization ruins models and some models aren't that
               | smart to begin with.
        
               | csomar wrote:
               | GLM-4.6 is on par with Sonnet 4.5. Sometimes it is
               | better, sometimes it is worse. Give it a shot. It's the
               | only model that made me (almost) ditch Claude. The only
               | problem is, Claude Code is still the best agentic program
               | in town and search doesn't function without a proper
               | subscription.
        
               | mcpeepants wrote:
               | z.ai hosted GLM 4.6 works great with claude code, drops
               | right in
        
               | esafak wrote:
               | Have you tried opencode?
        
               | DeathArrow wrote:
               | Have you tried Claude Code Router with GLM 4.6?
               | 
               | https://github.com/musistudio/claude-code-router
        
             | xadhominemx wrote:
             | $600 per year is a trivial cost for a professional tool
        
         | nine_k wrote:
         | > _What kind of workflow justifies this?_
         | 
         | Think about waiting for compilation to complete: the difference
         | between 5 minutes and 15 seconds is dramatic.
         | 
         | Same applies to AI-based code-wrangling tasks. The preserved
         | concentration may be well worth the $50, especially when paid
         | by your employer.
        
           | behnamoh wrote:
           | they should offer a free trial so we build confidence in the
           | model quality (e.g., to make sure it's not
           | nerfed/quantized/limited-context/etc.).
        
             | conception wrote:
             | A trial is literally front and center on their website.
        
             | NitpickLawyer wrote:
             | You can usually use them with things like openrouter. Load
             | some credits there and use the API in your preferred IDE
             | like you'd use any provider. For some quick tests it's
             | probably be <5$ for a few coding sessions so you can check
             | out the capabilities and see if it's worth it for you.
        
               | behnamoh wrote:
               | openrouter charges me $12 on a $100 credit...
        
         | NitpickLawyer wrote:
         | > What kind of workflow justifies this? I'm genuinely curious.
         | 
         | Any workflow where verification is faster / cheaper than
         | generation. If you have a well tested piece of code and want to
         | "refactor it to use such and such paradigme", you can run n
         | faster model queries and pick the fastest.
         | 
         | My colleagues that do frontend use faster models (not this one
         | specifically, but they did try fast-code-1) to build
         | components. Someone worked out a workflow w/ worktrees where
         | the model generates n variants of a component, and displays
         | them next to each other. A human can "at a glance" choose which
         | one they like. And sometimes pick and choose from multiple
         | variants (something like passing it to claude and say "keep the
         | styling of component A but the data management of component
         | B"), and at the end of the day is faster / cheaper than having
         | cc do all that work.
        
         | msp26 wrote:
         | Groq does quantise. Look at this benchmark from moonshotai for
         | K2 where they compare their official implementation to third
         | party providers.
         | 
         | https://github.com/MoonshotAI/K2-Vendor-Verifier
         | 
         | It's one of the lowest rated on that table.
        
         | threeducks wrote:
         | > but we literally have no way to prove they're right
         | 
         | Of course we do. Just run a benchmark with Cerebras/Groq and
         | compare to the results produced in a trusted environment. If
         | the scores are equal, the model is is either unquantized, or
         | quantized so well that we can not tell, in which case it does
         | not matter.
         | 
         | For example, here is a comparison of different providers for
         | gpt-oss-120b, with differences of over 10 % for best and worst
         | provider.
         | 
         | https://artificialanalysis.ai/models/gpt-oss-120b/providers#...
        
         | xadhominemx wrote:
         | It's because the model weights and KV cache are stored in SRAM.
         | It's extremely expensive per token.
        
       | niklassheth wrote:
       | This is more evidence that Cognition's SWE-1.5 is a GLM-4.6
       | finetune
        
         | prodigycorp wrote:
         | Can you provide more context for this? (eg Was SWE-1.5 released
         | recently? Is it considered good? Is it considered fast? Was
         | there speculation about what the underlying model was? How does
         | this prove that it's a GLM finetune?)
        
           | mhuffman wrote:
           | I suspect they are referencing the 950tok/s claim on
           | Cognition's page.
        
             | prodigycorp wrote:
             | Ah. Thx. Blogpost for others:
             | https://cognition.ai/blog/swe-1-5
             | 
             | Takeaway is that this is sonnet-ish model at 10x the speed.
        
           | NitpickLawyer wrote:
           | People saw chinese characters in generations made by swe-1.5
           | (windsurfs model) and also in the one made by cursor. This
           | led to suspicions that the models are finetunes of chinese
           | models (which makes sense, as there aren't many us/eu strong
           | coding models out there). GLM4.5/4.6 are the "strongest"
           | coding models atm (with dsv3.2 and qwen somewhat behind) so
           | that's where the speculation came from. Cerebras serving them
           | at roughly the same speeds kinda adds to that story (e.g. if
           | it'd be something heavier like dsv3 or kimik2 it would be
           | slower).
        
             | prodigycorp wrote:
             | Really appreciate this context. Thank you!
        
         | nl wrote:
         | Not at all. Any model with somewhat-similar architecture and
         | roughly similar size should run at the same speed on Cerabras.
         | 
         | It's like saying Llama 3.2 3B and Gemma 4B are fine tunes of
         | each other because they run at similar speeds on NVidia
         | hardware.
        
       | gatienboquet wrote:
       | Vibe Slopping at 1000 tokens per second
        
         | mmaunder wrote:
         | Yeah honestly having max cognitive capability is #1 for me.
         | Faster tokens is a distant second. I think anyone working on
         | creating valuable unique IP feels this way.
        
           | conception wrote:
           | This us where agents actually shine. Having a smart model
           | write code and plan is great and then having cerebra's do ask
           | the command line work, write documents effectively instantly
           | and other simple tasks does sped things up quite a bit.
        
       | lordofgibbons wrote:
       | At what quantization? And if it is in fact quantized below fp8,
       | how is the performance impacted on all the various benchmarks?
        
         | antonvs wrote:
         | They claim they don't use quantization.
         | 
         | The reason for their speed is this chip:
         | https://www.cerebras.ai/chip
        
       | renewiltord wrote:
       | Unfortunately for me, the models on Cerebras weren't as good as
       | Claude Code. Speedy but I needed to iterate more. Codex is
       | trustworthy and slow. Claude is better at iterating. But none of
       | the Cerebras models at the $50 tier were worth anything for me.
       | They would have been something if they'd just come out but we
       | have these alternatives now.
        
         | elzbardico wrote:
         | I don't care. I want LLMs to help with the boring stuff, the
         | toil. It may not be as intelligent as Claude, but if it takes
         | care of the boring stuff, and it is fast while doing it, I am
         | happy. Use it surgically, do the top-down design, and just let
         | it fill the blanks.
        
           | renewiltord wrote:
           | Give it a crack. It took a lot of iteration for it to write
           | decent code. If you figure out differences in prompting
           | technique, do share. I was really hoping for the speed to
           | improve a lot of execution - because that's genuinely the
           | primary problem for me. Unfortunately, speed is great but
           | quality wasn't great for me.
           | 
           | Good luck. Maybe it'll do well in some self-directed agent
           | loop.
        
       | Flux159 wrote:
       | Was able to sign up for the Max plan & start using it via
       | opencode. It does a way better job than Qwen3 Coder in my
       | opinion. Still extremely fast, but in less than 1 hour I was able
       | to use 7M input tokens, so with a single agent running I would be
       | able easily to pass that 120M daily token limit. The speed
       | difference between Claude Code is significant though - to the
       | point where I'm not waiting for generation most of the time, I'm
       | waiting for my tests to run.
       | 
       | For reference, each new request needs to send all previous
       | messages - tool calls force new requests too. So it's essentially
       | cumulative when you're chatting with an agent - my opencode
       | agent's context window is only 50% used at 72k tokens, but
       | Cerebra's tracking online shows that I've used 1M input tokens
       | and 10k output tokens already.
        
         | NitpickLawyer wrote:
         | > For reference, each new request needs to send all previous
         | messages - tool calls force new requests too. So it's
         | essentially cumulative when you're chatting with an agent - my
         | opencode agent's context window is only 50% used at 72k tokens,
         | but Cerebra's tracking online shows that I've used 1M input
         | tokens and 10k output tokens already.
         | 
         | This is how every "chatbot" / "agentic flow" / etc works behind
         | the scenes. That's why I liked that "you should build an agent"
         | post a few days ago. It gets people to really understand what's
         | behind the curtain. It's requests all the way down, sometimes
         | with more context added, sometimes with less (subagents & co).
        
         | zaptrem wrote:
         | They don't have prefix caching? Claude and Codex have this.
        
           | versteegen wrote:
           | At those speeds, it's probably impossible. It would require
           | enormous amounts of memory (which the chip simply doesn't
           | have, there's no room for it) or rather a lot of bandwidth
           | off-chip to storage, and again they wouldn't want to waste
           | surface area on the wiring. Bit of a drawback of increasing
           | density.
        
       | elzbardico wrote:
       | 50 dollars month cerebras code plan, first with qwen-420, now
       | with glm, is my secret weapon.
       | 
       | Stalin used to say that in war "quantity has a quality all its
       | own". And I think that in terms of coding agents, speed is
       | quality all its own too.
       | 
       | Maybe not for blind vibe coding, but if you are a developer, and
       | is able to understand the code the agent generates and change it,
       | the fast feedback of fast inference is a game changer. I don't
       | care if claude is better than GLM 4.6, fast iteractions are king
       | for me now.
       | 
       | It is like moving from DSL to gigabit fiber FTTH
        
       | divmain wrote:
       | I have been an AI-coding skeptic for some time. I always
       | acknowledged LLMs as useful for solving specific problems and
       | making certain things possible that weren't possible before. But
       | I've not been surprised to see AI fail to live up to the hype.
       | And I never had a personally magical moment - an experience that
       | shifted my perspective a la the peak end rule.
       | 
       | I've been using GLM 4.6 on Cerebras for the last week or so,
       | since they began the transition, and I've been blown away.
       | 
       | I'm not a vibe coder; when I use AI coding tools, they're in the
       | hot path. They save me time when whipping up a bash script and I
       | can't remember the exact syntax, or for finding easily
       | falsifiable answers that would otherwise take me a few minutes of
       | reading. But, even though GLM 4.6 is not as smart as Sonnet 4.5,
       | it is smart enough. And because it is so fast on Cerebras, I
       | genuinely feel that it augments my own ability and productivity;
       | the raw speed has considerably shifted the tipping point of time-
       | savings for me.
       | 
       | YMMV, of course. I'm very precise with the instructions I
       | provide. And I'm constantly interleaving my own design choices
       | into the process - I usually have a very clear idea in my mind of
       | what the end result should look like - so, in the end, the code
       | ends up how I would have written it without AI. But building
       | happens much faster.
       | 
       | No affiliation with Cerebras, just a happy customer. Just
       | upgraded to the $200/mo plan - and I'll admit that I was one that
       | scoffed when folks jumped on the original $200/mo Claude plan. I
       | think this particular way of working with LLMs just fits well
       | with how I think and work.
        
         | ramraj07 wrote:
         | Your post has inspired me to check them out. How do you use it,
         | with their UI oe to power some other open source tool?
         | 
         | Do you suggest that this thing is so fast its simpler now to
         | quickly work on one thing at a time instead of the 5 background
         | tools running in parallel which might have been a pattern we
         | invented because these things are so slow?
        
           | bananapub wrote:
           | you want to be using something like opencode in a terminal,
           | not the web ui.
           | 
           | you'll need to try it and see what the speed does to your
           | workflow.
        
           | divmain wrote:
           | I've been using the crush TUI primarily. I like that I have
           | the flexibility to switch to a smarter model on occasion -
           | for awhile I hesitated to pick up AI coding at all, simply
           | because I didn't want to be locked into a model that could be
           | immediately surpassed. It's also customizable enough with
           | sane defaults.
        
         | realo wrote:
         | I was AI skeptic too a year ago , but recently i wanted a
         | windows exe program to do the same as a complicated bash script
         | on linux.
         | 
         | i gave the bash script to claude code, which immediately
         | started implementing something in the zig language. after a few
         | iterations, i had zig source code that compiled in linux ,
         | produced a windows exe and perfectly mimicked the bash script.
         | 
         | I know nothing about zig programming.
        
           | aurareturn wrote:
           | I've been maintaining my company's Go repos using Claude
           | after our Go developer left. I don't know anything about Go.
        
       | mythz wrote:
       | AI moves so fast that Vibe Coding still has a negative stigma
       | attached to it, but even after 25 years of development, I'm not
       | able to match the productivity of getting AI to implement the
       | features I want. It's basically getting multiple devs to set out
       | and go do work for you where you just tell them what you want and
       | provide iterative feedback till they implement all the features
       | you want, in the way you want and to fix all the issues you find
       | along the way, which they can create tests and all the automated
       | and deployment scripts for.
       | 
       | This is clearly the future of Software Development, but the
       | models are so good atm that the future is possible now. I'm still
       | getting used to and having to rethink my entire dev workflow for
       | maximum productivity, and whilst I wouldn't unleash AI Agents on
       | a decade old code base, all my new Web Apps will likely end up
       | being AI-first unless there's a very good reason why it wouldn't
       | provide a net benefit.
        
         | namanyayg wrote:
         | Exactly, Codex gpt-5-high is quite like sending smart devs. It
         | still makes mistakes, and when it does they're extremely stupid
         | ones, but I am now accepting the code it generates as throwable
         | and I just reroll when it does something dumb.
        
         | dust42 wrote:
         | It just depends on what you are doing. A green field react app
         | in typescript with a CRUD API behind? The LLMs are a mind
         | blowing assistant and 1000t/s is crazy.
         | 
         | You are doing embedded development or anything else not as
         | mainstream as web dev? LLMs are still useful but no longer mind
         | blowing and often produce hallucinations. You need to read
         | every line of their output. 1000t/s is crazy but no longer
         | always in a good way.
         | 
         | You are doing stuff which the LLMs haven't seen yet? You are on
         | your own. There is quite a bit of irony in the fact that the
         | devs of llama.cpp barely use AI - just have a look at the
         | development of support for Qwen3-Next-80B [1].
         | 
         | [1] https://github.com/ggml-org/llama.cpp/pull/16095
        
           | almostgotcaught wrote:
           | I've said it before but no one takes it seriously: LLMs are
           | only useful if you're building something that's already in
           | the training set ie _already commodity_. In which case _why
           | are you building it_???
        
             | whiterook6 wrote:
             | It's not that the product you're building is a commodity.
             | It's that the tools you're using to built it are. Why not
             | build a landing page using HTML and CSS and tailwind? Why
             | not use swift to make an app? Why not write an AWS lambda
             | using JavaScript?
        
             | mythz wrote:
             | "LLMs are only useful..."
             | 
             | Is likely why no one takes you seriously, as it's a good
             | indication you don't have much experience with them.
        
             | nurettin wrote:
             | Do you avoid writing anything that the programming
             | community has ever built? How are you alive???
        
             | antonvs wrote:
             | The obvious point that you're missing is that there are
             | literally infinite ways to assemble software systems from
             | the pieces that an LLM is able to manipulate due to its
             | training. With minor guidance, LLMs can put together an
             | unlimited number of novel combinations. The idea that the
             | entire end product has to be in the training set is
             | trivially false.
        
             | becquerel wrote:
             | Because I'm getting paid to.
        
             | philipp-gayret wrote:
             | It's true, when I was working with LLMs on a novel idea it
             | said sorry I can't help you with that!
        
             | dboreham wrote:
             | Historically big AI skeptic here: what you say is very not
             | true now. LLMs aren't just regurgitating their training
             | data per se. I've used LLMs on _languages_ the LLM has not
             | seen, and it performed well. I 've used LLMs on code that
             | is about as far from a React todo app as it's possible to
             | get.
        
             | energy123 wrote:
             | Bell Labs should have fired all their toilet cleaners.
             | Nothing innovative about a toilet.
        
           | lifthrasiir wrote:
           | There aren't many things that LLMs haven't really seen yet,
           | however. I have successfully used LLMs to develop a large
           | portion of WebAssembly 3.0 interpreter [1], which surely
           | aren't in their training set because WebAssembly 3.0 was only
           | released months ago. Sure, it took me tons of guidance but it
           | was useful enough for me.
           | 
           | Even llama.cpp is not a truly novel thing to LLMs, there are
           | several performant machine learning model executors available
           | in their training sets anyway, and I'm sure llama.cpp _can_
           | benefit from LLMs if they want; they just chose not to.
           | 
           | [1] https://github.com/lifthrasiir/wah/
        
           | koito17 wrote:
           | > You are doing embedded development or anything else not as
           | mainstream as web dev? LLMs are still useful but no longer
           | mind blowing and often produce hallucinations.
           | 
           | I experienced this with Claude 4 Sonnet and, to some extent,
           | gpt-5-mini-high.
           | 
           | When able to run tests against its output, Claude produces
           | pretty good Rust backend and TypeScript frontend code.
           | However, Claude became borderline _unproductive_ once I
           | started experimenting with uefi-rs. Other LLMs, like
           | gpt-5-mini-high, did not fare much better, but they were at
           | least capable of admitting _lack of knowledge_. In
           | particular, GPT-5 would provide output akin to  "here is some
           | pseudocode that you may be able to adapt to your choice of
           | UEFI bindings".
           | 
           | Testing in a UEFI environment is quite difficult; the LLM
           | can't just run `cargo test` and verify its output. Things get
           | worse in embedded, because crates like embedded_hal made
           | massive API changes between 0.2 and 1.0 (the latest version),
           | and each LLM I've tried seems to only have knowledge of 0.2
           | releases. Also, for embedded, forget even thinking about
           | testing harnesses (which at least exist in _some form_ with
           | UEFI, it 's just difficult to automate the execution and
           | output for an LLM). In this case, you cannot really trust the
           | output of the LLM. To minimize risk of hallucination, I would
           | try maintaining data sheets and library code in context, but
           | at that point, it took more time to prompt an LLM than
           | handwrite code.
           | 
           | I've been writing a lot of embedded Rust over the past two
           | weeks, and my usage of LLMs _in general_ decreased because of
           | that. Currently planning to resume development on some of my
           | "easier" projects, since I have about 300 Claude prompts
           | remaining in my Zed subscription, and I don't want them to go
           | to waste.
        
             | RealityVoid wrote:
             | > Also, for embedded, forget even thinking about testing
             | harnesses (which at least exist in some form with UEFI,
             | it's just difficult to automate the execution and output
             | for an LLM).
             | 
             | I think this doesn't have to be like this and we can do
             | better for this. If LLMs keep this up, good testing
             | infrastructure might become more important.
        
               | koito17 wrote:
               | One of my expectations for the future is the development
               | of testing tools whose output is "optimized" in some way
               | for LLM consumption. This is already occurring with Bun's
               | test runner, for instance.[0] They are implementing a
               | flag in the test runner so that the output is structured
               | _and_ optimized for token count.
               | 
               | Overall, I agree with your point. LLMs feel a lot more
               | reliable when a codebase has thorough, easy-to-run tests.
               | For a similar reason, I have been drifting towards
               | strong, statically-typed languages. Both Rust and
               | TypeScript have rich type systems that can express many
               | kinds of runtime behavior with just types. When a
               | compiler can make strong guarantees about a program's
               | behavior, I assume that helps nudge the quality of LLM
               | output a bit higher. Tests then help prevent silly
               | regressions from occurring. I have no evidence for this
               | besides my anecdotal experience using LLMs across several
               | programming languages.
               | 
               | In general, I've had the best experience with LLMs when
               | there's plenty of static analysis (and tests) on the
               | codebase. When a codebase can't be easily tested, then I
               | get much less productivity gains from LLMs. So yeah, I'm
               | all for improving testing infrastructure.
               | 
               | [0] https://x.com/jarredsumner/status/1944948478184186366
        
             | miki123211 wrote:
             | This is where Rust's "if it compiles, it's probably
             | correct" philosophy may come in handy.
             | 
             | "Shifting bugs left" is even more important for LLMs than
             | it is for humans. There are certain tests LLMs can't run,
             | so if we can detect bugs at compile time and run the LLM in
             | a loop until things compile, that's a significant benefit.
        
               | nathan_compton wrote:
               | My recent experience is that llms are dogshit at rust,
               | though, unable to correct bugs without inserting new
               | ones, going back and forth fixing and breaking the same
               | thing, etc.
        
               | energy123 wrote:
               | A while ago I gathered every HN comment going back a year
               | that contains Rust and LLM and about half are positive
               | and half are negative.
        
               | manmal wrote:
               | Aren't we all though?
        
           | RealityVoid wrote:
           | > You are doing embedded development or anything else not as
           | mainstream as web dev?
           | 
           | Counterpoint, but also kind of reinforcing your point. It
           | depends on the kind of embedded development. I did a small
           | utility PCB with an ESP32, and their libs are good there is
           | active community, they have test frameworks. LLM's did a
           | great job there.
           | 
           | On the other hand, I wanted to drive a timer and a pwm module
           | and a DMA dma engine to generate some precise pulses. The way
           | I chained hw was... Not typical, but it was what I needed and
           | the hw could do it. At that, Claude failed miserably and it
           | only led me to waste my time, so I had to spend the time to
           | do it manually.
        
         | ffsm8 wrote:
         | The industry of "software" is so large... While I agree with
         | web development going this route, I'm not sure about
         | "everything else".
         | 
         | You could argue that that's the bulk of all software jobs in
         | tech, and you'd likely be correct... But depending on what your
         | actual challenge is, LLM assistance is more of a hindrance then
         | help. However creating a web platform without external
         | constraints makes LLM assistance shine, that's true
        
           | killerstorm wrote:
           | Well, there are certainly kinds of code LLMs would struggle
           | with, but people generally underestimate what LLMs are
           | capable of.
           | 
           | E.g. Victor Taelin is implementing ultra-advanced programming
           | language/runtime writing almost all code using LLM now.
           | Runtime (HVM) is based on Interaction Calculus model which
           | was only an obscure academic curiosity until Taelin started
           | working on it. So a hypothesis that LLMs are only capable of
           | copying bits of code from Stack Overflow shall be dismissed.
        
             | thesz wrote:
             | I took a look at the Taelin's work [1].
             | 
             | [1] https://github.com/HigherOrderCO/HVM
             | 
             | From my understanding, main problem there is a compilation
             | into (optimal) CUDA code and CUDA runtime, not language or
             | internal representation per se. CUDA is hard to debug, some
             | help can be warranted.
             | 
             | BTW, this HVM thing smells strange. The PAPER does not
             | provide any description of experiments where linear
             | parallel speedups were achieved. What were these 16K cores?
             | What were these tasks?
        
               | killerstorm wrote:
               | Taelin is experimenting with possible applications of
               | interaction calculus. That CUDA thing was one of
               | experiments, and it didn't quite work out.
               | 
               | Currently he's working on a different thing: a code
               | synthesis tool. AFAIK he got something better than
               | anything else in this category, but whether it's useful
               | is another question.
        
               | thesz wrote:
               | > something better than anything else in this category
               | 
               | That is a strong statement.
               | 
               | [1]
               | https://en.wikipedia.org/wiki/Id_(programming_language)
               | 
               | Id [1] was run on the CM-5 (then) supercomputer and
               | demonstrated superlinear parallel speedups on some of the
               | tasks. That superlinear speedup was due to better cache
               | utilization on individual nodes.
               | 
               | In some of the tasks the amount of parallel execution
               | discovered by Id90 would lead to overflow of content-
               | addressable memory and Id90's runtime implemented
               | throttling _to reduce available parallelism_ to make
               | things to be done at all.
               | 
               | Does the PAPER of HVM refers to Id (Id90 to be precise)?
               | No, it does not.
               | 
               | This is serious negligence of Taelin.
        
             | keithba wrote:
             | I've also experimented with using rust to create a new
             | programming language where I vibe coded (eg never wrote
             | myself). My opinion is that it's quite capable with
             | disciplined management.
             | 
             | https://github.com/GoogleCloudPlatform/aether
             | 
             | Note: the syntax is ugly as a trade-off to make it explicit
             | and unambiguous for LLMs to use.
        
         | kaspermarstal wrote:
         | We need a new term for LLMs actually solving a hard problems.
         | When I help Claude Code solve a nasty bug it doesn't feel like
         | "vibing" as in "I tell the model what I want the website to
         | look like". It feels like sniping as in "I spot for Claude
         | Code, telling how to adjust for wind, range, and elevation so
         | it can hit my far away target".
        
           | scosman wrote:
           | From what I recall of the original Karpathy definition, it's
           | only "vibe coding" if you aren't reading the code it produces
        
             | lukan wrote:
             | Yes, I vote for keeping that definition and not throw it
             | all into a box. LLM assisted coding is not vibe coding.
        
             | kaspermarstal wrote:
             | Cool, I did not know that. That makes perfect sense.
        
             | JimDabell wrote:
             | You're right. It's explicitly about not caring about the
             | code:
             | 
             | > There's a new kind of coding I call "vibe coding", where
             | you fully give in to the vibes, embrace exponentials, and
             | forget that the code even exists. It's possible because the
             | LLMs (e.g. Cursor Composer w Sonnet) are getting too good.
             | Also I just talk to Composer with SuperWhisper so I barely
             | even touch the keyboard. I ask for the dumbest things like
             | "decrease the padding on the sidebar by half" because I'm
             | too lazy to find it. I "Accept All" always, I don't read
             | the diffs anymore. When I get error messages I just copy
             | paste them in with no comment, usually that fixes it. The
             | code grows beyond my usual comprehension, I'd have to
             | really read through it for a while. Sometimes the LLMs
             | can't fix a bug so I just work around it or ask for random
             | changes until it goes away. It's not too bad for throwaway
             | weekend projects, but still quite amusing. I'm building a
             | project or webapp, but it's not really coding - I just see
             | stuff, say stuff, run stuff, and copy paste stuff, and it
             | mostly works.
             | 
             | -- https://x.com/karpathy/status/1886192184808149383
        
           | fuzzy_biscuit wrote:
           | So we're the spotter in that metaphor. I like it!
        
             | riskable wrote:
             | "spotter coding" or perhaps "checker coding"?
             | 
             | "verified vivisection development" when you're working with
             | older code :D
        
           | demarq wrote:
           | - backseat engineer
           | 
           | - keyboard princess
           | 
           | - Robin to the Batman
           | 
           | - meatstack engineer
           | 
           | - artificial manager
        
         | cyanydeez wrote:
         | Do you think that people will read your flowery prose and
         | suspect you're just part of the dead Internet. We are still
         | waiting for all these AI enhanced apps to flood the market.
        
       | odie5533 wrote:
       | I find the fast models good for rapidly iterating UI changes with
       | voice chat. Like "add some padding above the text box" or "right
       | align the button". But I find the fast models useless for deep
       | coding work. But a fast model has its place. Not $50/month
       | though. Cursor has Compose 1 and Grok Code Fast for free. Not
       | sure what $50/month gets me that those don't. I liked the stealth
       | supernova model a lot too.
        
         | gardnr wrote:
         | GLM 4.6 isn't a "fast" model. It does well in benchmarks vs
         | Sonnet 4.5.
         | 
         | Cerebras makes a giant chip that runs inference at unreal
         | speeds. I suspect they run their cloud service more as an
         | advertising mechanism for their core business: hardware. You
         | can hear the founder describing their journey:
         | 
         | https://podcasts.apple.com/us/podcast/launching-the-fastest-...
        
         | bn-l wrote:
         | Composer and grok fast are not free.
        
           | versteegen wrote:
           | grok-code-fast-1 is free (for a limited time) in opencode
           | zen, has been for a while. (Originally it was billed as a
           | test/to gather training data for xAI.) But right now GLM 4.6
           | is also temporarily free there (hosted by opencode
           | themselves; they call it "big pickle", and there's no data
           | collection), has been for weeks, and GLM 4.6 is _far_ better
           | (better than Haiku and not very far off Sonnet), and still
           | very fast, so I have no use for gcf1 anymore.
        
           | odie5533 wrote:
           | They are both free in Cursor right now.
        
       | seduerr wrote:
       | It's just amazing to have a reliable model at the speed of light.
       | Was waiting for such a great model for a long time!
        
       | dust42 wrote:
       | 1000 tokens/s is pretty fancy. I just wonder how sustainable the
       | pricing is or if they are VC-fueled drug dealers trying to
       | convert us into AI-coholics...
       | 
       | It is definitely fun playing with these models at these speeds.
       | The question is just how far from real pricing is 500M tokens for
       | $50?
       | 
       | Either way the LLM usage will grow for some time to come and so
       | will grow energy usage. Good times for renewables and probably
       | fusion and fission.
       | 
       | Selling shovels in a gold rush was always reliable business.
       | Cerebras is only rated at $8.1B as of one month ago. Compared to
       | Nvidia that seems pocket change.
        
       | ojosilva wrote:
       | Here's a customer of the $200 max plan for 2 months. I fell in
       | love with the Qwen3 Coder 480B model, Q3C, _that_ was fast, twice
       | the speed of GLM. GLM 4.6 is just meh, I mean, way faster than
       | competitors, and practically at Sonnet 4.x level in coding and
       | tool use, but not a life-changing difference.
       | 
       | Yes, Qwen3 made more mistakes than GLM, around 15% more in my
       | quick throwaway evals, but it was a more professional model
       | overall, more polished in some aspects, better with international
       | languages, and being non-reasoning, ideal for a lot of tasks
       | through the API that could be ran instantaneously. I think the
       | Qwen line of models is a more consistent offering, with other
       | versions of the model for 32B and VL, now a 80B one, etc. I guess
       | the problem was that Qwen Max was closed source, signalling that
       | Qwen may not have a way forward for Cerebras to evolve. GLM 4.6
       | covers precisely that hole. Not that Cerebras is a model provider
       | of any kind, their service levels are "buggy" (right now it's
       | been down for 1h and probably won't be fixed until California
       | wakes up at 9am PST). So it does feel like we are not the
       | customers, but the product, a marketing stunt for them to get
       | visibility for their tech.
       | 
       | GLM feels like they (Z.ai) are just distilling whatever they can
       | get into it. GLM switches to Chinese sometimes, or just cuts off.
       | It does have a bit of more "intelligence" than Q3C, but not
       | enough to say it solves the toughest problems. Regardless, for
       | tough nuts to crack I use my Codex Plus plan.
       | 
       | Ex: In one of my evals, it took 15 turns to solve an issue using
       | Cerebras Q3C. I took 12 turns with GLM, but overall GLM takes 2x
       | the time, so instead of doing a full task from zero-to-commit in
       | say 15 minutes, it takes 24 minutes.
       | 
       | In another eval (Next.js CSS editing), my task with Q3C coder was
       | done in 1:30 minutes. GLM 4.6 took 2:24. The same task in Codex
       | took 5:37 minutes, with maybe 1 or 2 turns. Codex DX is that of
       | working unattended: prompt it and go do something else, there's a
       | good chance it will get it right after 0, 1 or 2 nudges. With
       | CC+Cerebras it's a completely different DX, given the speed it
       | feels just like programming, but super-fast. Prompt, read the
       | change, accept (or don't), accept, accept, accept, test it out,
       | accept, prompt, accept, interrupt, prompt, accept, and 1:30 min
       | later we're done.
       | 
       | Like I said I use Claude Code + a proxy (llmux). The coding agent
       | makes a HUGE difference, and CC is hands-down the best agent out
       | there.
        
         | andai wrote:
         | Asked GLM-4.6 to introduce itself. "Hello! I'm glad you asked.
         | I'm a large language model, trained by Google. (...)"
         | 
         | It seems to have been fine-tuned on Claude Code interactions as
         | well. Though unfortunately not much of Claude's coding style
         | itself? (I wish!)
        
       | KronisLV wrote:
       | Been using Cerebras for quite a while now, previously with their
       | Qwen3 Coder and now GLM 4.6, overall the new model feels better
       | at tool calls and code in general. Fewer tool call failures with
       | RooCode (should also apply to Cline and others too), but
       | obviously still not perfect.
       | 
       | Currently on the 50 USD tier, very much worth the money, am kinda
       | considering going for the 200 USD tier, BUT GPT-5 and Sonnet 4.5
       | and Gemini 2.5 Pro still feel needed occasionally, so it'd be
       | stupid to go for the 200 USD tier and not use it fully and still
       | have to pay up to around 100 USD for tokens in the other models
       | per month. Maybe that will change in the future, when dealing
       | with lots of changes (e.g. needing to make a component showcase
       | of 90 components, but with enough differences between then to
       | make codegen unviable) Cerebras is already invaluable.
       | 
       | Plus the performance actually makes iterating faster, to a degree
       | where I believe that other models should also eventually run this
       | fast. Oddly enough, the other day their 200 USD plan showed as
       | "Sold out", maybe they're scaling up the capacity gradually. I
       | really hope they never axe the Code plans, they literally have no
       | competition for this mode of use. Maybe they'll also have a 100
       | USD plan some day, one can hope, but maybe offering _just_ the
       | 200 plan is better from an upsell perspective for them.
       | 
       | Oh also, when I spill over my daily 24M limits, please let me use
       | the Pay2Go thing on top of that, if instead of 24M tokens some
       | day I need 40M, I'd pay for those additional ones.
        
       | andai wrote:
       | I have been using Z.ai's (creators of GLM) "Coding Plan" with
       | GLM-4.6. $3/month and 3x higher limits than Claude Pro, they say.
       | 
       | (I have both and haven't run into any limits yet, so I'm probably
       | not a very heavy user.)
       | 
       | I'm quite impressed with the model. I have been using GLM-4.6 in
       | Claude Code instead of Sonnet, and finding it fine for my use
       | cases. (Simple scripting and web stuff.)
       | 
       | (Note: Z.ai's GLM doesn't seem to support web search (fails) or
       | image recognition (hallucinates). To fix that I use Claude Code
       | Router and hooked up those two features to Gemini (free)
       | instead.)
       | 
       | I find that Sonnet produces much nicer code. I often find myself
       | asking Sonnet to clean up GLM's code. More recently, I just got
       | the Pro plan for Claude so I'm mostly just using Sonnet directly
       | now. (Haven't had the rate limits yet but we'll see!)
       | 
       | So in my experience if you're not too fussy you can currently get
       | "80% of Claude Code" for like $3/month, which is pretty nuts.
       | 
       | GLM also works well in Charm Crush, though it seems to be better
       | optimized for Claude Code (I think they might have fine tuned
       | it.)
       | 
       | ---
       | 
       | I have tested Kimi K2 at 1000 tok/s via OpenRouter, and it's
       | bloody amazing, so I imagine this supercharged GLM will be great
       | too. Alas, $50!
        
         | realo wrote:
         | I just asked glm-4.6 how to setup a z.ai api key with claude
         | code and it kept on saying it has no idea what claude code
         | is...
         | 
         | Quite funny, actually.
        
           | andai wrote:
           | Ask and you shall receive!
           | 
           | https://docs.z.ai/devpack/tool/claude
           | 
           | tldr                   "env": {
           | "ANTHROPIC_AUTH_TOKEN": "your_zai_api_key",
           | "ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic"
           | }
           | 
           | Although if you want an Actually Good Experience I recommend
           | using Claude Code Router
           | 
           | https://github.com/musistudio/claude-code-router
           | 
           | because it allows you to intercept the requests and forward
           | them to other models. (e.g. GLM doesn't seem to support
           | search or images, so I use Gemini's free tier for that.)
           | 
           | (CCR just launches Claude with the base url set to a local
           | proxy. The more adventurous reader can also set up his own
           | proxy... :)
        
             | realo wrote:
             | Benchmark score
             | 
             | Humans : 1
             | 
             | AI : 0
        
             | brianjking wrote:
             | Does this allow me to use Claude Code as the orchestration
             | harness with GLM 4.6 as the LLM along with other LLMs?
             | Seems so based on your description, thanks for the link.
        
               | andai wrote:
               | Explain like I'm 5?
        
       | andai wrote:
       | I have created a "semi-interactive" AI coding workflow.
       | 
       | I write what I want, the LLM responds with edits, my 100 lines of
       | Python implement them in the project. It can edit any number of
       | files in one LLM call, which is very nice (and very cheap and
       | fast).
       | 
       | I tested this with Kimi K2 on Groq (also 1000 tok/s) and very
       | impressed.
       | 
       | I want to say this is the best use case for fast models -- that
       | frictionlessness in your workflow -- though it burns tokens
       | pretty fast working like that! (Agentic is even more nuts with
       | how fast it burns tokens on fast models, so the $50 is actually
       | pretty great value.)
        
       | lvl155 wrote:
       | I want to use Cerebaras but it's just not production ready. I
       | will root for them on sideline for now.
        
       | hereme888 wrote:
       | So basically "change this in the UI" and you see it happen almost
       | real time.
        
       | sheepscreek wrote:
       | I used their $50 plan and with the previously offered Qwen3 coder
       | 480B. While fast - none of the "supported" tools I tried were
       | able to use it in a way that didn't hit the per minute request
       | limit in a few seconds. It was incredibly frustrating. For the
       | record, I tried OpenCoder, VSCode, Quen Coder CLI, octofriend and
       | a few others I don't remember.
       | 
       | Fast forward to now, when GLM 4.6 has replaced Qwen3 coder in
       | their subscription plan. My subscription was still active so I
       | wanted to give this setup another shot. This time though, I
       | decided to give Cline a try. I've got to say, I was very
       | pleasantly surprised - it worked really well out of the box. I
       | guess whatever Cline does behind the scenes is more conducive to
       | Cerebra's API. I used Claude 4.5 + Thinking for "Plan" mode and
       | Cerebras/GLM 4.6 for "Act".
       | 
       | The combo feels solid. Much better than GPT-5 Codex alone. I
       | found codex to be very high quality but so godawful slow for long
       | interactive coding sessions. The worst part is I cannot see what
       | it's "thinking" to stop it in its tracks when it's going in the
       | wrong direction.
       | 
       | In an essence, Cerebras + GLM 4.6 feels like Grok Fast 1 on
       | steroids. Just couple it with a frontier + thinking model for
       | planning (Claude 4.5/GPT-5/Gemini Pro 2.5).
       | 
       | One caveat: sometimes the Cerebras API starts choking "because of
       | high demand" which has nothing to do with hitting subscription
       | limits. Just an FYI.
       | 
       | Note: For the record, I was coding on a semi-complex Rust
       | application tuned for low-latency mix of IO + CPU workload. The
       | application is multi-threaded and makes extensive use of locking
       | primitives and explicit reference counting (Arc). All models were
       | able to handle the code really well given the constraints.
       | 
       | Note2: I am also evaluating Synthetic's (synthetic.new) open-
       | source model inference subscription and I like it a lot. There's
       | a large number of models to choose from, including gpt-oss-120
       | and their usage limits are very very generous. To the point that
       | I don't think I will ever hit them.
        
         | anonzzzies wrote:
         | I run opencode with cerebras and it goes on and on. No issues
         | so far. It is not a codex or claude code but its fast that it
         | allows for a much more interactive experience. Our results with
         | Claude Code and Codex are much better quality but when, let's
         | use the term vibing, opencode wirh cerebras is more fun.
        
         | zixuanlimit wrote:
         | Have you tried Z.ai's official coding plan? Any differences in
         | performance?
        
       | Alifatisk wrote:
       | I don't know about GLM 4.6, some have said they are bench-maxing
       | so I kinda lost my interest in trying them out, does it live up
       | to its reputation? Is it that good like Sonnet 4.5?
        
       | jauntywundrkind wrote:
       | I'm curious to know what the cost is to switch contexts. Pure
       | performance is amazing, but how long does it take to get a system
       | going, to load the model and build context? What systems cans
       | witch contexts while keeping the model non-destructively vs when
       | is executing destructive?
       | 
       | I have a lot of questions about how models are run at scale; so
       | curious to know more. With such a massive wafer as chip as
       | Cerebras, it feels like perhaps switching might be even more
       | consuming. Or maybe there's some brilliant strategy to have
       | multiple contexts all loaded that it can flip between!
       | Inventorying & using so much ram so spread out is it's own
       | challenge!
        
       | w-m wrote:
       | I wanted to try GLM 4.6 through their API with Cline, before
       | spending the $50. But I'm getting hit with API limits. And now
       | I'm noticing a red banner "GLM4.6 Temporarily Sold Out. Check
       | back soon." at cloud.cerebras.ai. HN hug of death, or was this
       | there before?
        
       | vladgur wrote:
       | Where I'm unfortunately hitting the wall with these assistants is
       | developing a desktop app in say Java swing - Claude cannot verify
       | that ui it produced is actually functional
        
         | NaomiLehman wrote:
         | what is your AI dev stack? have you tried Kilo Code?
        
       ___________________________________________________________________
       (page generated 2025-11-08 23:01 UTC)