[HN Gopher] Cerebras launches inference for Llama 3.1; benchmark...
       ___________________________________________________________________
        
       Cerebras launches inference for Llama 3.1; benchmarked at 1846
       tokens/s on 8B
        
       Author : _micah_h
       Score  : 72 points
       Date   : 2024-08-27 16:42 UTC (6 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | freediver wrote:
       | Yep it is fast. Now what exactly is Llama 8B useful is another
       | matter - what are some good use cases?
       | 
       | One scenario I can think of is rolepaying - but I would assume
       | that the slow streaming speed was kind of a feature there.
        
         | rgbrgb wrote:
         | Speed is useful for batch tasks or doing a bunch of serial
         | tasks quickly. E.g. "take these 1000 pitch decks and give me 5
         | bullets on each", "run this prompt 100 times and then pick the
         | best response", "detect which of these 100k comments mention
         | the SF Giants".
        
         | drdaeman wrote:
         | 8B is not exactly great for roleplaying, if we put the bar any
         | high. It is just not sophisticated enough, as it has very
         | limited "reasoning"-like capabilities and can normally make
         | sensible conclusions only about very basic things (like if it's
         | raining, maybe character will get wet). It can and will
         | hallucinate about stuff like inventories or rules - and it's
         | not a context length thing. If there are multiple NPCs, things
         | get worse, as they're starting to all mix up.
         | 
         | 70B does significantly better in this regard. Nowhere close to
         | perfection, but the frequency of WTFs about LLM's output are
         | [subjectively] drastically lower.
         | 
         | Speed can be useful in RP if we'd run multiple LLM-based agents
         | (like "plot", "goal checker", "inventory", "validation",
         | "narrator") that function call each other to achieve some goal.
        
           | wkat4242 wrote:
           | These wafers only have 44GB of RAM though. Very curious why
           | the quantity is so low considering the chips are absolutely
           | massive. It's SRAM though so very fast, comparable to cache
           | in a modern CPU. But I assume being fast and loading the
           | whole model there is the point.
        
         | bottlepalm wrote:
         | Surveillance states and intelligence agencies.
         | 
         | Or maybe a MMO with a town of NPCs.
        
           | benopal64 wrote:
           | Why can't the MMO with a town of NPCs have an intelligence
           | agency too?
        
         | seldo wrote:
         | For agentic use cases, where you might need several round-trips
         | to the LLM to reflect on a query, improve a result, etc.,
         | getting fast inference means you can do more round-trips while
         | still responding in reasonable time. So basically any LLM use-
         | case is improved by having greater speed available IMO.
        
           | freediver wrote:
           | The problem with this is tok/sec does not tell you what time
           | to first token is. I've seen (with Groq) where this is large
           | for large prompts, nullifying the advantage of faster
           | tok/sec.
        
         | halJordan wrote:
         | What kind of answer are you looking for? Just start asking it
         | questions. The constant demand for a magic silver bullet use
         | case applicable to every person in the country is wild. If you
         | have to ask, you're not using it.
         | 
         | What exact use case did google.com enable you to do that made
         | it worthwhile for everyone to immediately start using? It let
         | you access nytimes.com? Access amazon.com? No, it let you ask
         | off the wall, asinine, long tail questions no one else asked.
        
       | mikewarot wrote:
       | Why is it so gosh darned slow? If you've got enough transistors
       | to hold 44 gigabytes of RAM, you've got enough to have the whole
       | model in stored with no need for off-chip transfers.
       | 
       | I'd expect tokens out at 1 Ghz aggregate. Anything less than 1
       | Mhz is a joke.... ok, not a joke, but surprisingly slow.
        
         | twothreeone wrote:
         | Even if they could generate tokens at that speed on the chip
         | (which maybe they can in theory?) you need to get user tokens
         | onto the chip and the resulting model tokens off again and
         | transport them to the user as well. This means at some point
         | the I/O becomes the bottleneck, not the compute. I also suspect
         | it will get faster still, from the announcement it didn't sound
         | like it's "optimal" yet.
        
         | chessgecko wrote:
         | On die communication isn't free, a lot of things here are
         | sequential and within matrix multiplies the cores have to
         | transfer output and mem loads have to be distributed. It's
         | really fast but not like one cycle
        
           | mikewarot wrote:
           | You could add a series of latches, and use the magic of graph
           | coloring to eliminate any timing issues, and pipeline the
           | thing sufficiently to get a GHz of throughput, even if it
           | takes many cycles to make it all the way though the pipe.
           | 
           | Personally, I'd put all the parameters in NOR flash, then
           | cycle through the row lines sequentially to load the
           | parameters into the MAC. You could load all the inputs in
           | parallel as fast as the dynamic power limits of the chip
           | allow. If you use either DMA or a hardware ring buffer to
           | push all the tokens through the layers, you could keep the
           | throughput going with various sizes of models, etc.
           | 
           | Obviously with only one MAC you couldn't have a single stream
           | at a GHZ, but you could have 4000 separate streams of 250,000
           | tokens/second.
        
             | chessgecko wrote:
             | Their numbers are for a single input, I assume the
             | throughput is much higher given the prices they are quoting
             | and the cost of a single cs3.
        
         | GaggiX wrote:
         | It only needs to compute about a trillion floating-point
         | operations per token, and each layer relies on the previous
         | one.
         | 
         | I wonder why it doesn't output a billion tokens per second.
        
           | ein0p wrote:
           | The coarse estimate of compute in transformers is about as
           | many MACs as there are weights, or twice as many flops
           | (because multiplication and addition are counted as separate
           | operations). So for llama 70b that's about 70b MACs per
           | token, which is manageable. What's far less manageable is
           | reading the entire model into RAM N times a second
        
             | GaggiX wrote:
             | This would only be the case if we ignore the multiplication
             | between queries and keys, and the resulting vector being
             | multiple with the values, and also the multiple heads.
        
               | ein0p wrote:
               | No, that is always the case. Attention is only about one
               | third the ops and qk is a fraction of that. Outside of
               | truly massive sequence lengths it doesn't matter a whole
               | lot, even though it's nominally quadratic. It's trivial
               | to run the numbers on this - you only need to do it for
               | one layer.
        
       | ChrisArchitect wrote:
       | [dupe]
       | 
       | More discussion on official post:
       | https://news.ycombinator.com/item?id=41369705
        
       | phkahler wrote:
       | The winner will be one of two approaches: 1) Getting great
       | performance using regular DRAM - system memory. 2) Bringing the
       | compute to the RAM chips - DRAM is accessed 64Kb per row (or
       | more?) and at ~10ns per read you can use small/slow ALUs along
       | the row to do MAC operations. Not sure how you program that
       | though.
       | 
       | Current "at home" inference tends to be limited by how much RAM
       | your graphics card has, but system RAM scales better.
        
         | rfoo wrote:
         | The winner, unfortunately, will be on cloud inference.
        
         | eth0up wrote:
         | I'll probably get stoned for asking here, but... since you seem
         | knowledgeable on the subject:
         | 
         | I just got llama3.1-8b (standard and instruct). However, I
         | cannot do anything with it on my current hardware. Can you
         | recommend the best AI model that I: 1) can self host 2) run on
         | 16GB ram with no dedicated graphics card and an old intel i5 3)
         | use on Debian without installing a bunch of exo-repo mystery
         | code?
         | 
         | Any recommendation, directly or semi related would be
         | appreciated - I'm doing my 'research' but haven't made much
         | progress nor had any questions answered.
        
           | arcanemachiner wrote:
           | Setting up Ollama via Docker was the easiest way for me to
           | get up and running. Not 100% sure if it fits your
           | constraints, but highly recommended.
        
             | programd wrote:
             | Another option is to download and compile llama.cpp and you
             | should be able to run quantized models at an acceptable
             | speed.
             | 
             | https://github.com/ggerganov/llama.cpp
             | 
             | Also, if you can spend the $60 and buy another 32GB of RAM,
             | this will allow you to run the 30GB models quite nicely.
        
               | eth0up wrote:
               | Unfortunately motherboard is capped at 16Gb ram
        
           | smokel wrote:
           | Running LLMs on that kind of hardware will be very slow
           | (expect responses with only a few words per second, which is
           | probably pretty annoying).
           | 
           | LM Studio [1] makes it very easy to run models locally and
           | play with them. Llama 3.1 will only run in quantized form
           | with 16GB RAM, and that cripples it quite badly, in my
           | opinion.
           | 
           | You may try Phi-3 Mini, which has only 3.8B weights and can
           | still do fun things.
           | 
           | [1] https://lmstudio.ai/
        
             | eth0up wrote:
             | Much appreciated. Thanks for this!
        
             | wkat4242 wrote:
             | I don't find llama3.1 noticeably worse on 8 bit integer
             | quantised than the original fp16 to be honest. It's also a
             | lot faster.
             | 
             | Of course even then you're not going to reach the whole
             | 128k context window on 16GB but if you don't need that it
             | works great.
        
         | mikewarot wrote:
         | Completely eliminating the separation between RAM and compute
         | is how FPGAs are so fast, they do most of the computing as a
         | series of Look Up Tables (LUTs), and optimize for latency and
         | utilization with fancy switching fabrics.
         | 
         | The downside of the switching fabrics is that optimizing a
         | design to fit an FPGA can sometimes take days.
        
         | ein0p wrote:
         | +1. For inference especially compute is abundant and basically
         | free in terms of energy. Almost all of the energy is spent on
         | memory movement. The logical solution is to not move
         | unaggregated data.
        
       | russ wrote:
       | Here's an AI voice assistant we built this weekend that uses it:
       | 
       | https://x.com/dsa/status/1828481132108873979?s=46&t=uB6padbn...
        
       | cheptsov wrote:
       | Very interested in playing with their hardware and cloud. Also I
       | wonder if it's possible to try cloud without contacting their
       | sales.
        
       | ein0p wrote:
       | 8b models won't even need a server a year from now. Basically the
       | only reason to go to the server a year or two from now will be to
       | do what edge devices can't do: general purpose chat, long context
       | (multimodal especially), data augmented generation that relies on
       | pre-existing data sources in the cloud, etc. And on the server
       | it's very expensive to run batch size 1. You want to maximize the
       | batch size while also keeping an eye on time to first token and
       | time per token. Basically 20-25 tok/sec generation throughput is
       | a good number for most non-demo workloads. TTFT for median prompt
       | size should ideally be well under 1 sec.
       | 
       | But I'm happy they got this far. It's an ambitious vision, and
       | it's extra competition in a field where it's severely lacking.
        
       | bkitano19 wrote:
       | Time to first token is as important to know for many use cases,
       | rarely are people reporting it
        
         | Gcam wrote:
         | See here for our TTFT metric benchmarks:
         | https://artificialanalysis.ai/models/llama-3-1-instruct-70b/...
        
       ___________________________________________________________________
       (page generated 2024-08-27 23:01 UTC)