[HN Gopher] Cerebras launches inference for Llama 3.1; benchmark...
___________________________________________________________________
Cerebras launches inference for Llama 3.1; benchmarked at 1846
tokens/s on 8B
Author : _micah_h
Score : 72 points
Date : 2024-08-27 16:42 UTC (6 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| freediver wrote:
| Yep it is fast. Now what exactly is Llama 8B useful is another
| matter - what are some good use cases?
|
| One scenario I can think of is rolepaying - but I would assume
| that the slow streaming speed was kind of a feature there.
| rgbrgb wrote:
| Speed is useful for batch tasks or doing a bunch of serial
| tasks quickly. E.g. "take these 1000 pitch decks and give me 5
| bullets on each", "run this prompt 100 times and then pick the
| best response", "detect which of these 100k comments mention
| the SF Giants".
| drdaeman wrote:
| 8B is not exactly great for roleplaying, if we put the bar any
| high. It is just not sophisticated enough, as it has very
| limited "reasoning"-like capabilities and can normally make
| sensible conclusions only about very basic things (like if it's
| raining, maybe character will get wet). It can and will
| hallucinate about stuff like inventories or rules - and it's
| not a context length thing. If there are multiple NPCs, things
| get worse, as they're starting to all mix up.
|
| 70B does significantly better in this regard. Nowhere close to
| perfection, but the frequency of WTFs about LLM's output are
| [subjectively] drastically lower.
|
| Speed can be useful in RP if we'd run multiple LLM-based agents
| (like "plot", "goal checker", "inventory", "validation",
| "narrator") that function call each other to achieve some goal.
| wkat4242 wrote:
| These wafers only have 44GB of RAM though. Very curious why
| the quantity is so low considering the chips are absolutely
| massive. It's SRAM though so very fast, comparable to cache
| in a modern CPU. But I assume being fast and loading the
| whole model there is the point.
| bottlepalm wrote:
| Surveillance states and intelligence agencies.
|
| Or maybe a MMO with a town of NPCs.
| benopal64 wrote:
| Why can't the MMO with a town of NPCs have an intelligence
| agency too?
| seldo wrote:
| For agentic use cases, where you might need several round-trips
| to the LLM to reflect on a query, improve a result, etc.,
| getting fast inference means you can do more round-trips while
| still responding in reasonable time. So basically any LLM use-
| case is improved by having greater speed available IMO.
| freediver wrote:
| The problem with this is tok/sec does not tell you what time
| to first token is. I've seen (with Groq) where this is large
| for large prompts, nullifying the advantage of faster
| tok/sec.
| halJordan wrote:
| What kind of answer are you looking for? Just start asking it
| questions. The constant demand for a magic silver bullet use
| case applicable to every person in the country is wild. If you
| have to ask, you're not using it.
|
| What exact use case did google.com enable you to do that made
| it worthwhile for everyone to immediately start using? It let
| you access nytimes.com? Access amazon.com? No, it let you ask
| off the wall, asinine, long tail questions no one else asked.
| mikewarot wrote:
| Why is it so gosh darned slow? If you've got enough transistors
| to hold 44 gigabytes of RAM, you've got enough to have the whole
| model in stored with no need for off-chip transfers.
|
| I'd expect tokens out at 1 Ghz aggregate. Anything less than 1
| Mhz is a joke.... ok, not a joke, but surprisingly slow.
| twothreeone wrote:
| Even if they could generate tokens at that speed on the chip
| (which maybe they can in theory?) you need to get user tokens
| onto the chip and the resulting model tokens off again and
| transport them to the user as well. This means at some point
| the I/O becomes the bottleneck, not the compute. I also suspect
| it will get faster still, from the announcement it didn't sound
| like it's "optimal" yet.
| chessgecko wrote:
| On die communication isn't free, a lot of things here are
| sequential and within matrix multiplies the cores have to
| transfer output and mem loads have to be distributed. It's
| really fast but not like one cycle
| mikewarot wrote:
| You could add a series of latches, and use the magic of graph
| coloring to eliminate any timing issues, and pipeline the
| thing sufficiently to get a GHz of throughput, even if it
| takes many cycles to make it all the way though the pipe.
|
| Personally, I'd put all the parameters in NOR flash, then
| cycle through the row lines sequentially to load the
| parameters into the MAC. You could load all the inputs in
| parallel as fast as the dynamic power limits of the chip
| allow. If you use either DMA or a hardware ring buffer to
| push all the tokens through the layers, you could keep the
| throughput going with various sizes of models, etc.
|
| Obviously with only one MAC you couldn't have a single stream
| at a GHZ, but you could have 4000 separate streams of 250,000
| tokens/second.
| chessgecko wrote:
| Their numbers are for a single input, I assume the
| throughput is much higher given the prices they are quoting
| and the cost of a single cs3.
| GaggiX wrote:
| It only needs to compute about a trillion floating-point
| operations per token, and each layer relies on the previous
| one.
|
| I wonder why it doesn't output a billion tokens per second.
| ein0p wrote:
| The coarse estimate of compute in transformers is about as
| many MACs as there are weights, or twice as many flops
| (because multiplication and addition are counted as separate
| operations). So for llama 70b that's about 70b MACs per
| token, which is manageable. What's far less manageable is
| reading the entire model into RAM N times a second
| GaggiX wrote:
| This would only be the case if we ignore the multiplication
| between queries and keys, and the resulting vector being
| multiple with the values, and also the multiple heads.
| ein0p wrote:
| No, that is always the case. Attention is only about one
| third the ops and qk is a fraction of that. Outside of
| truly massive sequence lengths it doesn't matter a whole
| lot, even though it's nominally quadratic. It's trivial
| to run the numbers on this - you only need to do it for
| one layer.
| ChrisArchitect wrote:
| [dupe]
|
| More discussion on official post:
| https://news.ycombinator.com/item?id=41369705
| phkahler wrote:
| The winner will be one of two approaches: 1) Getting great
| performance using regular DRAM - system memory. 2) Bringing the
| compute to the RAM chips - DRAM is accessed 64Kb per row (or
| more?) and at ~10ns per read you can use small/slow ALUs along
| the row to do MAC operations. Not sure how you program that
| though.
|
| Current "at home" inference tends to be limited by how much RAM
| your graphics card has, but system RAM scales better.
| rfoo wrote:
| The winner, unfortunately, will be on cloud inference.
| eth0up wrote:
| I'll probably get stoned for asking here, but... since you seem
| knowledgeable on the subject:
|
| I just got llama3.1-8b (standard and instruct). However, I
| cannot do anything with it on my current hardware. Can you
| recommend the best AI model that I: 1) can self host 2) run on
| 16GB ram with no dedicated graphics card and an old intel i5 3)
| use on Debian without installing a bunch of exo-repo mystery
| code?
|
| Any recommendation, directly or semi related would be
| appreciated - I'm doing my 'research' but haven't made much
| progress nor had any questions answered.
| arcanemachiner wrote:
| Setting up Ollama via Docker was the easiest way for me to
| get up and running. Not 100% sure if it fits your
| constraints, but highly recommended.
| programd wrote:
| Another option is to download and compile llama.cpp and you
| should be able to run quantized models at an acceptable
| speed.
|
| https://github.com/ggerganov/llama.cpp
|
| Also, if you can spend the $60 and buy another 32GB of RAM,
| this will allow you to run the 30GB models quite nicely.
| eth0up wrote:
| Unfortunately motherboard is capped at 16Gb ram
| smokel wrote:
| Running LLMs on that kind of hardware will be very slow
| (expect responses with only a few words per second, which is
| probably pretty annoying).
|
| LM Studio [1] makes it very easy to run models locally and
| play with them. Llama 3.1 will only run in quantized form
| with 16GB RAM, and that cripples it quite badly, in my
| opinion.
|
| You may try Phi-3 Mini, which has only 3.8B weights and can
| still do fun things.
|
| [1] https://lmstudio.ai/
| eth0up wrote:
| Much appreciated. Thanks for this!
| wkat4242 wrote:
| I don't find llama3.1 noticeably worse on 8 bit integer
| quantised than the original fp16 to be honest. It's also a
| lot faster.
|
| Of course even then you're not going to reach the whole
| 128k context window on 16GB but if you don't need that it
| works great.
| mikewarot wrote:
| Completely eliminating the separation between RAM and compute
| is how FPGAs are so fast, they do most of the computing as a
| series of Look Up Tables (LUTs), and optimize for latency and
| utilization with fancy switching fabrics.
|
| The downside of the switching fabrics is that optimizing a
| design to fit an FPGA can sometimes take days.
| ein0p wrote:
| +1. For inference especially compute is abundant and basically
| free in terms of energy. Almost all of the energy is spent on
| memory movement. The logical solution is to not move
| unaggregated data.
| russ wrote:
| Here's an AI voice assistant we built this weekend that uses it:
|
| https://x.com/dsa/status/1828481132108873979?s=46&t=uB6padbn...
| cheptsov wrote:
| Very interested in playing with their hardware and cloud. Also I
| wonder if it's possible to try cloud without contacting their
| sales.
| ein0p wrote:
| 8b models won't even need a server a year from now. Basically the
| only reason to go to the server a year or two from now will be to
| do what edge devices can't do: general purpose chat, long context
| (multimodal especially), data augmented generation that relies on
| pre-existing data sources in the cloud, etc. And on the server
| it's very expensive to run batch size 1. You want to maximize the
| batch size while also keeping an eye on time to first token and
| time per token. Basically 20-25 tok/sec generation throughput is
| a good number for most non-demo workloads. TTFT for median prompt
| size should ideally be well under 1 sec.
|
| But I'm happy they got this far. It's an ambitious vision, and
| it's extra competition in a field where it's severely lacking.
| bkitano19 wrote:
| Time to first token is as important to know for many use cases,
| rarely are people reporting it
| Gcam wrote:
| See here for our TTFT metric benchmarks:
| https://artificialanalysis.ai/models/llama-3-1-instruct-70b/...
___________________________________________________________________
(page generated 2024-08-27 23:01 UTC)