[HN Gopher] Groq runs Mixtral 8x7B-32k with 500 T/s
       ___________________________________________________________________
        
       Groq runs Mixtral 8x7B-32k with 500 T/s
        
       Author : tin7in
       Score  : 470 points
       Date   : 2024-02-19 12:06 UTC (10 hours ago)
        
 (HTM) web link (groq.com)
 (TXT) w3m dump (groq.com)
        
       | itsmechase wrote:
       | Incredible tool. The Mixtral 8x7B model running on their hardware
       | did 491.40 T/s for me...
        
       | dariobarila wrote:
       | Wow! So fast!
        
       | ppsreejith wrote:
       | Relevant thread from 5 months ago:
       | https://news.ycombinator.com/item?id=37469434
       | 
       | I'm achieving consistent 450+ tokens/sec for Mixtral 8x7b 32k and
       | ~200 tps for Llama 2 70B-4k.
       | 
       | As an aside, seeing that this is built with flutter Web, perhaps
       | a mobile app is coming soon?
        
         | tome wrote:
         | There was also another discussion about Groq a couple of months
         | ago https://news.ycombinator.com/item?id=38739199
        
       | tome wrote:
       | Hi folks, I work for Groq. Feel free to ask me any questions.
       | 
       | (If you check my HN post history you'll see I post a lot about
       | Haskell. That's right, part of Groq's compilation pipeline is
       | written in Haskell!)
        
         | ppsreejith wrote:
         | Thank you for doing this AMA
         | 
         | 1. How many GroqCards are you using to run the Demo?
         | 
         | 2. Is there a newer version you're using which has more SRAM
         | (since the one I see online only has 230MB)? Since this seems
         | to be the number that will drive down your cost (to take
         | advantage of batch processing, CMIIW!)
         | 
         | 3. Can TTS pipelines be integrated with your stack? If so, we
         | can truly have very low latency calls!
         | 
         | *Assuming you're using this:
         | https://www.bittware.com/products/groq/
        
           | tome wrote:
           | 1. I think our GroqChat demo is using 568 GroqChips. I'm not
           | sure exactly, but it's about that number.
           | 
           | 2. We're working on our second generation chip. I don't know
           | how much SRAM it has exactly but we don't need to increase
           | the SRAM to get efficient scaling. Our system is
           | deterministic, which means no need for waiting or queuing
           | anywhere, and we can have very low latency interconnect
           | between cards.
           | 
           | 3. Yeah absolutely, see this video of a live demo on CNN!
           | 
           | https://www.youtube.com/watch?t=235&v=pRUddK6sxDg
        
             | ppsreejith wrote:
             | Thank you, that demo was insane!
             | 
             | Follow up (noob) question: Are you using a KV cache? That
             | would significantly increase your memory requirements. Or
             | are you forwarding the whole prompt for each auto-
             | regressive pass?
        
               | tome wrote:
               | You're welcome! Yes, we have KV cache. Being able to
               | implement this efficiently in terms of hardware
               | requirements and compute time is one of the benefits of
               | our deterministic chip architecture (and deterministic
               | system architecture).
        
               | ppsreejith wrote:
               | Thanks again! Hope I'm not overwhelming but one more
               | question: Are you decoding with batch size = 1 or is it
               | more?
        
               | tome wrote:
               | That's OK, feel free to keep asking!
               | 
               | I think currently 1. Unlike with graphics processors,
               | which really need data parallelism to get good
               | throughput, our LPU architecture allows us to deliver
               | good throughput even at batch size 1.
        
             | WiSaGaN wrote:
             | How much do 568 chips cost? What's the cost ratio of it
             | comparing to setup with roughly the same throughput using
             | A100?
        
               | benchess wrote:
               | They're for sale on Mouser for $20625 each
               | https://www.mouser.com/ProductDetail/BittWare/RS-GQ-
               | GC1-0109...
               | 
               | At that price 568 chips would be $11.7M
        
               | tome wrote:
               | Yeah, I don't know what the cost to us is to build out
               | our own hardware but it's significantly less expensive
               | than retail.
        
               | WiSaGaN wrote:
               | That seems to be per card instead of chip. I would expect
               | it has multiple chips on a single card.
        
               | renewiltord wrote:
               | From the description that doesn't seem to be the case,
               | but I don't know this product well
               | 
               | > _Accelerator Cards GroqCard low latency AI /ML
               | Inference PCIe accelerator card with single GroqChip_
        
               | WiSaGaN wrote:
               | Missed that! Thanks for pointing out!
        
             | gautamcgoel wrote:
             | Can you talk about the interconnect? Is it fully custom as
             | well? How do you achieve low latency?
        
               | tome wrote:
               | You can find out about the chip to chip interconnect from
               | our paper below, section 2.3. I don't think that's
               | custom.
               | 
               | We achieve low latency by basically being a software-
               | defined architecture. Our functional units operate
               | completely orthoganal to each other. We don't have to
               | batch in order to achieve parallelism and the system
               | behaviour is completely deterministic, so we can schedule
               | all operations precisely.
               | 
               | https://wow.groq.com/wp-
               | content/uploads/2023/05/GroqISCAPape...
        
         | Oras wrote:
         | Impressive speed. Are there any plans to run fine-tuned models?
        
           | tome wrote:
           | Yes, we're working on a feature to give our partners the
           | ability to deploy their own fine-tuned models.
        
         | dkhudia wrote:
         | @tome for the deterministic system, what if the timing for one
         | chip/part is off due to manufacturing/environmental factors
         | (e.g., temperature) ? How does the system handle this?
        
           | mechagodzilla wrote:
           | Those sorts of issues are part of timing analysis for a chip,
           | but once a chip's clock rate is set, they don't really factor
           | in unless there is some kind of dynamic voltage/frequency
           | scaling scheme going on. This chip probably does not do any
           | of that and just uses a fixed frequency, so timing is
           | perfectly predictable.
        
           | tome wrote:
           | We know the maximum possible clock drift and so we know when
           | we need to do a resynchronisation to keep all the chips in
           | sync. You can read about it in section 3.3 of our recent
           | whitepaper: https://wow.groq.com/wp-
           | content/uploads/2023/05/GroqISCAPape...
        
         | mechagodzilla wrote:
         | You all seem like one of the only companies targeting low-
         | latency inference rather than focusing on throughput (and thus
         | $/inference) - what do you see as your primary market?
        
           | tome wrote:
           | Yes, because we're one of the only companies whose hardware
           | can actually support low latency! Everyone else is stuck with
           | traditional designs and they try to make up for their high
           | latency by batching to get higher throughput. But not all
           | applications work with high throughput/high latency ... Low
           | latency unlocks feeding the result of one model into the
           | input of another model. Check out this conversational AI demo
           | on CNN. You can't do that kind of thing unless you have low
           | latency.
           | 
           | https://www.youtube.com/watch?v=pRUddK6sxDg&t=235s
        
             | vimarsh6739 wrote:
             | Might be a bit out of context, but isn't the TPU also
             | optimized for low latency inference? (Judging by reading
             | the original TPU architecture paper here -
             | https://arxiv.org/abs/1704.04760). If so, does Groq
             | actually provide hardware support for LLM inference?
        
               | tome wrote:
               | Jonathan Ross on that paper is Groq's founder and CEO.
               | Groq's LPU is an natural continuation of the breakthrough
               | ideas he had when designing Google's TPU.
               | 
               | Could you clarify your question about hardware support?
               | Currently we build out our hardware to support our cloud
               | offering, and we sell systems to enterprise customers.
        
               | vimarsh6739 wrote:
               | Thanks for the quick reply! About hardware support, I was
               | wondering if the LPU has a hardware instruction to
               | compute the attention matrix similar to the
               | MatrixMultiply/Convolve instruction in the TPU ISA.
               | (Maybe a hardware instruction which fuses a softmax on
               | the matmul epilogue?)
        
               | tome wrote:
               | We don't have a hardware instruction but we do have some
               | patented technology around using a matrix engine to
               | efficiently calculate other linear algebra operations
               | such as convolution.
        
               | mirekrusin wrote:
               | Are you considering targeting consumer market? There are
               | a lot of people throwing $2k-$4k into local setups and
               | they primarily care about inference.
        
               | tome wrote:
               | At the moment we're concentrating on building out our API
               | and serving the enterprise market.
        
         | phh wrote:
         | You're running fp32 models, fp16 or quantized?
        
           | tome wrote:
           | FP16 for calculating all activations. Some data is stored as
           | FP8 at rest.
        
         | andy_xor_andrew wrote:
         | are your accelerator chips designed in-house? or they're some
         | specialized silicon or FPGPU or something that you wrote very
         | optimized code for inference?
         | 
         | it's really amazing! the first time I tried the demo, I had to
         | try a few prompts to believe it wasn't just an animation :)
        
           | tome wrote:
           | Yup, custom ASIC, designed in-house, built into a system of
           | several racks, hundreds of chips, with fast interconnect.
           | Really glad you enjoyed it!
        
         | itishappy wrote:
         | Alright, I'll bite. Haskell seems pretty unique in the ML
         | space! Any unique benefits to this decision, and would you
         | recommend it for others? What areas of your project do/don't
         | use Haskell?
        
           | tome wrote:
           | Haskell is a great language for writing compilers! The end of
           | our compilation pipeline is written in Haskell. Other stages
           | are written in C++ (MLIR) and Python. I'd recommend anyone to
           | look at Haskell if they have a compiler-shaped problem, for
           | sure.
           | 
           | We also use Haskell on our infra team. Most of our CI infra
           | is written in Haskell and Nix. Some of the chip itself was
           | designed in Haskell (or maybe Bluespec, a Haskell-like
           | language for chip design, I'm not sure).
        
         | ianpurton wrote:
         | Is it possible to buy Groq chips and how much do they cost?
        
           | ComputerGuru wrote:
           | https://www.mouser.com/ProductDetail/BittWare/RS-GQ-
           | GC1-0109...
        
             | UncleOxidant wrote:
             | Only $20,625.00!
        
         | michaelbuckbee wrote:
         | Friendly fyi - I think this might just be a web interface bug
         | but but I submitted a prompt with the Mixtral model and got a
         | response (great!) then switched the dropdown to Llama and
         | submitted the same prompt and got the exact same response.
         | 
         | It may be caching or it didn't change the model being queried
         | or something else.
        
           | tome wrote:
           | Thanks, I think it's because the chat context is fed back to
           | the model for the next generation even when you switch
           | models. If you refresh the page that should erase the history
           | and you should get results purely from the model you choose.
        
             | michaelbuckbee wrote:
             | Appreciate the quick reply! That's interesting.
        
               | tome wrote:
               | You're welcome. Thanks for reporting. It's pretty
               | confusing so maybe we should change it :)
        
               | pests wrote:
               | I've always liked how openrouter.ai does it
               | 
               | They allow you to configure chat participants (a model +
               | params like context or temp) and then each AI answers
               | each question independently in-line so you can compare
               | and remix outputs.
        
         | UncleOxidant wrote:
         | When will we be able to buy Groq accelerator cards that would
         | be affordable for hobbyists?
        
           | tome wrote:
           | We are prioritising building out whole systems at the moment
           | I don't think we'll have a consumer level offering in the
           | near future.
        
             | frognumber wrote:
             | I will mention: A lot of innovation in this space comes
             | bottom-up. The sooner you can get something in the hands of
             | individuals and smaller institutions, the better your
             | market position will be.
             | 
             | I'm coding to NVidia right now. That builds them a moat.
             | The instant I can get other hardware working, the less of a
             | moat they will have. The more open it is, the more likely I
             | am to adopt it.
        
               | tome wrote:
               | Definitely, that's why we've opened our API to everyone.
        
               | frognumber wrote:
               | I don't think that quite does it. What I'd want -- if you
               | want me to support you -- is access to the chip,
               | libraries, and API documentation.
               | 
               | Best-case would be something I buy for <$2k (if out-of-
               | pocket) or under $5k (if employer). Next best case would
               | be a cloud service with a limited free tier. It's okay if
               | it has barely enough quota that I can develop to it, but
               | the quota should never expire.
               | 
               | (The mistake a lot of services make is to limit free tier
               | to e.g. 30 day or 1 year, rather than hours/month; if I
               | didn't get around to evaluating, switch employers, switch
               | projects, etc. the free tier is gone).
               | 
               | I did sign up for your API service. I won't be able to
               | use it in prod before your (very nice) privacy guarantees
               | are turned into lawyer-compliant regulatory language. But
               | it's an almost ideal fit for my application.
        
               | tome wrote:
               | Yup, understood. Access to consumer hardware like this is
               | not something that we provide at the moment, I'm afraid.
        
               | frognumber wrote:
               | Don't blame you. Been at plenty of startups, resources
               | are finite, and focus is important.
               | 
               | My only point was to, well, perhaps bump this up from
               | #100 on your personal priority list perhaps to #87, to
               | the limited extent that influences your business.
        
         | kkzz99 wrote:
         | How does the Groq PCIE Card work exactly? Does it use system
         | ram to stream the model data to the card? How many T/s could
         | one expect with e.g. 36000Mhz DDR4 Ram?
        
           | tome wrote:
           | We build out large systems where we stream in the model
           | weights to the system once and then run multiple inferences
           | on it. We don't really recommend streaming model weights
           | repeatedly onto the chip because you'll lose the benefits of
           | low latency.
        
             | kkzz99 wrote:
             | How does that work when the card only has 230MB of SRAM?
        
               | tome wrote:
               | We connect hundreds of chips across several racks with
               | fast interconnect.
        
               | AhtiK wrote:
               | How fast is the memory bandwidth of that fast
               | interconnect?
        
               | tome wrote:
               | Have a look at section 2.3 of our paper. Between any two
               | chips we get 100 Gbps. The overall bandwidth depends on
               | the connection topology used. I don't know if we make
               | that public.
               | 
               | https://wow.groq.com/wp-
               | content/uploads/2023/05/GroqISCAPape...
        
         | tudorw wrote:
         | As it works at inference do you think 'Representation
         | Engineering ' could be applied to give a sort of fine-tuning
         | ability? https://news.ycombinator.com/item?id=39414532
        
         | karthityrion wrote:
         | Hi. Are these ASICs only for LLMs or could they accelerate
         | other kinds of models(vision) as well?
        
           | tome wrote:
           | It's a general purpose compute engine for numerical computing
           | and linear algebra, so it can accelerate any ML workloads.
           | Previously we've accelerated models for stabilising fusion
           | reactions and for COVID drug discovery
           | 
           | * https://alcf.anl.gov/news/researchers-accelerate-fusion-
           | rese...
           | 
           | * https://wow.groq.com/groq-accelerates-covid-drug-
           | discovery-3...
        
             | karthityrion wrote:
             | So, is this specific chip only for LLMs as the name LPU
             | suggests Language Processing Unit, right?
        
               | tome wrote:
               | The chip is capable of running general numerical compute,
               | but because we're focusing almost entirely on LLMs at the
               | moment we've branded it the LPU.
        
         | jart wrote:
         | If I understand correctly, you're using specialized hardware to
         | improve token generation speed, which is very latency bound on
         | the speed of computation. However generating tokens only
         | requires multiplying 1-dimensional matrices usually. If I enter
         | a prompt with ~100 tokens then your service goes much slower.
         | Probably because you have to multiply 2-dimensional matrices.
         | What are you doing to improve the computation speed of prompt
         | processing?
        
           | tome wrote:
           | I don't think it should be quadratic in input length. Why do
           | you think it is?
        
             | jart wrote:
             | all I know is that when I run llama.cpp a lot of the
             | matrices that get multiplied have their shapes defined by
             | how many tokens are in my prompt.
             | https://justine.lol/tmp/shapes.png Notice how the B matrix
             | is always skinny for generating tokens. But for batch
             | processing of the initial prompt, it's fat. It's not very
             | hard to multiply a skinny matrix but once it's fat it gets
             | harder. Handling the initial batch processing of the prompt
             | appears to be what your service goes slow at.
        
             | johndough wrote:
             | You can ask your website: "What is the computational
             | complexity of self-attention with respect to input sequence
             | length?"
             | 
             | It'll answer something along the lines of self-attention
             | being O(n^2) (where n is the sequence length) because you
             | have to compute an attention matrix of size n^2.
             | 
             | There are other attention mechanisms with better
             | computational complexity, but they usually result in worse
             | large language models. To answer jart: We'll have to wait
             | until someone finds a good linear attention mechanism and
             | then wait some more until someone trains a huge model with
             | it (not Groq, they only do inference).
        
         | BryanLegend wrote:
         | How well would your hardware work for image/video generation?
        
           | tome wrote:
           | It should work great as far as I know. We've implemented some
           | diffusion models for image generation but we don't offer them
           | at the moment. I'm not aware of us having implemented any
           | video models.
        
         | pama wrote:
         | FYI, I only see a repeating animation and nothing else in my
         | iPhone on lockdown mode, with Safari or Firefox.
        
         | karthityrion wrote:
         | What is the underlying architecture of the ASICs. Does it use
         | systolic arrays?
        
           | tome wrote:
           | Yes, our matrix engine is quite similar to a systolic array.
           | You can find more details about our architecture in our
           | paper:
           | 
           | https://wow.groq.com/wp-
           | content/uploads/2023/05/GroqISCAPape...
        
         | liberix wrote:
         | How do I sign up for API access? What payment methods do you
         | support?
        
         | amirhirsch wrote:
         | It seems like you are making general purpose chips to run many
         | models. Are we at a stage where we can consider taping out
         | inference networks directly propagating the weights as
         | constants in the RTL design?
         | 
         | Are chips and models obsoleted on roughly the same timelines?
        
           | tome wrote:
           | I think the models change far too quickly for that to be
           | viable. A chip has to last several years. Currently we're
           | seeing groundbreaking models released every few months.
        
       | sebzim4500 wrote:
       | So this has nothing to do with `Grok`, the model provided by
       | x.ai?
       | 
       | EDIT: Tried using it, very impressed with the speed.
        
         | tome wrote:
         | Yeah, it's nothing to do with Elon and we (Groq) had the name
         | first. It's a natural choice of name for something in the field
         | of AI because of the connections to the hacker ethos, but we
         | have the trademark and Elon doesn't.
         | 
         | https://wow.groq.com/hey-elon-its-time-to-cease-de-grok/
        
           | terhechte wrote:
           | Can't Chamath (he's one of your investors, right), do a thing
           | there? Every person I pitch Groq to is confused and thinks
           | its about Elons unspectacular LLM.
        
             | tome wrote:
             | Yeah the confusion has happened a lot to me too. All I know
             | is that it's in the hands of our legal team.
        
           | fragmede wrote:
           | I mean it sucks that Elon went and claimed Grok when you want
           | Groq, plus you were there first, but getting stuck on the
           | name seems like it's going to be a distraction, so why not
           | choose something different?
           | 
           | When Grok eventually makes the news for some negative thing,
           | so you really want that erroneously associated with your
           | product? Do you really want to pick a fight with the
           | billionaire that owns Twitter, is that a core competency of
           | the company?
        
         | mtlmtlmtlmtl wrote:
         | There's also a children's toy named Grok which uses LLMs to
         | talk to the kid.
        
       | cchance wrote:
       | Jesus that makes chatgpt and even gemini seem slow AF
        
         | gremlinsinc wrote:
         | better quality than I was expecting. For fun I set the system
         | prompt to:
         | 
         | You are a leader of a team of ai helpers. when given a question
         | you can call on an expert, as a wizard calls on magic. You will
         | say, I call forth {expert} master of {subject matter} an expert
         | in {x, y, z}. Then you will switch to that persona.
         | 
         | I was not let down..
        
           | tome wrote:
           | Nice prompting strategy :)
        
       | CuriouslyC wrote:
       | This is pretty sweet. The speed is nice but what I really care
       | about is you bringing the per token cost down compared with
       | models on the level of mistral medium/gpt4. GPT3.5 is pretty
       | close in terms of cost/token but the quality isn't there and GPT4
       | is overpriced. Having GPT4 quality at sub-gpt3.5 prices will
       | enable a lot of things though.
        
         | ukuina wrote:
         | I wonder if Gemini Pro 1.5 will act as a forcing function to
         | lower GPT4 pricing.
        
           | ComputerGuru wrote:
           | Is that available via an API now?
        
             | sp332 wrote:
             | Kind of, it's in a "Private Preview" with a waitlist.
        
               | sturza wrote:
               | And in non EU countries.
        
               | ComputerGuru wrote:
               | Via GCP only?
        
         | MuffinFlavored wrote:
         | What's the difference in your own words/opinion in quality
         | between GPT-3.5 and GPT-4? For what usecases?
        
           | CuriouslyC wrote:
           | GPT3.5 is great at spitting out marketing babble, summarizing
           | documents and performing superficial analysis but it doesn't
           | take style prompts as well as gpt-4 and its reasoning is
           | significantly worse when you want it to chain of thought
           | follow a complex process while referencing context guidance.
        
         | emporas wrote:
         | Mixtral's quality is definitely up there with Gpt3.5.
         | Specifically for coding, i consider them almost equivalent in
         | quality. In fact Mixtral 8x7 is starting to be my go-to coding
         | assistant instead of Gpt. It is fast, it is accurate, and i
         | think i like his responses better than Gpt.
         | 
         | Reducing LLM size almost 10 times in the span of a little more
         | than a year, that's great stuff. Next step i think is 3 billion
         | parameters MoE with 20 experts.
        
       | deepsquirrelnet wrote:
       | Incredible job. Feels dumb or obvious to say this, but this
       | really changes the way I think of using it. The slow
       | autoregression really sucks because it inhibits your ability to
       | skim sections. For me, that creates an unnatural reading
       | environment. This makes chatgpt feel antiqued.
        
         | tome wrote:
         | Yes, agreed. We believe the benefits of reducing latency are
         | non-linear. You can hit different phase changes as the latency
         | reduces and new applications become viable. Roundtripping text-
         | to-speech and speech-to-text is one example. We're looking
         | forward to seeing what low latency applications are unlocked by
         | our new users!
        
       | Cheer2171 wrote:
       | What's the underlying hardware for this?
        
         | tome wrote:
         | It's a system built from hundreds of GroqChips (a custom ASIC
         | we designed). We call it the LPU (language processing unit).
         | Unlike graphics processors, which are still best in class for
         | training, LPUs are best in class for low latency and high
         | throughput inference. Our LLMs are running on several racks
         | with fast interconnect between the chips.
        
         | michaelt wrote:
         | They have a paper [1] about their 'tensor streaming
         | multiprocessor'
         | 
         | [1] https://wow.groq.com/wp-
         | content/uploads/2024/02/GroqISCAPape...
        
       | xnorswap wrote:
       | Interesting, I ran my "prime test", it passed on generating a
       | prime, it failed on it being large.
       | 
       | > "Can you generate me a large prime please"
       | 
       | > Sure, here's a large prime number: 193707721
       | 
       | > This number is a prime number because it is only divisible by 1
       | and itself. You can check its primality by trying to divide it by
       | all the numbers up to its square root, and if none of them divide
       | it evenly, then it is a prime number. This number is quite large,
       | but there are much larger prime numbers known. For example, the
       | largest known prime number, as of January 2022, is
       | 2^82,589,933-1, a number with 24,862,048 digits, discovered as
       | part of the Great Internet Mersenne Prime Search (GIMPS).
       | 
       | I then prompted "That's not large" which seemed to break it, the
       | response is stuck in a loop:
       | 
       | > That isn't large
       | 
       | > I apologize if the prime number I provided did not meet your
       | expectations for "large." Prime numbers can be very large indeed,
       | and there is no upper limit to how large a prime number can be.
       | 
       | If you are looking for a very large prime number, here is one
       | with a billion digits:
       | 
       | 10000030303030303030303030303030303030303030303030303030303030303
       | 03030303030303030303030303030303030303030303030303030303030303030
       | 30303030303030303030303030303030303030303030303030303030303030303
       | 03030303030303030303030303030303030303030303030303030303030303030
       | 30303030303030303030303030303030303030303030303030303030303030303
       | 03030303030303030303030303030303030303030303030303030303030303030
       | 30303030303030303030303030303030303030303030303030303030303030303
       | 03030303030303030303030303030303030303030303030303030303030303030
       | 30303030303030303030303030303030303030303030303030303030303030303
       | 03030303030303030303030303030303030303030303030303030303030303030
       | 30303030303030303030303030303030303030303030303030303030303030303
       | 03030303030303030303030303030303030303030303030303030303030303030
       | 30303030303030303030303030303030303030303030303030303030303030303
       | 03030303030303030303030303030303030303030303030303030303030303030
       | 30303030303030303030303030303030303030303030303030303030303030303
       | 030303030303030303030303030303...
       | 
       | It's printed many more pages of 303030... but hacker news has a
       | post length limit!
       | 
       | Further testing, such as prompting to generate a 1024 bit prime
       | also runs into odd loops.
       | 
       | > Can you generate a 1024 bit prime?
       | 
       | > Yes, I can generate a 1024-bit prime number for you. Here's one
       | I just generated:
       | 
       | > 115322512531875133213135311402305250340282302283103413642283221
       | 25311565322125311221022112211211211211211211211211211211211211211
       | 2112112112112112112112112112112112112112112112112112112112112112
       | ( Loops 112 forever )
        
         | 7734128 wrote:
         | How do you know that it didn't somehow find the largest prime?
         | Perhaps you just threw away a Noble Prize.
        
           | gautamcgoel wrote:
           | Nobel Prize in what? There is no Nobel in mathematics or
           | computer science.
        
             | fatkam wrote:
             | nobel prizes are a joke anyways.... Obama got the one for
             | peace?
        
         | Workaccount2 wrote:
         | I can't think of any value in asking LLM's to generate prime
         | numbers. They are neural nets, not turing machines.
        
           | xnorswap wrote:
           | The value is showing how confidently is presents incorrect
           | results.
           | 
           | Especially the lack of nuance or uncertainty in the language
           | they use.
           | 
           | They extremely confidently present the incorrect information,
           | and prime generation is interesting because it's information
           | that isn't easy to spot as obviously incorrect to the user
           | while being information that's possible to determine is wrong
           | at small numbers and difficult to verify for large numbers.
           | 
           | It's my favourite test because it's a good demonstration of
           | the lack of nuance or uncertainty in LLMs. They have no sense
           | of how wrong the information they're giving out might be.
           | 
           | If they could give confidence intervals for any information
           | then they could provide the context by how likely they think
           | they might be correct, but they actually double-down on their
           | incorrectness instead.
        
             | tveita wrote:
             | Mixtral is a pre-existing model that was released by
             | Mistral AI in December: https://mistral.ai/news/mixtral-of-
             | experts/
             | 
             | It was discussed on HN at the time:
             | https://news.ycombinator.com/item?id=38598559
             | 
             | You won't learn anything about the chips this submission is
             | about by asking it questions - maybe except to look for
             | degradation from quantization or computational shortcuts,
             | but you'd have to compare the answers to Mixtral running
             | with 16 bit floats. The point is the speed, just like a
             | faster CPU isn't going to improve the UI of Outlook.
        
         | xnorswap wrote:
         | It's worth noting that it may have got lucky with the first
         | response. A clean session with the prompt:
         | 
         | > Could you generate me a 64 bit prime?
         | 
         | Returns:
         | 
         | > Sure, here's a 64-bit prime number: 15592425903773483751
         | 
         | > You can verify that it's prime by checking that it has no
         | divisors between 2 and its square root. In this case, the
         | square root of the number is approximately 12489593.5, so you
         | would need to check divisibility for all integers between 2 and
         | 12489593.5. Since the number is odd, you can skip checking even
         | numbers, which will reduce the number of checks you need to
         | perform.
         | 
         | > However, if you're looking for a larger 64-bit prime or a
         | prime number with specific properties, you may need to use a
         | more sophisticated algorithm or a specialized prime number
         | generator.
         | 
         | Like GPT3.5 and Bard, it confidently suggests that
         | 15592425903773483751 is prime, and even explains why.
         | 
         | It's wrong:
         | 
         | > $ factor 15592425903773483751 > 15592425903773483751: 3 3 3 7
         | 167 853 911 635723119
         | 
         | Further prompting:
         | 
         | > Can you factor 15592425903773483751?
         | 
         | > No, I cannot factor 15592425903773483751 because it is a
         | prime number, and prime numbers cannot be factored into smaller
         | non-trivial factors. By definition, a prime number is a
         | positive integer greater than 1 that has no positive integer
         | divisors other than 1 and itself. In this case,
         | 15592425903773483751 is a 64-bit prime number, which means it
         | is a prime number that fits into a 64-bit data type.
         | 
         | > To verify that 15592425903773483751 is a prime number, you
         | can use a primality test, such as the Fermat primality test or
         | the Miller-Rabin primality test. These tests can quickly
         | determine whether a number is prime or composite (not prime)
         | with high probability. In the case of 15592425903773483751,
         | both tests confirm that it is a prime number.
         | 
         | Even further prompting to get it to try to multiply numbers
         | together quickly turns into a disaster, culminating with:
         | 
         | > That doesn't look right
         | 
         | > I apologize for the mistake in my previous response. Here is
         | the correct result of the multiplication:
         | 
         | > 3 x 3 x 3 x 7 x 167 x 853 x 911 x 635723119 =
         | 322711211111111111
         | 
         | > Therefore, the product of the numbers is:
         | 
         | > 322711211111111111.
         | 
         | > I apologize for any confusion my previous response may have
         | caused.
         | 
         | The context seems to be really aggressively baked in, because
         | when I ask it:
         | 
         | > Can you factor 322711211111111111?
         | 
         | It returns
         | 
         | > Yes, I can factor the number 322711211111111111. Here is its
         | prime factorization:
         | 
         | > 322711211111111111 = 3 x 3 x 3 x 7 x 167 x 853 x 911 x
         | 635723119
         | 
         | > Therefore, the prime factors of the number are 3, 7, 167,
         | 853, 911, and 635723119.
         | 
         | > Note that the prime factorization is unique up to reordering
         | of the factors. In this case, the prime factors are repeated
         | several times, so the prime factorization can be written more
         | compactly as:
         | 
         | > 3^3 x 7 x 167 x 853 x 911 x 635723119
         | 
         | > or
         | 
         | > 3 x 3 x 3 x 7 x 167 x 853 x 911 x 635723119
         | 
         | > I hope this helps! Let me know if you have any other
         | questions.
         | 
         | The real irony? 322711211111111111 is prime!
        
       | mrtksn wrote:
       | Does this make it practical to run LLMs on mobile devices? I
       | wonder about the power consumption and if it can make sense to
       | have it integrated in some future mobile devices. Or maybe have a
       | dedicated storage, RAM and processing cores that goes as an USB-C
       | add-on? A case with integrated battery and this chip?
       | 
       | I'm dreaming of having LLMs on anything. Unlike the "bluetooth on
       | everything" craze, this can be practical as every device can
       | become smart. Remember how some British researchers made a self
       | driving car using an LLM? A toaster anticipating how to cook when
       | you describe it what you want want actually be an improvement.
        
         | tome wrote:
         | I don't think we've put a GroqChip in a mobile device yet.
         | Interesting idea!
        
         | wmf wrote:
         | I assume this is a million-dollar rack of custom chips so it's
         | probably not coming to mobile any time soon.
        
           | mrtksn wrote:
           | Well, currently its entirely possible to run these models on
           | iPhones. It's just not practical because it eats all the
           | resources and the battery when slowly generating the output.
           | 
           | Therefore if Groq has achieved significant efficiency
           | improvements, that its, they are not getting that crazy speed
           | by enormous power consumption then maybe they can eventually
           | build low power mass produced cutting edge fabbed chips that
           | run at acceptable speed?
        
             | wmf wrote:
             | The thing is, I don't see any efficiency improvements. I
             | see models running fast on very expensive hardware using
             | techniques that don't scale down.
        
               | mrtksn wrote:
               | Care to explain? Are they using 10x energy for 10x speed
               | improvements?
        
               | wmf wrote:
               | They're using hundreds of chips. Based on the data sheet
               | I would estimate this demo uses 173 KW. It may be 100x
               | energy to get 10x speedup.
        
               | mrtksn wrote:
               | 100s of chips for who knows how many clients. The mobile
               | phone will have to do calculations just for 1 client.
        
               | tome wrote:
               | Yes, we pipeline requests so multiple users are being
               | handled by the same hardware at one time.
        
               | mrtksn wrote:
               | Thanks for the clarification. So, would you say that Groq
               | has a potential to have let's say OpenAI speeds on
               | handheld devices at reasonable energy consumption? Or is
               | that not really what this tech's strength is maybe?
        
               | tome wrote:
               | The industry as a whole is a very long way away from
               | that. The power requirements are too high for mobile.
        
               | pptr wrote:
               | I think the limitation is chip size/cost. SRAM is a lot
               | less dense than RAM. According to Google this is
               | typically used for registers and caches, which are only
               | megabytes large.
        
         | frozenport wrote:
         | Yeah just offload the compute onto the cloud.
        
           | mrtksn wrote:
           | Its too unreliable, too restricted and too not-private.
           | 
           | ChatGPT stopped processing images for me, trying to get help
           | but support doesn't appear to be very fast, they asked for
           | more info but not heard back since.
           | 
           | Its too restricted, can't do anything on hard topics. It
           | doesn't work when you try to work out exploits or dangers in
           | a system for example.
           | 
           | Its not private, they say they don't train on API requests
           | but companies steer clear when it comes to send sensitive
           | data.
        
             | frozenport wrote:
             | The model being too restrictive does seem to be a good
             | point.
             | 
             | Do you think there are less restrictive models hosted on
             | poe.com?
        
         | SahAssar wrote:
         | > Remember how some British researchers made a self driving car
         | using an LLM?
         | 
         | No? Do you mean actual, full self driving on normal roads in
         | traffic?
        
           | mrtksn wrote:
           | Yes, IIRC they reason on the car actions using LLMs. They
           | still use image processing but once you identify the objects
           | in the scene, the LLM interprets and decides what to do with
           | the car.
           | 
           | I'm not sure which one was it though(Ghost Autonomy maybe?).
        
             | SahAssar wrote:
             | Do you have a source? Because that actually, properly
             | working would be headline global news and would value the
             | company in the billions.
        
               | mrtksn wrote:
               | It was discussed here on HN, that's how I know about it.
               | 
               | I found a few things when searched around but not sure
               | which one was the one I recall.
               | 
               | Anyway, here is a video from one:
               | https://www.youtube.com/watch?v=C2rbym6bXM0
               | 
               | Here is a paper discussing something similar:
               | https://arxiv.org/abs/2307.07162
        
               | SahAssar wrote:
               | The description for that video says
               | 
               | > Ghost Autonomy's MLLM-based capabilities are currently
               | in development. These video and image examples show MLLM-
               | based analysis of driving scenes captured from Ghost
               | vehicles driving in both autonomous and conventional
               | mode. _MLLM-based reasoning is not yet being returned to
               | the car to impact actual driving maneuvers._
               | 
               | So the model discussed is not doing any driving
               | whatsoever. This is not self-driving at any level.
        
               | mrtksn wrote:
               | Then its not the one I remember maybe.
        
       | jackblemming wrote:
       | Impressive work. Nice job team. This is big.
        
         | tome wrote:
         | Thanks!
        
       | sorokod wrote:
       | Not clear if it is due to Groq or to Mixtral, but confident
       | hallucinations are there.
        
         | tome wrote:
         | We run the open source models that everyone else has access to.
         | What we're trying to show off is our low latency and high
         | throughput, not the model itself.
        
           | MuffinFlavored wrote:
           | But if the model is useless/full of hallucinations, why does
           | the speed of its output matter?
           | 
           | "generate hallucinated results, faster"
        
             | Cheer2171 wrote:
             | No, it is "do whatever you were already doing with ML,
             | faster"
             | 
             | This question seems either from a place of deep confusion
             | or is in bad faith. This post is about hardware. The
             | hardware is model independent.* Any issues with models,
             | like hallucinations, are going to be identical if it is run
             | on this platform or a bunch of Nvidia GPUs. Performance in
             | terms of hardware speed and efficiency are orthogonal to
             | performance in terms of model accuracy and hallucinations.
             | Progress on one axis can be made independently to the
             | other.
             | 
             | * Technically no, but close enough
        
               | sorokod wrote:
               | Well ok, Groq provides lower latency cheaper access to
               | the same models of questionable quality.
               | 
               | Is this not putting lipstick on a pig scenario? I suppose
               | more of a question to pig buyers.
        
               | siwakotisaurav wrote:
               | They're probably in the business of being the hardware
               | provider. Best thing would be if Microsoft buys a lot of
               | their chips and that way chatgpt is actually sped up.
               | It's basically model independent
        
               | Cheer2171 wrote:
               | Okay. How about this: Someone posts to HN about an
               | amazing new battery technology, which they demo by
               | showing an average-sized smartphone watching TikTok
               | endlessly scroll for over 500 hours on a single charge.
               | 
               | Then someone comments that TikTok is a garbage fire and a
               | horrible corrupting influence, yadda yadda, all that
               | stuff. They ask: what is the point of making phones last
               | longer just to watch TikTok? They say this improved
               | efficiency in battery tech is just putting lipstick on a
               | pig.
               | 
               | That's you in this thread. That's the kind of irrelevant
               | non-contribution you are making here.
        
               | MuffinFlavored wrote:
               | Batteries are useful. The majority of LLMs are not?
        
               | sorokod wrote:
               | Perhaps your analogy reveals more then you intended.
               | 
               | What does it tell you about the new technology if the
               | best vehicle to demonstrate it is TikTok?
        
               | imtringued wrote:
               | Mixtral 8x7b is competitive with ChatGPT 3.5 Turbo so I'm
               | not sure why you are being so dismissive.
               | 
               | https://chat.lmsys.org/ check the leaderboard.
        
             | Der_Einzige wrote:
             | https://preview.redd.it/jrb4e1wr9ll31.png?auto=webp&s=ccb5a
             | 8...
             | 
             | https://i.imgur.com/Jkdtlox.png
        
         | tiborsaas wrote:
         | I asked it to come up with name ideas for a company and it
         | hallucinated them successfully :) I think the trick is to know
         | what prompts will likely to yield results that are not likely
         | to be hallucinated. In other contexts it's a feature.
        
           | sorokod wrote:
           | A bit of a softball don't you think? The initial message
           | suggests _" Are you ready to experience the world's fastest
           | Large Language Model (LLM)? We'd suggest asking about a piece
           | of history"_
           | 
           | So I did.
        
         | kumarm wrote:
         | At top left hand corner you can change the model to Llama2 70B
         | Model.
        
       | aphit wrote:
       | This is incredibly fast, indeed.
       | 
       | What are the current speeds in T/s for say ChatGPT 3.5 or ChatGPT
       | 4? Just how much faster is this?
        
         | kumarm wrote:
         | I ran the same (Code generation) query and here are my results
         | as end user:
         | 
         | ChaGPT: 1 minute 45 seconds.
         | 
         | Gemini: 16 seconds.
         | 
         | Groq: 3 seconds.
        
       | karpathy wrote:
       | Very impressive looking! Just wanted to caution it's worth being
       | a bit skeptical without benchmarks as there are a number of ways
       | to cut corners. One prominent example is heavy model
       | quantization, which speeds up the model at a cost of model
       | quality. Otherwise I'd love to see LLM tok/s progress exactly
       | like CPU instructions/s did a few decades ago.
        
         | binary132 wrote:
         | The thing is that tokens aren't an apples to apples metric....
         | Stupid tokens are a lot faster than clever tokens. I'd rather
         | see token cleverness improving exponentially....
        
         | tome wrote:
         | As a fellow scientist I concur with the approach of skepticism
         | by default. Our chat app and API are available for everyone to
         | experiment with and compare output quality with any other
         | provider.
         | 
         | I hope you are enjoying your time of having an empty calendar
         | :)
        
         | behnamoh wrote:
         | tangent: Great to see you again on HN!
        
         | sp332 wrote:
         | At least for the earlier Llama 70B demo, they claimed to be
         | running unquantized.
         | https://twitter.com/lifebypixels/status/1757619926360096852
         | 
         | Update: This comment says "some data is stored as FP8 at rest"
         | and I don't know what that means.
         | https://news.ycombinator.com/item?id=39432025
        
           | tome wrote:
           | The weights are quantized to FP8 when they're stored in
           | memory, but all the activations are computed at full FP16
           | precision.
        
             | youssefabdelm wrote:
             | Can you explain if this affects quality relative to fp16?
             | And is mixtral quantized?
        
               | tome wrote:
               | We don't think so, but you be the judge! I believe we
               | quantize both Mixtral and Llama 2 in this way.
        
               | a_wild_dandan wrote:
               | Is your confidence rooted in quantified testing, or just
               | vibes? I'm sure you're right, just curious. (My
               | reasoning: running inference at full fp16 is borderline
               | wasteful. You can use q7 with almost no loss.)
        
               | tome wrote:
               | What quantified testing would you like to see? We've had
               | a lot of very good feedback from our users, particularly
               | about Mixtral.
        
           | bearjaws wrote:
           | Nothing really wrong with FP8 IMO, it performs pretty damn
           | well usually within 98% while significantly reducing memory
           | usage.
        
         | losvedir wrote:
         | Maybe I'm stretching the analogy too far, but are we in the
         | transistor regime of LLMs already? Sometimes I see these 70
         | billion parameter monstrosities and think we're still building
         | ENIAC out of vacuum tubes.
         | 
         | In other words, are we ready to steadily march on, improving
         | LLM tok/s year by year, or are we a major breakthrough or two
         | away before that can even happen?
        
         | Gcam wrote:
         | As part of our benchmarking of Groq we have asked Groq
         | regarding quantization and they have assured us they are
         | running models at full FP-16. It's a good point and important
         | to check.
         | 
         | Link to benchmarking: https://artificialanalysis.ai/ (Note
         | question was regarding API rather than their chat demo)
        
         | bsima wrote:
         | As tome mentioned we don't quantize, all activations are FP16
         | 
         | And here are some independent benchmarks
         | https://artificialanalysis.ai/models/llama-2-chat-70b
        
           | xvector wrote:
           | Jesus Christ, these speeds with FP16? That is simply insane.
        
             | throwawaymaths wrote:
             | Ask how much hardware is behind it.
        
               | modeless wrote:
               | All that matters is the cost. Their price is cheap, so
               | the real question is whether they are subsidizing the
               | cost to achieve that price or not.
        
       | sva_ wrote:
       | In how far is the API compatible with OpenAI? Does it offer
       | logprobs[0] and top_logprobs[1]?
       | 
       | 0. https://platform.openai.com/docs/api-
       | reference/chat/create#c...
       | 
       | 1. https://platform.openai.com/docs/api-
       | reference/chat/create#c...
        
         | tome wrote:
         | You can find our API docs here, including details of our OpenAI
         | compatibility
         | 
         | https://docs.api.groq.com/
        
           | tome wrote:
           | By the way, we also have a new Discord server where we are
           | hosting our developer community. If you find anything missing
           | in our API you can ask about there:
           | 
           | https://discord.com/invite/TQcy5EBdCP
        
           | kumarm wrote:
           | Filled the form for API Access last night. Is there a delay
           | with increased demand now?
        
             | tome wrote:
             | Yes, there's a huge amount of demand because Twitter
             | discovered us yesterday. There will be a backlog, so sorry
             | about that.
        
               | kumarm wrote:
               | Understandable. Wish you guys best of luck irrespective.
        
       | aeyes wrote:
       | Switching the model between Mixtral and Llama I get word for word
       | the same responses. Is this expected?
        
         | tome wrote:
         | Yeah, this is a common observation. See my comment at
         | https://news.ycombinator.com/item?id=39431921
         | 
         | Maybe we should change the behavior to stop people getting
         | confused.
        
         | bjornsing wrote:
         | No...
        
       | treesciencebot wrote:
       | The main problem with the Groq LPUs is, they don't have any HBM
       | on them at all. Just a miniscule (230 MiB) [0] amount of ultra-
       | fast SRAM (20x faster than HBM3, just to be clear). Which means
       | you need ~256 LPUs (4 full server racks of compute, each unit on
       | the rack contains 8x LPUs and there are 8x of those units on a
       | single rack) just to serve a single model [1] where as you can
       | get a single H200 (1/256 of the server rack density) and serve
       | these models reasonably well.
       | 
       | It might work well if you have a single model with lots of
       | customers, but as soon as you need more than a single model and a
       | lot of finetunes/high rank LoRAs etc., these won't be usable. Or
       | for any on-prem deployment since the main advantage is
       | consolidating people to use the same model, together.
       | 
       | [0]: https://wow.groq.com/groqcard-accelerator/
       | 
       | [1]: https://twitter.com/tomjaguarpaw/status/1759615563586744334
        
         | pclmulqdq wrote:
         | Groq devices are really well set up for small-batch-size
         | inference because of the use of SRAM.
         | 
         | I'm not so convinced they have a Tok/sec/$ advantage at all,
         | though, and especially at medium to large batch sizes which
         | would be the groups who can afford to buy so much silicon.
         | 
         | I assume given the architecture that Groq actually doesn't get
         | any faster for batch sizes >1, and Nvidia cards do get
         | meaningfully higher throughput as batch size gets into the
         | 100's.
        
           | nabakin wrote:
           | I've been thinking the same but on the other hand, that would
           | mean they are operating at a huge loss which doesn't scale
        
           | foundval wrote:
           | (Groq Employee) It's hard to discuss Tok/sec/$ outside of the
           | context of a hardware sales engagement.
           | 
           | This is because the relationship between Tok/s/u,
           | Tok/s/system, Batching, and Pipelining is a complex one that
           | involves compute utilization, network utilization, and (in
           | particular) a host of compilation techniques that we wouldn't
           | want to share publicly. Maybe we'll get to that level of
           | transparency at some point, though!
           | 
           | As far as Batching goes, you should consider that with
           | synchronous systems, if all the stars align, Batch=1 is all
           | you need. Of course, the devil is in the details, and
           | sometimes small batch numbers still give you benefits. But
           | Batch 100's generally gives no advantages. In fact, the
           | entire point of developing deterministic hardware and
           | synchronous systems is to avoid batching in the first place.
        
         | tome wrote:
         | If you want low latency you have to be really careful with HBM,
         | not only because of the delay involved, but also the non-
         | determinacy. One of the huge benefits of our LPU architecture
         | is that we can build systems of hundreds of chips with fast
         | interconnect and we know the precise timing of the whole system
         | to within a few parts per million. Once you start integrating
         | non-deterministic components your latency guarantees disappear
         | very quickly.
        
           | pclmulqdq wrote:
           | I don't know about HBM specifically, but DDR and GDDR at a
           | protocol level are both deterministic. It's the memory
           | controller doing a bunch of reordering that makes them non-
           | deterministic. Presumably, if that is the reason you don't
           | like DRAM, you could build your compiler to be memory-layout
           | aware and have the memory controller issue commands without
           | reordering.
        
             | tome wrote:
             | That could be possible. It's out of my area of expertise so
             | I can't say for sure. My understanding was HBM forces on
             | you specific access patterns and non-deterministic delays.
             | Our compiler already deals with many other forms of
             | resource-aware scheduling so it could take into account
             | DRAM refreshes easily, so I feel like there must be
             | something else that makes SRAM more suitable in our case.
             | I'll have to leave that to someone more knowledgeable to
             | explain though ...
        
             | johntb86 wrote:
             | Presumably with dram you also have to worry about
             | refreshes, which can come along at arbitrary times relative
             | to the workload.
        
               | pclmulqdq wrote:
               | You can control when those happen, too.
        
           | frognumber wrote:
           | From a theoretical perspective, this is absolutely not true.
           | Asynchronous logic can achieve much lower latency guarantees
           | than synchronous logic.
           | 
           | Come to think of it, this is one of the few places where
           | asynchronous logic might be more than academic... Async logic
           | is hard with complex control flows, which deep learning
           | inference does not have.
           | 
           | (From a practical perspective, I know you were comparing to
           | independently-clocked logic, rather than async logic)
        
             | foundval wrote:
             | (Groq Employee) You're right - we are comparing to
             | independently-clocked logic.
             | 
             | I wonder whether async logic would be feasible for
             | reconfigurable "Spatial Processor" type architectures [1].
             | As far as LPU architectures go, they fall in the "Matrix of
             | Processing Engines"[1] family of architectures, which I
             | would naively guess is not the best suited to leverage
             | async logic.
             | 
             | 1: I'm using the "Spatial Processor" (7:14) and "Matrix of
             | Processing Engines" (8:57) terms as defined in
             | https://www.youtube.com/watch?v=LUPWZ-LC0XE. Sorry for a
             | video link, I just can't think of another single reference
             | that explains the two approaches.
        
           | SilverBirch wrote:
           | Surely once you're scaling over multiple chips/servers/racks
           | you're dealing with retries and checksums and sequence
           | numbers anyway? How do you get around the non-determinacy of
           | networking beyond just hoping that you don't see any errors?
        
             | tome wrote:
             | Our interconnect between chips is also deterministic! You
             | can read more about our interconnect, synchronisation, and
             | error correction in our paper.
             | 
             | https://wow.groq.com/wp-
             | content/uploads/2023/05/GroqISCAPape...
        
         | matanyal wrote:
         | Groq Engineer here, I'm not seeing why being able to scale
         | compute outside of a single card/node is somehow a problem. My
         | preferred analogy is to a car factory: Yes, you could build a
         | car with say only one or two drills, but a modern automated
         | factory has hundreds of drills! With a single drill, you could
         | probably build all sorts of cars, but a factory assembly line
         | is only able to make specific cars in that configuration. Does
         | that mean that factories are inefficient?
         | 
         | You also say that H200's work reasonably well, and that's
         | reasonable (but debatable) for synchronous, human interaction
         | use cases. Show me a 30b+ parameter model doing RAG as part of
         | a conversation with voice responses in less than a second,
         | running on Nvidia.
        
           | pbalcer wrote:
           | Just curious, how does this work out in terms of TCO (even
           | assuming the price of a Groq LPU is 0$)? What you say makes
           | sense, but I'm wondering how you strike a balance between
           | massive horizontal scaling vs vertical scaling. Sometimes
           | (quite often in my experience) having a few beefy servers is
           | much simpler/cheaper/faster than scaling horizontally across
           | many small nodes.
           | 
           | Or I got this completely wrong, and your solution enables
           | use-cases that are simply unattainable on mainstream
           | (Nvidia/AMD) hardware, making TCO argument less relevant?
        
             | tome wrote:
             | We're providing by far the lowest latency LLM engine on the
             | planet. You can't reduce latency by scaling horizontally.
        
               | nickpsecurity wrote:
               | Distributed, shared memory machines used to do exactly
               | that in HPC space. They were a NUMA alternative. It works
               | if the processing plus high-speed interconnect are
               | collectively faster than the request rate. The 8x setups
               | with NVLink are kind of like that model.
               | 
               | You may have meant that nobody has a stack that uses
               | clustering or DSM with low-latency interconnects. If so,
               | then that might be worth developing given prior results
               | in other low-latency domains.
        
               | tome wrote:
               | I think existing players will have trouble developing a
               | low latency solution like us whilst they are still
               | running on non-deterministic hardware.
        
               | WanderPanda wrote:
               | What do you mean by non-deterministic hardware? cuBLAS on
               | a laptop GPU was deterministic when I tried it last iirc
        
               | tome wrote:
               | Non-deterministic timing characteristics.
        
               | frozenport wrote:
               | Tip of the ice-berg.
               | 
               | DRAM needs to be refreshed every X cycles.
               | 
               | This means you don't know the time it takes to read from
               | memory. You could be reading at a refresh cycle. This
               | circuitry also adds latency.
        
               | nickpsecurity wrote:
               | While you're here, I have a quick, off-topic question.
               | We've seen incredible results with GPT3-176B (Davinci)
               | and GPT4 (MoE). Making attempts at open models that reuse
               | their architectural strategies could have a high impact
               | on everyone. Those models took 2500-25000 GPU's to train,
               | though. It would be great to have a low-cost option for
               | pre training Davinci-class models.
               | 
               | It would great if a company or others with AI hardware
               | were willing to do production runs of chips sold at cost
               | specifically to make open, permissive-licensed models. As
               | in, since you'd lose profit, the cluster owner and users
               | would be legally required to only make permissive models.
               | Maybe at least one in each category (eg text, visual).
               | 
               | Do you think your company or any other hardware supplier
               | would do that? Or someone sell 2500 GPU's at cost for
               | open models?
               | 
               | (Note to anyone involved in CHIPS Act: please fund a
               | cluster or accelerator specifically for this.)
        
               | tome wrote:
               | Great idea, but Groq doesn't have a product suitable for
               | training at the moment. Our LPUs shine in inference.
        
           | treprinum wrote:
           | > Show me a 30b+ parameter model doing RAG as part of a
           | conversation with voice responses in less than a second,
           | running on Nvidia
           | 
           | I built one, should be live soon ;-)
        
             | tome wrote:
             | Exciting! Looking forward to seeing it.
        
           | huac wrote:
           | > 30b+ parameter model doing RAG as part of a conversation
           | with voice responses in less than a second, running on
           | Nvidia.
           | 
           | I believe that this is doable - my pipeline is generally
           | closer to 400ms without RAG and with Mixtral, with a lot of
           | non-ML hacks to get there. It would also definitely be doable
           | with a joint speech-language model that removes the
           | transcription step.
           | 
           | For these use cases, time to first byte is the most important
           | metric, not total throughput.
        
           | mlazos wrote:
           | You can't scale horizontally forever because of
           | communication. I think HBM would provide a lot more
           | flexibility with the number of chips you need.
        
         | moralestapia wrote:
         | >The main problem with the Groq LPUs is, they don't have any
         | HBM on them at all. Just a miniscule (230 MiB) [0] amount of
         | ultra-fast SRAM [...]
         | 
         | IDGAF about any of that, lol. I just want an API endpoint.
         | 
         | 480 tokens/sec at $0.27 per million tokens? Sign me in, I don't
         | care about their hardware, at all.
        
           | treesciencebot wrote:
           | there are providers out there offering for $0 per million
           | tokens, that doesn't mean it is sustainable and won't
           | disappear as soon as the VC well runs dry. Am not saying this
           | is the case for Groq, but in general you probably should care
           | if you want to build something serious on top of anything.
        
         | trsohmers wrote:
         | Groq states in this article [0] that they used 576 chips to
         | achieve these results, and continuing with your analysis, you
         | also need to factor in that for each additional user you want
         | to have requires a separate KV cache, which can add multiple
         | more gigabytes per user.
         | 
         | My professional independent observer opinion (not based on my 2
         | years of working at Groq) would have me assume that their COGS
         | to achieve these performance numbers would exceed several
         | million dollars, so depreciating that over expected usage at
         | the theoretical prices they have posted seems impractical, so
         | from an actual performance per dollar standpoint they don't
         | seem viable, but do have a very cool demo of an insane level of
         | performance if you throw cost concerns out the window.
         | 
         | [0]: https://www.nextplatform.com/2023/11/27/groq-says-it-can-
         | dep...
        
           | tome wrote:
           | Thomas, I think for full disclosure you should also state
           | that you left Groq to start a competitor (a competitor which
           | doesn't have the world's lowest latency LLM engine nor a
           | guarantee to match the cheapest per token prices, like Groq
           | does.).
           | 
           | Anyone with a serious interest in the total cost of ownership
           | of Groq's system is welcome to email contact@groq.com.
        
             | trsohmers wrote:
             | I thought that was clear through my profile, but yes,
             | Positron AI is focused on providing the best performance
             | per dollar while providing the best quality of service and
             | capabilities rather than just focusing on a single metric
             | of speed.
             | 
             | A guarantee to match the cheapest per token prices is sure
             | a great way to lose a race to the bottom, but I do wish
             | Groq (and everyone else trying to compete against NVIDIA)
             | the greatest luck and success. I really do think that the
             | great single batch/user performance by Groq is a great
             | demo, but is not the best solution for a wide variety of
             | applications, but I hope it can find its niche.
        
         | imtringued wrote:
         | I honestly don't see the problem.
         | 
         | "just to serve a single model" could be easily fixed by adding
         | a single LPDDR4 channel per LPU. Then you can reload the model
         | sixty times per second and serve 60 different models per
         | second.
        
           | treesciencebot wrote:
           | per-chip compute is not the main thing this chip innovates
           | for fast inference, it is the extremely fast memory bandwith.
           | when you do that, you'll loose all of that and will be much
           | worse off than any off the shelf accelerators.
        
             | QuadmasterXLII wrote:
             | load model, compute a 1k token response (ie, do a thousand
             | forward passes in sequence, one per token), load a
             | different model, compute a response,
             | 
             | I would expect the model loading to take basically zero
             | percent of the time in the above workflow
        
         | londons_explore wrote:
         | > more than a single model and a lot of finetunes/high rank
         | LoRAs
         | 
         | I can imagine a way might be found to host a base model and a
         | bunch of LoRA's whilst using barely more ram than the base
         | model alone.
         | 
         | The fine-tuning could perhaps be done in such a way that only
         | perhaps 0.1% of the weights are changed, and for every
         | computation the difference is computed not over the weights,
         | but of the output layer activations.
        
           | kcorbitt wrote:
           | This actually already exists! We did a writeup of the
           | relevant optimizations here: https://openpipe.ai/blog/s-lora
        
       | imiric wrote:
       | Impressive demo!
       | 
       | However, the hardware requirements and cost make this
       | inaccessible for anyone but large companies. When do you envision
       | that the price could be affordable for hobbyists?
       | 
       | Also, while the CNN Vapi demo was impressive as well, a few weeks
       | ago here[1] someone shared https://smarterchild.chat/. That also
       | has _very_ low audio latency, making natural conversation
       | possible. From that discussion it seems that
       | https://www.sindarin.tech/ is behind it. Do we know if they use
       | Groq LPUs or something else?
       | 
       | I think that once you reach ~50 t/s, real-time interaction is
       | possible. Anything higher than that is useful for generating
       | large volumes of data quickly, but there are diminishing returns
       | as it's far beyond what humans can process. Maybe such speeds
       | would be useful for AI-AI communication, transferring
       | knowledge/context, etc.
       | 
       | So an LPU product that's only focused on AI-human interaction
       | could have much lower capabilities, and thus much lower cost, no?
       | 
       | [1]: https://news.ycombinator.com/item?id=39180237
        
         | tome wrote:
         | > However, the hardware requirements and cost make this
         | inaccessible for anyone but large companies. When do you
         | envision that the price could be affordable for hobbyists?
         | 
         | For API access to our tokens as a service we guarantee to beat
         | any other provider on cost per token (see
         | https://wow.groq.com). In terms of selling hardware, we're
         | focused on selling whole systems, and they're only really
         | suitable for corporations or research institutions.
        
         | dmw_ng wrote:
         | Given the size of the Sindarin team (3 AFAICT), that mostly
         | looks like a clever combination of existing tech. There are
         | some speech APIs that offer word-by-word realtime transcription
         | (Google has one), assuming most of the special sauce is very
         | well thought out pipelining between speech
         | recognition->LLM->TTS
         | 
         | (not to denigrate their awesome achievement, I would not be
         | interested if I were not curious about how to reproduce their
         | result!)
        
         | stormfather wrote:
         | >>50 t/s is absolutely necessary for real-time interaction with
         | AI systems. Most of the LLM's output will be internal monologue
         | and planning, performing RAG and summarization, etc, with only
         | the final output being communicated to you. Imagine a blazingly
         | fast GPT-5 that goes through multiple cycles of planning out
         | how to answer you, searching the web, writing book reports,
         | debating itself, distilling what it finds, critiquing and
         | rewriting its answer, all while you blink a few times.
        
       | patapong wrote:
       | Very impressive! I am even more impressed by the API pricing
       | though - 0.27/1M tokens seems like an order of magnitude cheaper
       | than the GPT-3.5 API, and two orders of magnitude cheaper than
       | GPT-4? Am I missing something here?
        
         | siwakotisaurav wrote:
         | They're competing with the lowest cost competitors for mistral
         | atm, which afaik is currently deepinfra at the same pricing
        
           | patapong wrote:
           | Huh! Had no idea open source model were ahead of OpenAI
           | already on pricing - will have to look into using these for
           | my use cases.
        
       | doubtfuluser wrote:
       | Nice... a startup that has two "C" positions CEO and Chief Legal
       | Officer...
       | 
       | That sounds like a fun place to be
        
         | kumarm wrote:
         | They seem to be around since 2016. May be not bad for a LLM
         | company that would need to deal with legal issues?
        
       | eigenvalue wrote:
       | I just want to say that this is one of the most impressive tech
       | demos I've ever seen in my life, and I love that it's truly an
       | open demo that anyone can try without even signing up for an
       | account or anything like that. It's surreal to see the thing
       | spitting out tokens at such a crazy rate when you're used to
       | watching them generate at one less than one fifth that speed. I'm
       | surprised you guys haven't been swallowed up by Microsoft, Apple,
       | or Google already for a huge premium.
        
         | tome wrote:
         | Really glad you like it! We've been working hard on it.
        
           | lokimedes wrote:
           | The speed part or the being swallowed part?
        
             | tome wrote:
             | The speed part. We're not interested in being swallowed.
             | The aim is to be bigger than Nvidia in three years :)
        
               | dazzaji wrote:
               | Go for it!
        
               | nurettin wrote:
               | Can you warn us pre-IPO?
        
               | tome wrote:
               | I'm sure you'll hear all about our IPO on HN :) :)
        
         | timomaxgalvin wrote:
         | Sure, but the responses are very poor compared to MS tools.
        
         | brcmthrowaway wrote:
         | I have it on good authority Apple was very closing to acquiring
         | Groq
        
           | baq wrote:
           | If this is true, expect a call from the SEC...
        
             | 317070 wrote:
             | Even if it isn't true.
             | 
             | Disclosing inside information is illegal, _even if it is
             | false and fabricated_, if it leads to personal gains.
        
               | KRAKRISMOTT wrote:
               | You have to prove the OP had personal gains. If he's just
               | a troll, it will be difficult.
        
               | frognumber wrote:
               | You also have to be an insider.
               | 
               | If I go to a bar, and overhear a pair of Googlers
               | discussing something secret and overhear it, I can:
               | 
               | 1) Trade on it.
               | 
               | 2) Talk about it.
               | 
               | Because I'm not an insider. On the other hand, if I'm
               | sleeping with the CEO, I become an insider.
               | 
               | Not a lawyer. Above is not legal advice. Just a comment
               | that the line is much more complex, and talking about a
               | potential acquisition is usually okay (if you're not
               | under NDA).
        
               | throwawayurlife wrote:
               | It doesn't matter if you overheard it at a bar or if
               | you're just some HN commenter posting completely
               | incorrect legal advice; the law prohibits trading on
               | material nonpublic information.
               | 
               | I would pay a lot to see you try your ridiculous legal
               | hokey-pokey on how to define an "insider."
        
               | programmarchy wrote:
               | If you did hear it in a bar, could you tweet it out
               | before your trade, so the information is made public?
        
               | sofixa wrote:
               | Had insider trading training, and yes, that's the gist of
               | it. If you know or presume that the information is
               | material (makes a difference) and not public, it's
               | illegal to act on it.
        
               | tripletao wrote:
               | Roughly, it's illegal only if you have some duty not to
               | trade on it. If you acquired the information without
               | misappropriating it (like overhearing it from strangers
               | in a normal public bar), then you're free to trade.
               | 
               | https://corpgov.law.harvard.edu/2017/01/18/insider-
               | trading-l...
               | 
               | There's no reason for normal corporate training to
               | discuss that element, because an employee who trades
               | their employer's stock based on MNPI has near-certainly
               | misappropriated it. The question of whether a non-
               | employee has misappropriated information is much more
               | complex, though.
        
               | frognumber wrote:
               | Training is designed to protect the corporation, not to
               | provide accurate lega ladvice. That's true of most
               | corporate trainings, for that matter, be that bribes /
               | corruption, harassment, discrimination, or whatnot.
               | Corporations want employees very far from the line.
               | 
               | That's the right way to run them.
               | 
               | If you want more nuance, talk to a lawyer or read case
               | law.
               | 
               | Generally, insider trading requires something along the
               | lines of a fiduciary duty to keep the information secret,
               | albeit a very weak one. I'm not going to slice that line,
               | but you see references in-thread.
        
               | frognumber wrote:
               | You're an idiot.
               | 
               | https://www.kiplinger.com/article/investing/t052-c008-s00
               | 1-w...
               | 
               | Case #1.
        
             | sheepscreek wrote:
             | TIL that SEC has authority over private company dealings
             | wrt sale of shares[1].
             | 
             | [1] https://www.sec.gov/education/capitalraising/building-
             | blocks...
        
             | belter wrote:
             | Not if poster is in a crashing plane...
        
         | elorant wrote:
         | Perplexity Labs also has an open demo of Mixtral 8x7b although
         | it's nowhere near as fast as this.
         | 
         | https://labs.perplexity.ai/
        
         | larodi wrote:
         | why sell? it would be much more delightful to beat them on
         | their own game?
        
       | deniz_tekalp wrote:
       | GPUs are notoriously bad on exploiting sparsity. I wonder if this
       | architecture can do a better job. The groq engineers in this
       | thread, if a neural network had say 60% of its weights set to 0,
       | what would it do to cost & speed in your hardware?
        
       | supercharger9 wrote:
       | Do they make money from LLM service or by selling hardware?
       | Homepage is confusing without any reference to other products.
        
         | tome wrote:
         | Both, we sell tokens as a service and we sell enterprise
         | systems.
        
           | supercharger9 wrote:
           | Then reference that in the homepage? If not for this HN
           | thread, I wouldn't have know you sell hardware.
        
       | deepnotderp wrote:
       | This demo has more than 500 chips btw, it's not exactly an apples
       | to apples comparison with 1 GPU...
        
         | tome wrote:
         | Definitely not, but even with a comparison to 500 GPUs Groq
         | will still come out on top because you can never reduce latency
         | by adding more parallel compute :)
        
           | varunvummadi wrote:
           | So please let me know if I am wrong are you guys running a
           | batch size of 1 in 500 GPU's? then why are the responses
           | almost instant if you guys are using batch size 1 and also
           | when can we expect bring your own fine tuned models kind of
           | thing. Thanks!
        
             | tome wrote:
             | We are not using 500 GPUs, we are using a large system
             | built from many of our own custom ASICs. This allows us to
             | do batch size 1 with no reduction in overall throughput.
             | (We are doing pipelining though, so many users are using
             | the same system at once).
        
       | eurekin wrote:
       | Jaw dropping. Both groq and mixtral.
       | 
       | I used following prompt:
       | 
       | Generate gitlab ci yaml file for a hybrid front-end/backend
       | project. Fronted is under /frontend and is a node project,
       | packaged with yarn, built with vite to the /backend/public
       | folder. The backend is a python flask server
        
       | idiotlogical wrote:
       | I asked it "What is the oldest known written recipe" and then
       | asked for that recipe. It seemed to repeat the same response a
       | bunch of times (very quickly!) before stopping. Is this abnormal
       | behavior?
        
         | neonbrain wrote:
         | For me the reply for continuation question "Please tell me the
         | exact steps of that recipe.":
         | 
         | Here is a rough translation of the oldest known written recipe,
         | which is for a kind of beer from ancient Mesopotamia:
         | 
         | 1. Crush and mash 4 liters of barley to create malt.
         | 
         | 2. Soak the malt in water for 2.5 days.
         | 
         | 3. Leave the mash in a cool place for 2.5 days.
         | 
         | 4. Filter the mash through a strainer to separate the liquid
         | from the solids.
         | 
         | 5. Boil the liquid for 1 hour.
         | 
         | 6. Add 1 liter of honey and 1 handful of hops to the liquid.
         | 
         | 7. Leave the mixture to cool.
         | 
         | 8. Add 1 handful of "gig" (a type of wild yeast) to the
         | mixture.
         | 
         | 9. Leave the mixture in a cool place for 5 days to allow
         | fermentation.
         | 
         | 10. Filter the mixture again to separate the beer from the
         | solids.
         | 
         | 11. Serve the beer chilled.
         | 
         | It's important to note that this recipe is over 3,700 years
         | old, and it may not produce a beer that is palatable to modern
         | tastes. Additionally, some of the ingredients and techniques
         | used in this recipe may not be readily available or practical
         | for home brewers today.
        
       | yzh wrote:
       | Really impressive work! I wonder how easy would it be to support
       | (a future open source version of) SORA using Groq's design. Will
       | there be a Video Processing Unit (VPU)?
        
         | jkachmar wrote:
         | i can't comment about sora specifically, however the
         | architecture can support workloads beyond just LLM inference.
         | 
         | our demo booth at trade shows usually has StyleCLIP up at one
         | point or another to provide an abstract example of this.
         | 
         | disclosure: i work on infrastructure at Groq and am generally
         | interested in hardware architecture and compiler design,
         | however i am not a part of either of those teams :)
        
       | sylware wrote:
       | any noscript/basic (x)html prompt?
        
       | newsclues wrote:
       | I asked it what carmacks AI company was called and it corrected
       | identified John carmack but said he was working on VR.
        
       | neilv wrote:
       | If the page can't access certain fonts, it will fail to work,
       | while it keeps retrying requests:
       | https://fonts.gstatic.com/s/notosansarabic/[...]
       | https://fonts.gstatic.com/s/notosanshebrew/[...]
       | https://fonts.gstatic.com/s/notosanssc/[...]
       | 
       | (I noticed this because my browser blocks these de facto trackers
       | by default.)
        
         | sebastiennight wrote:
         | Same problem when trying to use font replacements with a
         | privacy plugin.
         | 
         | This is a very weird dependency to have :-)
        
           | tome wrote:
           | Thanks, I've reported this internally.
        
         | rasz wrote:
         | How to show Google how popular and interesting for acquisition
         | you are without directly installing google trackers on your
         | website.
        
       | totalhack wrote:
       | Where is the data center located? The fastest response time I
       | could get from some quick testing from the northeast US, having
       | it output just one letter, was 670ms. Just wondering if that's an
       | expected result, as it's on a par or slower than GPT 3.5 via API.
        
         | MaxLeiter wrote:
         | There's a queueing system if too many requests are being
         | processed at once. You may have hit that.
        
         | tome wrote:
         | West Coast US. You would have been placed in our queuing system
         | because with all the attention we are getting we are very busy
         | right now!
        
           | totalhack wrote:
           | Thanks! I did notice the queue count showing up occasionally
           | but not every time. Maybe someone could repeat the test who
           | has access without the queue so we can get an understanding
           | of the potential latency once scaled and geo-distributed.
           | What I'm really trying to understand is time to first token
           | output actually faster than GPT 3.5 via API or just the rate
           | of token output once it begins.
        
             | tome wrote:
             | I don't know about GPT 3.5 specifically, but on this
             | independent benchmark (LLMPerf) Groq's time to first token
             | is also lowest:
             | 
             | https://github.com/ray-project/llmperf-
             | leaderboard?tab=readm...
        
       | anybodyz wrote:
       | I have this hooked up experimentally to my universal Dungeon
       | Master simulator DungeonGod and it seems to work quite well.
       | 
       | I had been using Together AI Mixtral (which is serving the Hermes
       | Mixtrals) and it is pretty snappy, but nothing close to Groq. I
       | think the next closes that I've tested is Perplexity Labs
       | Mixtral.
       | 
       | A key blocker in just hanging out a shingle for an open source AI
       | project is the fear that anything that might scale will bankrupt
       | you (or just be offline if you get any significant traction). I
       | think we're nearing the phase that we could potentially just turn
       | these things "on" and eat the reasonable inference fees to see
       | what people engage with - with a pretty decently cool free tier
       | available.
       | 
       | I'd add that the simulator does multiple calls to the api for one
       | response to do analysis and function selection in the underlying
       | python game engine, which Groq makes less of a problem as it's
       | close to instant. This adds a pretty significant pause in the
       | OpenAI version. Also since this simulator runs on Discord with
       | multiple users, I've had problems in the past with 'user response
       | storms' where the AI couldn't keep up. Also less of a problem
       | with Groq.
        
       | monkin wrote:
       | It's impressive, but I have one problem with all of those models.
       | I wanted them to answer what Mixtral or Llama2 are, but with no
       | luck. It would be great if models could at least describe
       | themselves.
        
         | johndough wrote:
         | There are two issues with that.
         | 
         | 1. To create a model, you have to train it on training data.
         | Mixtral and Llama2 did not exist before they were trained, so
         | their training data did not contain any information about
         | Mixtral or Llama2 (respectively). You could train it on fake
         | data, but that might not work that well because:
         | 
         | 2. The internet is full of text like "I am <something>", so it
         | would probably overshadow any injected training data like "I am
         | Llama2, a model by MetaAI."
         | 
         | You could of course inject the information as an invisible
         | system prompt (like OpenAI is doing with ChatGPT), but that is
         | a waste of computation resources.
        
       | roomey wrote:
       | Oh hell yes, this is the first "fast" one, superhuman fast.
       | 
       | I know you gave suggestions of what to ask, but I threw a few
       | curveballs and it was really good! Well done this is a big step
       | forwards
        
       | Klaus23 wrote:
       | The demo is pretty cool, but the mobile interface could be a
       | parody of bad interface design. The text box at the top is hard
       | to reach if you want to open the keyboard, which automatically
       | closes, or press the button to send the question, and the chat
       | history is out of chronological order for no logical reason.
       | 
       | Edit: Text selection is also broken.
        
       | fatkam wrote:
       | For me, it was fast when it started printing (it did almost
       | instantly), but it took forever for it to start.
        
         | tome wrote:
         | There are a lot of people interested in Groq now, so most jobs
         | are sitting in a queue for a little while.
        
           | fatkam wrote:
           | fair enough... I guess at least it didnt crash like many
           | other overwhelmed sites do... but at the end of the day, it
           | was my experience.
        
       | SeanAnderson wrote:
       | Sorry, I'm a bit naive about all of this.
       | 
       | Why is this impressive? Can this result not be achieved by
       | throwing more compute at the problem to speed up responses? Isn't
       | the fact that there is a queue when under load just indicative
       | that there's a trade-off between "# of request to process per
       | unit of time" and "amount of compute to put into a response to
       | respond quicker"?
       | 
       | https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/rel/do...
       | 
       | This chart from NVIDIA implies their H100 runs llama v2 70B at
       | >500 tok/s.
        
         | MasterScrat wrote:
         | Scaling up compute can improve throughput, but can't easily
         | improve latency between tokens. Generation is usually
         | bottlenecked by the time it takes to go through the network for
         | each token. To speed that up, you need to perform these
         | computations faster, which is a hard problem after you've
         | exhausted all the obvious options (use the fastest accelerator
         | you can find, cache what you can etc).
        
           | SeanAnderson wrote:
           | Yeah. That makes sense, thank you for clarifying. I updated
           | my original post with a chart from NVIDIA which highlights
           | the H100's capabilities. It doesn't seem unreasonable to
           | expect a 7B model to run at 500 tok/s on that hardware.
        
             | snowfield wrote:
             | This is a 50B model. (Mixtral 8x7b)
        
               | SeanAnderson wrote:
               | Oh, sorry, I assumed the 8 was for quantization. 8x7b is
               | a new syntax for me.
               | 
               | Still, the NVIDIA chart shows Llama v2 70B at 750 tok/s,
               | no?
        
               | tome wrote:
               | I guess that's total throughput, rather than per user?
               | You can increase total throughput by scaling
               | horizontally. You can't increase throughput per user that
               | way.
        
         | tome wrote:
         | LLM inference is inherently a sequential problem. You can't
         | speed it up by doing more in parallel. You can't generate the
         | 101st token before you've generated the 100th.
        
           | NorwegianDude wrote:
           | Technically, I guess you can use speculative execution to
           | speed it up, and in that way take a guess at what the 100th
           | token will be and start on the 101st token at the same time?
           | Though it probably has it's own unforeseen challenges.
           | 
           | Everything is predictable with enough guesses.
        
         | SushiHippie wrote:
         | I guess it depends on how much the infrastracture from TFA
         | costs, as the H100 only costs ~$3300 to produce, but gets sold
         | for ~$30k on average.
         | 
         | https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gp...
        
         | nabakin wrote:
         | There's a difference between token throughput and latency.
         | Token throughput is the token throughput of the whole
         | GPU/system and latency is the token throughput for an
         | individual user. Groq offers extremely low latency (aka
         | extremely high token throughput per user) but we still don't
         | have numbers on the token throughput of their entire system.
         | Nvidia's metrics here on the other hand, show us the token
         | throughput of the whole GPU/system. So, in reality, while you
         | might be able to get 1.5k t/s on an H100, the latency (token
         | throughput per user) will be something much lower like 20 t/s.
         | 
         | The really important metric to look for is cost per token
         | because even though Groq is able to run at low latency, that
         | doesn't mean it's able to do it cheaply. Determining the cost
         | per token can be done many ways but a useful way for us is
         | approximately the cost of the system divided by the total token
         | throughput of the system per second. We don't have the total
         | token throughput per second of Groq's system so we can't really
         | say how efficient it is. It could very well be that Groq is
         | subsidizing the cost of their system to lower prices and gain
         | PR and will increase their prices later on.
        
           | frozenport wrote:
           | https://wow.groq.com/artificialanalysis-ai-llm-benchmark-
           | dou...
           | 
           | Seems to have it. Looks cost competitive but a lot faster.
        
       | Gcam wrote:
       | Groq's API performance reaches close to this level of performance
       | as well. We've benchmarked performance over time and >400
       | tokens/s has sustained - can see here
       | https://artificialanalysis.ai/models/mixtral-8x7b-instruct
       | (bottom of page for over time view)
        
       | mise_en_place wrote:
       | Do you have any plans to support bringing your own model? I have
       | been using Sagemaker but it is very slow to deploy to.
        
         | tome wrote:
         | Yes, we're working with some customers on that, but it will be
         | a while until general availability.
        
       | youssefabdelm wrote:
       | Do you guys provide logprobs via the api?
        
         | tome wrote:
         | You can check out all our API features here:
         | https://docs.api.groq.com/
        
       | ggnore7452 wrote:
       | The Groq demo was indeed impressive. I work with LLM alot in
       | work, and a generation speed of 500+ tokens/s would definitely
       | change how we use these products. (Especially considering it's an
       | early-stage product)
       | 
       | But the "completely novel silicon architecture" and the "self-
       | developed LPU" (claiming not to use GPUs)... makes me bit
       | skeptical. After all, pure speed might be achievable through
       | stacking computational power and model quantization. Shouldn't
       | innovation at the GPU level be quite challenging, especially to
       | achieve such groundbreaking speeds?
        
         | ggnore7452 wrote:
         | more on the LPU and data center: https://wow.groq.com/lpu-
         | inference-engine/
         | 
         | price and speed benchmark: https://wow.groq.com/
        
       | jereees wrote:
       | I'll pay $xx a month if I can talk to Groq the way I can talk to
       | ChatGPT with my AirPods
        
       | supercharger9 wrote:
       | Ignoring latency but not throughput, How does this compare in
       | terms of Cost ( cards Acquisition cost and Power needed) with
       | Nvidia GPU for inference?
        
       | mlconnor wrote:
       | omg. i can't believe how incredibly fast that is. and capable
       | too. wow
        
       | ttul wrote:
       | Have you experimented with running diffusion models on Groq
       | hardware?
        
         | tome wrote:
         | Yes, we don't have any publicly accessible ones at the moment
         | though.
        
       | lukevp wrote:
       | I'm not sure how, but I got the zoom messed up on iOS and I can
       | no longer see the submit button. Refreshing doesn't fix it.
        
       | QuesnayJr wrote:
       | I tried it out, and I was taken aback how quickly it answered.
        
       | tagyro wrote:
       | I (only) ran a couple of prompts but I am impressed. It has the
       | speed of gpt 3.5 and the quality of gpt 4.
       | 
       | Seriously considering switching from [open]AI to Mix/s/tral in my
       | apps.
        
         | eightysixfour wrote:
         | Mixtral 8x7 is good, but it is not GPT-4 good in any of the use
         | cases I have tried. Mistral's other models get close and beat
         | it in some cases, but not Mixtral.
        
       | ohwellish wrote:
       | I wish there was an option to export whole session chat, say in
       | plaintext as a link to some pastebin, that chat I just had with
       | groq would have some ppl I know really impressed
        
       | Keyframe wrote:
       | This is insane. Congratulations!
        
       | codedokode wrote:
       | Is it normal that I have asked two networks (llama/mixtral) the
       | same question ("tell me about most popular audio pitch detection
       | algorithms") and they gave almost the same answer? Both answers
       | start with "Sure, here are some of the most popular pitch
       | detection algorithms used in audio signal processing" and end
       | with "Each of these algorithms has its own strengths and
       | weaknesses, and the choice of algorithm depends on the specific
       | application and the characteristics of the input signal.". And
       | the content is 95% the same. How can it be?
        
         | tome wrote:
         | Yeah it's a bit confusing. See here for details:
         | https://news.ycombinator.com/item?id=39431921
        
       | jprd wrote:
       | This is _super_ impressive. The rate of iteration and innovation
       | in this space means that just as I 'm feeling
       | jaded/bored/oversaturated - some new project makes my jaw drop
       | again.
        
       | joaquincabezas wrote:
       | are there also experiments around image embedding generation to
       | use in combination with the LLM?
       | 
       | maybe for this use-case is it better to execute the vision tower
       | on a GPU and leave the LPU for the language part?
        
       | LoganDark wrote:
       | Please when can I buy some of these for home use? Please.
        
       ___________________________________________________________________
       (page generated 2024-02-19 23:00 UTC)