[HN Gopher] Groq runs Mixtral 8x7B-32k with 500 T/s
___________________________________________________________________
Groq runs Mixtral 8x7B-32k with 500 T/s
Author : tin7in
Score : 470 points
Date : 2024-02-19 12:06 UTC (10 hours ago)
(HTM) web link (groq.com)
(TXT) w3m dump (groq.com)
| itsmechase wrote:
| Incredible tool. The Mixtral 8x7B model running on their hardware
| did 491.40 T/s for me...
| dariobarila wrote:
| Wow! So fast!
| ppsreejith wrote:
| Relevant thread from 5 months ago:
| https://news.ycombinator.com/item?id=37469434
|
| I'm achieving consistent 450+ tokens/sec for Mixtral 8x7b 32k and
| ~200 tps for Llama 2 70B-4k.
|
| As an aside, seeing that this is built with flutter Web, perhaps
| a mobile app is coming soon?
| tome wrote:
| There was also another discussion about Groq a couple of months
| ago https://news.ycombinator.com/item?id=38739199
| tome wrote:
| Hi folks, I work for Groq. Feel free to ask me any questions.
|
| (If you check my HN post history you'll see I post a lot about
| Haskell. That's right, part of Groq's compilation pipeline is
| written in Haskell!)
| ppsreejith wrote:
| Thank you for doing this AMA
|
| 1. How many GroqCards are you using to run the Demo?
|
| 2. Is there a newer version you're using which has more SRAM
| (since the one I see online only has 230MB)? Since this seems
| to be the number that will drive down your cost (to take
| advantage of batch processing, CMIIW!)
|
| 3. Can TTS pipelines be integrated with your stack? If so, we
| can truly have very low latency calls!
|
| *Assuming you're using this:
| https://www.bittware.com/products/groq/
| tome wrote:
| 1. I think our GroqChat demo is using 568 GroqChips. I'm not
| sure exactly, but it's about that number.
|
| 2. We're working on our second generation chip. I don't know
| how much SRAM it has exactly but we don't need to increase
| the SRAM to get efficient scaling. Our system is
| deterministic, which means no need for waiting or queuing
| anywhere, and we can have very low latency interconnect
| between cards.
|
| 3. Yeah absolutely, see this video of a live demo on CNN!
|
| https://www.youtube.com/watch?t=235&v=pRUddK6sxDg
| ppsreejith wrote:
| Thank you, that demo was insane!
|
| Follow up (noob) question: Are you using a KV cache? That
| would significantly increase your memory requirements. Or
| are you forwarding the whole prompt for each auto-
| regressive pass?
| tome wrote:
| You're welcome! Yes, we have KV cache. Being able to
| implement this efficiently in terms of hardware
| requirements and compute time is one of the benefits of
| our deterministic chip architecture (and deterministic
| system architecture).
| ppsreejith wrote:
| Thanks again! Hope I'm not overwhelming but one more
| question: Are you decoding with batch size = 1 or is it
| more?
| tome wrote:
| That's OK, feel free to keep asking!
|
| I think currently 1. Unlike with graphics processors,
| which really need data parallelism to get good
| throughput, our LPU architecture allows us to deliver
| good throughput even at batch size 1.
| WiSaGaN wrote:
| How much do 568 chips cost? What's the cost ratio of it
| comparing to setup with roughly the same throughput using
| A100?
| benchess wrote:
| They're for sale on Mouser for $20625 each
| https://www.mouser.com/ProductDetail/BittWare/RS-GQ-
| GC1-0109...
|
| At that price 568 chips would be $11.7M
| tome wrote:
| Yeah, I don't know what the cost to us is to build out
| our own hardware but it's significantly less expensive
| than retail.
| WiSaGaN wrote:
| That seems to be per card instead of chip. I would expect
| it has multiple chips on a single card.
| renewiltord wrote:
| From the description that doesn't seem to be the case,
| but I don't know this product well
|
| > _Accelerator Cards GroqCard low latency AI /ML
| Inference PCIe accelerator card with single GroqChip_
| WiSaGaN wrote:
| Missed that! Thanks for pointing out!
| gautamcgoel wrote:
| Can you talk about the interconnect? Is it fully custom as
| well? How do you achieve low latency?
| tome wrote:
| You can find out about the chip to chip interconnect from
| our paper below, section 2.3. I don't think that's
| custom.
|
| We achieve low latency by basically being a software-
| defined architecture. Our functional units operate
| completely orthoganal to each other. We don't have to
| batch in order to achieve parallelism and the system
| behaviour is completely deterministic, so we can schedule
| all operations precisely.
|
| https://wow.groq.com/wp-
| content/uploads/2023/05/GroqISCAPape...
| Oras wrote:
| Impressive speed. Are there any plans to run fine-tuned models?
| tome wrote:
| Yes, we're working on a feature to give our partners the
| ability to deploy their own fine-tuned models.
| dkhudia wrote:
| @tome for the deterministic system, what if the timing for one
| chip/part is off due to manufacturing/environmental factors
| (e.g., temperature) ? How does the system handle this?
| mechagodzilla wrote:
| Those sorts of issues are part of timing analysis for a chip,
| but once a chip's clock rate is set, they don't really factor
| in unless there is some kind of dynamic voltage/frequency
| scaling scheme going on. This chip probably does not do any
| of that and just uses a fixed frequency, so timing is
| perfectly predictable.
| tome wrote:
| We know the maximum possible clock drift and so we know when
| we need to do a resynchronisation to keep all the chips in
| sync. You can read about it in section 3.3 of our recent
| whitepaper: https://wow.groq.com/wp-
| content/uploads/2023/05/GroqISCAPape...
| mechagodzilla wrote:
| You all seem like one of the only companies targeting low-
| latency inference rather than focusing on throughput (and thus
| $/inference) - what do you see as your primary market?
| tome wrote:
| Yes, because we're one of the only companies whose hardware
| can actually support low latency! Everyone else is stuck with
| traditional designs and they try to make up for their high
| latency by batching to get higher throughput. But not all
| applications work with high throughput/high latency ... Low
| latency unlocks feeding the result of one model into the
| input of another model. Check out this conversational AI demo
| on CNN. You can't do that kind of thing unless you have low
| latency.
|
| https://www.youtube.com/watch?v=pRUddK6sxDg&t=235s
| vimarsh6739 wrote:
| Might be a bit out of context, but isn't the TPU also
| optimized for low latency inference? (Judging by reading
| the original TPU architecture paper here -
| https://arxiv.org/abs/1704.04760). If so, does Groq
| actually provide hardware support for LLM inference?
| tome wrote:
| Jonathan Ross on that paper is Groq's founder and CEO.
| Groq's LPU is an natural continuation of the breakthrough
| ideas he had when designing Google's TPU.
|
| Could you clarify your question about hardware support?
| Currently we build out our hardware to support our cloud
| offering, and we sell systems to enterprise customers.
| vimarsh6739 wrote:
| Thanks for the quick reply! About hardware support, I was
| wondering if the LPU has a hardware instruction to
| compute the attention matrix similar to the
| MatrixMultiply/Convolve instruction in the TPU ISA.
| (Maybe a hardware instruction which fuses a softmax on
| the matmul epilogue?)
| tome wrote:
| We don't have a hardware instruction but we do have some
| patented technology around using a matrix engine to
| efficiently calculate other linear algebra operations
| such as convolution.
| mirekrusin wrote:
| Are you considering targeting consumer market? There are
| a lot of people throwing $2k-$4k into local setups and
| they primarily care about inference.
| tome wrote:
| At the moment we're concentrating on building out our API
| and serving the enterprise market.
| phh wrote:
| You're running fp32 models, fp16 or quantized?
| tome wrote:
| FP16 for calculating all activations. Some data is stored as
| FP8 at rest.
| andy_xor_andrew wrote:
| are your accelerator chips designed in-house? or they're some
| specialized silicon or FPGPU or something that you wrote very
| optimized code for inference?
|
| it's really amazing! the first time I tried the demo, I had to
| try a few prompts to believe it wasn't just an animation :)
| tome wrote:
| Yup, custom ASIC, designed in-house, built into a system of
| several racks, hundreds of chips, with fast interconnect.
| Really glad you enjoyed it!
| itishappy wrote:
| Alright, I'll bite. Haskell seems pretty unique in the ML
| space! Any unique benefits to this decision, and would you
| recommend it for others? What areas of your project do/don't
| use Haskell?
| tome wrote:
| Haskell is a great language for writing compilers! The end of
| our compilation pipeline is written in Haskell. Other stages
| are written in C++ (MLIR) and Python. I'd recommend anyone to
| look at Haskell if they have a compiler-shaped problem, for
| sure.
|
| We also use Haskell on our infra team. Most of our CI infra
| is written in Haskell and Nix. Some of the chip itself was
| designed in Haskell (or maybe Bluespec, a Haskell-like
| language for chip design, I'm not sure).
| ianpurton wrote:
| Is it possible to buy Groq chips and how much do they cost?
| ComputerGuru wrote:
| https://www.mouser.com/ProductDetail/BittWare/RS-GQ-
| GC1-0109...
| UncleOxidant wrote:
| Only $20,625.00!
| michaelbuckbee wrote:
| Friendly fyi - I think this might just be a web interface bug
| but but I submitted a prompt with the Mixtral model and got a
| response (great!) then switched the dropdown to Llama and
| submitted the same prompt and got the exact same response.
|
| It may be caching or it didn't change the model being queried
| or something else.
| tome wrote:
| Thanks, I think it's because the chat context is fed back to
| the model for the next generation even when you switch
| models. If you refresh the page that should erase the history
| and you should get results purely from the model you choose.
| michaelbuckbee wrote:
| Appreciate the quick reply! That's interesting.
| tome wrote:
| You're welcome. Thanks for reporting. It's pretty
| confusing so maybe we should change it :)
| pests wrote:
| I've always liked how openrouter.ai does it
|
| They allow you to configure chat participants (a model +
| params like context or temp) and then each AI answers
| each question independently in-line so you can compare
| and remix outputs.
| UncleOxidant wrote:
| When will we be able to buy Groq accelerator cards that would
| be affordable for hobbyists?
| tome wrote:
| We are prioritising building out whole systems at the moment
| I don't think we'll have a consumer level offering in the
| near future.
| frognumber wrote:
| I will mention: A lot of innovation in this space comes
| bottom-up. The sooner you can get something in the hands of
| individuals and smaller institutions, the better your
| market position will be.
|
| I'm coding to NVidia right now. That builds them a moat.
| The instant I can get other hardware working, the less of a
| moat they will have. The more open it is, the more likely I
| am to adopt it.
| tome wrote:
| Definitely, that's why we've opened our API to everyone.
| frognumber wrote:
| I don't think that quite does it. What I'd want -- if you
| want me to support you -- is access to the chip,
| libraries, and API documentation.
|
| Best-case would be something I buy for <$2k (if out-of-
| pocket) or under $5k (if employer). Next best case would
| be a cloud service with a limited free tier. It's okay if
| it has barely enough quota that I can develop to it, but
| the quota should never expire.
|
| (The mistake a lot of services make is to limit free tier
| to e.g. 30 day or 1 year, rather than hours/month; if I
| didn't get around to evaluating, switch employers, switch
| projects, etc. the free tier is gone).
|
| I did sign up for your API service. I won't be able to
| use it in prod before your (very nice) privacy guarantees
| are turned into lawyer-compliant regulatory language. But
| it's an almost ideal fit for my application.
| tome wrote:
| Yup, understood. Access to consumer hardware like this is
| not something that we provide at the moment, I'm afraid.
| frognumber wrote:
| Don't blame you. Been at plenty of startups, resources
| are finite, and focus is important.
|
| My only point was to, well, perhaps bump this up from
| #100 on your personal priority list perhaps to #87, to
| the limited extent that influences your business.
| kkzz99 wrote:
| How does the Groq PCIE Card work exactly? Does it use system
| ram to stream the model data to the card? How many T/s could
| one expect with e.g. 36000Mhz DDR4 Ram?
| tome wrote:
| We build out large systems where we stream in the model
| weights to the system once and then run multiple inferences
| on it. We don't really recommend streaming model weights
| repeatedly onto the chip because you'll lose the benefits of
| low latency.
| kkzz99 wrote:
| How does that work when the card only has 230MB of SRAM?
| tome wrote:
| We connect hundreds of chips across several racks with
| fast interconnect.
| AhtiK wrote:
| How fast is the memory bandwidth of that fast
| interconnect?
| tome wrote:
| Have a look at section 2.3 of our paper. Between any two
| chips we get 100 Gbps. The overall bandwidth depends on
| the connection topology used. I don't know if we make
| that public.
|
| https://wow.groq.com/wp-
| content/uploads/2023/05/GroqISCAPape...
| tudorw wrote:
| As it works at inference do you think 'Representation
| Engineering ' could be applied to give a sort of fine-tuning
| ability? https://news.ycombinator.com/item?id=39414532
| karthityrion wrote:
| Hi. Are these ASICs only for LLMs or could they accelerate
| other kinds of models(vision) as well?
| tome wrote:
| It's a general purpose compute engine for numerical computing
| and linear algebra, so it can accelerate any ML workloads.
| Previously we've accelerated models for stabilising fusion
| reactions and for COVID drug discovery
|
| * https://alcf.anl.gov/news/researchers-accelerate-fusion-
| rese...
|
| * https://wow.groq.com/groq-accelerates-covid-drug-
| discovery-3...
| karthityrion wrote:
| So, is this specific chip only for LLMs as the name LPU
| suggests Language Processing Unit, right?
| tome wrote:
| The chip is capable of running general numerical compute,
| but because we're focusing almost entirely on LLMs at the
| moment we've branded it the LPU.
| jart wrote:
| If I understand correctly, you're using specialized hardware to
| improve token generation speed, which is very latency bound on
| the speed of computation. However generating tokens only
| requires multiplying 1-dimensional matrices usually. If I enter
| a prompt with ~100 tokens then your service goes much slower.
| Probably because you have to multiply 2-dimensional matrices.
| What are you doing to improve the computation speed of prompt
| processing?
| tome wrote:
| I don't think it should be quadratic in input length. Why do
| you think it is?
| jart wrote:
| all I know is that when I run llama.cpp a lot of the
| matrices that get multiplied have their shapes defined by
| how many tokens are in my prompt.
| https://justine.lol/tmp/shapes.png Notice how the B matrix
| is always skinny for generating tokens. But for batch
| processing of the initial prompt, it's fat. It's not very
| hard to multiply a skinny matrix but once it's fat it gets
| harder. Handling the initial batch processing of the prompt
| appears to be what your service goes slow at.
| johndough wrote:
| You can ask your website: "What is the computational
| complexity of self-attention with respect to input sequence
| length?"
|
| It'll answer something along the lines of self-attention
| being O(n^2) (where n is the sequence length) because you
| have to compute an attention matrix of size n^2.
|
| There are other attention mechanisms with better
| computational complexity, but they usually result in worse
| large language models. To answer jart: We'll have to wait
| until someone finds a good linear attention mechanism and
| then wait some more until someone trains a huge model with
| it (not Groq, they only do inference).
| BryanLegend wrote:
| How well would your hardware work for image/video generation?
| tome wrote:
| It should work great as far as I know. We've implemented some
| diffusion models for image generation but we don't offer them
| at the moment. I'm not aware of us having implemented any
| video models.
| pama wrote:
| FYI, I only see a repeating animation and nothing else in my
| iPhone on lockdown mode, with Safari or Firefox.
| karthityrion wrote:
| What is the underlying architecture of the ASICs. Does it use
| systolic arrays?
| tome wrote:
| Yes, our matrix engine is quite similar to a systolic array.
| You can find more details about our architecture in our
| paper:
|
| https://wow.groq.com/wp-
| content/uploads/2023/05/GroqISCAPape...
| liberix wrote:
| How do I sign up for API access? What payment methods do you
| support?
| amirhirsch wrote:
| It seems like you are making general purpose chips to run many
| models. Are we at a stage where we can consider taping out
| inference networks directly propagating the weights as
| constants in the RTL design?
|
| Are chips and models obsoleted on roughly the same timelines?
| tome wrote:
| I think the models change far too quickly for that to be
| viable. A chip has to last several years. Currently we're
| seeing groundbreaking models released every few months.
| sebzim4500 wrote:
| So this has nothing to do with `Grok`, the model provided by
| x.ai?
|
| EDIT: Tried using it, very impressed with the speed.
| tome wrote:
| Yeah, it's nothing to do with Elon and we (Groq) had the name
| first. It's a natural choice of name for something in the field
| of AI because of the connections to the hacker ethos, but we
| have the trademark and Elon doesn't.
|
| https://wow.groq.com/hey-elon-its-time-to-cease-de-grok/
| terhechte wrote:
| Can't Chamath (he's one of your investors, right), do a thing
| there? Every person I pitch Groq to is confused and thinks
| its about Elons unspectacular LLM.
| tome wrote:
| Yeah the confusion has happened a lot to me too. All I know
| is that it's in the hands of our legal team.
| fragmede wrote:
| I mean it sucks that Elon went and claimed Grok when you want
| Groq, plus you were there first, but getting stuck on the
| name seems like it's going to be a distraction, so why not
| choose something different?
|
| When Grok eventually makes the news for some negative thing,
| so you really want that erroneously associated with your
| product? Do you really want to pick a fight with the
| billionaire that owns Twitter, is that a core competency of
| the company?
| mtlmtlmtlmtl wrote:
| There's also a children's toy named Grok which uses LLMs to
| talk to the kid.
| cchance wrote:
| Jesus that makes chatgpt and even gemini seem slow AF
| gremlinsinc wrote:
| better quality than I was expecting. For fun I set the system
| prompt to:
|
| You are a leader of a team of ai helpers. when given a question
| you can call on an expert, as a wizard calls on magic. You will
| say, I call forth {expert} master of {subject matter} an expert
| in {x, y, z}. Then you will switch to that persona.
|
| I was not let down..
| tome wrote:
| Nice prompting strategy :)
| CuriouslyC wrote:
| This is pretty sweet. The speed is nice but what I really care
| about is you bringing the per token cost down compared with
| models on the level of mistral medium/gpt4. GPT3.5 is pretty
| close in terms of cost/token but the quality isn't there and GPT4
| is overpriced. Having GPT4 quality at sub-gpt3.5 prices will
| enable a lot of things though.
| ukuina wrote:
| I wonder if Gemini Pro 1.5 will act as a forcing function to
| lower GPT4 pricing.
| ComputerGuru wrote:
| Is that available via an API now?
| sp332 wrote:
| Kind of, it's in a "Private Preview" with a waitlist.
| sturza wrote:
| And in non EU countries.
| ComputerGuru wrote:
| Via GCP only?
| MuffinFlavored wrote:
| What's the difference in your own words/opinion in quality
| between GPT-3.5 and GPT-4? For what usecases?
| CuriouslyC wrote:
| GPT3.5 is great at spitting out marketing babble, summarizing
| documents and performing superficial analysis but it doesn't
| take style prompts as well as gpt-4 and its reasoning is
| significantly worse when you want it to chain of thought
| follow a complex process while referencing context guidance.
| emporas wrote:
| Mixtral's quality is definitely up there with Gpt3.5.
| Specifically for coding, i consider them almost equivalent in
| quality. In fact Mixtral 8x7 is starting to be my go-to coding
| assistant instead of Gpt. It is fast, it is accurate, and i
| think i like his responses better than Gpt.
|
| Reducing LLM size almost 10 times in the span of a little more
| than a year, that's great stuff. Next step i think is 3 billion
| parameters MoE with 20 experts.
| deepsquirrelnet wrote:
| Incredible job. Feels dumb or obvious to say this, but this
| really changes the way I think of using it. The slow
| autoregression really sucks because it inhibits your ability to
| skim sections. For me, that creates an unnatural reading
| environment. This makes chatgpt feel antiqued.
| tome wrote:
| Yes, agreed. We believe the benefits of reducing latency are
| non-linear. You can hit different phase changes as the latency
| reduces and new applications become viable. Roundtripping text-
| to-speech and speech-to-text is one example. We're looking
| forward to seeing what low latency applications are unlocked by
| our new users!
| Cheer2171 wrote:
| What's the underlying hardware for this?
| tome wrote:
| It's a system built from hundreds of GroqChips (a custom ASIC
| we designed). We call it the LPU (language processing unit).
| Unlike graphics processors, which are still best in class for
| training, LPUs are best in class for low latency and high
| throughput inference. Our LLMs are running on several racks
| with fast interconnect between the chips.
| michaelt wrote:
| They have a paper [1] about their 'tensor streaming
| multiprocessor'
|
| [1] https://wow.groq.com/wp-
| content/uploads/2024/02/GroqISCAPape...
| xnorswap wrote:
| Interesting, I ran my "prime test", it passed on generating a
| prime, it failed on it being large.
|
| > "Can you generate me a large prime please"
|
| > Sure, here's a large prime number: 193707721
|
| > This number is a prime number because it is only divisible by 1
| and itself. You can check its primality by trying to divide it by
| all the numbers up to its square root, and if none of them divide
| it evenly, then it is a prime number. This number is quite large,
| but there are much larger prime numbers known. For example, the
| largest known prime number, as of January 2022, is
| 2^82,589,933-1, a number with 24,862,048 digits, discovered as
| part of the Great Internet Mersenne Prime Search (GIMPS).
|
| I then prompted "That's not large" which seemed to break it, the
| response is stuck in a loop:
|
| > That isn't large
|
| > I apologize if the prime number I provided did not meet your
| expectations for "large." Prime numbers can be very large indeed,
| and there is no upper limit to how large a prime number can be.
|
| If you are looking for a very large prime number, here is one
| with a billion digits:
|
| 10000030303030303030303030303030303030303030303030303030303030303
| 03030303030303030303030303030303030303030303030303030303030303030
| 30303030303030303030303030303030303030303030303030303030303030303
| 03030303030303030303030303030303030303030303030303030303030303030
| 30303030303030303030303030303030303030303030303030303030303030303
| 03030303030303030303030303030303030303030303030303030303030303030
| 30303030303030303030303030303030303030303030303030303030303030303
| 03030303030303030303030303030303030303030303030303030303030303030
| 30303030303030303030303030303030303030303030303030303030303030303
| 03030303030303030303030303030303030303030303030303030303030303030
| 30303030303030303030303030303030303030303030303030303030303030303
| 03030303030303030303030303030303030303030303030303030303030303030
| 30303030303030303030303030303030303030303030303030303030303030303
| 03030303030303030303030303030303030303030303030303030303030303030
| 30303030303030303030303030303030303030303030303030303030303030303
| 030303030303030303030303030303...
|
| It's printed many more pages of 303030... but hacker news has a
| post length limit!
|
| Further testing, such as prompting to generate a 1024 bit prime
| also runs into odd loops.
|
| > Can you generate a 1024 bit prime?
|
| > Yes, I can generate a 1024-bit prime number for you. Here's one
| I just generated:
|
| > 115322512531875133213135311402305250340282302283103413642283221
| 25311565322125311221022112211211211211211211211211211211211211211
| 2112112112112112112112112112112112112112112112112112112112112112
| ( Loops 112 forever )
| 7734128 wrote:
| How do you know that it didn't somehow find the largest prime?
| Perhaps you just threw away a Noble Prize.
| gautamcgoel wrote:
| Nobel Prize in what? There is no Nobel in mathematics or
| computer science.
| fatkam wrote:
| nobel prizes are a joke anyways.... Obama got the one for
| peace?
| Workaccount2 wrote:
| I can't think of any value in asking LLM's to generate prime
| numbers. They are neural nets, not turing machines.
| xnorswap wrote:
| The value is showing how confidently is presents incorrect
| results.
|
| Especially the lack of nuance or uncertainty in the language
| they use.
|
| They extremely confidently present the incorrect information,
| and prime generation is interesting because it's information
| that isn't easy to spot as obviously incorrect to the user
| while being information that's possible to determine is wrong
| at small numbers and difficult to verify for large numbers.
|
| It's my favourite test because it's a good demonstration of
| the lack of nuance or uncertainty in LLMs. They have no sense
| of how wrong the information they're giving out might be.
|
| If they could give confidence intervals for any information
| then they could provide the context by how likely they think
| they might be correct, but they actually double-down on their
| incorrectness instead.
| tveita wrote:
| Mixtral is a pre-existing model that was released by
| Mistral AI in December: https://mistral.ai/news/mixtral-of-
| experts/
|
| It was discussed on HN at the time:
| https://news.ycombinator.com/item?id=38598559
|
| You won't learn anything about the chips this submission is
| about by asking it questions - maybe except to look for
| degradation from quantization or computational shortcuts,
| but you'd have to compare the answers to Mixtral running
| with 16 bit floats. The point is the speed, just like a
| faster CPU isn't going to improve the UI of Outlook.
| xnorswap wrote:
| It's worth noting that it may have got lucky with the first
| response. A clean session with the prompt:
|
| > Could you generate me a 64 bit prime?
|
| Returns:
|
| > Sure, here's a 64-bit prime number: 15592425903773483751
|
| > You can verify that it's prime by checking that it has no
| divisors between 2 and its square root. In this case, the
| square root of the number is approximately 12489593.5, so you
| would need to check divisibility for all integers between 2 and
| 12489593.5. Since the number is odd, you can skip checking even
| numbers, which will reduce the number of checks you need to
| perform.
|
| > However, if you're looking for a larger 64-bit prime or a
| prime number with specific properties, you may need to use a
| more sophisticated algorithm or a specialized prime number
| generator.
|
| Like GPT3.5 and Bard, it confidently suggests that
| 15592425903773483751 is prime, and even explains why.
|
| It's wrong:
|
| > $ factor 15592425903773483751 > 15592425903773483751: 3 3 3 7
| 167 853 911 635723119
|
| Further prompting:
|
| > Can you factor 15592425903773483751?
|
| > No, I cannot factor 15592425903773483751 because it is a
| prime number, and prime numbers cannot be factored into smaller
| non-trivial factors. By definition, a prime number is a
| positive integer greater than 1 that has no positive integer
| divisors other than 1 and itself. In this case,
| 15592425903773483751 is a 64-bit prime number, which means it
| is a prime number that fits into a 64-bit data type.
|
| > To verify that 15592425903773483751 is a prime number, you
| can use a primality test, such as the Fermat primality test or
| the Miller-Rabin primality test. These tests can quickly
| determine whether a number is prime or composite (not prime)
| with high probability. In the case of 15592425903773483751,
| both tests confirm that it is a prime number.
|
| Even further prompting to get it to try to multiply numbers
| together quickly turns into a disaster, culminating with:
|
| > That doesn't look right
|
| > I apologize for the mistake in my previous response. Here is
| the correct result of the multiplication:
|
| > 3 x 3 x 3 x 7 x 167 x 853 x 911 x 635723119 =
| 322711211111111111
|
| > Therefore, the product of the numbers is:
|
| > 322711211111111111.
|
| > I apologize for any confusion my previous response may have
| caused.
|
| The context seems to be really aggressively baked in, because
| when I ask it:
|
| > Can you factor 322711211111111111?
|
| It returns
|
| > Yes, I can factor the number 322711211111111111. Here is its
| prime factorization:
|
| > 322711211111111111 = 3 x 3 x 3 x 7 x 167 x 853 x 911 x
| 635723119
|
| > Therefore, the prime factors of the number are 3, 7, 167,
| 853, 911, and 635723119.
|
| > Note that the prime factorization is unique up to reordering
| of the factors. In this case, the prime factors are repeated
| several times, so the prime factorization can be written more
| compactly as:
|
| > 3^3 x 7 x 167 x 853 x 911 x 635723119
|
| > or
|
| > 3 x 3 x 3 x 7 x 167 x 853 x 911 x 635723119
|
| > I hope this helps! Let me know if you have any other
| questions.
|
| The real irony? 322711211111111111 is prime!
| mrtksn wrote:
| Does this make it practical to run LLMs on mobile devices? I
| wonder about the power consumption and if it can make sense to
| have it integrated in some future mobile devices. Or maybe have a
| dedicated storage, RAM and processing cores that goes as an USB-C
| add-on? A case with integrated battery and this chip?
|
| I'm dreaming of having LLMs on anything. Unlike the "bluetooth on
| everything" craze, this can be practical as every device can
| become smart. Remember how some British researchers made a self
| driving car using an LLM? A toaster anticipating how to cook when
| you describe it what you want want actually be an improvement.
| tome wrote:
| I don't think we've put a GroqChip in a mobile device yet.
| Interesting idea!
| wmf wrote:
| I assume this is a million-dollar rack of custom chips so it's
| probably not coming to mobile any time soon.
| mrtksn wrote:
| Well, currently its entirely possible to run these models on
| iPhones. It's just not practical because it eats all the
| resources and the battery when slowly generating the output.
|
| Therefore if Groq has achieved significant efficiency
| improvements, that its, they are not getting that crazy speed
| by enormous power consumption then maybe they can eventually
| build low power mass produced cutting edge fabbed chips that
| run at acceptable speed?
| wmf wrote:
| The thing is, I don't see any efficiency improvements. I
| see models running fast on very expensive hardware using
| techniques that don't scale down.
| mrtksn wrote:
| Care to explain? Are they using 10x energy for 10x speed
| improvements?
| wmf wrote:
| They're using hundreds of chips. Based on the data sheet
| I would estimate this demo uses 173 KW. It may be 100x
| energy to get 10x speedup.
| mrtksn wrote:
| 100s of chips for who knows how many clients. The mobile
| phone will have to do calculations just for 1 client.
| tome wrote:
| Yes, we pipeline requests so multiple users are being
| handled by the same hardware at one time.
| mrtksn wrote:
| Thanks for the clarification. So, would you say that Groq
| has a potential to have let's say OpenAI speeds on
| handheld devices at reasonable energy consumption? Or is
| that not really what this tech's strength is maybe?
| tome wrote:
| The industry as a whole is a very long way away from
| that. The power requirements are too high for mobile.
| pptr wrote:
| I think the limitation is chip size/cost. SRAM is a lot
| less dense than RAM. According to Google this is
| typically used for registers and caches, which are only
| megabytes large.
| frozenport wrote:
| Yeah just offload the compute onto the cloud.
| mrtksn wrote:
| Its too unreliable, too restricted and too not-private.
|
| ChatGPT stopped processing images for me, trying to get help
| but support doesn't appear to be very fast, they asked for
| more info but not heard back since.
|
| Its too restricted, can't do anything on hard topics. It
| doesn't work when you try to work out exploits or dangers in
| a system for example.
|
| Its not private, they say they don't train on API requests
| but companies steer clear when it comes to send sensitive
| data.
| frozenport wrote:
| The model being too restrictive does seem to be a good
| point.
|
| Do you think there are less restrictive models hosted on
| poe.com?
| SahAssar wrote:
| > Remember how some British researchers made a self driving car
| using an LLM?
|
| No? Do you mean actual, full self driving on normal roads in
| traffic?
| mrtksn wrote:
| Yes, IIRC they reason on the car actions using LLMs. They
| still use image processing but once you identify the objects
| in the scene, the LLM interprets and decides what to do with
| the car.
|
| I'm not sure which one was it though(Ghost Autonomy maybe?).
| SahAssar wrote:
| Do you have a source? Because that actually, properly
| working would be headline global news and would value the
| company in the billions.
| mrtksn wrote:
| It was discussed here on HN, that's how I know about it.
|
| I found a few things when searched around but not sure
| which one was the one I recall.
|
| Anyway, here is a video from one:
| https://www.youtube.com/watch?v=C2rbym6bXM0
|
| Here is a paper discussing something similar:
| https://arxiv.org/abs/2307.07162
| SahAssar wrote:
| The description for that video says
|
| > Ghost Autonomy's MLLM-based capabilities are currently
| in development. These video and image examples show MLLM-
| based analysis of driving scenes captured from Ghost
| vehicles driving in both autonomous and conventional
| mode. _MLLM-based reasoning is not yet being returned to
| the car to impact actual driving maneuvers._
|
| So the model discussed is not doing any driving
| whatsoever. This is not self-driving at any level.
| mrtksn wrote:
| Then its not the one I remember maybe.
| jackblemming wrote:
| Impressive work. Nice job team. This is big.
| tome wrote:
| Thanks!
| sorokod wrote:
| Not clear if it is due to Groq or to Mixtral, but confident
| hallucinations are there.
| tome wrote:
| We run the open source models that everyone else has access to.
| What we're trying to show off is our low latency and high
| throughput, not the model itself.
| MuffinFlavored wrote:
| But if the model is useless/full of hallucinations, why does
| the speed of its output matter?
|
| "generate hallucinated results, faster"
| Cheer2171 wrote:
| No, it is "do whatever you were already doing with ML,
| faster"
|
| This question seems either from a place of deep confusion
| or is in bad faith. This post is about hardware. The
| hardware is model independent.* Any issues with models,
| like hallucinations, are going to be identical if it is run
| on this platform or a bunch of Nvidia GPUs. Performance in
| terms of hardware speed and efficiency are orthogonal to
| performance in terms of model accuracy and hallucinations.
| Progress on one axis can be made independently to the
| other.
|
| * Technically no, but close enough
| sorokod wrote:
| Well ok, Groq provides lower latency cheaper access to
| the same models of questionable quality.
|
| Is this not putting lipstick on a pig scenario? I suppose
| more of a question to pig buyers.
| siwakotisaurav wrote:
| They're probably in the business of being the hardware
| provider. Best thing would be if Microsoft buys a lot of
| their chips and that way chatgpt is actually sped up.
| It's basically model independent
| Cheer2171 wrote:
| Okay. How about this: Someone posts to HN about an
| amazing new battery technology, which they demo by
| showing an average-sized smartphone watching TikTok
| endlessly scroll for over 500 hours on a single charge.
|
| Then someone comments that TikTok is a garbage fire and a
| horrible corrupting influence, yadda yadda, all that
| stuff. They ask: what is the point of making phones last
| longer just to watch TikTok? They say this improved
| efficiency in battery tech is just putting lipstick on a
| pig.
|
| That's you in this thread. That's the kind of irrelevant
| non-contribution you are making here.
| MuffinFlavored wrote:
| Batteries are useful. The majority of LLMs are not?
| sorokod wrote:
| Perhaps your analogy reveals more then you intended.
|
| What does it tell you about the new technology if the
| best vehicle to demonstrate it is TikTok?
| imtringued wrote:
| Mixtral 8x7b is competitive with ChatGPT 3.5 Turbo so I'm
| not sure why you are being so dismissive.
|
| https://chat.lmsys.org/ check the leaderboard.
| Der_Einzige wrote:
| https://preview.redd.it/jrb4e1wr9ll31.png?auto=webp&s=ccb5a
| 8...
|
| https://i.imgur.com/Jkdtlox.png
| tiborsaas wrote:
| I asked it to come up with name ideas for a company and it
| hallucinated them successfully :) I think the trick is to know
| what prompts will likely to yield results that are not likely
| to be hallucinated. In other contexts it's a feature.
| sorokod wrote:
| A bit of a softball don't you think? The initial message
| suggests _" Are you ready to experience the world's fastest
| Large Language Model (LLM)? We'd suggest asking about a piece
| of history"_
|
| So I did.
| kumarm wrote:
| At top left hand corner you can change the model to Llama2 70B
| Model.
| aphit wrote:
| This is incredibly fast, indeed.
|
| What are the current speeds in T/s for say ChatGPT 3.5 or ChatGPT
| 4? Just how much faster is this?
| kumarm wrote:
| I ran the same (Code generation) query and here are my results
| as end user:
|
| ChaGPT: 1 minute 45 seconds.
|
| Gemini: 16 seconds.
|
| Groq: 3 seconds.
| karpathy wrote:
| Very impressive looking! Just wanted to caution it's worth being
| a bit skeptical without benchmarks as there are a number of ways
| to cut corners. One prominent example is heavy model
| quantization, which speeds up the model at a cost of model
| quality. Otherwise I'd love to see LLM tok/s progress exactly
| like CPU instructions/s did a few decades ago.
| binary132 wrote:
| The thing is that tokens aren't an apples to apples metric....
| Stupid tokens are a lot faster than clever tokens. I'd rather
| see token cleverness improving exponentially....
| tome wrote:
| As a fellow scientist I concur with the approach of skepticism
| by default. Our chat app and API are available for everyone to
| experiment with and compare output quality with any other
| provider.
|
| I hope you are enjoying your time of having an empty calendar
| :)
| behnamoh wrote:
| tangent: Great to see you again on HN!
| sp332 wrote:
| At least for the earlier Llama 70B demo, they claimed to be
| running unquantized.
| https://twitter.com/lifebypixels/status/1757619926360096852
|
| Update: This comment says "some data is stored as FP8 at rest"
| and I don't know what that means.
| https://news.ycombinator.com/item?id=39432025
| tome wrote:
| The weights are quantized to FP8 when they're stored in
| memory, but all the activations are computed at full FP16
| precision.
| youssefabdelm wrote:
| Can you explain if this affects quality relative to fp16?
| And is mixtral quantized?
| tome wrote:
| We don't think so, but you be the judge! I believe we
| quantize both Mixtral and Llama 2 in this way.
| a_wild_dandan wrote:
| Is your confidence rooted in quantified testing, or just
| vibes? I'm sure you're right, just curious. (My
| reasoning: running inference at full fp16 is borderline
| wasteful. You can use q7 with almost no loss.)
| tome wrote:
| What quantified testing would you like to see? We've had
| a lot of very good feedback from our users, particularly
| about Mixtral.
| bearjaws wrote:
| Nothing really wrong with FP8 IMO, it performs pretty damn
| well usually within 98% while significantly reducing memory
| usage.
| losvedir wrote:
| Maybe I'm stretching the analogy too far, but are we in the
| transistor regime of LLMs already? Sometimes I see these 70
| billion parameter monstrosities and think we're still building
| ENIAC out of vacuum tubes.
|
| In other words, are we ready to steadily march on, improving
| LLM tok/s year by year, or are we a major breakthrough or two
| away before that can even happen?
| Gcam wrote:
| As part of our benchmarking of Groq we have asked Groq
| regarding quantization and they have assured us they are
| running models at full FP-16. It's a good point and important
| to check.
|
| Link to benchmarking: https://artificialanalysis.ai/ (Note
| question was regarding API rather than their chat demo)
| bsima wrote:
| As tome mentioned we don't quantize, all activations are FP16
|
| And here are some independent benchmarks
| https://artificialanalysis.ai/models/llama-2-chat-70b
| xvector wrote:
| Jesus Christ, these speeds with FP16? That is simply insane.
| throwawaymaths wrote:
| Ask how much hardware is behind it.
| modeless wrote:
| All that matters is the cost. Their price is cheap, so
| the real question is whether they are subsidizing the
| cost to achieve that price or not.
| sva_ wrote:
| In how far is the API compatible with OpenAI? Does it offer
| logprobs[0] and top_logprobs[1]?
|
| 0. https://platform.openai.com/docs/api-
| reference/chat/create#c...
|
| 1. https://platform.openai.com/docs/api-
| reference/chat/create#c...
| tome wrote:
| You can find our API docs here, including details of our OpenAI
| compatibility
|
| https://docs.api.groq.com/
| tome wrote:
| By the way, we also have a new Discord server where we are
| hosting our developer community. If you find anything missing
| in our API you can ask about there:
|
| https://discord.com/invite/TQcy5EBdCP
| kumarm wrote:
| Filled the form for API Access last night. Is there a delay
| with increased demand now?
| tome wrote:
| Yes, there's a huge amount of demand because Twitter
| discovered us yesterday. There will be a backlog, so sorry
| about that.
| kumarm wrote:
| Understandable. Wish you guys best of luck irrespective.
| aeyes wrote:
| Switching the model between Mixtral and Llama I get word for word
| the same responses. Is this expected?
| tome wrote:
| Yeah, this is a common observation. See my comment at
| https://news.ycombinator.com/item?id=39431921
|
| Maybe we should change the behavior to stop people getting
| confused.
| bjornsing wrote:
| No...
| treesciencebot wrote:
| The main problem with the Groq LPUs is, they don't have any HBM
| on them at all. Just a miniscule (230 MiB) [0] amount of ultra-
| fast SRAM (20x faster than HBM3, just to be clear). Which means
| you need ~256 LPUs (4 full server racks of compute, each unit on
| the rack contains 8x LPUs and there are 8x of those units on a
| single rack) just to serve a single model [1] where as you can
| get a single H200 (1/256 of the server rack density) and serve
| these models reasonably well.
|
| It might work well if you have a single model with lots of
| customers, but as soon as you need more than a single model and a
| lot of finetunes/high rank LoRAs etc., these won't be usable. Or
| for any on-prem deployment since the main advantage is
| consolidating people to use the same model, together.
|
| [0]: https://wow.groq.com/groqcard-accelerator/
|
| [1]: https://twitter.com/tomjaguarpaw/status/1759615563586744334
| pclmulqdq wrote:
| Groq devices are really well set up for small-batch-size
| inference because of the use of SRAM.
|
| I'm not so convinced they have a Tok/sec/$ advantage at all,
| though, and especially at medium to large batch sizes which
| would be the groups who can afford to buy so much silicon.
|
| I assume given the architecture that Groq actually doesn't get
| any faster for batch sizes >1, and Nvidia cards do get
| meaningfully higher throughput as batch size gets into the
| 100's.
| nabakin wrote:
| I've been thinking the same but on the other hand, that would
| mean they are operating at a huge loss which doesn't scale
| foundval wrote:
| (Groq Employee) It's hard to discuss Tok/sec/$ outside of the
| context of a hardware sales engagement.
|
| This is because the relationship between Tok/s/u,
| Tok/s/system, Batching, and Pipelining is a complex one that
| involves compute utilization, network utilization, and (in
| particular) a host of compilation techniques that we wouldn't
| want to share publicly. Maybe we'll get to that level of
| transparency at some point, though!
|
| As far as Batching goes, you should consider that with
| synchronous systems, if all the stars align, Batch=1 is all
| you need. Of course, the devil is in the details, and
| sometimes small batch numbers still give you benefits. But
| Batch 100's generally gives no advantages. In fact, the
| entire point of developing deterministic hardware and
| synchronous systems is to avoid batching in the first place.
| tome wrote:
| If you want low latency you have to be really careful with HBM,
| not only because of the delay involved, but also the non-
| determinacy. One of the huge benefits of our LPU architecture
| is that we can build systems of hundreds of chips with fast
| interconnect and we know the precise timing of the whole system
| to within a few parts per million. Once you start integrating
| non-deterministic components your latency guarantees disappear
| very quickly.
| pclmulqdq wrote:
| I don't know about HBM specifically, but DDR and GDDR at a
| protocol level are both deterministic. It's the memory
| controller doing a bunch of reordering that makes them non-
| deterministic. Presumably, if that is the reason you don't
| like DRAM, you could build your compiler to be memory-layout
| aware and have the memory controller issue commands without
| reordering.
| tome wrote:
| That could be possible. It's out of my area of expertise so
| I can't say for sure. My understanding was HBM forces on
| you specific access patterns and non-deterministic delays.
| Our compiler already deals with many other forms of
| resource-aware scheduling so it could take into account
| DRAM refreshes easily, so I feel like there must be
| something else that makes SRAM more suitable in our case.
| I'll have to leave that to someone more knowledgeable to
| explain though ...
| johntb86 wrote:
| Presumably with dram you also have to worry about
| refreshes, which can come along at arbitrary times relative
| to the workload.
| pclmulqdq wrote:
| You can control when those happen, too.
| frognumber wrote:
| From a theoretical perspective, this is absolutely not true.
| Asynchronous logic can achieve much lower latency guarantees
| than synchronous logic.
|
| Come to think of it, this is one of the few places where
| asynchronous logic might be more than academic... Async logic
| is hard with complex control flows, which deep learning
| inference does not have.
|
| (From a practical perspective, I know you were comparing to
| independently-clocked logic, rather than async logic)
| foundval wrote:
| (Groq Employee) You're right - we are comparing to
| independently-clocked logic.
|
| I wonder whether async logic would be feasible for
| reconfigurable "Spatial Processor" type architectures [1].
| As far as LPU architectures go, they fall in the "Matrix of
| Processing Engines"[1] family of architectures, which I
| would naively guess is not the best suited to leverage
| async logic.
|
| 1: I'm using the "Spatial Processor" (7:14) and "Matrix of
| Processing Engines" (8:57) terms as defined in
| https://www.youtube.com/watch?v=LUPWZ-LC0XE. Sorry for a
| video link, I just can't think of another single reference
| that explains the two approaches.
| SilverBirch wrote:
| Surely once you're scaling over multiple chips/servers/racks
| you're dealing with retries and checksums and sequence
| numbers anyway? How do you get around the non-determinacy of
| networking beyond just hoping that you don't see any errors?
| tome wrote:
| Our interconnect between chips is also deterministic! You
| can read more about our interconnect, synchronisation, and
| error correction in our paper.
|
| https://wow.groq.com/wp-
| content/uploads/2023/05/GroqISCAPape...
| matanyal wrote:
| Groq Engineer here, I'm not seeing why being able to scale
| compute outside of a single card/node is somehow a problem. My
| preferred analogy is to a car factory: Yes, you could build a
| car with say only one or two drills, but a modern automated
| factory has hundreds of drills! With a single drill, you could
| probably build all sorts of cars, but a factory assembly line
| is only able to make specific cars in that configuration. Does
| that mean that factories are inefficient?
|
| You also say that H200's work reasonably well, and that's
| reasonable (but debatable) for synchronous, human interaction
| use cases. Show me a 30b+ parameter model doing RAG as part of
| a conversation with voice responses in less than a second,
| running on Nvidia.
| pbalcer wrote:
| Just curious, how does this work out in terms of TCO (even
| assuming the price of a Groq LPU is 0$)? What you say makes
| sense, but I'm wondering how you strike a balance between
| massive horizontal scaling vs vertical scaling. Sometimes
| (quite often in my experience) having a few beefy servers is
| much simpler/cheaper/faster than scaling horizontally across
| many small nodes.
|
| Or I got this completely wrong, and your solution enables
| use-cases that are simply unattainable on mainstream
| (Nvidia/AMD) hardware, making TCO argument less relevant?
| tome wrote:
| We're providing by far the lowest latency LLM engine on the
| planet. You can't reduce latency by scaling horizontally.
| nickpsecurity wrote:
| Distributed, shared memory machines used to do exactly
| that in HPC space. They were a NUMA alternative. It works
| if the processing plus high-speed interconnect are
| collectively faster than the request rate. The 8x setups
| with NVLink are kind of like that model.
|
| You may have meant that nobody has a stack that uses
| clustering or DSM with low-latency interconnects. If so,
| then that might be worth developing given prior results
| in other low-latency domains.
| tome wrote:
| I think existing players will have trouble developing a
| low latency solution like us whilst they are still
| running on non-deterministic hardware.
| WanderPanda wrote:
| What do you mean by non-deterministic hardware? cuBLAS on
| a laptop GPU was deterministic when I tried it last iirc
| tome wrote:
| Non-deterministic timing characteristics.
| frozenport wrote:
| Tip of the ice-berg.
|
| DRAM needs to be refreshed every X cycles.
|
| This means you don't know the time it takes to read from
| memory. You could be reading at a refresh cycle. This
| circuitry also adds latency.
| nickpsecurity wrote:
| While you're here, I have a quick, off-topic question.
| We've seen incredible results with GPT3-176B (Davinci)
| and GPT4 (MoE). Making attempts at open models that reuse
| their architectural strategies could have a high impact
| on everyone. Those models took 2500-25000 GPU's to train,
| though. It would be great to have a low-cost option for
| pre training Davinci-class models.
|
| It would great if a company or others with AI hardware
| were willing to do production runs of chips sold at cost
| specifically to make open, permissive-licensed models. As
| in, since you'd lose profit, the cluster owner and users
| would be legally required to only make permissive models.
| Maybe at least one in each category (eg text, visual).
|
| Do you think your company or any other hardware supplier
| would do that? Or someone sell 2500 GPU's at cost for
| open models?
|
| (Note to anyone involved in CHIPS Act: please fund a
| cluster or accelerator specifically for this.)
| tome wrote:
| Great idea, but Groq doesn't have a product suitable for
| training at the moment. Our LPUs shine in inference.
| treprinum wrote:
| > Show me a 30b+ parameter model doing RAG as part of a
| conversation with voice responses in less than a second,
| running on Nvidia
|
| I built one, should be live soon ;-)
| tome wrote:
| Exciting! Looking forward to seeing it.
| huac wrote:
| > 30b+ parameter model doing RAG as part of a conversation
| with voice responses in less than a second, running on
| Nvidia.
|
| I believe that this is doable - my pipeline is generally
| closer to 400ms without RAG and with Mixtral, with a lot of
| non-ML hacks to get there. It would also definitely be doable
| with a joint speech-language model that removes the
| transcription step.
|
| For these use cases, time to first byte is the most important
| metric, not total throughput.
| mlazos wrote:
| You can't scale horizontally forever because of
| communication. I think HBM would provide a lot more
| flexibility with the number of chips you need.
| moralestapia wrote:
| >The main problem with the Groq LPUs is, they don't have any
| HBM on them at all. Just a miniscule (230 MiB) [0] amount of
| ultra-fast SRAM [...]
|
| IDGAF about any of that, lol. I just want an API endpoint.
|
| 480 tokens/sec at $0.27 per million tokens? Sign me in, I don't
| care about their hardware, at all.
| treesciencebot wrote:
| there are providers out there offering for $0 per million
| tokens, that doesn't mean it is sustainable and won't
| disappear as soon as the VC well runs dry. Am not saying this
| is the case for Groq, but in general you probably should care
| if you want to build something serious on top of anything.
| trsohmers wrote:
| Groq states in this article [0] that they used 576 chips to
| achieve these results, and continuing with your analysis, you
| also need to factor in that for each additional user you want
| to have requires a separate KV cache, which can add multiple
| more gigabytes per user.
|
| My professional independent observer opinion (not based on my 2
| years of working at Groq) would have me assume that their COGS
| to achieve these performance numbers would exceed several
| million dollars, so depreciating that over expected usage at
| the theoretical prices they have posted seems impractical, so
| from an actual performance per dollar standpoint they don't
| seem viable, but do have a very cool demo of an insane level of
| performance if you throw cost concerns out the window.
|
| [0]: https://www.nextplatform.com/2023/11/27/groq-says-it-can-
| dep...
| tome wrote:
| Thomas, I think for full disclosure you should also state
| that you left Groq to start a competitor (a competitor which
| doesn't have the world's lowest latency LLM engine nor a
| guarantee to match the cheapest per token prices, like Groq
| does.).
|
| Anyone with a serious interest in the total cost of ownership
| of Groq's system is welcome to email contact@groq.com.
| trsohmers wrote:
| I thought that was clear through my profile, but yes,
| Positron AI is focused on providing the best performance
| per dollar while providing the best quality of service and
| capabilities rather than just focusing on a single metric
| of speed.
|
| A guarantee to match the cheapest per token prices is sure
| a great way to lose a race to the bottom, but I do wish
| Groq (and everyone else trying to compete against NVIDIA)
| the greatest luck and success. I really do think that the
| great single batch/user performance by Groq is a great
| demo, but is not the best solution for a wide variety of
| applications, but I hope it can find its niche.
| imtringued wrote:
| I honestly don't see the problem.
|
| "just to serve a single model" could be easily fixed by adding
| a single LPDDR4 channel per LPU. Then you can reload the model
| sixty times per second and serve 60 different models per
| second.
| treesciencebot wrote:
| per-chip compute is not the main thing this chip innovates
| for fast inference, it is the extremely fast memory bandwith.
| when you do that, you'll loose all of that and will be much
| worse off than any off the shelf accelerators.
| QuadmasterXLII wrote:
| load model, compute a 1k token response (ie, do a thousand
| forward passes in sequence, one per token), load a
| different model, compute a response,
|
| I would expect the model loading to take basically zero
| percent of the time in the above workflow
| londons_explore wrote:
| > more than a single model and a lot of finetunes/high rank
| LoRAs
|
| I can imagine a way might be found to host a base model and a
| bunch of LoRA's whilst using barely more ram than the base
| model alone.
|
| The fine-tuning could perhaps be done in such a way that only
| perhaps 0.1% of the weights are changed, and for every
| computation the difference is computed not over the weights,
| but of the output layer activations.
| kcorbitt wrote:
| This actually already exists! We did a writeup of the
| relevant optimizations here: https://openpipe.ai/blog/s-lora
| imiric wrote:
| Impressive demo!
|
| However, the hardware requirements and cost make this
| inaccessible for anyone but large companies. When do you envision
| that the price could be affordable for hobbyists?
|
| Also, while the CNN Vapi demo was impressive as well, a few weeks
| ago here[1] someone shared https://smarterchild.chat/. That also
| has _very_ low audio latency, making natural conversation
| possible. From that discussion it seems that
| https://www.sindarin.tech/ is behind it. Do we know if they use
| Groq LPUs or something else?
|
| I think that once you reach ~50 t/s, real-time interaction is
| possible. Anything higher than that is useful for generating
| large volumes of data quickly, but there are diminishing returns
| as it's far beyond what humans can process. Maybe such speeds
| would be useful for AI-AI communication, transferring
| knowledge/context, etc.
|
| So an LPU product that's only focused on AI-human interaction
| could have much lower capabilities, and thus much lower cost, no?
|
| [1]: https://news.ycombinator.com/item?id=39180237
| tome wrote:
| > However, the hardware requirements and cost make this
| inaccessible for anyone but large companies. When do you
| envision that the price could be affordable for hobbyists?
|
| For API access to our tokens as a service we guarantee to beat
| any other provider on cost per token (see
| https://wow.groq.com). In terms of selling hardware, we're
| focused on selling whole systems, and they're only really
| suitable for corporations or research institutions.
| dmw_ng wrote:
| Given the size of the Sindarin team (3 AFAICT), that mostly
| looks like a clever combination of existing tech. There are
| some speech APIs that offer word-by-word realtime transcription
| (Google has one), assuming most of the special sauce is very
| well thought out pipelining between speech
| recognition->LLM->TTS
|
| (not to denigrate their awesome achievement, I would not be
| interested if I were not curious about how to reproduce their
| result!)
| stormfather wrote:
| >>50 t/s is absolutely necessary for real-time interaction with
| AI systems. Most of the LLM's output will be internal monologue
| and planning, performing RAG and summarization, etc, with only
| the final output being communicated to you. Imagine a blazingly
| fast GPT-5 that goes through multiple cycles of planning out
| how to answer you, searching the web, writing book reports,
| debating itself, distilling what it finds, critiquing and
| rewriting its answer, all while you blink a few times.
| patapong wrote:
| Very impressive! I am even more impressed by the API pricing
| though - 0.27/1M tokens seems like an order of magnitude cheaper
| than the GPT-3.5 API, and two orders of magnitude cheaper than
| GPT-4? Am I missing something here?
| siwakotisaurav wrote:
| They're competing with the lowest cost competitors for mistral
| atm, which afaik is currently deepinfra at the same pricing
| patapong wrote:
| Huh! Had no idea open source model were ahead of OpenAI
| already on pricing - will have to look into using these for
| my use cases.
| doubtfuluser wrote:
| Nice... a startup that has two "C" positions CEO and Chief Legal
| Officer...
|
| That sounds like a fun place to be
| kumarm wrote:
| They seem to be around since 2016. May be not bad for a LLM
| company that would need to deal with legal issues?
| eigenvalue wrote:
| I just want to say that this is one of the most impressive tech
| demos I've ever seen in my life, and I love that it's truly an
| open demo that anyone can try without even signing up for an
| account or anything like that. It's surreal to see the thing
| spitting out tokens at such a crazy rate when you're used to
| watching them generate at one less than one fifth that speed. I'm
| surprised you guys haven't been swallowed up by Microsoft, Apple,
| or Google already for a huge premium.
| tome wrote:
| Really glad you like it! We've been working hard on it.
| lokimedes wrote:
| The speed part or the being swallowed part?
| tome wrote:
| The speed part. We're not interested in being swallowed.
| The aim is to be bigger than Nvidia in three years :)
| dazzaji wrote:
| Go for it!
| nurettin wrote:
| Can you warn us pre-IPO?
| tome wrote:
| I'm sure you'll hear all about our IPO on HN :) :)
| timomaxgalvin wrote:
| Sure, but the responses are very poor compared to MS tools.
| brcmthrowaway wrote:
| I have it on good authority Apple was very closing to acquiring
| Groq
| baq wrote:
| If this is true, expect a call from the SEC...
| 317070 wrote:
| Even if it isn't true.
|
| Disclosing inside information is illegal, _even if it is
| false and fabricated_, if it leads to personal gains.
| KRAKRISMOTT wrote:
| You have to prove the OP had personal gains. If he's just
| a troll, it will be difficult.
| frognumber wrote:
| You also have to be an insider.
|
| If I go to a bar, and overhear a pair of Googlers
| discussing something secret and overhear it, I can:
|
| 1) Trade on it.
|
| 2) Talk about it.
|
| Because I'm not an insider. On the other hand, if I'm
| sleeping with the CEO, I become an insider.
|
| Not a lawyer. Above is not legal advice. Just a comment
| that the line is much more complex, and talking about a
| potential acquisition is usually okay (if you're not
| under NDA).
| throwawayurlife wrote:
| It doesn't matter if you overheard it at a bar or if
| you're just some HN commenter posting completely
| incorrect legal advice; the law prohibits trading on
| material nonpublic information.
|
| I would pay a lot to see you try your ridiculous legal
| hokey-pokey on how to define an "insider."
| programmarchy wrote:
| If you did hear it in a bar, could you tweet it out
| before your trade, so the information is made public?
| sofixa wrote:
| Had insider trading training, and yes, that's the gist of
| it. If you know or presume that the information is
| material (makes a difference) and not public, it's
| illegal to act on it.
| tripletao wrote:
| Roughly, it's illegal only if you have some duty not to
| trade on it. If you acquired the information without
| misappropriating it (like overhearing it from strangers
| in a normal public bar), then you're free to trade.
|
| https://corpgov.law.harvard.edu/2017/01/18/insider-
| trading-l...
|
| There's no reason for normal corporate training to
| discuss that element, because an employee who trades
| their employer's stock based on MNPI has near-certainly
| misappropriated it. The question of whether a non-
| employee has misappropriated information is much more
| complex, though.
| frognumber wrote:
| Training is designed to protect the corporation, not to
| provide accurate lega ladvice. That's true of most
| corporate trainings, for that matter, be that bribes /
| corruption, harassment, discrimination, or whatnot.
| Corporations want employees very far from the line.
|
| That's the right way to run them.
|
| If you want more nuance, talk to a lawyer or read case
| law.
|
| Generally, insider trading requires something along the
| lines of a fiduciary duty to keep the information secret,
| albeit a very weak one. I'm not going to slice that line,
| but you see references in-thread.
| frognumber wrote:
| You're an idiot.
|
| https://www.kiplinger.com/article/investing/t052-c008-s00
| 1-w...
|
| Case #1.
| sheepscreek wrote:
| TIL that SEC has authority over private company dealings
| wrt sale of shares[1].
|
| [1] https://www.sec.gov/education/capitalraising/building-
| blocks...
| belter wrote:
| Not if poster is in a crashing plane...
| elorant wrote:
| Perplexity Labs also has an open demo of Mixtral 8x7b although
| it's nowhere near as fast as this.
|
| https://labs.perplexity.ai/
| larodi wrote:
| why sell? it would be much more delightful to beat them on
| their own game?
| deniz_tekalp wrote:
| GPUs are notoriously bad on exploiting sparsity. I wonder if this
| architecture can do a better job. The groq engineers in this
| thread, if a neural network had say 60% of its weights set to 0,
| what would it do to cost & speed in your hardware?
| supercharger9 wrote:
| Do they make money from LLM service or by selling hardware?
| Homepage is confusing without any reference to other products.
| tome wrote:
| Both, we sell tokens as a service and we sell enterprise
| systems.
| supercharger9 wrote:
| Then reference that in the homepage? If not for this HN
| thread, I wouldn't have know you sell hardware.
| deepnotderp wrote:
| This demo has more than 500 chips btw, it's not exactly an apples
| to apples comparison with 1 GPU...
| tome wrote:
| Definitely not, but even with a comparison to 500 GPUs Groq
| will still come out on top because you can never reduce latency
| by adding more parallel compute :)
| varunvummadi wrote:
| So please let me know if I am wrong are you guys running a
| batch size of 1 in 500 GPU's? then why are the responses
| almost instant if you guys are using batch size 1 and also
| when can we expect bring your own fine tuned models kind of
| thing. Thanks!
| tome wrote:
| We are not using 500 GPUs, we are using a large system
| built from many of our own custom ASICs. This allows us to
| do batch size 1 with no reduction in overall throughput.
| (We are doing pipelining though, so many users are using
| the same system at once).
| eurekin wrote:
| Jaw dropping. Both groq and mixtral.
|
| I used following prompt:
|
| Generate gitlab ci yaml file for a hybrid front-end/backend
| project. Fronted is under /frontend and is a node project,
| packaged with yarn, built with vite to the /backend/public
| folder. The backend is a python flask server
| idiotlogical wrote:
| I asked it "What is the oldest known written recipe" and then
| asked for that recipe. It seemed to repeat the same response a
| bunch of times (very quickly!) before stopping. Is this abnormal
| behavior?
| neonbrain wrote:
| For me the reply for continuation question "Please tell me the
| exact steps of that recipe.":
|
| Here is a rough translation of the oldest known written recipe,
| which is for a kind of beer from ancient Mesopotamia:
|
| 1. Crush and mash 4 liters of barley to create malt.
|
| 2. Soak the malt in water for 2.5 days.
|
| 3. Leave the mash in a cool place for 2.5 days.
|
| 4. Filter the mash through a strainer to separate the liquid
| from the solids.
|
| 5. Boil the liquid for 1 hour.
|
| 6. Add 1 liter of honey and 1 handful of hops to the liquid.
|
| 7. Leave the mixture to cool.
|
| 8. Add 1 handful of "gig" (a type of wild yeast) to the
| mixture.
|
| 9. Leave the mixture in a cool place for 5 days to allow
| fermentation.
|
| 10. Filter the mixture again to separate the beer from the
| solids.
|
| 11. Serve the beer chilled.
|
| It's important to note that this recipe is over 3,700 years
| old, and it may not produce a beer that is palatable to modern
| tastes. Additionally, some of the ingredients and techniques
| used in this recipe may not be readily available or practical
| for home brewers today.
| yzh wrote:
| Really impressive work! I wonder how easy would it be to support
| (a future open source version of) SORA using Groq's design. Will
| there be a Video Processing Unit (VPU)?
| jkachmar wrote:
| i can't comment about sora specifically, however the
| architecture can support workloads beyond just LLM inference.
|
| our demo booth at trade shows usually has StyleCLIP up at one
| point or another to provide an abstract example of this.
|
| disclosure: i work on infrastructure at Groq and am generally
| interested in hardware architecture and compiler design,
| however i am not a part of either of those teams :)
| sylware wrote:
| any noscript/basic (x)html prompt?
| newsclues wrote:
| I asked it what carmacks AI company was called and it corrected
| identified John carmack but said he was working on VR.
| neilv wrote:
| If the page can't access certain fonts, it will fail to work,
| while it keeps retrying requests:
| https://fonts.gstatic.com/s/notosansarabic/[...]
| https://fonts.gstatic.com/s/notosanshebrew/[...]
| https://fonts.gstatic.com/s/notosanssc/[...]
|
| (I noticed this because my browser blocks these de facto trackers
| by default.)
| sebastiennight wrote:
| Same problem when trying to use font replacements with a
| privacy plugin.
|
| This is a very weird dependency to have :-)
| tome wrote:
| Thanks, I've reported this internally.
| rasz wrote:
| How to show Google how popular and interesting for acquisition
| you are without directly installing google trackers on your
| website.
| totalhack wrote:
| Where is the data center located? The fastest response time I
| could get from some quick testing from the northeast US, having
| it output just one letter, was 670ms. Just wondering if that's an
| expected result, as it's on a par or slower than GPT 3.5 via API.
| MaxLeiter wrote:
| There's a queueing system if too many requests are being
| processed at once. You may have hit that.
| tome wrote:
| West Coast US. You would have been placed in our queuing system
| because with all the attention we are getting we are very busy
| right now!
| totalhack wrote:
| Thanks! I did notice the queue count showing up occasionally
| but not every time. Maybe someone could repeat the test who
| has access without the queue so we can get an understanding
| of the potential latency once scaled and geo-distributed.
| What I'm really trying to understand is time to first token
| output actually faster than GPT 3.5 via API or just the rate
| of token output once it begins.
| tome wrote:
| I don't know about GPT 3.5 specifically, but on this
| independent benchmark (LLMPerf) Groq's time to first token
| is also lowest:
|
| https://github.com/ray-project/llmperf-
| leaderboard?tab=readm...
| anybodyz wrote:
| I have this hooked up experimentally to my universal Dungeon
| Master simulator DungeonGod and it seems to work quite well.
|
| I had been using Together AI Mixtral (which is serving the Hermes
| Mixtrals) and it is pretty snappy, but nothing close to Groq. I
| think the next closes that I've tested is Perplexity Labs
| Mixtral.
|
| A key blocker in just hanging out a shingle for an open source AI
| project is the fear that anything that might scale will bankrupt
| you (or just be offline if you get any significant traction). I
| think we're nearing the phase that we could potentially just turn
| these things "on" and eat the reasonable inference fees to see
| what people engage with - with a pretty decently cool free tier
| available.
|
| I'd add that the simulator does multiple calls to the api for one
| response to do analysis and function selection in the underlying
| python game engine, which Groq makes less of a problem as it's
| close to instant. This adds a pretty significant pause in the
| OpenAI version. Also since this simulator runs on Discord with
| multiple users, I've had problems in the past with 'user response
| storms' where the AI couldn't keep up. Also less of a problem
| with Groq.
| monkin wrote:
| It's impressive, but I have one problem with all of those models.
| I wanted them to answer what Mixtral or Llama2 are, but with no
| luck. It would be great if models could at least describe
| themselves.
| johndough wrote:
| There are two issues with that.
|
| 1. To create a model, you have to train it on training data.
| Mixtral and Llama2 did not exist before they were trained, so
| their training data did not contain any information about
| Mixtral or Llama2 (respectively). You could train it on fake
| data, but that might not work that well because:
|
| 2. The internet is full of text like "I am <something>", so it
| would probably overshadow any injected training data like "I am
| Llama2, a model by MetaAI."
|
| You could of course inject the information as an invisible
| system prompt (like OpenAI is doing with ChatGPT), but that is
| a waste of computation resources.
| roomey wrote:
| Oh hell yes, this is the first "fast" one, superhuman fast.
|
| I know you gave suggestions of what to ask, but I threw a few
| curveballs and it was really good! Well done this is a big step
| forwards
| Klaus23 wrote:
| The demo is pretty cool, but the mobile interface could be a
| parody of bad interface design. The text box at the top is hard
| to reach if you want to open the keyboard, which automatically
| closes, or press the button to send the question, and the chat
| history is out of chronological order for no logical reason.
|
| Edit: Text selection is also broken.
| fatkam wrote:
| For me, it was fast when it started printing (it did almost
| instantly), but it took forever for it to start.
| tome wrote:
| There are a lot of people interested in Groq now, so most jobs
| are sitting in a queue for a little while.
| fatkam wrote:
| fair enough... I guess at least it didnt crash like many
| other overwhelmed sites do... but at the end of the day, it
| was my experience.
| SeanAnderson wrote:
| Sorry, I'm a bit naive about all of this.
|
| Why is this impressive? Can this result not be achieved by
| throwing more compute at the problem to speed up responses? Isn't
| the fact that there is a queue when under load just indicative
| that there's a trade-off between "# of request to process per
| unit of time" and "amount of compute to put into a response to
| respond quicker"?
|
| https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/rel/do...
|
| This chart from NVIDIA implies their H100 runs llama v2 70B at
| >500 tok/s.
| MasterScrat wrote:
| Scaling up compute can improve throughput, but can't easily
| improve latency between tokens. Generation is usually
| bottlenecked by the time it takes to go through the network for
| each token. To speed that up, you need to perform these
| computations faster, which is a hard problem after you've
| exhausted all the obvious options (use the fastest accelerator
| you can find, cache what you can etc).
| SeanAnderson wrote:
| Yeah. That makes sense, thank you for clarifying. I updated
| my original post with a chart from NVIDIA which highlights
| the H100's capabilities. It doesn't seem unreasonable to
| expect a 7B model to run at 500 tok/s on that hardware.
| snowfield wrote:
| This is a 50B model. (Mixtral 8x7b)
| SeanAnderson wrote:
| Oh, sorry, I assumed the 8 was for quantization. 8x7b is
| a new syntax for me.
|
| Still, the NVIDIA chart shows Llama v2 70B at 750 tok/s,
| no?
| tome wrote:
| I guess that's total throughput, rather than per user?
| You can increase total throughput by scaling
| horizontally. You can't increase throughput per user that
| way.
| tome wrote:
| LLM inference is inherently a sequential problem. You can't
| speed it up by doing more in parallel. You can't generate the
| 101st token before you've generated the 100th.
| NorwegianDude wrote:
| Technically, I guess you can use speculative execution to
| speed it up, and in that way take a guess at what the 100th
| token will be and start on the 101st token at the same time?
| Though it probably has it's own unforeseen challenges.
|
| Everything is predictable with enough guesses.
| SushiHippie wrote:
| I guess it depends on how much the infrastracture from TFA
| costs, as the H100 only costs ~$3300 to produce, but gets sold
| for ~$30k on average.
|
| https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gp...
| nabakin wrote:
| There's a difference between token throughput and latency.
| Token throughput is the token throughput of the whole
| GPU/system and latency is the token throughput for an
| individual user. Groq offers extremely low latency (aka
| extremely high token throughput per user) but we still don't
| have numbers on the token throughput of their entire system.
| Nvidia's metrics here on the other hand, show us the token
| throughput of the whole GPU/system. So, in reality, while you
| might be able to get 1.5k t/s on an H100, the latency (token
| throughput per user) will be something much lower like 20 t/s.
|
| The really important metric to look for is cost per token
| because even though Groq is able to run at low latency, that
| doesn't mean it's able to do it cheaply. Determining the cost
| per token can be done many ways but a useful way for us is
| approximately the cost of the system divided by the total token
| throughput of the system per second. We don't have the total
| token throughput per second of Groq's system so we can't really
| say how efficient it is. It could very well be that Groq is
| subsidizing the cost of their system to lower prices and gain
| PR and will increase their prices later on.
| frozenport wrote:
| https://wow.groq.com/artificialanalysis-ai-llm-benchmark-
| dou...
|
| Seems to have it. Looks cost competitive but a lot faster.
| Gcam wrote:
| Groq's API performance reaches close to this level of performance
| as well. We've benchmarked performance over time and >400
| tokens/s has sustained - can see here
| https://artificialanalysis.ai/models/mixtral-8x7b-instruct
| (bottom of page for over time view)
| mise_en_place wrote:
| Do you have any plans to support bringing your own model? I have
| been using Sagemaker but it is very slow to deploy to.
| tome wrote:
| Yes, we're working with some customers on that, but it will be
| a while until general availability.
| youssefabdelm wrote:
| Do you guys provide logprobs via the api?
| tome wrote:
| You can check out all our API features here:
| https://docs.api.groq.com/
| ggnore7452 wrote:
| The Groq demo was indeed impressive. I work with LLM alot in
| work, and a generation speed of 500+ tokens/s would definitely
| change how we use these products. (Especially considering it's an
| early-stage product)
|
| But the "completely novel silicon architecture" and the "self-
| developed LPU" (claiming not to use GPUs)... makes me bit
| skeptical. After all, pure speed might be achievable through
| stacking computational power and model quantization. Shouldn't
| innovation at the GPU level be quite challenging, especially to
| achieve such groundbreaking speeds?
| ggnore7452 wrote:
| more on the LPU and data center: https://wow.groq.com/lpu-
| inference-engine/
|
| price and speed benchmark: https://wow.groq.com/
| jereees wrote:
| I'll pay $xx a month if I can talk to Groq the way I can talk to
| ChatGPT with my AirPods
| supercharger9 wrote:
| Ignoring latency but not throughput, How does this compare in
| terms of Cost ( cards Acquisition cost and Power needed) with
| Nvidia GPU for inference?
| mlconnor wrote:
| omg. i can't believe how incredibly fast that is. and capable
| too. wow
| ttul wrote:
| Have you experimented with running diffusion models on Groq
| hardware?
| tome wrote:
| Yes, we don't have any publicly accessible ones at the moment
| though.
| lukevp wrote:
| I'm not sure how, but I got the zoom messed up on iOS and I can
| no longer see the submit button. Refreshing doesn't fix it.
| QuesnayJr wrote:
| I tried it out, and I was taken aback how quickly it answered.
| tagyro wrote:
| I (only) ran a couple of prompts but I am impressed. It has the
| speed of gpt 3.5 and the quality of gpt 4.
|
| Seriously considering switching from [open]AI to Mix/s/tral in my
| apps.
| eightysixfour wrote:
| Mixtral 8x7 is good, but it is not GPT-4 good in any of the use
| cases I have tried. Mistral's other models get close and beat
| it in some cases, but not Mixtral.
| ohwellish wrote:
| I wish there was an option to export whole session chat, say in
| plaintext as a link to some pastebin, that chat I just had with
| groq would have some ppl I know really impressed
| Keyframe wrote:
| This is insane. Congratulations!
| codedokode wrote:
| Is it normal that I have asked two networks (llama/mixtral) the
| same question ("tell me about most popular audio pitch detection
| algorithms") and they gave almost the same answer? Both answers
| start with "Sure, here are some of the most popular pitch
| detection algorithms used in audio signal processing" and end
| with "Each of these algorithms has its own strengths and
| weaknesses, and the choice of algorithm depends on the specific
| application and the characteristics of the input signal.". And
| the content is 95% the same. How can it be?
| tome wrote:
| Yeah it's a bit confusing. See here for details:
| https://news.ycombinator.com/item?id=39431921
| jprd wrote:
| This is _super_ impressive. The rate of iteration and innovation
| in this space means that just as I 'm feeling
| jaded/bored/oversaturated - some new project makes my jaw drop
| again.
| joaquincabezas wrote:
| are there also experiments around image embedding generation to
| use in combination with the LLM?
|
| maybe for this use-case is it better to execute the vision tower
| on a GPU and leave the LPU for the language part?
| LoganDark wrote:
| Please when can I buy some of these for home use? Please.
___________________________________________________________________
(page generated 2024-02-19 23:00 UTC)