[HN Gopher] Kaist develops next-generation ultra-low power LLM a...
___________________________________________________________________
Kaist develops next-generation ultra-low power LLM accelerator
Author : readline_prompt
Score : 56 points
Date : 2024-03-06 06:21 UTC (1 days ago)
(HTM) web link (en.yna.co.kr)
(TXT) w3m dump (en.yna.co.kr)
| moffkalast wrote:
| > The 4.5-mm-square chip, developed using Korean tech giant
| Samsung Electronics Co.'s 28 nanometer process, has 625 times
| less power consumption compared with global AI chip giant
| Nvidia's A-100 GPU, which requires 250 watts of power to process
| LLMs, the ministry explained.
|
| >processes GPT-2 with an ultra-low power consumption of 400
| milliwatts and a high speed of 0.4 seconds
|
| Not sure what's the point on comparing the two, an A100 will get
| you a lot more speed than 2.5 tokens/sec. GPT 2 is just a 1.5B
| param model, a Pi 4 would get you more tokens per second with
| just CPU inference.
|
| Still, I'm sure there's improvements to be made and the direction
| is fantastic to see, especially after Coral TPUs have proven
| completely useless for LLM and whisper acceleration. Hopefully it
| ends up as something vaguely affordable.
| dloss wrote:
| Which of the model requirements of Coral TPUs [1] are the most
| problematic for LLMs?
|
| [1] https://coral.ai/docs/edgetpu/models-intro/#model-
| requiremen...
| semisight wrote:
| Guessing as to what the GP meant--coral TPUs max out around
| 8M parameters, IIRC. That's a few orders of magnitude less
| than the smallest LLM model.
| dartos wrote:
| > New structure mimics the layout of neurons and synapses
|
| What does that mean, practically?
|
| How can you mimic that layout in silicon?
| p1esk wrote:
| This means they use Spiking Neural Networks. It's a software
| algorithm that most likely doesn't work as well as regular NNs.
| colinator wrote:
| Well, our brains are closer to spiking neural networks than
| 'regular' neural networks. And they work pretty well. For the
| most part.
|
| I feel like SNNs are like Brazil - they are the future, and
| shall remain so. I think more basic research is needed for
| them to mature. AFAIK the current SOTA is to train them with
| 'surrogate gradients', which shoe-horn them into the current
| NN training paradigm, and that sort of discards some of their
| worth. Have biologically-inspired learning rules, like STDP,
| _really_ been exhausted?
| ilaksh wrote:
| But this group claims to have demonstrated a way to use
| SNNs to run LLMs effectively and with vastly less energy
| usage.
| p1esk wrote:
| If OpenAI or DeepMind makes such claim I'd pay attention.
| Otherwise it's always some (usually hw) guys trying to
| get a grant, or even just publish a paper.
|
| p.s. People interested in biologically inspired data
| processing algorithms should look at Numenta's papers
| (earlier ones, because recently they switched to regular
| deep learning), and especially learn their justification
| for _not_ using spikes.
| geuis wrote:
| Want to reference Groq.com. They are developing their own
| inference hardware called an LPU https://wow.groq.com/lpu-
| inference-engine/
|
| They also released their API a week or 2 ago. Its _significantly_
| faster than anything from OpenAI right now. Mixtral 8x7b operates
| at around 500 tokens per second. https://groq.com/
| moffkalast wrote:
| It's not so much an accelerator as it is addressing the main
| inference bottleneck (i.e. memory latency) with sheer brute
| force by throwing money at the problem. They've made
| accelerators out of pure L3 cache with a whopping 230 MB per
| card. They cited something like 500 cards to load one single
| Mixtral instance, which probably cost over $10M to build. It's
| a supercomputer essentially.
| jiggawatts wrote:
| Or to put it another way: they've made a compute substrate
| with the correct ratios of processing power to memory
| capacity.
|
| NVIDIA GPUs were optimised for different workloads, such as
| 3D rendering, that have different optimal ratios.
|
| This "supercomputer" isn't brute force or wasteful because it
| allows more requests per second. By having each response get
| processed faster it can pipeline more of them through per
| unit time and unit silicon area.
| wmf wrote:
| The correct ratio for one workload (production inference).
| LoganDark wrote:
| They need 568 LPUs to load both Mixtral 8x7B _and_ LLaMA 70B,
| because they need both those models available for the demo.
|
| I imagine Mixtral by itself would only take something like
| 200-300 LPUs
___________________________________________________________________
(page generated 2024-03-07 23:00 UTC)