[HN Gopher] Kaist develops next-generation ultra-low power LLM a...
       ___________________________________________________________________
        
       Kaist develops next-generation ultra-low power LLM accelerator
        
       Author : readline_prompt
       Score  : 56 points
       Date   : 2024-03-06 06:21 UTC (1 days ago)
        
 (HTM) web link (en.yna.co.kr)
 (TXT) w3m dump (en.yna.co.kr)
        
       | moffkalast wrote:
       | > The 4.5-mm-square chip, developed using Korean tech giant
       | Samsung Electronics Co.'s 28 nanometer process, has 625 times
       | less power consumption compared with global AI chip giant
       | Nvidia's A-100 GPU, which requires 250 watts of power to process
       | LLMs, the ministry explained.
       | 
       | >processes GPT-2 with an ultra-low power consumption of 400
       | milliwatts and a high speed of 0.4 seconds
       | 
       | Not sure what's the point on comparing the two, an A100 will get
       | you a lot more speed than 2.5 tokens/sec. GPT 2 is just a 1.5B
       | param model, a Pi 4 would get you more tokens per second with
       | just CPU inference.
       | 
       | Still, I'm sure there's improvements to be made and the direction
       | is fantastic to see, especially after Coral TPUs have proven
       | completely useless for LLM and whisper acceleration. Hopefully it
       | ends up as something vaguely affordable.
        
         | dloss wrote:
         | Which of the model requirements of Coral TPUs [1] are the most
         | problematic for LLMs?
         | 
         | [1] https://coral.ai/docs/edgetpu/models-intro/#model-
         | requiremen...
        
           | semisight wrote:
           | Guessing as to what the GP meant--coral TPUs max out around
           | 8M parameters, IIRC. That's a few orders of magnitude less
           | than the smallest LLM model.
        
       | dartos wrote:
       | > New structure mimics the layout of neurons and synapses
       | 
       | What does that mean, practically?
       | 
       | How can you mimic that layout in silicon?
        
         | p1esk wrote:
         | This means they use Spiking Neural Networks. It's a software
         | algorithm that most likely doesn't work as well as regular NNs.
        
           | colinator wrote:
           | Well, our brains are closer to spiking neural networks than
           | 'regular' neural networks. And they work pretty well. For the
           | most part.
           | 
           | I feel like SNNs are like Brazil - they are the future, and
           | shall remain so. I think more basic research is needed for
           | them to mature. AFAIK the current SOTA is to train them with
           | 'surrogate gradients', which shoe-horn them into the current
           | NN training paradigm, and that sort of discards some of their
           | worth. Have biologically-inspired learning rules, like STDP,
           | _really_ been exhausted?
        
             | ilaksh wrote:
             | But this group claims to have demonstrated a way to use
             | SNNs to run LLMs effectively and with vastly less energy
             | usage.
        
               | p1esk wrote:
               | If OpenAI or DeepMind makes such claim I'd pay attention.
               | Otherwise it's always some (usually hw) guys trying to
               | get a grant, or even just publish a paper.
               | 
               | p.s. People interested in biologically inspired data
               | processing algorithms should look at Numenta's papers
               | (earlier ones, because recently they switched to regular
               | deep learning), and especially learn their justification
               | for _not_ using spikes.
        
       | geuis wrote:
       | Want to reference Groq.com. They are developing their own
       | inference hardware called an LPU https://wow.groq.com/lpu-
       | inference-engine/
       | 
       | They also released their API a week or 2 ago. Its _significantly_
       | faster than anything from OpenAI right now. Mixtral 8x7b operates
       | at around 500 tokens per second. https://groq.com/
        
         | moffkalast wrote:
         | It's not so much an accelerator as it is addressing the main
         | inference bottleneck (i.e. memory latency) with sheer brute
         | force by throwing money at the problem. They've made
         | accelerators out of pure L3 cache with a whopping 230 MB per
         | card. They cited something like 500 cards to load one single
         | Mixtral instance, which probably cost over $10M to build. It's
         | a supercomputer essentially.
        
           | jiggawatts wrote:
           | Or to put it another way: they've made a compute substrate
           | with the correct ratios of processing power to memory
           | capacity.
           | 
           | NVIDIA GPUs were optimised for different workloads, such as
           | 3D rendering, that have different optimal ratios.
           | 
           | This "supercomputer" isn't brute force or wasteful because it
           | allows more requests per second. By having each response get
           | processed faster it can pipeline more of them through per
           | unit time and unit silicon area.
        
             | wmf wrote:
             | The correct ratio for one workload (production inference).
        
           | LoganDark wrote:
           | They need 568 LPUs to load both Mixtral 8x7B _and_ LLaMA 70B,
           | because they need both those models available for the demo.
           | 
           | I imagine Mixtral by itself would only take something like
           | 200-300 LPUs
        
       ___________________________________________________________________
       (page generated 2024-03-07 23:00 UTC)