[HN Gopher] Basic math related to computation and memory usage f...
       ___________________________________________________________________
        
       Basic math related to computation and memory usage for transformers
        
       Author : tim_sw
       Score  : 116 points
       Date   : 2023-04-19 18:23 UTC (4 hours ago)
        
 (HTM) web link (blog.eleuther.ai)
 (TXT) w3m dump (blog.eleuther.ai)
        
       | letitgo12345 wrote:
       | Shouldn't the memory needed scale quadratically with sequence
       | length rather than the linear scaling they have in their
       | equations?
        
         | visarga wrote:
         | Not if they use flash attention which solves the problem in
         | fixed memory by working tile by tile. They never materialise
         | the whole attention matrix at once. But the computation time is
         | still quadratic.
        
           | letitgo12345 wrote:
           | They present it as an article about transformers in general,
           | not ones using Flash Attention. Anyway maybe they're
           | presenting per token memory requirement instead of the
           | requirement for the entire sequence at once.
        
       | psyklic wrote:
       | Another related one: https://kipp.ly/blog/transformer-inference-
       | arithmetic/
        
       | monkmartinez wrote:
       | Great article... However, the proliferation of "quantization"
       | (8bit, 4bit, 3, 2, etc.) so normies like myself can run
       | transformer based models on consumer grade has changed this math
       | significantly. It has also changed the landscape for text
       | generation at such a pace that its nearly impossible to keep up.
       | 
       | I don't look at any model the same after head to head comparisons
       | with full precision and quantization at 4bit have run on my
       | machine. There is little to no perceptible change with models of
       | the same initial weight. BUT!!! I am now able to run models that
       | required a DGX a few weeks ago on my home computer thanks to
       | quantization. These models are better in every way from my POV. I
       | am now more interested in what I can "do" with the models vs.
       | just getting them to run. 30B at 4 bits is the sweet spot for my
       | setup.
        
       | tinglymintyfrsh wrote:
       | The title: for a second, I though people were using eddy currents
       | in the electrical grid to perform computation. Maybe it's Turing
       | complete.
        
       | visarga wrote:
       | Great post! Very detailed explanations. This makes training large
       | models easier to get into for other teams.
        
       | sroussey wrote:
       | Nice article, though I feel something went amiss with this part:
       | 
       | $$ \begin{align _}\text{Total Memory}{\text{Training}} = \text{me
       | mory}{\text{model}}+\text{memory}{\text{optimizer}}+\text{memory}
       | {\text{activations}}+\text{memory}_{\text{gradients}}\end{align_
       | } $$
        
         | [deleted]
        
         | teruakohatu wrote:
         | Do you have javascript disabled? That is Latex which should be
         | converted to images (or svg) dynamically after the page loads.
        
       ___________________________________________________________________
       (page generated 2023-04-19 23:00 UTC)