[HN Gopher] Basic math related to computation and memory usage f...
___________________________________________________________________
Basic math related to computation and memory usage for transformers
Author : tim_sw
Score : 116 points
Date : 2023-04-19 18:23 UTC (4 hours ago)
(HTM) web link (blog.eleuther.ai)
(TXT) w3m dump (blog.eleuther.ai)
| letitgo12345 wrote:
| Shouldn't the memory needed scale quadratically with sequence
| length rather than the linear scaling they have in their
| equations?
| visarga wrote:
| Not if they use flash attention which solves the problem in
| fixed memory by working tile by tile. They never materialise
| the whole attention matrix at once. But the computation time is
| still quadratic.
| letitgo12345 wrote:
| They present it as an article about transformers in general,
| not ones using Flash Attention. Anyway maybe they're
| presenting per token memory requirement instead of the
| requirement for the entire sequence at once.
| psyklic wrote:
| Another related one: https://kipp.ly/blog/transformer-inference-
| arithmetic/
| monkmartinez wrote:
| Great article... However, the proliferation of "quantization"
| (8bit, 4bit, 3, 2, etc.) so normies like myself can run
| transformer based models on consumer grade has changed this math
| significantly. It has also changed the landscape for text
| generation at such a pace that its nearly impossible to keep up.
|
| I don't look at any model the same after head to head comparisons
| with full precision and quantization at 4bit have run on my
| machine. There is little to no perceptible change with models of
| the same initial weight. BUT!!! I am now able to run models that
| required a DGX a few weeks ago on my home computer thanks to
| quantization. These models are better in every way from my POV. I
| am now more interested in what I can "do" with the models vs.
| just getting them to run. 30B at 4 bits is the sweet spot for my
| setup.
| tinglymintyfrsh wrote:
| The title: for a second, I though people were using eddy currents
| in the electrical grid to perform computation. Maybe it's Turing
| complete.
| visarga wrote:
| Great post! Very detailed explanations. This makes training large
| models easier to get into for other teams.
| sroussey wrote:
| Nice article, though I feel something went amiss with this part:
|
| $$ \begin{align _}\text{Total Memory}{\text{Training}} = \text{me
| mory}{\text{model}}+\text{memory}{\text{optimizer}}+\text{memory}
| {\text{activations}}+\text{memory}_{\text{gradients}}\end{align_
| } $$
| [deleted]
| teruakohatu wrote:
| Do you have javascript disabled? That is Latex which should be
| converted to images (or svg) dynamically after the page loads.
___________________________________________________________________
(page generated 2023-04-19 23:00 UTC)