https://vram.asmirnov.xyz/

VRAM Estimator

Estimate GPU VRAM usage of transformer-based models.

Running Parameters

Inference
Training

Precision:

mixed
full (fp32)

Optimizer:

Adam
SGD
[*]momentum
Sequence Length
[512                 ]Sequence Length
Batch Size
[4                   ]Batch Size
Number of GPUs
[1                   ]Number of GPUs

Model Parameters

Model Parameters could be taken from config.json on HuggingFace or
directly from model via model.config

Parameters Preset
[                    ]
Parameters Preset
Number of Parameters (billions)
[1.418               ]Number of Parameters (billions)
Number of Layers
[24                  ]Number of Layers
Vocab Size
[51200               ]Vocab Size
Hidden Size
[2048                ]Hidden Size
Number of Attention Heads
[32                  ]Number of Attention Heads
Intermediate Size
[8192                ]Intermediate Size

Expanding dimensionality within MLP block. Usually it is 4 x hidden
size.

Number of Key Value Heads
[32                  ]Number of Key Value Heads

Might be different from number of attention heads when using Grouped
Query Attention

Estimation Result

MiB
GiB

  * Total VRAM usage is 27836 MiB

  * CUDA Kernels use 1000 MiB of VRAM

    When PyTorch uses CUDA for the first time, it allocates between
    300 MiB and 2 GiB of VRAM

  * Parameters use 8114 MiB of VRAM

    Number of Parameters (1.418 billion) x number of bytes per
    parameter (6; parameters are stored in both full precision and
    half precision)

  * Activations use 7104 MiB of VRAM

    Sum of sizes of all intermediate tensors during forward pass
    across all 24 layers. Activations size have quadratic dependence
    on Sequence Length.

  * Gradients use 5409 MiB of VRAM

    Gradient is stored for each parameter in full precision, so it is
    Number of Parameters (1.418 billion) x number of bytes per
    parameter (4)

  * First Moments use 5409 MiB of VRAM

    Optimizer stores moving average of gradients for each parameter
    in full precision, so it is Number of Parameters (1.418 billion)
    x number of bytes per parameter (4)

  * Output tensor uses 800 MiB of VRAM

    Batch Size (4) x Sequence Length (512) x Vocabulary Size (51200)
    x number of bytes per parameter (4) x 2 (storing probabilities
    after softmax output which are the same size as output)

---------------------------------------------------------------------

While estimates might not be completely precise, they reflect my
current understanding of the topic. For an in-depth explanation and
the logic behind these numbers, feel free to check out my detailed
post and the calculation code in the source repo. If you feel
something is wrong please reach out via email alex@asmirnov.xyz or
create an issue/PR in the repo. Cheers!