https://vram.asmirnov.xyz/ VRAM Estimator Estimate GPU VRAM usage of transformer-based models. Running Parameters Inference Training Precision: mixed full (fp32) Optimizer: Adam SGD [*]momentum Sequence Length [512 ]Sequence Length Batch Size [4 ]Batch Size Number of GPUs [1 ]Number of GPUs Model Parameters Model Parameters could be taken from config.json on HuggingFace or directly from model via model.config Parameters Preset [ ] Parameters Preset Number of Parameters (billions) [1.418 ]Number of Parameters (billions) Number of Layers [24 ]Number of Layers Vocab Size [51200 ]Vocab Size Hidden Size [2048 ]Hidden Size Number of Attention Heads [32 ]Number of Attention Heads Intermediate Size [8192 ]Intermediate Size Expanding dimensionality within MLP block. Usually it is 4 x hidden size. Number of Key Value Heads [32 ]Number of Key Value Heads Might be different from number of attention heads when using Grouped Query Attention Estimation Result MiB GiB * Total VRAM usage is 27836 MiB * CUDA Kernels use 1000 MiB of VRAM When PyTorch uses CUDA for the first time, it allocates between 300 MiB and 2 GiB of VRAM * Parameters use 8114 MiB of VRAM Number of Parameters (1.418 billion) x number of bytes per parameter (6; parameters are stored in both full precision and half precision) * Activations use 7104 MiB of VRAM Sum of sizes of all intermediate tensors during forward pass across all 24 layers. Activations size have quadratic dependence on Sequence Length. * Gradients use 5409 MiB of VRAM Gradient is stored for each parameter in full precision, so it is Number of Parameters (1.418 billion) x number of bytes per parameter (4) * First Moments use 5409 MiB of VRAM Optimizer stores moving average of gradients for each parameter in full precision, so it is Number of Parameters (1.418 billion) x number of bytes per parameter (4) * Output tensor uses 800 MiB of VRAM Batch Size (4) x Sequence Length (512) x Vocabulary Size (51200) x number of bytes per parameter (4) x 2 (storing probabilities after softmax output which are the same size as output) --------------------------------------------------------------------- While estimates might not be completely precise, they reflect my current understanding of the topic. For an in-depth explanation and the logic behind these numbers, feel free to check out my detailed post and the calculation code in the source repo. If you feel something is wrong please reach out via email alex@asmirnov.xyz or create an issue/PR in the repo. Cheers!