[HN Gopher] Liger-Kernel: Efficient Triton kernels for LLM training
___________________________________________________________________
Liger-Kernel: Efficient Triton kernels for LLM training
Author : letmehandle
Score : 7 points
Date : 2024-08-23 20:18 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| letmehandle wrote:
| Maximizing GPU efficiency for training large language models
| (LLMs) is challenging due to issues like out-of-memory (OOM)
| errors, scaling batch size, and sequence length. To address these
| challenges, LinkedIn has developed an open-source library called
| Liger-Kernel, which offers efficient Triton kernels for LLM
| training. This library can increase training throughput by 20%
| and reduce memory usage by 60% with just one line of code.
|
| The custom triton kernels we developed at LinkedIn integrate
| smoothly with Flash Attention, PyTorch FSDP, and DeepSpeed. Patch
| your Hugging Face model with one line, or compose your own model
| using the provided kernels. Lightweight and efficient, these
| kernels have minimal dependencies--just Torch and Triton.
|
| *Implementation details*
|
| We have taken the spirit from llm.c but used Triton to
| reimplement RMSNorm, RoPE, SwiGLU, CrossEntropy, and
| FusedLinearCrossEntropy from scratch with forward and backward
| passes in pure Triton. The kernels are exact, without
| approximations.
|
| We have adopted kernel fusion, in-place, tiling, and chunking
| techniques. For example, due to the large vocab size of some
| models, instead of materializing the full logits (10s of GB), we
| combine chunking, gradient-in-forward, and online softmax to
| reduce memory usage by 5X.
|
| Torch Compiler now supports custom Triton kernels, allowing our
| kernels to seamlessly integrate. For example, by combining Torch
| Compile with FusedLinearCrossEntropy, we have observed more than
| a 50% reduction in memory usage.
|
| *Acknowledgement*
|
| We'd like to first give a shout-out to Andrej Karpathy's llm.c
| for inspiring us to develop llm.triton. FlashAttention, vLLM, and
| Unsloth have been pioneers in custom Triton kernels. Special
| thanks to the Triton team for the revolutionary kernel interface,
| and to Efficient Cross Entropy for the linear cross entropy
| tricks.
|
| We would like to thank Animesh Singh, Haowen Ning, Yanning Chen
| for the leadership support, Shao Tang, Qingquan Song, Yun Dai,
| Vignesh Kothapalli, Jason (Siyu) Zhu, Steven Shimizu, Shivam
| Sahni and Zain Merchant for the technical contribution.
|
| *Want to contribute?*
|
| Are you a dedicated researcher looking for a reliable kernel, a
| kernel guru who can help us shape better kernels, or a curious
| novice wanting to learn Triton? Join our community at
| https://discord.gg/CX2YmNmn to hack together.
|
| Stay tuned for our talk at CUDA MODE
| (https://discord.gg/CX2YmNmn?event=1273323969788772455), where we
| will provide an immersive experience in developing Triton
| kernels. We'll share code examples, and together we'll identify
| bottlenecks, derive backward formulas, ensure exactness, and fix
| intricate bugs.
| mrshu wrote:
| Liger-Kernel support has already been merged to axolotl [0] along
| with an example config that makes use of it [1], if anyone would
| like to quickly try it out.
|
| [0] https://github.com/axolotl-ai-cloud/axolotl/pull/1861
|
| [1] https://github.com/axolotl-ai-
| cloud/axolotl/blob/main/exampl...
___________________________________________________________________
(page generated 2024-08-23 23:01 UTC)