[HN Gopher] Liger-Kernel: Efficient Triton kernels for LLM training
       ___________________________________________________________________
        
       Liger-Kernel: Efficient Triton kernels for LLM training
        
       Author : letmehandle
       Score  : 7 points
       Date   : 2024-08-23 20:18 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | letmehandle wrote:
       | Maximizing GPU efficiency for training large language models
       | (LLMs) is challenging due to issues like out-of-memory (OOM)
       | errors, scaling batch size, and sequence length. To address these
       | challenges, LinkedIn has developed an open-source library called
       | Liger-Kernel, which offers efficient Triton kernels for LLM
       | training. This library can increase training throughput by 20%
       | and reduce memory usage by 60% with just one line of code.
       | 
       | The custom triton kernels we developed at LinkedIn integrate
       | smoothly with Flash Attention, PyTorch FSDP, and DeepSpeed. Patch
       | your Hugging Face model with one line, or compose your own model
       | using the provided kernels. Lightweight and efficient, these
       | kernels have minimal dependencies--just Torch and Triton.
       | 
       | *Implementation details*
       | 
       | We have taken the spirit from llm.c but used Triton to
       | reimplement RMSNorm, RoPE, SwiGLU, CrossEntropy, and
       | FusedLinearCrossEntropy from scratch with forward and backward
       | passes in pure Triton. The kernels are exact, without
       | approximations.
       | 
       | We have adopted kernel fusion, in-place, tiling, and chunking
       | techniques. For example, due to the large vocab size of some
       | models, instead of materializing the full logits (10s of GB), we
       | combine chunking, gradient-in-forward, and online softmax to
       | reduce memory usage by 5X.
       | 
       | Torch Compiler now supports custom Triton kernels, allowing our
       | kernels to seamlessly integrate. For example, by combining Torch
       | Compile with FusedLinearCrossEntropy, we have observed more than
       | a 50% reduction in memory usage.
       | 
       | *Acknowledgement*
       | 
       | We'd like to first give a shout-out to Andrej Karpathy's llm.c
       | for inspiring us to develop llm.triton. FlashAttention, vLLM, and
       | Unsloth have been pioneers in custom Triton kernels. Special
       | thanks to the Triton team for the revolutionary kernel interface,
       | and to Efficient Cross Entropy for the linear cross entropy
       | tricks.
       | 
       | We would like to thank Animesh Singh, Haowen Ning, Yanning Chen
       | for the leadership support, Shao Tang, Qingquan Song, Yun Dai,
       | Vignesh Kothapalli, Jason (Siyu) Zhu, Steven Shimizu, Shivam
       | Sahni and Zain Merchant for the technical contribution.
       | 
       | *Want to contribute?*
       | 
       | Are you a dedicated researcher looking for a reliable kernel, a
       | kernel guru who can help us shape better kernels, or a curious
       | novice wanting to learn Triton? Join our community at
       | https://discord.gg/CX2YmNmn to hack together.
       | 
       | Stay tuned for our talk at CUDA MODE
       | (https://discord.gg/CX2YmNmn?event=1273323969788772455), where we
       | will provide an immersive experience in developing Triton
       | kernels. We'll share code examples, and together we'll identify
       | bottlenecks, derive backward formulas, ensure exactness, and fix
       | intricate bugs.
        
       | mrshu wrote:
       | Liger-Kernel support has already been merged to axolotl [0] along
       | with an example config that makes use of it [1], if anyone would
       | like to quickly try it out.
       | 
       | [0] https://github.com/axolotl-ai-cloud/axolotl/pull/1861
       | 
       | [1] https://github.com/axolotl-ai-
       | cloud/axolotl/blob/main/exampl...
        
       ___________________________________________________________________
       (page generated 2024-08-23 23:01 UTC)