https://medium.com/inspiredbrilliance/exploring-lora-part-1-the-idea-behind-parameter-efficient-fine-tuning-and-lora-ec469d176c26 Open in app Sign up Sign in [ ] Write Sign up Sign in [1] Exploring LoRA -- Part 1: The Idea Behind Parameter Efficient Fine-Tuning and LoRA "The What and Why of Adapter based Parameter Efficient Fine Tuning: Understanding Its Purpose and Significance" 3pi inspiringbrilliance 3pi * Follow Published in inspiringbrilliance * 9 min read * Feb 13, 2024 -- 1 Listen Share Table of Contents * What is the necessity of fine tuning? * What is the conventional way of fine-tuning? * How can fine-tuning be made efficient? * How does fine-tuning with fewer parameters work? * What are adapters and how are they used for fine-tuning? * What is LoRA? * The Idea Behind Low-Rank Adaptation * Conclusion What is the necessity of fine tuning? Pre-trained large language models undergo extensive training on vast data from the internet, resulting in exceptional performance across a broad spectrum of tasks. Nonetheless, in most real-world scenarios, there arises a necessity for the model to possess expertise in a particular, specialized domain. Numerous applications in the fields of natural language processing and computer vision rely on the adaptation of a single large-scale, pre-trained language model for multiple downstream applications. This adaptation process is typically achieved through a technique called fine-tuning, which involves customizing the model to a specific domain and task. Consequently, fine-tuning the model becomes vital to achieve highest levels of performance and efficiency on downstream tasks. The pre-trained models serve as a robust foundation for the fine-tuning process which is specifically tailored to address targeted tasks. What is the conventional way of fine-tuning? During the conventional fine-tuning of deep neural networks, modifications are applied to the top layers of the network, while the lower layers remain fixed or "frozen". This is necessary because the label spaces and loss functions for downstream tasks often differ from those of the original pre-trained model. However, a notable drawback of this approach is that the resulting new model retains the same number of parameters as the original model, which can be quite substantial. Nonetheless, creating a separate full model for each downstream task may not be efficient when working with contemporary large language models (LLMs). In many cases, both the top layers and the original weights undergo co-training. Large language models (LLMs) have reached such immense sizes that fine-tuning even a single layer, let alone the entire model, requires substantial computational resources and can be prohibitively expensive. Take Llama 3.1-8B, for example: it contains 32 layers, excluding embedding and normalization layers, and each layer has about 218 million parameters (combining all projection/ attention and MLP layers). Traditional fine-tuning, even if limited to the final layer, thus becomes a costly endeavor. At the end of this blog, we'll see how this situation can be improved. Conventional fine-tuning: Black bordered squares represent layers of a neural network. The last couple of layers (green) are modified during fine-tuning while the rest of the layers (red) are kept frozen.` How can fine-tuning be made efficient? Many sought to mitigate this by learning a set of few extra parameters for each new task. This way, we only need to store and load a small number of task-specific parameters in addition to the pre-trained model for each task, greatly boosting the operational efficiency when deployed. This is one of the Parameter-Efficient Fine-Tuning (PEFT) methods and it focuses on fine-tuning only the external modules called Adapters. This approach significantly reduces computational and storage costs. The advantages of employing parameter-efficient methods for fine-tuning can be articulated in two aspects: * Regarding disk storage: Fine-tuning solely an adapter module with a limited number of additional parameters necessitates storing only the adapter for each downstream task. This significantly reduces the required storage space for models. It prompts the question of whether maintaining an entire additional copy of LLaMA for each distinct sub-task is truly necessary. * Concerning RAM: While the exact measurement of the memory footprint considers factors like batch size and various buffers needed during training, the general memory requirement is roughly four times the size of the model. This accounts for gradients, activations, and other elements for each parameter. Fine-tuning a smaller set of parameters alleviates the need to maintain optimizer states for the majority of parameters. During the forward pass, frozen layer weights are used to compute the loss without storing local gradients, saving memory by eliminating the need to save gradients for these layers. During the backward pass, the weights of the frozen layers remain unchanged, which also saves computational resources and RAM, as no calculations are needed for updating these weights. Adapter based fine-tuning: Few extra layers of parameters are injected into the original base model and only they are trained while the base model layers remain frozen. How does fine-tuning with fewer parameters work? Li et al.2 and Aghajanyan et al.3 showed that the learned over-parametrized models in fact reside on a low intrinsic dimension8. This line raises several intriguing questions: What is the intrinsic dimension (ID) of a model? What are over parameterized models? What is the dimension of a model? What is the dimension of an objective function? If ID was so low why do we have such large networks in the first place? How is ID related to fine-tuning, and how to find the ID of a model? These questions will be explored in detail this and the accompanying article. To pave the way for our discussion on LoRA, here is a quick summary -- 1. While deep networks may have a large number of parameters, say 'n', only a small subset of these, say 'd' where d<