[HN Gopher] LoRA from scratch: implementation for LLM finetuning
       ___________________________________________________________________
        
       LoRA from scratch: implementation for LLM finetuning
        
       Author : rasbt
       Score  : 198 points
       Date   : 2024-01-22 16:56 UTC (6 hours ago)
        
 (HTM) web link (lightning.ai)
 (TXT) w3m dump (lightning.ai)
        
       | ignoramous wrote:
       | I've been keeping track of the techniques through Maxime
       | Labonne's LLMs 101: https://github.com/mlabonne/llm-
       | course#4-supervised-fine-tun...
        
         | pama wrote:
         | Thanks for the resource. It seems useful enough to warrant its
         | own thread here.
        
       | dymk wrote:
       | Not to be confused with LoRa ("long range"), a radio
       | communication protocol. At first I thought this could be about
       | using LLMs to find optimal protocol parameters, but alas.
        
         | cpfohl wrote:
         | I had the exact same confusion
        
         | OJFord wrote:
         | It's the first thing that comes to my mind too, but this is
         | mentioned in every thread (and there are far more of them for
         | LoRA than LoRa atm), and in this case there's unlikely to be
         | much confusion because it starts by spelling out the acronym:
         | 'LoRA, which stands for Low Rank Adaptation, [...]'.
        
         | rasbt wrote:
         | Hah, yeah that's LoRA as in Low-Rank Adaptation :P
        
         | thelastparadise wrote:
         | This caught me off-guard as well.
         | 
         | I really wish they could have used abother acronym.
        
         | the__alchemist wrote:
         | Concur; or at least don't use a mix of lower and upper-case,
         | like the radio. I think there would be less mis-assumptions if
         | they had called it "LORA", "Lora", "lora" etc. "LoRA" is asking
         | for trouble.
        
       | andy99 wrote:
       | "From scratch" seems to be a matter of opinion. "Pure pytorch"
       | maybe, except it uses HF transformers. So it's LoRA on top of
       | common frameworks...
        
         | rasbt wrote:
         | Yeah, the LoRA part is from scratch. The LLM backbone in this
         | example is not, this is to provide a concrete example. But you
         | could apply the exact same LoRA from scratch code to a pure
         | PyTorch model if you wanted to:
         | 
         | E.g.                   class MultilayerPerceptron(nn.Module):
         | def __init__(self, num_features, num_hidden_1, num_hidden_2,
         | num_classes):                 super().__init__()
         | self.layers = nn.Sequential(
         | nn.Linear(num_features, num_hidden_1),
         | nn.ReLU(),                     nn.Linear(num_hidden_1,
         | num_hidden_2),                     nn.ReLU(),
         | nn.Linear(num_hidden_2, num_classes)                 )
         | def forward(self, x):                 x = self.layers(x)
         | return x              model = MultilayerPerceptron(
         | num_features=num_features,
         | num_hidden_1=num_hidden_1,
         | num_hidden_2=num_hidden_2,              num_classes=num_classes
         | )              model.layers[0] =
         | LinearWithLoRA(model.layers[0], rank=4, alpha=1)
         | model.layers[2] = LinearWithLoRA(model.layers[2], rank=4,
         | alpha=1)         model.layers[4] =
         | LinearWithLoRA(model.layers[4], rank=4, alpha=1)
        
         | 2024throwaway wrote:
         | This apple pie recipe claims to be from scratch, but they
         | cooked it in an off the shelf oven. So it's from scratch on top
         | of the universe...
        
       | huqedato wrote:
       | Excellent and practical example! I'm curious if there's a
       | comparable one using Julia or JavaScript.
        
       | ijhuygft776 wrote:
       | I wish the wireless LoRa protocol would be open source...
        
       | broabprobe wrote:
       | wow definitely thought this was about LoRa at first.
        
       | gourabmi wrote:
       | Someone somewhere is already working on naming their project
       | Lehsun.. /s
        
       | rsweeney21 wrote:
       | It's still strange to me to work in a field of computer science
       | where we say things like "we're not exactly sure how these
       | numbers (hyper parameters) affect the result, so just try a bunch
       | of different values and see which one works best."
        
         | manojlds wrote:
         | Divine benevolence
        
         | r3trohack3r wrote:
         | I feel like it's the difference between something that has been
         | engineered and something that has been discovered.
         | 
         | I feel like most of our industry up until now has been
         | engineered.
         | 
         | LLMs were discovered.
        
           | arketyp wrote:
           | I understand your distinction, I think, but I would say it is
           | more engineering than ever. It's like the early days of the
           | steam engine or firearms development. It's not a hard
           | science, not formal analysis, it's engineering: tinkering,
           | testing, experimenting, iterating.
        
             | peddling-brink wrote:
             | > tinkering, testing, experimenting, iterating
             | 
             | But that describes science. http://imgur.com/1h3K2TT/
        
             | amelius wrote:
             | AI requires a lot of engineering. However, the engineering
             | is not what makes working in AI interesting. It's the
             | plumbing, basically.
        
           | justanotheratom wrote:
           | and finally, this justifies the "science" in Computer
           | Science.
        
           | SkyMarshal wrote:
           | If the Black Swan model of science is true, then most of the
           | consequential innovations and advances are discovered rather
           | than engineered.
        
         | jejeyyy77 wrote:
         | it's a new paradigm
        
         | UberFly wrote:
         | This is what researching different Stable Diffusion settings is
         | like. You quickly learn that there's a lot of guessing going
         | on.
        
         | CamperBob2 wrote:
         | This can be laid at the feet of Minsky and others who dismissed
         | perceptrons because they couldn't model nonlinear functions.
         | LLMs were never going to happen until modern CPUs and GPUs came
         | along, but that doesn't mean we couldn't have a better
         | theoretical foundation in place. We are years behind where we
         | should be.
         | 
         | When I worked in the games industry in the 1990s, it was
         | "common knowledge" that neural nets were a dead end at best and
         | a con job at worst. Really a shame to lose so much time because
         | a few senior authority figures warned everyone off. We need to
         | make sure that doesn't happen this time.
        
           | spidersenses wrote:
           | What is the point you're trying to make?
        
             | CamperBob2 wrote:
             | _What is the point you 're trying to make?_
             | 
             | Answering the GP's point regarding why deep learning
             | textbooks, articles, and blog posts are full of sentences
             | that begin with "We think..." and "We're not sure, but..."
             | and "It appears that..."
             | 
             | What's yours?
        
         | TacticalCoder wrote:
         | > "we're not exactly sure how these numbers (hyper parameters)
         | affect the result, so just try a bunch of different values and
         | see which one works best."
         | 
         | Isn't it the same for anything that uses a Monte Carlo
         | simulation to find a value? At times you'll end up on a local
         | maxima (instead of the best/correct) answer, but it works.
         | 
         | We cannot solve something used a closed formula so we just do a
         | billion (or whatever) random samplings and find what we're
         | after.
         | 
         | I'm not saying it's the same for LLMs but "trying a bunch of
         | different values and see which one works best" is something we
         | do a lot.
        
         | SkyMarshal wrote:
         | That bottom-up tinkering is kinda how CS started in the US, as
         | observed by Dijkstra himself:
         | https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...
         | 
         | Ideally we want theoretical foundations, but sometimes random
         | explorations are necessary to tease out enough data to
         | construct or validate theory.
        
         | stormfather wrote:
         | It's how God programs
        
         | amelius wrote:
         | AI is more like gardening than engineering. You try things
         | without knowing the outcome. And you wait a very long time to
         | see the outcome.
        
         | thatguysaguy wrote:
         | I haven't seen this key/buzzword mentioned yet, so I think part
         | of it is the fact that we're now working on complex systems.
         | This was already true (a social network is a complex system),
         | but now we have the impenetrability of a complex system within
         | the scope of a single process. It's hard to figure out
         | generalizable principles about this kind of thing!
        
         | fierro wrote:
         | we have no theories of intelligence. We're like people in the
         | 1500s trying to figure out why and how people get sick, with no
         | concept of bacteria, germs, transmission, etc
        
       | chenxi9649 wrote:
       | It's still not too clear to me when we should fine tune versus
       | RAG.
       | 
       | In the past, I used to believe that finetuning is mostly for
       | model behavioral change, but recently it seems that certain
       | companies are also using fine-tuning for knowledge addition.
       | 
       | What are the main use cases for fine tuning?
        
         | rasbt wrote:
         | I think the main use case remains behavior changes: instruction
         | finetuning, finetuning for classification, etc. Knowledge
         | addition to the weights is best done via pretraining. Or, if
         | you have an external database or documentation that you want to
         | query during the generation, RAG as you mention.
         | 
         | PS: All winners of the NeurIPS 2023 LLM Efficiency Challenge
         | (finetuning the "best" LLM in 24h on 1 GPU) used LoRA or QLoRA
         | (quantized LoRA).
        
         | ignoramous wrote:
         | From what I gather, fine-tuning is unreasonably effective [0]
         | because in-context learning really depends on how powerful the
         | underlying model is _and_ just how you do RAG (process queries,
         | retrieve embeddings, rank outcomes, etc [1]). Per this paper I
         | read, fine-tuning _may_ add new domain knowledge (but as
         | another commenter pointed out, knowledge is better represented
         | from data of the pre-training stage) or boost specific
         | knowledge; while RAG is limited to _boosting_ only;
         | nevertheless, both techniques turn out to be similarly capable
         | with different trade-offs [2].
         | 
         | --
         | 
         | [0] _Fast.ai: Can Models learn from one sample_ ,
         | https://www.fast.ai/posts/2023-09-04-learning-jumps/ /
         | https://archive.is/eJMPR
         | 
         | [1] _LlamaIndex: Advanced RAG_ ,
         | https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-fo...
         | / https://archive.is/qtBXX
         | 
         | [2] _Microsoft: RAG vs Fine-tuning: Pipelines, Tradeoffs, and a
         | Case Study_ , https://arxiv.org/html/2401.08406v2#S6 /
         | https://archive.is/UQ8Sa#S6
        
         | CuriouslyC wrote:
         | Fine tuning is better than RAG when the additional data isn't
         | concise, or requires context. This is because too much context
         | (or "unfocused" context) can dilute prompt following behavior,
         | and RAG doesn't help the model with higher order token
         | associations so you have to get lucky and pull what you need
         | from the augmentation material, at which point it's not much
         | better than a fancy search engine. Of course this is mostly an
         | issue when you're dealing with a specialized corpus with its
         | own micro-dialect that isn't well represented in public data
         | sets, such as with government/big corporation internal
         | documents.
        
       | jamesblonde wrote:
       | I prefer the not from scratch, but from configuration approach by
       | Axolotl. Aolotl supports fine-tuning mistral, llama-2, with lots
       | of the latest techniques - sample packing, flash attention,
       | xformers.
       | 
       | I concentrate on collecting and curating the fine-tuning data, do
       | "data-centric" fine-tuning - not learning LoRA from scratch.
        
         | wfalcon wrote:
         | this is also what our (Lightning AI) lit-gpt library does.
         | https://github.com/Lightning-AI/lit-gpt
        
       | denysvitali wrote:
       | LoRA != LoRa. I keep on getting confused and hate that they chose
       | to reuse an existing acronym
        
         | sschueller wrote:
         | It's unfortunate that those two so far unrelated technologies
         | have the same acronym.
        
         | daemonologist wrote:
         | Likewise. My day job is machine learning and I still, or maybe
         | consequently, do a double-take every time I see the acronym
         | with minimal context (like on the HN front page, where either
         | usage would be normal).
        
         | sbrother wrote:
         | Wait, what is the meaning other than "Low-Rank Adaptation"?
         | It's hard to google the difference.
        
           | boolemancer wrote:
           | I assume the radio technology:
           | 
           | https://en.wikipedia.org/wiki/LoRa
        
           | cristoperb wrote:
           | It's the name of a "Lo"ng "Ra"nge wifi-like technology:
           | 
           | https://en.wikipedia.org/wiki/LoRa
        
       | facu17y wrote:
       | What's the performance penalty of LoRA?
        
         | rasbt wrote:
         | During training, it's more efficient than full finetuning
         | because you only update a fraction of the parameters via
         | backprop. During inference, it can ...
         | 
         | 1) ... be theoretically a tad slower if you add the LoRA values
         | dynamically during the forward pass (however, this is also an
         | advantage if you want to keep a separate small weight set per
         | customer, for example; you run only one large base model and
         | can apply the different LoRA weights per customer on the fly)
         | 
         | 2) ... have the exact same performance as the base model if you
         | merge the LoRA weights back with the base model.
        
       | yandrypozo wrote:
       | gotta say naming is hard I thought this was about LoRa (from
       | "long range") or LoRaWAN, the IoT sensors communication.
        
       | somethingsome wrote:
       | Nice article, I'm not in this field, however, my understanding of
       | the original paper was that the LoRA was applied only on the last
       | dense layer, and not to all independently (maybe I misread it
       | originally).
       | 
       | Digging a bit in why the implementation is like this in the link,
       | I found that in QLoRA they used this and it seems to have some
       | interesting effects, maybe adding a note on the QLoRA decision
       | would be nice :)
       | 
       | I'm not sure I understand why it works though, my neophyte view
       | was that applying LoRA to the last layer made sense, but, I do
       | not wrap my mind on the rationale of applying it repeadly to each
       | linear layer. Can someone explain their intuition?
        
         | icyfox wrote:
         | Like most things in ML, the answer of which layers to use come
         | down to empirical evidence more than theory. In a typical Lora
         | training pipeline, you freeze the contents of the base model
         | and just adjust the Lora layers. The more layers you convert to
         | lora layers the more degrees of freedom you have for the
         | optimization.
         | 
         | There are some finetuning regimens that only recommend
         | finetuning the last layer since this is theorized to have the
         | "highest order" representation of the inputs. Other training
         | regimens will finetune all layers. It's largely data and
         | problem dependent. Lora just mirrors this convention.
        
       ___________________________________________________________________
       (page generated 2024-01-22 23:00 UTC)