[HN Gopher] LoRA from scratch: implementation for LLM finetuning
___________________________________________________________________
LoRA from scratch: implementation for LLM finetuning
Author : rasbt
Score : 198 points
Date : 2024-01-22 16:56 UTC (6 hours ago)
(HTM) web link (lightning.ai)
(TXT) w3m dump (lightning.ai)
| ignoramous wrote:
| I've been keeping track of the techniques through Maxime
| Labonne's LLMs 101: https://github.com/mlabonne/llm-
| course#4-supervised-fine-tun...
| pama wrote:
| Thanks for the resource. It seems useful enough to warrant its
| own thread here.
| dymk wrote:
| Not to be confused with LoRa ("long range"), a radio
| communication protocol. At first I thought this could be about
| using LLMs to find optimal protocol parameters, but alas.
| cpfohl wrote:
| I had the exact same confusion
| OJFord wrote:
| It's the first thing that comes to my mind too, but this is
| mentioned in every thread (and there are far more of them for
| LoRA than LoRa atm), and in this case there's unlikely to be
| much confusion because it starts by spelling out the acronym:
| 'LoRA, which stands for Low Rank Adaptation, [...]'.
| rasbt wrote:
| Hah, yeah that's LoRA as in Low-Rank Adaptation :P
| thelastparadise wrote:
| This caught me off-guard as well.
|
| I really wish they could have used abother acronym.
| the__alchemist wrote:
| Concur; or at least don't use a mix of lower and upper-case,
| like the radio. I think there would be less mis-assumptions if
| they had called it "LORA", "Lora", "lora" etc. "LoRA" is asking
| for trouble.
| andy99 wrote:
| "From scratch" seems to be a matter of opinion. "Pure pytorch"
| maybe, except it uses HF transformers. So it's LoRA on top of
| common frameworks...
| rasbt wrote:
| Yeah, the LoRA part is from scratch. The LLM backbone in this
| example is not, this is to provide a concrete example. But you
| could apply the exact same LoRA from scratch code to a pure
| PyTorch model if you wanted to:
|
| E.g. class MultilayerPerceptron(nn.Module):
| def __init__(self, num_features, num_hidden_1, num_hidden_2,
| num_classes): super().__init__()
| self.layers = nn.Sequential(
| nn.Linear(num_features, num_hidden_1),
| nn.ReLU(), nn.Linear(num_hidden_1,
| num_hidden_2), nn.ReLU(),
| nn.Linear(num_hidden_2, num_classes) )
| def forward(self, x): x = self.layers(x)
| return x model = MultilayerPerceptron(
| num_features=num_features,
| num_hidden_1=num_hidden_1,
| num_hidden_2=num_hidden_2, num_classes=num_classes
| ) model.layers[0] =
| LinearWithLoRA(model.layers[0], rank=4, alpha=1)
| model.layers[2] = LinearWithLoRA(model.layers[2], rank=4,
| alpha=1) model.layers[4] =
| LinearWithLoRA(model.layers[4], rank=4, alpha=1)
| 2024throwaway wrote:
| This apple pie recipe claims to be from scratch, but they
| cooked it in an off the shelf oven. So it's from scratch on top
| of the universe...
| huqedato wrote:
| Excellent and practical example! I'm curious if there's a
| comparable one using Julia or JavaScript.
| ijhuygft776 wrote:
| I wish the wireless LoRa protocol would be open source...
| broabprobe wrote:
| wow definitely thought this was about LoRa at first.
| gourabmi wrote:
| Someone somewhere is already working on naming their project
| Lehsun.. /s
| rsweeney21 wrote:
| It's still strange to me to work in a field of computer science
| where we say things like "we're not exactly sure how these
| numbers (hyper parameters) affect the result, so just try a bunch
| of different values and see which one works best."
| manojlds wrote:
| Divine benevolence
| r3trohack3r wrote:
| I feel like it's the difference between something that has been
| engineered and something that has been discovered.
|
| I feel like most of our industry up until now has been
| engineered.
|
| LLMs were discovered.
| arketyp wrote:
| I understand your distinction, I think, but I would say it is
| more engineering than ever. It's like the early days of the
| steam engine or firearms development. It's not a hard
| science, not formal analysis, it's engineering: tinkering,
| testing, experimenting, iterating.
| peddling-brink wrote:
| > tinkering, testing, experimenting, iterating
|
| But that describes science. http://imgur.com/1h3K2TT/
| amelius wrote:
| AI requires a lot of engineering. However, the engineering
| is not what makes working in AI interesting. It's the
| plumbing, basically.
| justanotheratom wrote:
| and finally, this justifies the "science" in Computer
| Science.
| SkyMarshal wrote:
| If the Black Swan model of science is true, then most of the
| consequential innovations and advances are discovered rather
| than engineered.
| jejeyyy77 wrote:
| it's a new paradigm
| UberFly wrote:
| This is what researching different Stable Diffusion settings is
| like. You quickly learn that there's a lot of guessing going
| on.
| CamperBob2 wrote:
| This can be laid at the feet of Minsky and others who dismissed
| perceptrons because they couldn't model nonlinear functions.
| LLMs were never going to happen until modern CPUs and GPUs came
| along, but that doesn't mean we couldn't have a better
| theoretical foundation in place. We are years behind where we
| should be.
|
| When I worked in the games industry in the 1990s, it was
| "common knowledge" that neural nets were a dead end at best and
| a con job at worst. Really a shame to lose so much time because
| a few senior authority figures warned everyone off. We need to
| make sure that doesn't happen this time.
| spidersenses wrote:
| What is the point you're trying to make?
| CamperBob2 wrote:
| _What is the point you 're trying to make?_
|
| Answering the GP's point regarding why deep learning
| textbooks, articles, and blog posts are full of sentences
| that begin with "We think..." and "We're not sure, but..."
| and "It appears that..."
|
| What's yours?
| TacticalCoder wrote:
| > "we're not exactly sure how these numbers (hyper parameters)
| affect the result, so just try a bunch of different values and
| see which one works best."
|
| Isn't it the same for anything that uses a Monte Carlo
| simulation to find a value? At times you'll end up on a local
| maxima (instead of the best/correct) answer, but it works.
|
| We cannot solve something used a closed formula so we just do a
| billion (or whatever) random samplings and find what we're
| after.
|
| I'm not saying it's the same for LLMs but "trying a bunch of
| different values and see which one works best" is something we
| do a lot.
| SkyMarshal wrote:
| That bottom-up tinkering is kinda how CS started in the US, as
| observed by Dijkstra himself:
| https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...
|
| Ideally we want theoretical foundations, but sometimes random
| explorations are necessary to tease out enough data to
| construct or validate theory.
| stormfather wrote:
| It's how God programs
| amelius wrote:
| AI is more like gardening than engineering. You try things
| without knowing the outcome. And you wait a very long time to
| see the outcome.
| thatguysaguy wrote:
| I haven't seen this key/buzzword mentioned yet, so I think part
| of it is the fact that we're now working on complex systems.
| This was already true (a social network is a complex system),
| but now we have the impenetrability of a complex system within
| the scope of a single process. It's hard to figure out
| generalizable principles about this kind of thing!
| fierro wrote:
| we have no theories of intelligence. We're like people in the
| 1500s trying to figure out why and how people get sick, with no
| concept of bacteria, germs, transmission, etc
| chenxi9649 wrote:
| It's still not too clear to me when we should fine tune versus
| RAG.
|
| In the past, I used to believe that finetuning is mostly for
| model behavioral change, but recently it seems that certain
| companies are also using fine-tuning for knowledge addition.
|
| What are the main use cases for fine tuning?
| rasbt wrote:
| I think the main use case remains behavior changes: instruction
| finetuning, finetuning for classification, etc. Knowledge
| addition to the weights is best done via pretraining. Or, if
| you have an external database or documentation that you want to
| query during the generation, RAG as you mention.
|
| PS: All winners of the NeurIPS 2023 LLM Efficiency Challenge
| (finetuning the "best" LLM in 24h on 1 GPU) used LoRA or QLoRA
| (quantized LoRA).
| ignoramous wrote:
| From what I gather, fine-tuning is unreasonably effective [0]
| because in-context learning really depends on how powerful the
| underlying model is _and_ just how you do RAG (process queries,
| retrieve embeddings, rank outcomes, etc [1]). Per this paper I
| read, fine-tuning _may_ add new domain knowledge (but as
| another commenter pointed out, knowledge is better represented
| from data of the pre-training stage) or boost specific
| knowledge; while RAG is limited to _boosting_ only;
| nevertheless, both techniques turn out to be similarly capable
| with different trade-offs [2].
|
| --
|
| [0] _Fast.ai: Can Models learn from one sample_ ,
| https://www.fast.ai/posts/2023-09-04-learning-jumps/ /
| https://archive.is/eJMPR
|
| [1] _LlamaIndex: Advanced RAG_ ,
| https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-fo...
| / https://archive.is/qtBXX
|
| [2] _Microsoft: RAG vs Fine-tuning: Pipelines, Tradeoffs, and a
| Case Study_ , https://arxiv.org/html/2401.08406v2#S6 /
| https://archive.is/UQ8Sa#S6
| CuriouslyC wrote:
| Fine tuning is better than RAG when the additional data isn't
| concise, or requires context. This is because too much context
| (or "unfocused" context) can dilute prompt following behavior,
| and RAG doesn't help the model with higher order token
| associations so you have to get lucky and pull what you need
| from the augmentation material, at which point it's not much
| better than a fancy search engine. Of course this is mostly an
| issue when you're dealing with a specialized corpus with its
| own micro-dialect that isn't well represented in public data
| sets, such as with government/big corporation internal
| documents.
| jamesblonde wrote:
| I prefer the not from scratch, but from configuration approach by
| Axolotl. Aolotl supports fine-tuning mistral, llama-2, with lots
| of the latest techniques - sample packing, flash attention,
| xformers.
|
| I concentrate on collecting and curating the fine-tuning data, do
| "data-centric" fine-tuning - not learning LoRA from scratch.
| wfalcon wrote:
| this is also what our (Lightning AI) lit-gpt library does.
| https://github.com/Lightning-AI/lit-gpt
| denysvitali wrote:
| LoRA != LoRa. I keep on getting confused and hate that they chose
| to reuse an existing acronym
| sschueller wrote:
| It's unfortunate that those two so far unrelated technologies
| have the same acronym.
| daemonologist wrote:
| Likewise. My day job is machine learning and I still, or maybe
| consequently, do a double-take every time I see the acronym
| with minimal context (like on the HN front page, where either
| usage would be normal).
| sbrother wrote:
| Wait, what is the meaning other than "Low-Rank Adaptation"?
| It's hard to google the difference.
| boolemancer wrote:
| I assume the radio technology:
|
| https://en.wikipedia.org/wiki/LoRa
| cristoperb wrote:
| It's the name of a "Lo"ng "Ra"nge wifi-like technology:
|
| https://en.wikipedia.org/wiki/LoRa
| facu17y wrote:
| What's the performance penalty of LoRA?
| rasbt wrote:
| During training, it's more efficient than full finetuning
| because you only update a fraction of the parameters via
| backprop. During inference, it can ...
|
| 1) ... be theoretically a tad slower if you add the LoRA values
| dynamically during the forward pass (however, this is also an
| advantage if you want to keep a separate small weight set per
| customer, for example; you run only one large base model and
| can apply the different LoRA weights per customer on the fly)
|
| 2) ... have the exact same performance as the base model if you
| merge the LoRA weights back with the base model.
| yandrypozo wrote:
| gotta say naming is hard I thought this was about LoRa (from
| "long range") or LoRaWAN, the IoT sensors communication.
| somethingsome wrote:
| Nice article, I'm not in this field, however, my understanding of
| the original paper was that the LoRA was applied only on the last
| dense layer, and not to all independently (maybe I misread it
| originally).
|
| Digging a bit in why the implementation is like this in the link,
| I found that in QLoRA they used this and it seems to have some
| interesting effects, maybe adding a note on the QLoRA decision
| would be nice :)
|
| I'm not sure I understand why it works though, my neophyte view
| was that applying LoRA to the last layer made sense, but, I do
| not wrap my mind on the rationale of applying it repeadly to each
| linear layer. Can someone explain their intuition?
| icyfox wrote:
| Like most things in ML, the answer of which layers to use come
| down to empirical evidence more than theory. In a typical Lora
| training pipeline, you freeze the contents of the base model
| and just adjust the Lora layers. The more layers you convert to
| lora layers the more degrees of freedom you have for the
| optimization.
|
| There are some finetuning regimens that only recommend
| finetuning the last layer since this is theorized to have the
| "highest order" representation of the inputs. Other training
| regimens will finetune all layers. It's largely data and
| problem dependent. Lora just mirrors this convention.
___________________________________________________________________
(page generated 2024-01-22 23:00 UTC)