[HN Gopher] LoRA vs. Full Fine-Tuning: An Illusion of Equivalence
___________________________________________________________________
LoRA vs. Full Fine-Tuning: An Illusion of Equivalence
Author : timbilt
Score : 180 points
Date : 2024-11-08 09:58 UTC (13 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| K0balt wrote:
| So, in layman's terms, LoRa appears to "traumatize " the model to
| some degree, connecting the vector space with strong "jumpers"
| (intruder dimensions) to change it's behavior, instead of subtly
| conforming the entire model into a shape that accommodates the
| new data.
|
| These jumpers or shortcuts do create connections between the
| relevant new concepts in the model, but by directly connecting
| them instead of associating them through the existing network of
| concepts, nuance is lost and the bypassed areas become
| deemphasized, leading to forgetting of previously held
| associations.
|
| Because of this, In general, fine tuning produces better results
| than LoRa in most cases, especially when forgetting of existing
| training is detrimental.
|
| Or, to further oversimplify the issue in SE terms, LoRa ==
| monkeypatching. (Is this a kind of intruder dimension?)
| Mockapapella wrote:
| Thank you for this layman explanation
| ismailmaj wrote:
| How does it compare to partially fine-tuning the model by
| freezing most of the network beside the last few layers?
| six_four_eight wrote:
| I wonder how this compares to 'catastrophic forgetting' that
| can be a problem of full fine tuning. Or at least that's what
| I've just been reading as a case _for_ using LoRa, as it's not
| susceptible to that. I guess this paper shows LoRa causes
| forgetting in a different way.
|
| Are there good general principles yet for what fine tuning
| method to use in certain situations? It still seems quite
| difficult to know ahead of time what's going to happen.
| pwillia7 wrote:
| This tracks with my feelings making and using Stable Diffusion
| Loras and fine tunes. Still, with the speed to train and use,
| Loras have worked for me in most use cases and it hasn't been
| worth fine tuning the entire model.
| K0balt wrote:
| Yeah,it reflects the "feel" I get from lLoRa as well,
| especially if I overdo it. The new data becomes the preferred
| output even for unrelated inputs. I always felt like it was
| bludgeoning the model to some extent vs finetuning.
|
| Also, LoRa tuning an extensively tuned model occasionally
| provokes full on delusional "insanity" or gibberish seizures.
|
| I have had really good luck though using a highly tuned model
| as the training basis for a LoRa and then applying that LoRa
| mask to the base version of that model. I'm not sure why that
| seems to work better than the same LoRa training directly on
| the base model.
| cheald wrote:
| I've done a lot of tinkering with the internals of LoRA
| training, specifically investigating why fine-tune and LoRA
| training result in such different results, and I'm no
| academic, but I have found that there are definitely some
| issues with the SOTA at least WRT Stable Diffusion.
|
| I've had significant success with alternate init mechanisms
| (the standard technique of init'ing B to zeros really does
| hurt gradient flow), training alpha as a separate parameter
| (and _especially_ if you bootstrap the process with alphas
| learned from a previous run), and altering the per-layer
| learning rates (because (lr * B) @ (lr @ A) produces an
| update of a fundamentally different magnitude than the fine-
| tune update of W * lr = lr * B @ A).
|
| In the context of Stable Diffusion specifically, as well,
| there's some really pathological stuff that happens when
| training text encoders alongside the unet; for SD-1.5, the
| norm of "good" embeddings settles right around 28.0, but the
| model learns that it can reduce loss by pushing the
| embeddings away from that value. However, this comes at the
| cost of de-generalizing your outputs! Adding a second loss
| term which penalizes the network for drifting away from the
| L1 norm of the untrained embeddings for a given text
| substantially reduces the "insanity" tendencies. There's a
| more complete writeup at https://github.com/kohya-ss/sd-
| scripts/discussions/294#discu...
|
| You also have the fact that the current SOTA training tools
| just straight up don't train some layers that fine-tunes do.
|
| I do think there's a huge amount of ground to be gained in
| diffusion LoRA training, but most of the existing techniques
| work well enough that people settle for "good enough".
| doctorpangloss wrote:
| Most people are using LoRAs as a solution for IP transfer.
|
| Thing is Ideogram v2 has already achieved IP transfer
| without fine tuning or adapters. So we know those aren't
| needed.
|
| Is Ideogram v2 an exotic architecture? No, I don't think
| so.
|
| Are there exotic architectures that will solve IP transfer
| and other tasks? The Chameleon and OmniGen architectures.
| Lots of expertise went into SD3 and Flux dataset prep, but:
| the multimodal architectures are so much more flexible and
| expressive.
|
| Flow matching models are maybe the last we will see before
| multi-modal goes big.
|
| What to make of things in the community? How is it possible
| that random hyperparameters and 30 minute long fine tunings
| produce good results?
|
| (1) Dreambooth effect: if it's like, a dog, you won't
| notice the flaws.
|
| (2) Filing drawer problem. Nobody publishes the 99 things
| that didn't work.
|
| (3) SD <3 struggled with IP transfer on image content that
| could not have possibly been in its datasets. But laypeople
| are not doing that. They don't have access to art content
| that Stability and BFL also don't have access to.
|
| (4) Faces: of course SD family saw celebrity images. Faces
| are over-represented in its datasets. So yeah, it's going
| to be good at IP transfer of photographic faces. Most are
| in-sample.
| sorenjan wrote:
| > We randomly initialize A such that it has singular values of 1,
| freeze it, and only train B. When we do this, we see a sharp
| reduction in high ranking intruder dimensions in comparison to
| those in normal LoRA
|
| This sounds interesting, but I can't see that they do much with
| this result. Are they saving it for a follow up paper? I would
| think that if their whole paper is about a big problem with LoRAs
| and they then find what looks like an easy solution for that
| problem that would warrant more than a paragraph just before the
| conclusion.
|
| It would also have been interesting if they included the DoRA
| method, they reference it briefly and that paper claims to
| resemble fine tuning learning behavior.
|
| But perhaps this paper is focused on LoRA behavior, and a
| separate paper comparing various improvements is better.
| liuliu wrote:
| Yeah, honestly not too surprising. Happy someone made the
| experiments though.
|
| _I think_ we know that NN with limited data tends to over-
| fitting, so to train LoRA you need stronger regularization
| mechanism, that including:
|
| * Fixing A as projection matrix so it doesn't rotate to an
| "easier" orientation for B to learn.
|
| * Periodically merging AB into W_tuned to simulate the full-
| model finetuning behavior.
|
| I think fundamentally, LoRA is sound because gradient matrix is
| low-rank by its nature.
| Eisenstein wrote:
| Is this just specifying what has been known, that LoRAs skew
| towards the new training heavily and are not 'more intelligent'
| just 'more targeted' and become less intelligent the more they
| are targeted? Or is this proposing something else? I am having a
| difficult time understanding exactly what 'intruder dimensions'
| are.
| viktour19 wrote:
| > LoRA and full fine-tuning, with equal performance on the fine-
| tuning task, can have solutions with very different
| generalization behaviors outside the fine-tuning task
| distribution.
|
| The ability for nnets to generalize is inherently tied to their
| trainable parameter count via mechanisms we don't understand but
| we know parameter count is the key. When you finetune with lora,
| you're updating maybe 5% of the parameters, I really don't think
| there is an illusion of equivalence in the field.
| wrs wrote:
| Well, I think it depends who you talk to. I suspect quite a few
| practitioners (as opposed to researchers) regard LoRA as a
| valid shortcut without full consideration of the difference.
| abhgh wrote:
| More magnitude than count [1] I think, but I haven't kept up in
| a while.
|
| [1]
| https://proceedings.neurips.cc/paper_files/paper/1996/file/f...
| kelseyfrog wrote:
| > When you finetune with lora, you're updating maybe 5% of the
| parameters
|
| I'm not sure I understand this comment. The LoRA paper[1]
| specifically says that all of the pretrained weights remain
| frozen.
|
| > keeping the pre-trained weights frozen
|
| Specifically, the LoRA paper differentiates itself from
| updating some parameters by stating
|
| > Many sought to mitigate this by adapting only some parameters
| or learning external modules for new tasks.
|
| 1. https://arxiv.org/pdf/2106.09685
| viktour19 wrote:
| The effective parameters of the model are the parameters of
| the original model + lora parameters i.e lora updates only
| lora parameters, and full finetuning updates only original
| model parameters.
| Der_Einzige wrote:
| This paper seems dubious, because it flies in the face of what
| the reft/pyreft paper is showing (you can use 0.0001% of the
| parameters trained for 100 epochs to personalize on a small
| dataset):
|
| https://github.com/stanfordnlp/pyreft
|
| https://arxiv.org/abs/2404.03592
|
| Note that the OP paper is not peer reviewed yet, and while the
| one I linked isn't either, it has Christopher Manning (yes, the
| one you know from youtube), the head of AI at Stanford, as a co-
| author.
|
| In general, I think that Lora and especially reft should be more
| resistant to catastrophic forgetting due to them literally not
| impacting most of the model.
|
| The Stable Diffusion community has literally tens of thousands of
| lora's that don't cripple a model at small rank.
| chompychop wrote:
| I don't see how the authorship by Christopher Manning shifts
| favour towards the other paper; this paper has Antonio Torralba
| as a co-author, who's also one of the big shots in AI.
| deskr wrote:
| What an unfortunate choice of name. LoRa is already a big
| project.
| pclmulqdq wrote:
| Welcome to ML/AI project naming.
| zwaps wrote:
| Different field entirely
| greenavocado wrote:
| Just watch: Pretty soon there will be an LLM optimization
| called Windows
| DidYaWipe wrote:
| Yep. People don't bother to even check anymore.
|
| https://www.youtube.com/watch?v=YQ7aLHCTeeE
|
| And Amazon named its voice assistant after a well-known camera.
| And... and...
| danielhanchen wrote:
| TLDR: 1. Use alpha = 2*rank
|
| 2. Don't use too small ranks (rank=1 to 8)
|
| 3. Sensational title. Better title "LoRA works if done right"
|
| 4. Didn't test SVD init
___________________________________________________________________
(page generated 2024-11-08 23:00 UTC)