[HN Gopher] LoRA vs. Full Fine-Tuning: An Illusion of Equivalence
       ___________________________________________________________________
        
       LoRA vs. Full Fine-Tuning: An Illusion of Equivalence
        
       Author : timbilt
       Score  : 180 points
       Date   : 2024-11-08 09:58 UTC (13 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | K0balt wrote:
       | So, in layman's terms, LoRa appears to "traumatize " the model to
       | some degree, connecting the vector space with strong "jumpers"
       | (intruder dimensions) to change it's behavior, instead of subtly
       | conforming the entire model into a shape that accommodates the
       | new data.
       | 
       | These jumpers or shortcuts do create connections between the
       | relevant new concepts in the model, but by directly connecting
       | them instead of associating them through the existing network of
       | concepts, nuance is lost and the bypassed areas become
       | deemphasized, leading to forgetting of previously held
       | associations.
       | 
       | Because of this, In general, fine tuning produces better results
       | than LoRa in most cases, especially when forgetting of existing
       | training is detrimental.
       | 
       | Or, to further oversimplify the issue in SE terms, LoRa ==
       | monkeypatching. (Is this a kind of intruder dimension?)
        
         | Mockapapella wrote:
         | Thank you for this layman explanation
        
         | ismailmaj wrote:
         | How does it compare to partially fine-tuning the model by
         | freezing most of the network beside the last few layers?
        
         | six_four_eight wrote:
         | I wonder how this compares to 'catastrophic forgetting' that
         | can be a problem of full fine tuning. Or at least that's what
         | I've just been reading as a case _for_ using LoRa, as it's not
         | susceptible to that. I guess this paper shows LoRa causes
         | forgetting in a different way.
         | 
         | Are there good general principles yet for what fine tuning
         | method to use in certain situations? It still seems quite
         | difficult to know ahead of time what's going to happen.
        
       | pwillia7 wrote:
       | This tracks with my feelings making and using Stable Diffusion
       | Loras and fine tunes. Still, with the speed to train and use,
       | Loras have worked for me in most use cases and it hasn't been
       | worth fine tuning the entire model.
        
         | K0balt wrote:
         | Yeah,it reflects the "feel" I get from lLoRa as well,
         | especially if I overdo it. The new data becomes the preferred
         | output even for unrelated inputs. I always felt like it was
         | bludgeoning the model to some extent vs finetuning.
         | 
         | Also, LoRa tuning an extensively tuned model occasionally
         | provokes full on delusional "insanity" or gibberish seizures.
         | 
         | I have had really good luck though using a highly tuned model
         | as the training basis for a LoRa and then applying that LoRa
         | mask to the base version of that model. I'm not sure why that
         | seems to work better than the same LoRa training directly on
         | the base model.
        
           | cheald wrote:
           | I've done a lot of tinkering with the internals of LoRA
           | training, specifically investigating why fine-tune and LoRA
           | training result in such different results, and I'm no
           | academic, but I have found that there are definitely some
           | issues with the SOTA at least WRT Stable Diffusion.
           | 
           | I've had significant success with alternate init mechanisms
           | (the standard technique of init'ing B to zeros really does
           | hurt gradient flow), training alpha as a separate parameter
           | (and _especially_ if you bootstrap the process with alphas
           | learned from a previous run), and altering the per-layer
           | learning rates (because (lr * B) @ (lr @ A) produces an
           | update of a fundamentally different magnitude than the fine-
           | tune update of W * lr = lr * B @ A).
           | 
           | In the context of Stable Diffusion specifically, as well,
           | there's some really pathological stuff that happens when
           | training text encoders alongside the unet; for SD-1.5, the
           | norm of "good" embeddings settles right around 28.0, but the
           | model learns that it can reduce loss by pushing the
           | embeddings away from that value. However, this comes at the
           | cost of de-generalizing your outputs! Adding a second loss
           | term which penalizes the network for drifting away from the
           | L1 norm of the untrained embeddings for a given text
           | substantially reduces the "insanity" tendencies. There's a
           | more complete writeup at https://github.com/kohya-ss/sd-
           | scripts/discussions/294#discu...
           | 
           | You also have the fact that the current SOTA training tools
           | just straight up don't train some layers that fine-tunes do.
           | 
           | I do think there's a huge amount of ground to be gained in
           | diffusion LoRA training, but most of the existing techniques
           | work well enough that people settle for "good enough".
        
             | doctorpangloss wrote:
             | Most people are using LoRAs as a solution for IP transfer.
             | 
             | Thing is Ideogram v2 has already achieved IP transfer
             | without fine tuning or adapters. So we know those aren't
             | needed.
             | 
             | Is Ideogram v2 an exotic architecture? No, I don't think
             | so.
             | 
             | Are there exotic architectures that will solve IP transfer
             | and other tasks? The Chameleon and OmniGen architectures.
             | Lots of expertise went into SD3 and Flux dataset prep, but:
             | the multimodal architectures are so much more flexible and
             | expressive.
             | 
             | Flow matching models are maybe the last we will see before
             | multi-modal goes big.
             | 
             | What to make of things in the community? How is it possible
             | that random hyperparameters and 30 minute long fine tunings
             | produce good results?
             | 
             | (1) Dreambooth effect: if it's like, a dog, you won't
             | notice the flaws.
             | 
             | (2) Filing drawer problem. Nobody publishes the 99 things
             | that didn't work.
             | 
             | (3) SD <3 struggled with IP transfer on image content that
             | could not have possibly been in its datasets. But laypeople
             | are not doing that. They don't have access to art content
             | that Stability and BFL also don't have access to.
             | 
             | (4) Faces: of course SD family saw celebrity images. Faces
             | are over-represented in its datasets. So yeah, it's going
             | to be good at IP transfer of photographic faces. Most are
             | in-sample.
        
       | sorenjan wrote:
       | > We randomly initialize A such that it has singular values of 1,
       | freeze it, and only train B. When we do this, we see a sharp
       | reduction in high ranking intruder dimensions in comparison to
       | those in normal LoRA
       | 
       | This sounds interesting, but I can't see that they do much with
       | this result. Are they saving it for a follow up paper? I would
       | think that if their whole paper is about a big problem with LoRAs
       | and they then find what looks like an easy solution for that
       | problem that would warrant more than a paragraph just before the
       | conclusion.
       | 
       | It would also have been interesting if they included the DoRA
       | method, they reference it briefly and that paper claims to
       | resemble fine tuning learning behavior.
       | 
       | But perhaps this paper is focused on LoRA behavior, and a
       | separate paper comparing various improvements is better.
        
         | liuliu wrote:
         | Yeah, honestly not too surprising. Happy someone made the
         | experiments though.
         | 
         |  _I think_ we know that NN with limited data tends to over-
         | fitting, so to train LoRA you need stronger regularization
         | mechanism, that including:
         | 
         | * Fixing A as projection matrix so it doesn't rotate to an
         | "easier" orientation for B to learn.
         | 
         | * Periodically merging AB into W_tuned to simulate the full-
         | model finetuning behavior.
         | 
         | I think fundamentally, LoRA is sound because gradient matrix is
         | low-rank by its nature.
        
       | Eisenstein wrote:
       | Is this just specifying what has been known, that LoRAs skew
       | towards the new training heavily and are not 'more intelligent'
       | just 'more targeted' and become less intelligent the more they
       | are targeted? Or is this proposing something else? I am having a
       | difficult time understanding exactly what 'intruder dimensions'
       | are.
        
       | viktour19 wrote:
       | > LoRA and full fine-tuning, with equal performance on the fine-
       | tuning task, can have solutions with very different
       | generalization behaviors outside the fine-tuning task
       | distribution.
       | 
       | The ability for nnets to generalize is inherently tied to their
       | trainable parameter count via mechanisms we don't understand but
       | we know parameter count is the key. When you finetune with lora,
       | you're updating maybe 5% of the parameters, I really don't think
       | there is an illusion of equivalence in the field.
        
         | wrs wrote:
         | Well, I think it depends who you talk to. I suspect quite a few
         | practitioners (as opposed to researchers) regard LoRA as a
         | valid shortcut without full consideration of the difference.
        
         | abhgh wrote:
         | More magnitude than count [1] I think, but I haven't kept up in
         | a while.
         | 
         | [1]
         | https://proceedings.neurips.cc/paper_files/paper/1996/file/f...
        
         | kelseyfrog wrote:
         | > When you finetune with lora, you're updating maybe 5% of the
         | parameters
         | 
         | I'm not sure I understand this comment. The LoRA paper[1]
         | specifically says that all of the pretrained weights remain
         | frozen.
         | 
         | > keeping the pre-trained weights frozen
         | 
         | Specifically, the LoRA paper differentiates itself from
         | updating some parameters by stating
         | 
         | > Many sought to mitigate this by adapting only some parameters
         | or learning external modules for new tasks.
         | 
         | 1. https://arxiv.org/pdf/2106.09685
        
           | viktour19 wrote:
           | The effective parameters of the model are the parameters of
           | the original model + lora parameters i.e lora updates only
           | lora parameters, and full finetuning updates only original
           | model parameters.
        
       | Der_Einzige wrote:
       | This paper seems dubious, because it flies in the face of what
       | the reft/pyreft paper is showing (you can use 0.0001% of the
       | parameters trained for 100 epochs to personalize on a small
       | dataset):
       | 
       | https://github.com/stanfordnlp/pyreft
       | 
       | https://arxiv.org/abs/2404.03592
       | 
       | Note that the OP paper is not peer reviewed yet, and while the
       | one I linked isn't either, it has Christopher Manning (yes, the
       | one you know from youtube), the head of AI at Stanford, as a co-
       | author.
       | 
       | In general, I think that Lora and especially reft should be more
       | resistant to catastrophic forgetting due to them literally not
       | impacting most of the model.
       | 
       | The Stable Diffusion community has literally tens of thousands of
       | lora's that don't cripple a model at small rank.
        
         | chompychop wrote:
         | I don't see how the authorship by Christopher Manning shifts
         | favour towards the other paper; this paper has Antonio Torralba
         | as a co-author, who's also one of the big shots in AI.
        
       | deskr wrote:
       | What an unfortunate choice of name. LoRa is already a big
       | project.
        
         | pclmulqdq wrote:
         | Welcome to ML/AI project naming.
        
         | zwaps wrote:
         | Different field entirely
        
           | greenavocado wrote:
           | Just watch: Pretty soon there will be an LLM optimization
           | called Windows
        
         | DidYaWipe wrote:
         | Yep. People don't bother to even check anymore.
         | 
         | https://www.youtube.com/watch?v=YQ7aLHCTeeE
         | 
         | And Amazon named its voice assistant after a well-known camera.
         | And... and...
        
       | danielhanchen wrote:
       | TLDR: 1. Use alpha = 2*rank
       | 
       | 2. Don't use too small ranks (rank=1 to 8)
       | 
       | 3. Sensational title. Better title "LoRA works if done right"
       | 
       | 4. Didn't test SVD init
        
       ___________________________________________________________________
       (page generated 2024-11-08 23:00 UTC)