[HN Gopher] LoRA Learns Less and Forgets Less
       ___________________________________________________________________
        
       LoRA Learns Less and Forgets Less
        
       Author : wolecki
       Score  : 119 points
       Date   : 2024-05-17 13:00 UTC (9 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | hybridtupel wrote:
       | This is about Low-rank adaptation. Not to be confused with LoRa
       | the long range proprietary radio communication technique, which
       | hopefully doesn't learn at all.
        
         | martinky24 wrote:
         | "Why the hell is LoRa learning" was indeed my first thought...
        
       | gregmac wrote:
       | This is "Low-Rank Adaptation", "a widely-used parameter-efficient
       | finetuning method for large language models."
       | 
       | Not to be confused with LoRa ("long range") [1], an Internet of
       | Things radio technology.
       | 
       | [1] https://en.wikipedia.org/wiki/LoRa
        
         | chaos_emergent wrote:
         | Isn't this fairly obvious after a two second glance at the
         | abstract
        
       | SubiculumCode wrote:
       | This is Low-rank adaptation. Not to be confused with Lake of the
       | Ozarks Recreation Area.
        
         | 0cf8612b2e1e wrote:
         | Apparently constructed in 1929. You think those wireless people
         | would have been more careful when they reappropriated the name.
        
       | chaos_emergent wrote:
       | The findings are that the best fine-tune performance comes from
       | fine-tuning all weights, followed my MLPs, followed by attention
       | heads, using LoRA. Authors assert that the performance difference
       | is based on the target module of the NN.
       | 
       | Isn't an equally valid argument that MLPs tend to constitute a
       | greater number of weights in transformer networks than attention
       | heads, and the performance difference can be traced to a greater
       | number of weights having freedom to change? I'd be curious to
       | know if randomly choosing a subset of matrices to train,
       | regardless of where they are in the network, would provide
       | analogous performance to LoRA on a specific module with
       | comparable learnable weights.
        
         | chaos_emergent wrote:
         | as a follow up curiosity, has anyone tried using LoRA on the
         | entire model for pretraining to compare regular training model
         | performance to LoRA?
        
           | cabidaher wrote:
           | This paper [1] does atempt that and reports similar
           | performance compared to conventional pre-training. However,
           | they do start off by doing a normal full-rank training and
           | claim that it is needed to 'warm start' the training process.
           | 
           | [1] https://arxiv.org/abs/2307.05695
        
             | danielhanchen wrote:
             | Oh yes this paper! The main issue is the scaling of the A
             | and B LoRA matrices. Some papers show scaling the B matrix
             | with larger learning rates (LoRA+) could be beneficial.
             | DoRA for eg learns an auto scaling vector of numbers which
             | tries to alleviate these issues.
             | 
             | Galore might be more equivalent to full pretraining with
             | the gradients being low rank.
        
           | sp332 wrote:
           | Do you mean leaving most of the model in its initial,
           | randomised state and only training a LoRA?
        
             | buildbot wrote:
             | I've tested specifically this (on my personal time) :) It
             | will train but I found the loss is proportional to the
             | number of trainable parameters. So roughly to hit the
             | performance of a standard 70m param model, you need to
             | train ~70m lora params anyway.
        
               | cheald wrote:
               | It's worse than that, because lora requires two matrices
               | per layer. At full rank, you have an additional NxN
               | parameters to learn versus full finetuning, where N is
               | min(input_features, output_features).
               | 
               | For example, tuning a layer of 128 in x 256 out is 32k
               | params. Learning a full-rank lora for that layer would be
               | two matrices of 128x128 and 128x256 = 48k params.
        
               | buildbot wrote:
               | Yeah, exactly. Though the 48k param lora might be as good
               | as a 48k param layer of higher rank, I haven't looked
               | into that case really.
        
           | buildbot wrote:
           | Yes, I've tested this out. It does train, but the scaling
           | doesn't seem to pan out. It'll perform slightly better than
           | the number of trainable parameters, but never improves as you
           | scale, so for now there's no benefit.
        
           | whimsicalism wrote:
           | i would be shocked if this worked well
        
         | danielhanchen wrote:
         | I think the QLoRA paper https://arxiv.org/pdf/2305.14314 paper
         | also showed LoRA on all MLP + Attention layers > all MLP layers
         | > just Attention layers.
         | 
         | Other papers show finetuning a select few layers can also work
         | well.
        
       | thepasswordis wrote:
       | I really wish people would be more careful about choosing names
       | for these things.
       | 
       | LoRa has been a popular wireless protocol for like 10 years.
        
         | sva_ wrote:
         | Yes, but this is LoRA, clearly not LoRa.
        
           | squarefoot wrote:
           | PP has a point though. I entered "LoRA" on Google,
           | DuckDuckGo, Startpage and Bing, and all returned results in
           | all first pages were about the communication protocol (1).
           | They could have inferred my interests from previous searches,
           | but I never used Bing in the last year or so, so it seems to
           | me someone didn't care about name clashes.
           | 
           | (1) well, except Google which -surprise- returned about mid
           | page an ad of a local quite expensive chandeliers brand
           | called "LORA".
        
             | sva_ wrote:
             | I usually just add a term like 'ml' or 'nn' after my search
             | to give the machine context and it is sufficient in most
             | cases.
        
           | atherton33 wrote:
           | I think you mean LoRa(r), a registered trademark of
           | Semtech(r) Corporation, for use only with permission and
           | within specific guidelines.
           | https://www.semtech.com/uploads/company/FAQ-for-Use-of-
           | LoRa-...
        
         | mobilemidget wrote:
         | I addressed the same in a previous 'lora' post on HN. For me
         | the name is already reserved for the radio telecommunication
         | meaning. Nothing going to change that.
        
         | renewiltord wrote:
         | Seriously, that was a terrible name for the wireless system
         | since it's been used by the Loyola Online Records Access system
         | for half a decade or more before the radio company shamelessly
         | copied the name.
        
         | dheera wrote:
         | Not sure about "popular"
         | 
         | 99% of ML engineers wouldn't know what it is.
        
           | enlyth wrote:
           | I'm not an ML engineer and the only reason I know that the
           | wireless protocol exists is because in every HN article,
           | there's a comment repeating the same complaint
        
           | Findecanor wrote:
           | 99% of the engineers who are still working in ML (Machine
           | Language) would.
           | 
           | A much smaller percent among those who write in ML (the
           | functional programming language) probably, though.
        
             | nerdponx wrote:
             | Even if the ML _engineers_ know about the wireless protocol
             | (and I doubt that many do), the scientists /researchers who
             | develop these models probably don't. They are completely
             | different domain. The lead author on this paper is
             | basically a neuroscientist; some of the other are
             | technically computer scientists, but probably have little
             | hands-on experience with networking beyond whatever they
             | did in undergrad.
        
           | goodpoint wrote:
           | ...but they should know how to use search engines...
        
       | ssl-3 wrote:
       | What can we learn about Low Rank Acronyms today?
        
       | chriskanan wrote:
       | This study is great and addresses a question I've had about LoRA
       | for a while.
       | 
       | In a continual learning paper from last year, I found LoRA was
       | extremely effective for faster fine-tuning and not forgetting the
       | original dataset:
       | 
       | https://arxiv.org/abs/2306.01904
        
       | rzzzt wrote:
       | This paper has 12 authors, which fascinates me to no end for some
       | unexplainable reason. How does it work? Is it a common occurrence
       | to have this many people working on a submission? Did each of
       | them get at least a paragraph in edgewise?
        
         | repsak wrote:
         | I raise you the Gemini paper https://arxiv.org/abs/2312.11805
        
           | guyomes wrote:
           | All-in with the Foldit paper [1,2].
           | 
           | [1]: https://en.wikipedia.org/wiki/Foldit
           | 
           | [2]: https://www.nature.com/articles/nature09304
        
         | SubiculumCode wrote:
         | For a serious answer, this is how it works in my field A
         | researcher gets a grant with 3-7 co-investigators. This
         | generates a bunch of data and other resources that will support
         | 10 or more papers. Coinvestigators and PIs will ask their
         | postdocs and grad students to write up a paper. PIs and co-Is
         | go on every paper...because it's a paper from their grant. Then
         | the 1 to 4 grad students and post-docs go on the paper,
         | depending on their specific material contributions to the work,
         | be it analysis, conception, or execution, or writing. The
         | numbers can stack up.
        
       | iudexgundyr wrote:
       | I feel like this is a trivial conclusion. Keeping the rank low in
       | the optimization is a common regularization technique.
        
       ___________________________________________________________________
       (page generated 2024-05-17 23:00 UTC)