[HN Gopher] LoRA Learns Less and Forgets Less
       ___________________________________________________________________
        
       LoRA Learns Less and Forgets Less
        
       Author : wolecki
       Score  : 170 points
       Date   : 2024-05-17 13:00 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | hybridtupel wrote:
       | This is about Low-rank adaptation. Not to be confused with LoRa
       | the long range proprietary radio communication technique, which
       | hopefully doesn't learn at all.
        
         | martinky24 wrote:
         | "Why the hell is LoRa learning" was indeed my first thought...
        
           | HeatrayEnjoyer wrote:
           | This is how the subs were knocked offline in Terminator III
        
       | gregmac wrote:
       | This is "Low-Rank Adaptation", "a widely-used parameter-efficient
       | finetuning method for large language models."
       | 
       | Not to be confused with LoRa ("long range") [1], an Internet of
       | Things radio technology.
       | 
       | [1] https://en.wikipedia.org/wiki/LoRa
        
         | chaos_emergent wrote:
         | Isn't this fairly obvious after a two second glance at the
         | abstract
        
       | SubiculumCode wrote:
       | This is Low-rank adaptation. Not to be confused with Lake of the
       | Ozarks Recreation Area.
        
         | 0cf8612b2e1e wrote:
         | Apparently constructed in 1929. You think those wireless people
         | would have been more careful when they reappropriated the name.
        
       | chaos_emergent wrote:
       | The findings are that the best fine-tune performance comes from
       | fine-tuning all weights, followed my MLPs, followed by attention
       | heads, using LoRA. Authors assert that the performance difference
       | is based on the target module of the NN.
       | 
       | Isn't an equally valid argument that MLPs tend to constitute a
       | greater number of weights in transformer networks than attention
       | heads, and the performance difference can be traced to a greater
       | number of weights having freedom to change? I'd be curious to
       | know if randomly choosing a subset of matrices to train,
       | regardless of where they are in the network, would provide
       | analogous performance to LoRA on a specific module with
       | comparable learnable weights.
        
         | chaos_emergent wrote:
         | as a follow up curiosity, has anyone tried using LoRA on the
         | entire model for pretraining to compare regular training model
         | performance to LoRA?
        
           | cabidaher wrote:
           | This paper [1] does atempt that and reports similar
           | performance compared to conventional pre-training. However,
           | they do start off by doing a normal full-rank training and
           | claim that it is needed to 'warm start' the training process.
           | 
           | [1] https://arxiv.org/abs/2307.05695
        
             | danielhanchen wrote:
             | Oh yes this paper! The main issue is the scaling of the A
             | and B LoRA matrices. Some papers show scaling the B matrix
             | with larger learning rates (LoRA+) could be beneficial.
             | DoRA for eg learns an auto scaling vector of numbers which
             | tries to alleviate these issues.
             | 
             | Galore might be more equivalent to full pretraining with
             | the gradients being low rank.
        
           | sp332 wrote:
           | Do you mean leaving most of the model in its initial,
           | randomised state and only training a LoRA?
        
             | buildbot wrote:
             | I've tested specifically this (on my personal time) :) It
             | will train but I found the loss is proportional to the
             | number of trainable parameters. So roughly to hit the
             | performance of a standard 70m param model, you need to
             | train ~70m lora params anyway.
        
               | cheald wrote:
               | It's worse than that, because lora requires two matrices
               | per layer. At full rank, you have an additional NxN
               | parameters to learn versus full finetuning, where N is
               | min(input_features, output_features).
               | 
               | For example, tuning a layer of 128 in x 256 out is 32k
               | params. Learning a full-rank lora for that layer would be
               | two matrices of 128x128 and 128x256 = 48k params.
        
               | buildbot wrote:
               | Yeah, exactly. Though the 48k param lora might be as good
               | as a 48k param layer of higher rank, I haven't looked
               | into that case really.
        
           | buildbot wrote:
           | Yes, I've tested this out. It does train, but the scaling
           | doesn't seem to pan out. It'll perform slightly better than
           | the number of trainable parameters, but never improves as you
           | scale, so for now there's no benefit.
        
           | whimsicalism wrote:
           | i would be shocked if this worked well
        
         | danielhanchen wrote:
         | I think the QLoRA paper https://arxiv.org/pdf/2305.14314 paper
         | also showed LoRA on all MLP + Attention layers > all MLP layers
         | > just Attention layers.
         | 
         | Other papers show finetuning a select few layers can also work
         | well.
        
           | 3abiton wrote:
           | Any real world performance comparison between QLoRa and LoRa?
        
             | danielhanchen wrote:
             | The QLoRA paper itself provided some cool benchmarks across
             | many many experiments - QLoRA is near equivalent to LoRA,
             | with it sometimes exceeding or losing 1-2% accuracy (it
             | depends on the use case)
        
       | thepasswordis wrote:
       | I really wish people would be more careful about choosing names
       | for these things.
       | 
       | LoRa has been a popular wireless protocol for like 10 years.
        
         | sva_ wrote:
         | Yes, but this is LoRA, clearly not LoRa.
        
           | squarefoot wrote:
           | PP has a point though. I entered "LoRA" on Google,
           | DuckDuckGo, Startpage and Bing, and all returned results in
           | all first pages were about the communication protocol (1).
           | They could have inferred my interests from previous searches,
           | but I never used Bing in the last year or so, so it seems to
           | me someone didn't care about name clashes.
           | 
           | (1) well, except Google which -surprise- returned about mid
           | page an ad of a local quite expensive chandeliers brand
           | called "LORA".
        
             | sva_ wrote:
             | I usually just add a term like 'ml' or 'nn' after my search
             | to give the machine context and it is sufficient in most
             | cases.
        
             | refulgentis wrote:
             | Wait until you find out its a name too
        
             | johnisgood wrote:
             | Most of the time you have to add a keyword as to what it is
             | related. We cannot expect everything to have unique names,
             | unless we are perfectly fine with random pronounceable
             | strings as names.
        
           | atherton33 wrote:
           | I think you mean LoRa(r), a registered trademark of
           | Semtech(r) Corporation, for use only with permission and
           | within specific guidelines.
           | https://www.semtech.com/uploads/company/FAQ-for-Use-of-
           | LoRa-...
        
             | mbirth wrote:
             | Not to be confused with Semtex who sell a completely
             | different kind of problem solver.
        
               | noisy_boy wrote:
               | Isn't Semtex an explosive aka RDX?
        
               | yau8edq12i wrote:
               | Yes, you got the joke.
        
         | mobilemidget wrote:
         | I addressed the same in a previous 'lora' post on HN. For me
         | the name is already reserved for the radio telecommunication
         | meaning. Nothing going to change that.
        
         | renewiltord wrote:
         | Seriously, that was a terrible name for the wireless system
         | since it's been used by the Loyola Online Records Access system
         | for half a decade or more before the radio company shamelessly
         | copied the name.
        
         | dheera wrote:
         | Not sure about "popular"
         | 
         | 99% of ML engineers wouldn't know what it is.
        
           | enlyth wrote:
           | I'm not an ML engineer and the only reason I know that the
           | wireless protocol exists is because in every HN article,
           | there's a comment repeating the same complaint
        
           | Findecanor wrote:
           | 99% of the engineers who are still working in ML (Machine
           | Language) would.
           | 
           | A much smaller percent among those who write in ML (the
           | functional programming language) probably, though.
        
             | nerdponx wrote:
             | Even if the ML _engineers_ know about the wireless protocol
             | (and I doubt that many do), the scientists /researchers who
             | develop these models probably don't. They are completely
             | different domain. The lead author on this paper is
             | basically a neuroscientist; some of the other are
             | technically computer scientists, but probably have little
             | hands-on experience with networking beyond whatever they
             | did in undergrad.
        
           | goodpoint wrote:
           | ...but they should know how to use search engines...
        
             | bryanrasmussen wrote:
             | at some point we are going to run out of easily
             | pronounceable abbreviations that are unique. Perhaps that
             | point is actually in the past and we should just
             | acknowledge it and move on. Although I guess it could have
             | been Lorall - oops, that's a character in World of
             | Warcraft.
        
               | dheera wrote:
               | Old concepts become obsolete anyway. People can start
               | reusing VCR, etc.
        
             | marcinzm wrote:
             | As should have the IoT people to not conflict with the
             | decades old LORA name used for Level of Repair Analysis.
        
           | bmitc wrote:
           | Not much of a surprise there. The ML culture is to reinvent
           | names for everything.
        
         | marcinzm wrote:
         | And LORA stood for Level of Repair Analysis since the 70s.
        
         | onion2k wrote:
         | The scale of human knowledge, and the pace we're increasing it
         | at, means we probably can't have unique names for everything
         | any more if we aim for short, pronounceable things. "Lora" must
         | have _dozens_ of meanings across different domains. The fact
         | you recognize it from another one is just something you have to
         | deal with.
        
         | BaculumMeumEst wrote:
         | The other side of this is that if you become paralyzed by
         | decisions because 0.001% of people are bothered by it, you're
         | not gonna make it.
        
         | dsjoerg wrote:
         | The overlap in the Venn Diagram of people who care about LoRA
         | and people who care about LoRa is extremely small. Your problem
         | is not typical of people in the Machine Learning field. That's
         | why they didn't care, or more likely this issue didn't occur to
         | the first 50 people who saw the name LoRA.
        
         | Turing_Machine wrote:
         | Think of Apple naming its spreadsheet "Numbers" and its word
         | processor "Pages" (or, for that matter, the name "Apple"
         | itself. Or "Windows", "Word", "Access"...).
         | 
         | And yet (as others have noted) adding another word or two to
         | give a bit of context is _usually_ enough that web searches
         | work.
         | 
         | Search engines are pretty clever nowadays, except when they've
         | been deliberately dumbed-down (cough... Google...).
        
       | ssl-3 wrote:
       | What can we learn about Low Rank Acronyms today?
        
       | chriskanan wrote:
       | This study is great and addresses a question I've had about LoRA
       | for a while.
       | 
       | In a continual learning paper from last year, I found LoRA was
       | extremely effective for faster fine-tuning and not forgetting the
       | original dataset:
       | 
       | https://arxiv.org/abs/2306.01904
        
       | rzzzt wrote:
       | This paper has 12 authors, which fascinates me to no end for some
       | unexplainable reason. How does it work? Is it a common occurrence
       | to have this many people working on a submission? Did each of
       | them get at least a paragraph in edgewise?
        
         | repsak wrote:
         | I raise you the Gemini paper https://arxiv.org/abs/2312.11805
        
           | guyomes wrote:
           | All-in with the Foldit paper [1,2].
           | 
           | [1]: https://en.wikipedia.org/wiki/Foldit
           | 
           | [2]: https://www.nature.com/articles/nature09304
        
           | jpgvm wrote:
           | Goes to show how much money is being poured into this stuff.
        
         | SubiculumCode wrote:
         | For a serious answer, this is how it works in my field A
         | researcher gets a grant with 3-7 co-investigators. This
         | generates a bunch of data and other resources that will support
         | 10 or more papers. Coinvestigators and PIs will ask their
         | postdocs and grad students to write up a paper. PIs and co-Is
         | go on every paper...because it's a paper from their grant. Then
         | the 1 to 4 grad students and post-docs go on the paper,
         | depending on their specific material contributions to the work,
         | be it analysis, conception, or execution, or writing. The
         | numbers can stack up.
        
         | PeterisP wrote:
         | The general criteria for authorship require including the
         | people who worked on the experiments and data for the paper,
         | which can be more important contribution than most of the text
         | in that paper. In other experimental fields, there are papers
         | with dozens or even hundreds of authors, because it can take
         | many people to get to a measurement of a single number in the
         | paper.
        
           | rzzzt wrote:
           | Thanks, this is the bit I've been missing.
        
         | yau8edq12i wrote:
         | Wait until you learn that the paper on the LHC has more than
         | 5000 authors: https://www.nature.com/articles/nature.2015.17567
        
       | iudexgundyr wrote:
       | I feel like this is a trivial conclusion. Keeping the rank low in
       | the optimization is a common regularization technique.
        
       | Saris wrote:
       | What does LoRa have to do with LLMs? Whoever named this thing
       | screwed up big time.
        
       | yinser wrote:
       | This was a poor study,
       | https://x.com/danielhanchen/status/1791900967472140583?s=46&...
        
       ___________________________________________________________________
       (page generated 2024-05-18 23:02 UTC)