[HN Gopher] LoRA Learns Less and Forgets Less
___________________________________________________________________
LoRA Learns Less and Forgets Less
Author : wolecki
Score : 170 points
Date : 2024-05-17 13:00 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| hybridtupel wrote:
| This is about Low-rank adaptation. Not to be confused with LoRa
| the long range proprietary radio communication technique, which
| hopefully doesn't learn at all.
| martinky24 wrote:
| "Why the hell is LoRa learning" was indeed my first thought...
| HeatrayEnjoyer wrote:
| This is how the subs were knocked offline in Terminator III
| gregmac wrote:
| This is "Low-Rank Adaptation", "a widely-used parameter-efficient
| finetuning method for large language models."
|
| Not to be confused with LoRa ("long range") [1], an Internet of
| Things radio technology.
|
| [1] https://en.wikipedia.org/wiki/LoRa
| chaos_emergent wrote:
| Isn't this fairly obvious after a two second glance at the
| abstract
| SubiculumCode wrote:
| This is Low-rank adaptation. Not to be confused with Lake of the
| Ozarks Recreation Area.
| 0cf8612b2e1e wrote:
| Apparently constructed in 1929. You think those wireless people
| would have been more careful when they reappropriated the name.
| chaos_emergent wrote:
| The findings are that the best fine-tune performance comes from
| fine-tuning all weights, followed my MLPs, followed by attention
| heads, using LoRA. Authors assert that the performance difference
| is based on the target module of the NN.
|
| Isn't an equally valid argument that MLPs tend to constitute a
| greater number of weights in transformer networks than attention
| heads, and the performance difference can be traced to a greater
| number of weights having freedom to change? I'd be curious to
| know if randomly choosing a subset of matrices to train,
| regardless of where they are in the network, would provide
| analogous performance to LoRA on a specific module with
| comparable learnable weights.
| chaos_emergent wrote:
| as a follow up curiosity, has anyone tried using LoRA on the
| entire model for pretraining to compare regular training model
| performance to LoRA?
| cabidaher wrote:
| This paper [1] does atempt that and reports similar
| performance compared to conventional pre-training. However,
| they do start off by doing a normal full-rank training and
| claim that it is needed to 'warm start' the training process.
|
| [1] https://arxiv.org/abs/2307.05695
| danielhanchen wrote:
| Oh yes this paper! The main issue is the scaling of the A
| and B LoRA matrices. Some papers show scaling the B matrix
| with larger learning rates (LoRA+) could be beneficial.
| DoRA for eg learns an auto scaling vector of numbers which
| tries to alleviate these issues.
|
| Galore might be more equivalent to full pretraining with
| the gradients being low rank.
| sp332 wrote:
| Do you mean leaving most of the model in its initial,
| randomised state and only training a LoRA?
| buildbot wrote:
| I've tested specifically this (on my personal time) :) It
| will train but I found the loss is proportional to the
| number of trainable parameters. So roughly to hit the
| performance of a standard 70m param model, you need to
| train ~70m lora params anyway.
| cheald wrote:
| It's worse than that, because lora requires two matrices
| per layer. At full rank, you have an additional NxN
| parameters to learn versus full finetuning, where N is
| min(input_features, output_features).
|
| For example, tuning a layer of 128 in x 256 out is 32k
| params. Learning a full-rank lora for that layer would be
| two matrices of 128x128 and 128x256 = 48k params.
| buildbot wrote:
| Yeah, exactly. Though the 48k param lora might be as good
| as a 48k param layer of higher rank, I haven't looked
| into that case really.
| buildbot wrote:
| Yes, I've tested this out. It does train, but the scaling
| doesn't seem to pan out. It'll perform slightly better than
| the number of trainable parameters, but never improves as you
| scale, so for now there's no benefit.
| whimsicalism wrote:
| i would be shocked if this worked well
| danielhanchen wrote:
| I think the QLoRA paper https://arxiv.org/pdf/2305.14314 paper
| also showed LoRA on all MLP + Attention layers > all MLP layers
| > just Attention layers.
|
| Other papers show finetuning a select few layers can also work
| well.
| 3abiton wrote:
| Any real world performance comparison between QLoRa and LoRa?
| danielhanchen wrote:
| The QLoRA paper itself provided some cool benchmarks across
| many many experiments - QLoRA is near equivalent to LoRA,
| with it sometimes exceeding or losing 1-2% accuracy (it
| depends on the use case)
| thepasswordis wrote:
| I really wish people would be more careful about choosing names
| for these things.
|
| LoRa has been a popular wireless protocol for like 10 years.
| sva_ wrote:
| Yes, but this is LoRA, clearly not LoRa.
| squarefoot wrote:
| PP has a point though. I entered "LoRA" on Google,
| DuckDuckGo, Startpage and Bing, and all returned results in
| all first pages were about the communication protocol (1).
| They could have inferred my interests from previous searches,
| but I never used Bing in the last year or so, so it seems to
| me someone didn't care about name clashes.
|
| (1) well, except Google which -surprise- returned about mid
| page an ad of a local quite expensive chandeliers brand
| called "LORA".
| sva_ wrote:
| I usually just add a term like 'ml' or 'nn' after my search
| to give the machine context and it is sufficient in most
| cases.
| refulgentis wrote:
| Wait until you find out its a name too
| johnisgood wrote:
| Most of the time you have to add a keyword as to what it is
| related. We cannot expect everything to have unique names,
| unless we are perfectly fine with random pronounceable
| strings as names.
| atherton33 wrote:
| I think you mean LoRa(r), a registered trademark of
| Semtech(r) Corporation, for use only with permission and
| within specific guidelines.
| https://www.semtech.com/uploads/company/FAQ-for-Use-of-
| LoRa-...
| mbirth wrote:
| Not to be confused with Semtex who sell a completely
| different kind of problem solver.
| noisy_boy wrote:
| Isn't Semtex an explosive aka RDX?
| yau8edq12i wrote:
| Yes, you got the joke.
| mobilemidget wrote:
| I addressed the same in a previous 'lora' post on HN. For me
| the name is already reserved for the radio telecommunication
| meaning. Nothing going to change that.
| renewiltord wrote:
| Seriously, that was a terrible name for the wireless system
| since it's been used by the Loyola Online Records Access system
| for half a decade or more before the radio company shamelessly
| copied the name.
| dheera wrote:
| Not sure about "popular"
|
| 99% of ML engineers wouldn't know what it is.
| enlyth wrote:
| I'm not an ML engineer and the only reason I know that the
| wireless protocol exists is because in every HN article,
| there's a comment repeating the same complaint
| Findecanor wrote:
| 99% of the engineers who are still working in ML (Machine
| Language) would.
|
| A much smaller percent among those who write in ML (the
| functional programming language) probably, though.
| nerdponx wrote:
| Even if the ML _engineers_ know about the wireless protocol
| (and I doubt that many do), the scientists /researchers who
| develop these models probably don't. They are completely
| different domain. The lead author on this paper is
| basically a neuroscientist; some of the other are
| technically computer scientists, but probably have little
| hands-on experience with networking beyond whatever they
| did in undergrad.
| goodpoint wrote:
| ...but they should know how to use search engines...
| bryanrasmussen wrote:
| at some point we are going to run out of easily
| pronounceable abbreviations that are unique. Perhaps that
| point is actually in the past and we should just
| acknowledge it and move on. Although I guess it could have
| been Lorall - oops, that's a character in World of
| Warcraft.
| dheera wrote:
| Old concepts become obsolete anyway. People can start
| reusing VCR, etc.
| marcinzm wrote:
| As should have the IoT people to not conflict with the
| decades old LORA name used for Level of Repair Analysis.
| bmitc wrote:
| Not much of a surprise there. The ML culture is to reinvent
| names for everything.
| marcinzm wrote:
| And LORA stood for Level of Repair Analysis since the 70s.
| onion2k wrote:
| The scale of human knowledge, and the pace we're increasing it
| at, means we probably can't have unique names for everything
| any more if we aim for short, pronounceable things. "Lora" must
| have _dozens_ of meanings across different domains. The fact
| you recognize it from another one is just something you have to
| deal with.
| BaculumMeumEst wrote:
| The other side of this is that if you become paralyzed by
| decisions because 0.001% of people are bothered by it, you're
| not gonna make it.
| dsjoerg wrote:
| The overlap in the Venn Diagram of people who care about LoRA
| and people who care about LoRa is extremely small. Your problem
| is not typical of people in the Machine Learning field. That's
| why they didn't care, or more likely this issue didn't occur to
| the first 50 people who saw the name LoRA.
| Turing_Machine wrote:
| Think of Apple naming its spreadsheet "Numbers" and its word
| processor "Pages" (or, for that matter, the name "Apple"
| itself. Or "Windows", "Word", "Access"...).
|
| And yet (as others have noted) adding another word or two to
| give a bit of context is _usually_ enough that web searches
| work.
|
| Search engines are pretty clever nowadays, except when they've
| been deliberately dumbed-down (cough... Google...).
| ssl-3 wrote:
| What can we learn about Low Rank Acronyms today?
| chriskanan wrote:
| This study is great and addresses a question I've had about LoRA
| for a while.
|
| In a continual learning paper from last year, I found LoRA was
| extremely effective for faster fine-tuning and not forgetting the
| original dataset:
|
| https://arxiv.org/abs/2306.01904
| rzzzt wrote:
| This paper has 12 authors, which fascinates me to no end for some
| unexplainable reason. How does it work? Is it a common occurrence
| to have this many people working on a submission? Did each of
| them get at least a paragraph in edgewise?
| repsak wrote:
| I raise you the Gemini paper https://arxiv.org/abs/2312.11805
| guyomes wrote:
| All-in with the Foldit paper [1,2].
|
| [1]: https://en.wikipedia.org/wiki/Foldit
|
| [2]: https://www.nature.com/articles/nature09304
| jpgvm wrote:
| Goes to show how much money is being poured into this stuff.
| SubiculumCode wrote:
| For a serious answer, this is how it works in my field A
| researcher gets a grant with 3-7 co-investigators. This
| generates a bunch of data and other resources that will support
| 10 or more papers. Coinvestigators and PIs will ask their
| postdocs and grad students to write up a paper. PIs and co-Is
| go on every paper...because it's a paper from their grant. Then
| the 1 to 4 grad students and post-docs go on the paper,
| depending on their specific material contributions to the work,
| be it analysis, conception, or execution, or writing. The
| numbers can stack up.
| PeterisP wrote:
| The general criteria for authorship require including the
| people who worked on the experiments and data for the paper,
| which can be more important contribution than most of the text
| in that paper. In other experimental fields, there are papers
| with dozens or even hundreds of authors, because it can take
| many people to get to a measurement of a single number in the
| paper.
| rzzzt wrote:
| Thanks, this is the bit I've been missing.
| yau8edq12i wrote:
| Wait until you learn that the paper on the LHC has more than
| 5000 authors: https://www.nature.com/articles/nature.2015.17567
| iudexgundyr wrote:
| I feel like this is a trivial conclusion. Keeping the rank low in
| the optimization is a common regularization technique.
| Saris wrote:
| What does LoRa have to do with LLMs? Whoever named this thing
| screwed up big time.
| yinser wrote:
| This was a poor study,
| https://x.com/danielhanchen/status/1791900967472140583?s=46&...
___________________________________________________________________
(page generated 2024-05-18 23:02 UTC)