[HN Gopher] LoRA Learns Less and Forgets Less
___________________________________________________________________
LoRA Learns Less and Forgets Less
Author : wolecki
Score : 119 points
Date : 2024-05-17 13:00 UTC (9 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| hybridtupel wrote:
| This is about Low-rank adaptation. Not to be confused with LoRa
| the long range proprietary radio communication technique, which
| hopefully doesn't learn at all.
| martinky24 wrote:
| "Why the hell is LoRa learning" was indeed my first thought...
| gregmac wrote:
| This is "Low-Rank Adaptation", "a widely-used parameter-efficient
| finetuning method for large language models."
|
| Not to be confused with LoRa ("long range") [1], an Internet of
| Things radio technology.
|
| [1] https://en.wikipedia.org/wiki/LoRa
| chaos_emergent wrote:
| Isn't this fairly obvious after a two second glance at the
| abstract
| SubiculumCode wrote:
| This is Low-rank adaptation. Not to be confused with Lake of the
| Ozarks Recreation Area.
| 0cf8612b2e1e wrote:
| Apparently constructed in 1929. You think those wireless people
| would have been more careful when they reappropriated the name.
| chaos_emergent wrote:
| The findings are that the best fine-tune performance comes from
| fine-tuning all weights, followed my MLPs, followed by attention
| heads, using LoRA. Authors assert that the performance difference
| is based on the target module of the NN.
|
| Isn't an equally valid argument that MLPs tend to constitute a
| greater number of weights in transformer networks than attention
| heads, and the performance difference can be traced to a greater
| number of weights having freedom to change? I'd be curious to
| know if randomly choosing a subset of matrices to train,
| regardless of where they are in the network, would provide
| analogous performance to LoRA on a specific module with
| comparable learnable weights.
| chaos_emergent wrote:
| as a follow up curiosity, has anyone tried using LoRA on the
| entire model for pretraining to compare regular training model
| performance to LoRA?
| cabidaher wrote:
| This paper [1] does atempt that and reports similar
| performance compared to conventional pre-training. However,
| they do start off by doing a normal full-rank training and
| claim that it is needed to 'warm start' the training process.
|
| [1] https://arxiv.org/abs/2307.05695
| danielhanchen wrote:
| Oh yes this paper! The main issue is the scaling of the A
| and B LoRA matrices. Some papers show scaling the B matrix
| with larger learning rates (LoRA+) could be beneficial.
| DoRA for eg learns an auto scaling vector of numbers which
| tries to alleviate these issues.
|
| Galore might be more equivalent to full pretraining with
| the gradients being low rank.
| sp332 wrote:
| Do you mean leaving most of the model in its initial,
| randomised state and only training a LoRA?
| buildbot wrote:
| I've tested specifically this (on my personal time) :) It
| will train but I found the loss is proportional to the
| number of trainable parameters. So roughly to hit the
| performance of a standard 70m param model, you need to
| train ~70m lora params anyway.
| cheald wrote:
| It's worse than that, because lora requires two matrices
| per layer. At full rank, you have an additional NxN
| parameters to learn versus full finetuning, where N is
| min(input_features, output_features).
|
| For example, tuning a layer of 128 in x 256 out is 32k
| params. Learning a full-rank lora for that layer would be
| two matrices of 128x128 and 128x256 = 48k params.
| buildbot wrote:
| Yeah, exactly. Though the 48k param lora might be as good
| as a 48k param layer of higher rank, I haven't looked
| into that case really.
| buildbot wrote:
| Yes, I've tested this out. It does train, but the scaling
| doesn't seem to pan out. It'll perform slightly better than
| the number of trainable parameters, but never improves as you
| scale, so for now there's no benefit.
| whimsicalism wrote:
| i would be shocked if this worked well
| danielhanchen wrote:
| I think the QLoRA paper https://arxiv.org/pdf/2305.14314 paper
| also showed LoRA on all MLP + Attention layers > all MLP layers
| > just Attention layers.
|
| Other papers show finetuning a select few layers can also work
| well.
| thepasswordis wrote:
| I really wish people would be more careful about choosing names
| for these things.
|
| LoRa has been a popular wireless protocol for like 10 years.
| sva_ wrote:
| Yes, but this is LoRA, clearly not LoRa.
| squarefoot wrote:
| PP has a point though. I entered "LoRA" on Google,
| DuckDuckGo, Startpage and Bing, and all returned results in
| all first pages were about the communication protocol (1).
| They could have inferred my interests from previous searches,
| but I never used Bing in the last year or so, so it seems to
| me someone didn't care about name clashes.
|
| (1) well, except Google which -surprise- returned about mid
| page an ad of a local quite expensive chandeliers brand
| called "LORA".
| sva_ wrote:
| I usually just add a term like 'ml' or 'nn' after my search
| to give the machine context and it is sufficient in most
| cases.
| atherton33 wrote:
| I think you mean LoRa(r), a registered trademark of
| Semtech(r) Corporation, for use only with permission and
| within specific guidelines.
| https://www.semtech.com/uploads/company/FAQ-for-Use-of-
| LoRa-...
| mobilemidget wrote:
| I addressed the same in a previous 'lora' post on HN. For me
| the name is already reserved for the radio telecommunication
| meaning. Nothing going to change that.
| renewiltord wrote:
| Seriously, that was a terrible name for the wireless system
| since it's been used by the Loyola Online Records Access system
| for half a decade or more before the radio company shamelessly
| copied the name.
| dheera wrote:
| Not sure about "popular"
|
| 99% of ML engineers wouldn't know what it is.
| enlyth wrote:
| I'm not an ML engineer and the only reason I know that the
| wireless protocol exists is because in every HN article,
| there's a comment repeating the same complaint
| Findecanor wrote:
| 99% of the engineers who are still working in ML (Machine
| Language) would.
|
| A much smaller percent among those who write in ML (the
| functional programming language) probably, though.
| nerdponx wrote:
| Even if the ML _engineers_ know about the wireless protocol
| (and I doubt that many do), the scientists /researchers who
| develop these models probably don't. They are completely
| different domain. The lead author on this paper is
| basically a neuroscientist; some of the other are
| technically computer scientists, but probably have little
| hands-on experience with networking beyond whatever they
| did in undergrad.
| goodpoint wrote:
| ...but they should know how to use search engines...
| ssl-3 wrote:
| What can we learn about Low Rank Acronyms today?
| chriskanan wrote:
| This study is great and addresses a question I've had about LoRA
| for a while.
|
| In a continual learning paper from last year, I found LoRA was
| extremely effective for faster fine-tuning and not forgetting the
| original dataset:
|
| https://arxiv.org/abs/2306.01904
| rzzzt wrote:
| This paper has 12 authors, which fascinates me to no end for some
| unexplainable reason. How does it work? Is it a common occurrence
| to have this many people working on a submission? Did each of
| them get at least a paragraph in edgewise?
| repsak wrote:
| I raise you the Gemini paper https://arxiv.org/abs/2312.11805
| guyomes wrote:
| All-in with the Foldit paper [1,2].
|
| [1]: https://en.wikipedia.org/wiki/Foldit
|
| [2]: https://www.nature.com/articles/nature09304
| SubiculumCode wrote:
| For a serious answer, this is how it works in my field A
| researcher gets a grant with 3-7 co-investigators. This
| generates a bunch of data and other resources that will support
| 10 or more papers. Coinvestigators and PIs will ask their
| postdocs and grad students to write up a paper. PIs and co-Is
| go on every paper...because it's a paper from their grant. Then
| the 1 to 4 grad students and post-docs go on the paper,
| depending on their specific material contributions to the work,
| be it analysis, conception, or execution, or writing. The
| numbers can stack up.
| iudexgundyr wrote:
| I feel like this is a trivial conclusion. Keeping the rank low in
| the optimization is a common regularization technique.
___________________________________________________________________
(page generated 2024-05-17 23:00 UTC)