[HN Gopher] DyLoRA: Parameter Efficient Tuning of Pre-Trained Mo...
___________________________________________________________________
DyLoRA: Parameter Efficient Tuning of Pre-Trained Models
Author : mparrett
Score : 65 points
Date : 2023-04-10 16:38 UTC (6 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| whimsicalism wrote:
| I'm unsure of the value of dynamically reducing the rank of the
| LoRA matrix at inference time given that probably most of the
| parameter count comes from the original weights rather than the
| LoRA diff.
|
| But nonetheless, training time improvements look interesting.
|
| e: Oh I see, the training time improvement is compared to a grid
| search over the LoRA rank. Not for a single run.
|
| I am not convinced that you shouldn't just train on the highest
| possible rank that you can with your compute budget. If you can
| train a DynLoRA with rank 8, why not just train a LoRA with that
| rank?
| huevosabio wrote:
| Yea, this is interesting but I can't see the immidiate value
| (not that there isn't).
|
| Maybe if the "optimal rank" of LORA applies to any adaptation
| and you interested in training multiple adaptations for
| different use cases?
| vladf wrote:
| The optimal rank could differ across layers
| whimsicalism wrote:
| I would be shocked if the "optimal rank" in terms of
| performance wouldn't be using the maximum rank from the
| DynLoRA across all layers.
| vladf wrote:
| Err, I suppose trivially, the higher rank terms include the
| lower-rank subnets, so they dominate in terms of quality.
|
| But if you have some capacity constraint (e.g., memory, I
| guess?) then you can imagine dynamic rank allocation
| helping in the case where the maximum rank across all
| layers isn't within budget.
|
| It's a bit of a stretch though, I agree
| fancyfredbot wrote:
| When fine tuning an LLM you can use the LORA technique to make
| the fine tuning faster. LORA involves fine tuning a subset of
| parameters (really it's a low rank approximation of the weight
| matrix determined by picking the n largest eigenvalues in the SVD
| decomposition). The size of the subset is determined by the rank.
| The smaller the rank the faster the fine tuning. However if you
| make the rank too small then quality will suffer. So you want to
| pick the optimal rank. This paper describes a technique which can
| be used to find the optimal rank more easily.
| lxe wrote:
| Kudos for the authors for providing the code
| https://github.com/huawei-noah/KD-NLP/tree/main/DyLoRA and the
| roberta example. Considering the current state of the OSS LLM
| community, I'm guessing someone is already porting it to Llama
| and gpt-style models.
| kernelsanderz wrote:
| Adding this to the huggingface peft library would be amazing.
| That's the main library that people using LoRA are currently
| using. https://github.com/huggingface/peft/issues/289
| vladf wrote:
| How does this technique differ from the supernet optimization for
| one-shot NAS? https://proceedings.mlr.press/v80/bender18a.html
|
| It seems like they use a fixed-distribution controller for
| training. It'd be nice to see why it's worth deviating from the
| original RL paradigm.
| whimsicalism wrote:
| It's very different, but hard to distill in a comment. They use
| a new regularization technique to basically create a LoRA with
| dynamically adjustable rank.
| turnsout wrote:
| So this can tune a model 7X faster than LoRA, which was already a
| massive speed boost? Curious to see what this will do to the
| LLaMA-derivative community in particular.
| whimsicalism wrote:
| 7x faster compared to grid-search LoRA for best rank.
|
| I am not convinced that the "best rank" is not just the highest
| possible with your compute budget, personally.
| sitkack wrote:
| What is the fastest way to show that?
| whimsicalism wrote:
| Fastest way to show what? That you should train with the
| maximum sized LoRA you can? Because the only upsides to
| having a smaller LoRA are in the training time, and if you
| are already able to train a DynLoRA with max rank 8, then
| you should just train a LoRA with that rank.
| fancyfredbot wrote:
| You get diminishing returns as you increase the rank, so
| with a fixed training budget it's not clear whether you
| get the best return from increasing rank vs increasing
| something else. If you start off by training DynLORA with
| max rank 8 you can see returns diminish fast beyond rank
| 5. Then you can use rank 5 for the rest of your training.
| You wouldn't know that with LoRA. I think this is the
| idea behind the paper. If you are just going to use your
| entire budget training a DyLoRA with max rank 8 then
| you're right there's no advantage over LoRA with rank 8.
| You'd have to use the ability to assess multiple ranks in
| order to see some benefit.
| whimsicalism wrote:
| I can see that. But are we sure that a rank-based
| difference that doesn't manifest early in the training
| process won't manifest as you get further along? See also
| 'grokking' [0]
|
| [0]: https://arxiv.org/abs/2201.02177
| fancyfredbot wrote:
| Not sure there's any way to know beforehand whether that
| would happen but the advantage of DyLoRA is that at least
| you will know afterwards whether you really needed the
| full rank whereas with LoRA you wouldn't? In some cases
| that might not be valuable information but I guess you'd
| rather know than not.
___________________________________________________________________
(page generated 2023-04-10 23:01 UTC)