[HN Gopher] DyLoRA: Parameter Efficient Tuning of Pre-Trained Mo...
       ___________________________________________________________________
        
       DyLoRA: Parameter Efficient Tuning of Pre-Trained Models
        
       Author : mparrett
       Score  : 65 points
       Date   : 2023-04-10 16:38 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | whimsicalism wrote:
       | I'm unsure of the value of dynamically reducing the rank of the
       | LoRA matrix at inference time given that probably most of the
       | parameter count comes from the original weights rather than the
       | LoRA diff.
       | 
       | But nonetheless, training time improvements look interesting.
       | 
       | e: Oh I see, the training time improvement is compared to a grid
       | search over the LoRA rank. Not for a single run.
       | 
       | I am not convinced that you shouldn't just train on the highest
       | possible rank that you can with your compute budget. If you can
       | train a DynLoRA with rank 8, why not just train a LoRA with that
       | rank?
        
         | huevosabio wrote:
         | Yea, this is interesting but I can't see the immidiate value
         | (not that there isn't).
         | 
         | Maybe if the "optimal rank" of LORA applies to any adaptation
         | and you interested in training multiple adaptations for
         | different use cases?
        
         | vladf wrote:
         | The optimal rank could differ across layers
        
           | whimsicalism wrote:
           | I would be shocked if the "optimal rank" in terms of
           | performance wouldn't be using the maximum rank from the
           | DynLoRA across all layers.
        
             | vladf wrote:
             | Err, I suppose trivially, the higher rank terms include the
             | lower-rank subnets, so they dominate in terms of quality.
             | 
             | But if you have some capacity constraint (e.g., memory, I
             | guess?) then you can imagine dynamic rank allocation
             | helping in the case where the maximum rank across all
             | layers isn't within budget.
             | 
             | It's a bit of a stretch though, I agree
        
       | fancyfredbot wrote:
       | When fine tuning an LLM you can use the LORA technique to make
       | the fine tuning faster. LORA involves fine tuning a subset of
       | parameters (really it's a low rank approximation of the weight
       | matrix determined by picking the n largest eigenvalues in the SVD
       | decomposition). The size of the subset is determined by the rank.
       | The smaller the rank the faster the fine tuning. However if you
       | make the rank too small then quality will suffer. So you want to
       | pick the optimal rank. This paper describes a technique which can
       | be used to find the optimal rank more easily.
        
       | lxe wrote:
       | Kudos for the authors for providing the code
       | https://github.com/huawei-noah/KD-NLP/tree/main/DyLoRA and the
       | roberta example. Considering the current state of the OSS LLM
       | community, I'm guessing someone is already porting it to Llama
       | and gpt-style models.
        
         | kernelsanderz wrote:
         | Adding this to the huggingface peft library would be amazing.
         | That's the main library that people using LoRA are currently
         | using. https://github.com/huggingface/peft/issues/289
        
       | vladf wrote:
       | How does this technique differ from the supernet optimization for
       | one-shot NAS? https://proceedings.mlr.press/v80/bender18a.html
       | 
       | It seems like they use a fixed-distribution controller for
       | training. It'd be nice to see why it's worth deviating from the
       | original RL paradigm.
        
         | whimsicalism wrote:
         | It's very different, but hard to distill in a comment. They use
         | a new regularization technique to basically create a LoRA with
         | dynamically adjustable rank.
        
       | turnsout wrote:
       | So this can tune a model 7X faster than LoRA, which was already a
       | massive speed boost? Curious to see what this will do to the
       | LLaMA-derivative community in particular.
        
         | whimsicalism wrote:
         | 7x faster compared to grid-search LoRA for best rank.
         | 
         | I am not convinced that the "best rank" is not just the highest
         | possible with your compute budget, personally.
        
           | sitkack wrote:
           | What is the fastest way to show that?
        
             | whimsicalism wrote:
             | Fastest way to show what? That you should train with the
             | maximum sized LoRA you can? Because the only upsides to
             | having a smaller LoRA are in the training time, and if you
             | are already able to train a DynLoRA with max rank 8, then
             | you should just train a LoRA with that rank.
        
               | fancyfredbot wrote:
               | You get diminishing returns as you increase the rank, so
               | with a fixed training budget it's not clear whether you
               | get the best return from increasing rank vs increasing
               | something else. If you start off by training DynLORA with
               | max rank 8 you can see returns diminish fast beyond rank
               | 5. Then you can use rank 5 for the rest of your training.
               | You wouldn't know that with LoRA. I think this is the
               | idea behind the paper. If you are just going to use your
               | entire budget training a DyLoRA with max rank 8 then
               | you're right there's no advantage over LoRA with rank 8.
               | You'd have to use the ability to assess multiple ranks in
               | order to see some benefit.
        
               | whimsicalism wrote:
               | I can see that. But are we sure that a rank-based
               | difference that doesn't manifest early in the training
               | process won't manifest as you get further along? See also
               | 'grokking' [0]
               | 
               | [0]: https://arxiv.org/abs/2201.02177
        
               | fancyfredbot wrote:
               | Not sure there's any way to know beforehand whether that
               | would happen but the advantage of DyLoRA is that at least
               | you will know afterwards whether you really needed the
               | full rank whereas with LoRA you wouldn't? In some cases
               | that might not be valuable information but I guess you'd
               | rather know than not.
        
       ___________________________________________________________________
       (page generated 2023-04-10 23:01 UTC)