hngopher.com

       [HN Gopher] LoRA Without Regret
       ___________________________________________________________________
        
       LoRA Without Regret
        
       Author : grantpitt
       Score  : 177 points
       Date   : 2025-09-29 17:52 UTC (5 days ago)
        
 (HTM) web link (thinkingmachines.ai)
 (TXT) w3m dump (thinkingmachines.ai)
        
       | Yenrabbit wrote:
       | Thinking Machines have put out a string of incredibly high-
       | quality posts lately. Hard to oversell how much cred it's buying
       | them with the AI research community! Keep up the great work folks
        
         | sudohalt wrote:
         | [flagged]
        
           | dang wrote:
           | " _Please don 't post shallow dismissals, especially of other
           | people's work. A good critical comment teaches us
           | something._"
           | 
           | https://news.ycombinator.com/newsguidelines.html
        
         | mijoharas wrote:
         | What else has there been. I've only seen this one (which is
         | great!)
        
           | joloooo wrote:
           | Their Defeating Nondeterminism in LLM Inference was
           | interesting for me. Worth reading their others!
        
       | _def wrote:
       | Took me a moment to realize this is not about LoRa.
        
         | ellisv wrote:
         | I also mistook it to be about LoRa and not about LoRA
        
           | chrystalkey wrote:
           | I too fell victim to mistaking LoRa for LoRa
        
       | logannyeMD wrote:
       | Missed opportunity to title this "Lo-RAgrets"
        
       | HumblyTossed wrote:
       | The name gets me every single time. Always think it's going to be
       | about radio LoRa
        
         | dannyfritz07 wrote:
         | Dang it! Got me too! I've been wanting to hop into Meshtastic
         | lately.
        
           | ijustlovemath wrote:
           | Set up a node! Bare boards that work with the app are like
           | $50 and take a few clicks to flash and setup. The basic
           | antenna with no amp makes contacts up to 50mi away if the
           | conditions are right. I have one in a window and one in a
           | backpack at all times.
        
             | jacquesm wrote:
             | It's insane how far you can go between hops, really most
             | impressive. Where I live the mesh density is fairly high
             | but I've also tried it in places where it was vanishingly
             | low and yet I never completely lost contact. LoRa is very
             | much an underappreciated technology.
        
           | wkjagt wrote:
           | I have a couple of nodes up, but not seeing a lot of traffic
        
         | mrandish wrote:
         | Yeah, kinda disappointed it's just more AI stuff...
        
         | canadiantim wrote:
         | I thought it was Lora the CRTD implementation, but then
         | realized that Loro
        
         | halfmatthalfcat wrote:
         | Same - sad it's not.
        
           | moffkalast wrote:
           | No such thing as LoRa and LoRaWAN without regret I'm afraid,
           | all the range but no throughput.
        
             | halfmatthalfcat wrote:
             | You can do a lot with 255 bytes (SF5-8), just have to be
             | creative :)
        
         | sifar wrote:
         | And I thought you were going to say thinking machines :). Buy
         | yeah LoRA trips me up too.
        
         | papascrubs wrote:
         | Not just me then. It's always the first thing that springs to
         | mind.
        
           | apple4ever wrote:
           | Nope not just you! Gets me everytime.
        
             | CaptainOfCoit wrote:
             | Microsofts inability to properly name things once again
             | introduces more confusion than clarity, thanks Microsoft :)
             | 
             | At this point I think they do it on purpose, as their
             | metrics for "people visiting the website/repository" or
             | whatever gets increased as people thinking the repository
             | is about the existing concept/technology.
        
         | dvfjsdhgfv wrote:
         | By the way, some time ago when I checked there were two cool
         | applications of LoRa: (1) a mesh, for (hopefully) truly
         | decentralized and more difficult to disrupt communication, (2)
         | a gateway, so that you could get data from your sensors in
         | remote places via standard internet protocols.
         | 
         | Both are very cool, but I wonder if I missed something else?
        
       | eagsalazar2 wrote:
       | stupid website hijackes cmd-back-arrow.
        
       | markisus wrote:
       | Can someone explain the bit counting argument in the
       | reinforcement learning part?
       | 
       | I don't get why a trajectory would provide only one bit of
       | information.
       | 
       | Each step of the trajectory is at least giving information about
       | what state transitions are possible.
       | 
       | An infinitely long trajectory can explore the whole state space
       | if there are no absorbing states. Such a trajectory would provide
       | a massive amount of information about the system, even if we
       | ignored the final reward.
        
         | mountainriver wrote:
         | A fair amount of research has shown that RL doesn't add
         | knowledge to the base model it just optimizes paths that
         | already exist. Now ProRL from Nvidia showed there are ways of
         | adding knowledge, mostly through progressive merging.
         | 
         | I'm still not fully convinced of the 1bit claim, they made
         | other mistakes in the blog post
        
         | navar wrote:
         | I believe it's because the way you measure things in RL, each
         | episode only tells you whether it was good (say reward +1) or
         | bad (say 0 or negative reward), it does not tell you anything
         | about the trace that was produced to get the outcome. This
         | reward is the only thing measured to produce your gradients.
         | Hence why the amount of info in it is O(1).
         | 
         | This is in contrast to more "supervised" forms of learning
         | where you could get a loss for each token produced (e.g. cross
         | entropy loss), and where you'd get, as a consequence O(number
         | of tokens) information into your gradients.
        
       | mountainriver wrote:
       | > LoRA works well when not capacity constrained, i.e., the number
       | of trainable parameters exceeds the amount of information to be
       | learned, which can be estimated in terms of dataset size
       | 
       | I'm shocked they didn't look at progressive merging of LoRAs.
       | Research shows that's the best way of improving its ability to
       | model higher level features.
       | 
       | Seems like a massive miss, not to mention there is other research
       | that contradicts a lot of their findings. This feels a bit like a
       | researchers first pass at learning LoRA
        
         | yenepho wrote:
         | I am curious, would you mind sharing a citation?
        
           | Mkengin wrote:
           | https://arxiv.org/abs/2311.13600
           | 
           | https://arxiv.org/abs/2410.22911
           | 
           | https://arxiv.org/abs/2409.16167
        
             | mountainriver wrote:
             | Don't forget ReLoRA! https://arxiv.org/abs/2307.05695
        
         | let_tim_cook_ wrote:
         | I'm not sure why progressive LoRa merging needs to be addressed
         | here. They show there is a regime of problem where LoRa
         | performs equivalently to FFT.
         | 
         | Progressive merging of LoRa is somewhere inbetween and
         | categorically more complex than just LoRa so would be dominated
         | by standard LoRa in that case.
         | 
         | While progressive merging could train faster as fewer params
         | are trainable at any given time, it results in very larger
         | adapter diffs OTO the size of the original model and doesn't
         | retain the benefits of being able to deploy multiple adapters
         | over the same base model idt.
        
       | raaron773 wrote:
       | The amount of people who mistook this for long range radio and
       | were disappointed when it isnt about it is way too damn high.
       | (This is including me)
        
         | ineedasername wrote:
         | It might be useful to use this thread in a dataset to train a
         | LoRa so that LLM agents can more easily disambiguate the great
         | LoRa acronym collision of '25. No longer will future
         | generations suffer the indignity of either/or/both confusions.
        
       | kouteiheika wrote:
       | > However, the literature is unclear on how well LoRA performs
       | relative to FullFT.
       | 
       | I think the literature is clear on that?
       | 
       | "LoRA vs Full Fine-tuning: An Illusion of Equivalence" --
       | https://arxiv.org/abs/2410.21228v1
       | 
       | Quoting from the conclusions:
       | 
       | > The paper describes the finding that LoRA and full fine-tuning,
       | with equal performance on the fine-tuning task, can have
       | solutions with very different generalization behaviors outside
       | the fine-tuning task distribution. We found that LoRA and full
       | fine-tuning yield models with significant differences spectral
       | properties of their weight matrices: LoRA models often containing
       | "intruder dimensions", high-ranking singular vectors
       | approximately orthogonal to the singular vectors of pre-trained
       | weight matrices. The existence of intruder dimensions correlates
       | with the fine-tuned model forgetting more of the pre-training
       | distribution as well as forgetting more when trained on tasks
       | sequentially in a continual learning setup.
       | 
       | I'm surprised they didn't cite this; it's a well known paper.
        
         | adhi01 wrote:
         | To say that the 'literature is clear on that' while citing a
         | single paper, which has been rejected from ICLR, is a bit of an
         | overstatement.
        
           | muragekibicho wrote:
           | Thanks for this comment.
        
           | kouteiheika wrote:
           | > which has been rejected from ICLR
           | 
           | Oh, you mean rejected just like these papers?
           | 
           | Efficient Estimation of Word Representations in Vector
           | Space[1], one of the most influential papers in the space
           | with tens of thousands of citations[2]? Or the RoBERTa[3]
           | paper (dramatically improved upon BERT; RoBERTa and derived
           | models currently have tens of millions of downloads on HF and
           | still serve as a reliable industry workhorse)? Or the Mamba
           | paper[4] (pretty much the only alternative to transformers
           | that actually gets used)? Do you want me to keep going?
           | 
           | Honestly, I find that whether a paper gets rejected or not
           | means diddly squat considering how broken the review system
           | is, and through how much honestly terrible papers I have to
           | wade through every time I'm looking through the conference
           | submissions for anything good.
           | 
           | [1] -- https://openreview.net/forum?id=idpCdOWtqXd60
           | 
           | [2] --
           | https://scholar.google.com/scholar?cites=7447715766504981253
           | 
           | [3] -- https://openreview.net/forum?id=SyxS0T4tvS
           | 
           | [4] -- https://openreview.net/forum?id=AL1fq05o7H
        
           | p1esk wrote:
           | Even that paper itself does not provide any "clear"
           | conclusions about which method is better.
        
         | lelanthran wrote:
         | > I'm surprised they didn't cite this; it's a well known paper.
         | 
         | I'm surprised you copied and pasted all of that without
         | explaining what it means.
         | 
         | Does LoRA perform worse, better or statistically
         | insignificantly different to FullFT?
         | 
         | You aren't able to tell from what you pasted, are you?
        
           | crimsoneer wrote:
           | If you're going to be snarky, could you at least clarify what
           | the answer is for those of us who don't stay on top of ML
           | research...?
        
             | p1esk wrote:
             | The paper does not make any clear conclusions about LoRA vs
             | FullFT performance, beyond "the two methods seem to be
             | learning different things".
        
             | lelanthran wrote:
             | > If you're going to be snarky, could you at least clarify
             | what the answer is for those of us who don't stay on top of
             | ML research...?
             | 
             | The answer is "There's a difference, _perhaps_ ", but the
             | GP appeared to imply that LoRA performed worse.
             | 
             | My understanding is that _that_ paper found differences,
             | but did not conclude that the differences were quantifiably
             | better or worse, but this is not what GP 's post implied.
        
           | cheald wrote:
           | Standard LoRA (W_delta = B@A with standard inits) generally
           | underperforms FT, primarily because of "intruder dimensions"
           | (new high-ranking singular vectors which misalign with the
           | singular vectors of the underlying weights) as outlined in
           | the paper.
           | 
           | There are techniques like PiCa and SVFT which can mitigate
           | much of the loss, though.
        
         | richardvsu wrote:
         | Why would they cite a paper that's not helping with their
         | Tinker API that was released soon after? :)
        
       | rco8786 wrote:
       | I've been curious about LoRA and find a lot of these articles
       | interesting. But I've been unable to find a good "LoRA for
       | idiots" kind of starting point that gets me started actually
       | doing some training with my data. Anybody know of a more
       | practical guide I could use for that?
        
         | CaptainOfCoit wrote:
         | Unsloths documentation probably gets as close to practical as
         | it can get: https://docs.unsloth.ai/get-started/fine-tuning-
         | llms-guide
         | 
         | Be sure to validate everything you're reading though as of late
         | I've come across more and more things that don't seem 100%
         | accurate in their docs, seems to heavily depend on what
         | section.
        
           | ijk wrote:
           | My sense is they need to go back and update previous docs;
           | they release a lot of software updates and a lot of notebooks
           | showing how to use the features, but the two might fall out
           | of sync. Would that match your observations?
        
       | sgt101 wrote:
       | Question for dudes building modern nn's... what's the thinking on
       | estimating structural capacity for real world problem? How should
       | I estimate how many parameters to choose for the model?
        
         | p1esk wrote:
         | You test different models on your real world problem, and pick
         | the smallest one that works.
        
           | sgt101 wrote:
           | I just think that there has to be some heuristic..
        
             | BoorishBears wrote:
             | Closest thing to a heuristic is trying the task with non
             | fine-tuned models and building an intuition for how far off
             | each model is, what directions it's off in, and how easily
             | you can improve that direction via fine-tuning.
             | 
             | For example, for classification, if is hallucinating
             | semantically similar, but not technically valid classes,
             | you can probably fine-tune your way out of the gap with a
             | smaller model.
             | 
             | But if your task requires world knowledge, you likely need
             | a larger model. It's not cheap, efficient, or generally
             | useful to fine-tune for additional world knowledge
             | directly.
        
       | _spduchamp wrote:
       | Well since we all thought this was about Meshtastic stuff, let's
       | just give in and make this that radio/Meshtastic comment thread.
       | 
       | Stumbled on this today... https://hackerpager.net/
       | 
       | I really want something like this with flip out keyboard and
       | could do Signal on LTE/WiFi.
        
       | lewtun wrote:
       | For those interested in playing with an implementation of these
       | ideas, my colleagues at HF made some recipes here:
       | https://github.com/huggingface/trl/blob/main/docs/source/lor...
        
       ___________________________________________________________________
       (page generated 2025-10-04 23:01 UTC)