[HN Gopher] LoRA Without Regret
___________________________________________________________________
LoRA Without Regret
Author : grantpitt
Score : 177 points
Date : 2025-09-29 17:52 UTC (5 days ago)
(HTM) web link (thinkingmachines.ai)
(TXT) w3m dump (thinkingmachines.ai)
| Yenrabbit wrote:
| Thinking Machines have put out a string of incredibly high-
| quality posts lately. Hard to oversell how much cred it's buying
| them with the AI research community! Keep up the great work folks
| sudohalt wrote:
| [flagged]
| dang wrote:
| " _Please don 't post shallow dismissals, especially of other
| people's work. A good critical comment teaches us
| something._"
|
| https://news.ycombinator.com/newsguidelines.html
| mijoharas wrote:
| What else has there been. I've only seen this one (which is
| great!)
| joloooo wrote:
| Their Defeating Nondeterminism in LLM Inference was
| interesting for me. Worth reading their others!
| _def wrote:
| Took me a moment to realize this is not about LoRa.
| ellisv wrote:
| I also mistook it to be about LoRa and not about LoRA
| chrystalkey wrote:
| I too fell victim to mistaking LoRa for LoRa
| logannyeMD wrote:
| Missed opportunity to title this "Lo-RAgrets"
| HumblyTossed wrote:
| The name gets me every single time. Always think it's going to be
| about radio LoRa
| dannyfritz07 wrote:
| Dang it! Got me too! I've been wanting to hop into Meshtastic
| lately.
| ijustlovemath wrote:
| Set up a node! Bare boards that work with the app are like
| $50 and take a few clicks to flash and setup. The basic
| antenna with no amp makes contacts up to 50mi away if the
| conditions are right. I have one in a window and one in a
| backpack at all times.
| jacquesm wrote:
| It's insane how far you can go between hops, really most
| impressive. Where I live the mesh density is fairly high
| but I've also tried it in places where it was vanishingly
| low and yet I never completely lost contact. LoRa is very
| much an underappreciated technology.
| wkjagt wrote:
| I have a couple of nodes up, but not seeing a lot of traffic
| mrandish wrote:
| Yeah, kinda disappointed it's just more AI stuff...
| canadiantim wrote:
| I thought it was Lora the CRTD implementation, but then
| realized that Loro
| halfmatthalfcat wrote:
| Same - sad it's not.
| moffkalast wrote:
| No such thing as LoRa and LoRaWAN without regret I'm afraid,
| all the range but no throughput.
| halfmatthalfcat wrote:
| You can do a lot with 255 bytes (SF5-8), just have to be
| creative :)
| sifar wrote:
| And I thought you were going to say thinking machines :). Buy
| yeah LoRA trips me up too.
| papascrubs wrote:
| Not just me then. It's always the first thing that springs to
| mind.
| apple4ever wrote:
| Nope not just you! Gets me everytime.
| CaptainOfCoit wrote:
| Microsofts inability to properly name things once again
| introduces more confusion than clarity, thanks Microsoft :)
|
| At this point I think they do it on purpose, as their
| metrics for "people visiting the website/repository" or
| whatever gets increased as people thinking the repository
| is about the existing concept/technology.
| dvfjsdhgfv wrote:
| By the way, some time ago when I checked there were two cool
| applications of LoRa: (1) a mesh, for (hopefully) truly
| decentralized and more difficult to disrupt communication, (2)
| a gateway, so that you could get data from your sensors in
| remote places via standard internet protocols.
|
| Both are very cool, but I wonder if I missed something else?
| eagsalazar2 wrote:
| stupid website hijackes cmd-back-arrow.
| markisus wrote:
| Can someone explain the bit counting argument in the
| reinforcement learning part?
|
| I don't get why a trajectory would provide only one bit of
| information.
|
| Each step of the trajectory is at least giving information about
| what state transitions are possible.
|
| An infinitely long trajectory can explore the whole state space
| if there are no absorbing states. Such a trajectory would provide
| a massive amount of information about the system, even if we
| ignored the final reward.
| mountainriver wrote:
| A fair amount of research has shown that RL doesn't add
| knowledge to the base model it just optimizes paths that
| already exist. Now ProRL from Nvidia showed there are ways of
| adding knowledge, mostly through progressive merging.
|
| I'm still not fully convinced of the 1bit claim, they made
| other mistakes in the blog post
| navar wrote:
| I believe it's because the way you measure things in RL, each
| episode only tells you whether it was good (say reward +1) or
| bad (say 0 or negative reward), it does not tell you anything
| about the trace that was produced to get the outcome. This
| reward is the only thing measured to produce your gradients.
| Hence why the amount of info in it is O(1).
|
| This is in contrast to more "supervised" forms of learning
| where you could get a loss for each token produced (e.g. cross
| entropy loss), and where you'd get, as a consequence O(number
| of tokens) information into your gradients.
| mountainriver wrote:
| > LoRA works well when not capacity constrained, i.e., the number
| of trainable parameters exceeds the amount of information to be
| learned, which can be estimated in terms of dataset size
|
| I'm shocked they didn't look at progressive merging of LoRAs.
| Research shows that's the best way of improving its ability to
| model higher level features.
|
| Seems like a massive miss, not to mention there is other research
| that contradicts a lot of their findings. This feels a bit like a
| researchers first pass at learning LoRA
| yenepho wrote:
| I am curious, would you mind sharing a citation?
| Mkengin wrote:
| https://arxiv.org/abs/2311.13600
|
| https://arxiv.org/abs/2410.22911
|
| https://arxiv.org/abs/2409.16167
| mountainriver wrote:
| Don't forget ReLoRA! https://arxiv.org/abs/2307.05695
| let_tim_cook_ wrote:
| I'm not sure why progressive LoRa merging needs to be addressed
| here. They show there is a regime of problem where LoRa
| performs equivalently to FFT.
|
| Progressive merging of LoRa is somewhere inbetween and
| categorically more complex than just LoRa so would be dominated
| by standard LoRa in that case.
|
| While progressive merging could train faster as fewer params
| are trainable at any given time, it results in very larger
| adapter diffs OTO the size of the original model and doesn't
| retain the benefits of being able to deploy multiple adapters
| over the same base model idt.
| raaron773 wrote:
| The amount of people who mistook this for long range radio and
| were disappointed when it isnt about it is way too damn high.
| (This is including me)
| ineedasername wrote:
| It might be useful to use this thread in a dataset to train a
| LoRa so that LLM agents can more easily disambiguate the great
| LoRa acronym collision of '25. No longer will future
| generations suffer the indignity of either/or/both confusions.
| kouteiheika wrote:
| > However, the literature is unclear on how well LoRA performs
| relative to FullFT.
|
| I think the literature is clear on that?
|
| "LoRA vs Full Fine-tuning: An Illusion of Equivalence" --
| https://arxiv.org/abs/2410.21228v1
|
| Quoting from the conclusions:
|
| > The paper describes the finding that LoRA and full fine-tuning,
| with equal performance on the fine-tuning task, can have
| solutions with very different generalization behaviors outside
| the fine-tuning task distribution. We found that LoRA and full
| fine-tuning yield models with significant differences spectral
| properties of their weight matrices: LoRA models often containing
| "intruder dimensions", high-ranking singular vectors
| approximately orthogonal to the singular vectors of pre-trained
| weight matrices. The existence of intruder dimensions correlates
| with the fine-tuned model forgetting more of the pre-training
| distribution as well as forgetting more when trained on tasks
| sequentially in a continual learning setup.
|
| I'm surprised they didn't cite this; it's a well known paper.
| adhi01 wrote:
| To say that the 'literature is clear on that' while citing a
| single paper, which has been rejected from ICLR, is a bit of an
| overstatement.
| muragekibicho wrote:
| Thanks for this comment.
| kouteiheika wrote:
| > which has been rejected from ICLR
|
| Oh, you mean rejected just like these papers?
|
| Efficient Estimation of Word Representations in Vector
| Space[1], one of the most influential papers in the space
| with tens of thousands of citations[2]? Or the RoBERTa[3]
| paper (dramatically improved upon BERT; RoBERTa and derived
| models currently have tens of millions of downloads on HF and
| still serve as a reliable industry workhorse)? Or the Mamba
| paper[4] (pretty much the only alternative to transformers
| that actually gets used)? Do you want me to keep going?
|
| Honestly, I find that whether a paper gets rejected or not
| means diddly squat considering how broken the review system
| is, and through how much honestly terrible papers I have to
| wade through every time I'm looking through the conference
| submissions for anything good.
|
| [1] -- https://openreview.net/forum?id=idpCdOWtqXd60
|
| [2] --
| https://scholar.google.com/scholar?cites=7447715766504981253
|
| [3] -- https://openreview.net/forum?id=SyxS0T4tvS
|
| [4] -- https://openreview.net/forum?id=AL1fq05o7H
| p1esk wrote:
| Even that paper itself does not provide any "clear"
| conclusions about which method is better.
| lelanthran wrote:
| > I'm surprised they didn't cite this; it's a well known paper.
|
| I'm surprised you copied and pasted all of that without
| explaining what it means.
|
| Does LoRA perform worse, better or statistically
| insignificantly different to FullFT?
|
| You aren't able to tell from what you pasted, are you?
| crimsoneer wrote:
| If you're going to be snarky, could you at least clarify what
| the answer is for those of us who don't stay on top of ML
| research...?
| p1esk wrote:
| The paper does not make any clear conclusions about LoRA vs
| FullFT performance, beyond "the two methods seem to be
| learning different things".
| lelanthran wrote:
| > If you're going to be snarky, could you at least clarify
| what the answer is for those of us who don't stay on top of
| ML research...?
|
| The answer is "There's a difference, _perhaps_ ", but the
| GP appeared to imply that LoRA performed worse.
|
| My understanding is that _that_ paper found differences,
| but did not conclude that the differences were quantifiably
| better or worse, but this is not what GP 's post implied.
| cheald wrote:
| Standard LoRA (W_delta = B@A with standard inits) generally
| underperforms FT, primarily because of "intruder dimensions"
| (new high-ranking singular vectors which misalign with the
| singular vectors of the underlying weights) as outlined in
| the paper.
|
| There are techniques like PiCa and SVFT which can mitigate
| much of the loss, though.
| richardvsu wrote:
| Why would they cite a paper that's not helping with their
| Tinker API that was released soon after? :)
| rco8786 wrote:
| I've been curious about LoRA and find a lot of these articles
| interesting. But I've been unable to find a good "LoRA for
| idiots" kind of starting point that gets me started actually
| doing some training with my data. Anybody know of a more
| practical guide I could use for that?
| CaptainOfCoit wrote:
| Unsloths documentation probably gets as close to practical as
| it can get: https://docs.unsloth.ai/get-started/fine-tuning-
| llms-guide
|
| Be sure to validate everything you're reading though as of late
| I've come across more and more things that don't seem 100%
| accurate in their docs, seems to heavily depend on what
| section.
| ijk wrote:
| My sense is they need to go back and update previous docs;
| they release a lot of software updates and a lot of notebooks
| showing how to use the features, but the two might fall out
| of sync. Would that match your observations?
| sgt101 wrote:
| Question for dudes building modern nn's... what's the thinking on
| estimating structural capacity for real world problem? How should
| I estimate how many parameters to choose for the model?
| p1esk wrote:
| You test different models on your real world problem, and pick
| the smallest one that works.
| sgt101 wrote:
| I just think that there has to be some heuristic..
| BoorishBears wrote:
| Closest thing to a heuristic is trying the task with non
| fine-tuned models and building an intuition for how far off
| each model is, what directions it's off in, and how easily
| you can improve that direction via fine-tuning.
|
| For example, for classification, if is hallucinating
| semantically similar, but not technically valid classes,
| you can probably fine-tune your way out of the gap with a
| smaller model.
|
| But if your task requires world knowledge, you likely need
| a larger model. It's not cheap, efficient, or generally
| useful to fine-tune for additional world knowledge
| directly.
| _spduchamp wrote:
| Well since we all thought this was about Meshtastic stuff, let's
| just give in and make this that radio/Meshtastic comment thread.
|
| Stumbled on this today... https://hackerpager.net/
|
| I really want something like this with flip out keyboard and
| could do Signal on LTE/WiFi.
| lewtun wrote:
| For those interested in playing with an implementation of these
| ideas, my colleagues at HF made some recipes here:
| https://github.com/huggingface/trl/blob/main/docs/source/lor...
___________________________________________________________________
(page generated 2025-10-04 23:01 UTC)