[HN Gopher] LoRA: Low-Rank Adaptation of Large Language Models
___________________________________________________________________
LoRA: Low-Rank Adaptation of Large Language Models
Author : eternalban
Score : 227 points
Date : 2023-03-24 12:15 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| eternalban wrote:
| From the paper:
|
| _"Aghajanyan et al. (2020) shows that the pre-trained language
| models have a low "instrisic dimension" and can still learn
| efficiently despite a random projection to a smaller subspace."_
|
| Would be great to have an informed practitioner comment (sota) on
| why we opt for random projection. Is the actual 'intrinsic'
| vector space uncomputable? Too slow to find?
| moyix wrote:
| Not an informed/sota practitioner, but isn't this just a
| standard property of high dimensional spaces?
|
| https://en.wikipedia.org/wiki/Random_projection
|
| > The core idea behind random projection is given in the
| Johnson-Lindenstrauss lemma, which states that if points in a
| vector space are of sufficiently high dimension, then they may
| be projected into a suitable lower-dimensional space in a way
| which approximately preserves the distances between the points.
| stu2b50 wrote:
| Random projects work well in high dimensional spaces, they're
| cheap, easy, and require no understanding of the initial space.
| Part of the point of Lora is efficiency, after all!
| pgen wrote:
| Name clash! https://en.wikipedia.org/wiki/LoRa#LoRaWAN
| smusamashah wrote:
| This term is already used when fine tuning stable diffusion
| models. https://replicate.com/blog/lora-faster-fine-tuning-of-
| stable...
| Filligree wrote:
| Isn't this actually the same thing?
| TeMPOraL wrote:
| Already used for like... a month or two.
| mronetwo wrote:
| it's Microsoft. they know. they just don't care
| pygy_ wrote:
| Your case insensitive brain doesn't get it... It's LoRA, not
| LoRa.
|
| /s
| ga_to wrote:
| Microsoft has done this before with mauikit and mauilinux:
| https://github.com/dotnet/maui/issues/35
|
| Unlikely that they even consider checking whether they are
| stomping across existing names.
| capableweb wrote:
| > Unlikely that they even consider checking whether they are
| stomping across existing names.
|
| Or it's on purpose as existing terms already have good amount
| of search traffic for those terms, and Microsoft know
| Google/Bing will rank Microsoft's own pages higher than
| what's already out there.
| Semaphor wrote:
| They even do that with their own products...
| capableweb wrote:
| Easy, one is LoRA and the other one is LoRa, Microsoft made it
| very distinct, as they always do.
| fullstop wrote:
| Just don't put the files in the same directory on an exFAT
| drive
| capableweb wrote:
| or on macOS, although I'm not sure it's still by default
| using case sensitive file system or not. I do remember the
| first time that bit me though, being a programmer using
| Linux collaborating with a developer using macOS. Must have
| been in ~2005 or something.
| fullstop wrote:
| The one which bit me happened when I was running a java
| minimizer / obfuscator on a Windows platform and it
| assumed that A.class was not the same as a.class. It
| worked great on Linux and didn't warn that it had
| overwritten a file, resulting in a package which almost
| worked.
| dudeinjapan wrote:
| And yet they sued Mike Rowe who made Mike Rowe Soft.
| [deleted]
| htrp wrote:
| this is from 2 years ago
| nummerfuenf wrote:
| Can we stop naming things like stuff that already exists?
| denysvitali wrote:
| I was looking for this comment. Thank you!
| postdb wrote:
| Firebird!!!
| wlesieutre wrote:
| This is totally different, Microsoft's A is capitalized!
|
| https://lora-alliance.org/
| indeyets wrote:
| https://github.com/microsoft/LoRA/issues/47
| entropicdrifter wrote:
| Unfortunately, no. It's even worse within the video game
| industry. I'm not just talking Doom 4, er, Doom (2016). The
| upcoming sequel to 2014's Lords of The Fallen? Well that's
| called Lords of The Fallen. They didn't even get 2 game in
| before repeating the exact same name.
| Agentlien wrote:
| My favourite video game franchise in terms of confusing name
| is the Jedi Knight franchise.
|
| Star Wars: Dark Forces
|
| Star Wars Jedi Knight: Dark Forces 2
|
| Star Wars Jedi Knight 2: Jedi Outcast
|
| Star Wars Jedi Knight: Jedi Academy
| JustSomeNobody wrote:
| Assholes. Don't call it LoRA!
|
| There's already a technology called LoRa!
|
| Fuck I hate this crap. Be better than this.
| krolden wrote:
| Its Microsoft they'll only be better if they go under.
| runnerup wrote:
| There's a not insignificant intersection of projects and
| developers who might be using both LoRA and LoRa at the same
| time. What a terrible name collision. Hopefully this doesn't
| become one of the foundational terms in AI that everyone must use
| frequently like "Transformer".
| davesque wrote:
| Is this really a big problem? LoRa is a telecom thing. LoRA is
| a machine learning thing. Yeah, they're adjacent industries but
| still seems different enough to make it pretty easy to
| distinguish. I had never heard of the LoRa alliance until you
| mentioned it in this comment.
| asddubs wrote:
| yeah it really does seem like the AI folks are EEEing EE
| seydor wrote:
| transformer itself is ambiguous
|
| But to be clear, LoRa is not related to ANN training, is it?
| Why would they be using both?
| Maxion wrote:
| I was going to comment the same, horrible name collision.
| Surprised they didn't notice it.
| whalesalad wrote:
| transformer and adapter are two of the new "ai terms" that
| grind my gears
| ahkurtz wrote:
| Isn't there something really perfect about people working on a
| language model either not trying or outright failing to use
| that language model to tell them if their project name already
| exists?
|
| On their github they reference a related project called
| "HuggingFace" so you know the sky's the limit with the names in
| this field, could have been called anything else really.
| chaorace wrote:
| > On their github they reference a related project called
| "HuggingFace"
|
| Quick jargon literacy boost: "HuggingFace" is a platform
| tailored to hosting and sharing ML repositories -- like
| Github for AI. The parent company, "Hugging Face", is also in
| and of itself a major contributor to several AI research
| projects & tooling.
|
| Ironically, they still managed to hit a namespace
| collision... albeit self-inflicted.
| tmabraham wrote:
| the actual platform is called "HuggingFace Hub". The
| company itself is called "HuggingFace" or "Hugging Face" (I
| have seen it referred to in both ways, I am unsure which is
| officially correct). There is no namespace collision.
| indeyets wrote:
| https://github.com/microsoft/LoRA/issues/47
| elil17 wrote:
| Can someone ELI5 "LoRA reduces the number of trainable parameters
| by learning pairs of rank-decompostion matrices while freezing
| the original weights"?
| MacsHeadroom wrote:
| LoRA finds a subset of the original weights (about 1%) which
| can be trained to achieve about the same result as training the
| whole model while using 100x less compute.
|
| Original weights frozen = Rather than modify the original
| model, the training results are saved to a small file of only a
| few MB.
|
| In practice this means you can fine tune a 30B parameter model
| on a consumer GPU in a couple of hours. Without LoRA you would
| need to run multiple expensive data center GPUs for days or
| weeks.
| tylerekahn wrote:
| It's actually as low as 0.01% of the original weights.
|
| From the LoRa paper:
|
| >When the pre-trained model is GPT-3 175B, the number of
| train- able parameters |Th| can be as small as 0.01% of
| |Ph0|.
| pffft8888 wrote:
| Is this the same as or similar to the Lottery Ticket concept
| from a few years ago?
| arugulum wrote:
| >In practice this means you can fine tune a 30B parameter
| model on a consumer GPU in a couple of hours.
|
| Consumer GPU, yes, but in practice LoRA doesn't actually
| reduce training time. What it mainly reduces is memory
| requirements. In fact LoRA training can often require more
| training steps than full fine-tuning and therefore be slower
| (you can imagine why this is the case: the optimization is
| trying to modify the mode's behavior a smaller number of
| parameters, and so has a harder job)
| stephanheijl wrote:
| To be more exact, LoRA adds two matrices `A` and `B` to any
| layers that contain trainable weights. The original weights
| (`W_0`) have the shape `d x k`. These are frozen. Matrix `A`
| has dimensions `d x <rank>` (`rank` is configurable) and
| matrix `B` has the shape `<rank> x k`. A and B are then
| multiplied and added to `W_0` to get altered weights. The
| benefit here is that the extra matrices are small compared to
| `W_0`, which means less parameters need to be optimized, so
| less activations need to be stored in memory.
| twic wrote:
| Ah, so the resulting model contains both the large matrix
| of original weights, and also the two small matrices of
| alterations? But this is smaller than the alternative of a
| model which contains the large matrix of original weights,
| and an equally large matrix of alterations.
|
| Why is fine-tuning done with separate alterations, rather
| than by mutating the original weights?
| TuringTest wrote:
| It's larger, but there are less parameters to train for
| your specific use case since you are training the small
| matrix only, while the original ones remain unaltered.
| arugulum wrote:
| > Why is fine-tuning done with separate alterations,
| rather than by mutating the original weights?
|
| The goal of most parameter-efficient methods is to store
| one gold copy of the original model, and learn minor
| modifications/additions to the model. The easiest way to
| think about this is in some kind of deployment setting,
| where you have 1 capable model and you learn different
| sets of LoRA weights for different tasks and
| applications.
|
| The original intent of parameter-efficient methods is to
| reduce the amount of storage space needed for models (do
| you really want to keep a whole additional copy of LLaMA
| for each different task?). A secondary benefit is that
| because you are fine-tuning a smaller number of
| parameters, the optimizer states (can take up to 2x the
| size of your model) are also heavily shrunk, which makes
| it more economical (memory-wise) to (parameter-efficient)
| fine-tune your model.
| stu2b50 wrote:
| > But this is smaller than the alternative of a model
| which contains the large matrix of original weights, and
| an equally large matrix of alterations.
|
| It's actually larger. If you just have two equally large
| matrices of the same dimension, one original, and one of
| "altercations"... then you can just add them together.
|
| > Why is fine-tuning done with separate alterations,
| rather than by mutating the original weights?
|
| Then you'd have to compute the gradients for the whole
| network, which is very expensive when the model has 7b,
| 65b, 165b parameters. The intent is to make that cheaper
| by only computing gradients for a low rank representation
| of the _change_ in the weight matrix from training.
| arugulum wrote:
| >Then you'd have to compute the gradients for the whole
| network
|
| You have to do that with LoRA regardless, to compute the
| gradients for the lowest-level LoRA weights.
| gliptic wrote:
| Correct me if I'm wrong, but I think you still need to
| compute gradients of non-trained weights in order to
| compute the gradients of the LoRA weights. What you don't
| have to do is store and update the optimizer state for
| all those non-trained weights.
| stu2b50 wrote:
| I mean the derivative of a constant is 0. So if all of
| the original weights are considered constants, then
| computing their gradients is trivial, since they're just
| zero.
| jprafael wrote:
| Computing gradients is easy/cheap. What this technique
| solves is that you no longer need to store the computed
| values of the gradient until the backpropagation phase,
| which saves on expensive GPU RAM, allowing you to use
| commodity hardware.
| seydor wrote:
| Can rank decomposition be used to reduce the original
| weight matrices as well? Or are they assumed to be
| compressed already?
| grph123dot wrote:
| Your explanation is crystal clear. I suppose it works well
| in practice, but is there any reason it works that well?
| stu2b50 wrote:
| Per the original paper, empirically it's been found that
| neural network weights often have low intrinsic rank. It
| follows, then, that the change in the weights as you
| train also have low intrinsic rank, which means that you
| should be able represent them with a lower rank matrix.
| grph123dot wrote:
| Since we are in ELI5, it seems that the concept of low
| rank approximation is required to understand this method.
|
| (1) https://en.wikipedia.org/wiki/Low-rank_approximation
|
| Edited: By the way, it seems to me that there is an error
| in the wikipedia page because if the Low-rank
| approximation takes a larger rank then the bound of the
| error should decrease, and in this page the error
| increases.
| grph123dot wrote:
| >> that the change in the weights as you train also have
| low intrinsic rank
|
| It seems that the initial matrix of weights has a low
| rank approximation A and this implies that the difference
| E = W - A is small, also it seems that PCA fails when E
| is sparse because PCA is designed to be optimum when the
| error is gaussian.
| stu2b50 wrote:
| In terms of PCA, PCA is also quite expensive
| computationally. Additionally, you'd probably have to do
| SVD instead.
|
| Since the weights are derived from gradient descent, yeah
| we don't really know what the distributions would be.
|
| A random projection empirically works quite well for very
| high dimensions, and is of course very cheap
| computationally.
| seydor wrote:
| Does this mean the matrices are highly compressible?
| quest88 wrote:
| Is this the same as Knowledge Distillation (teacher-student
| training)?
| edwardjhu wrote:
| Hi! I'm the author of the repo.
|
| The insight is that we don't need to modify a lot of parameters
| to get a generally competent model to do well on specific
| tasks. When you have a linear layer with a weight matrix of
| dimension d_in x d_out, the change you undergo during full
| finetuning is also a matrix of d_in x d_out, which can be huge.
| We represent the latter using two matrices of shape d_in x r
| and r x d_out. You save a lot of parameters when r is small. So
| when you use it, the input goes through two streams 1) the
| orignal frozen weight turning a vector of size d_in to d_out
| and 2) the low-rank weights turning a vector of size d_in to r
| and r to d_out. The two streams are then summed together.
| (There's a figure in the paper.)
|
| This way of doing thing is nice for a few reasons. It's easy to
| parallelize. You can change r to control how many parameters to
| train. You can also merge the low-rank weights with the
| original one to avoid latency.
|
| Note that we don't select a subset of the original parameters.
| We train extra ones.
| loxias wrote:
| Hi! I in _no way_ mean to detract or malign or "anything
| negative" the parent comment (communication is hard!!), BUT I
| really compliment that exact sentence. :)
|
| My backgroung contains signal processing, "pre-deep learning
| ML", systems engineering, and firmware, and that sentence
| jumped out at me as crystal clear in my mind, despite not
| knowing what HuggingFace is or PyTorch.
|
| Correct me if I'm wrong: These huge models involve lots of
| weights used in large matrices. The contribution of this work
| is to plug in some matrix factorization and learn a lower
| dimensional representation. Fantastic!
|
| Also makes me wonder what other performance improvements
| await through proper application of established and well
| known Mathematics. :D
| eternalban wrote:
| Great, we can get authoritative answers. (I'm trying to
| understand the ML space and have mostly done readings, not an
| expert.)
|
| I am assuming you can have n LoRA fine-tunings, say each
| specializing in one aspect of a coherent task, with n
| summers, running in parallel, and then combine them at the
| end? Or more generally, does LoRA enable a sort of
| modularizing around a core (un-merged) model?
|
| And curious if you ever tried merging 2 or more fine-tunings
| and then testing the resultant single model (merge all)
| against the original tests to check retention?
| zeckalpha wrote:
| Quite different from https://en.m.wikipedia.org/wiki/LoRa
| michaelhartm wrote:
| Btw, it's kinda crazy how bad the GPT4-J results in the blog are
| compared to the Dolly one, which seem pretty good. Do we know why
| it works so well to use this 50k dataset?
| quadrature wrote:
| Dolly is instruction fine tuned whereas GPT4-J is not. Which
| means that it doesn't even understand that it is being
| instructed to do something, it is just doing an autocomplete.
| muny wrote:
| Why use the same name as LoRa? https://lora-alliance.org/
|
| Edit: Microsoft is even a member of the LoRa alliance:
| https://lora-alliance.org/lora-alliance-press-release/micros...
| edwardjhu wrote:
| Good question! I came up with the name because the idea is best
| described as low-rank adaptation. I know very little about
| radio communication and didn't anticipate the visibility my
| repo has today :)
| StingyJelly wrote:
| At least could have been LoRaA
| stu2b50 wrote:
| You're assuming a lot more intercompany coordination than would
| exist. Even though it's research by Microsoft labs, the
| researchers themselves are to a large extent autonomous and
| also narrow experts in their fields.
|
| This process involves low rank approximations -> Lora is a
| namey sounding term that uses characters from low and rank ->
| call it LoRA in the paper. That's all there was to it. Probably
| didn't even know the other lora existed.
| edwardjhu wrote:
| Yup. That's exactly what happened.
| anthk wrote:
| Also Guix vs Guix...
| FlyingRobot wrote:
| I had to scan the readme to make sure this story wasn't about
| applying machine learning to radio communication.
| ChancyChance wrote:
| Small CNNs can be used for BLE channel hopping and body
| detection.
| tylerekahn wrote:
| Low Rank Adaption is a mathematical technique, it's not a
| technology standard
| krolden wrote:
| Then call it LoRad
| samtho wrote:
| It's still a currently-in-use acronym/term that a
| sufficiently large tech company could conceivably be using
| both meanings concurrently. This causes confusion and muddies
| the water of a general web search experience.
|
| Not the same situation, but I remember when "Electron" was
| called "Atom Shell" because it was built for the (now
| defunct) text editor by the same name. For the longest time,
| I had an unsubstantiated thought that it was a new Unix shell
| that was based around a text editor somehow (yes, dumb). In
| hindsight, they just had named this cleverly to reference the
| various layers or shells of electrons orbiting atomic nuclei,
| thus the eventual name of Electron.
|
| On the other hand, a wireless technology standard is very
| different than a known mathematical technique that likely
| predates the wireless meaning anyway.
| kkielhofner wrote:
| In all seriousness there should be ML project naming approaches
| (I should try ChatGPT). Naming a project or a company is very
| difficult so I can't blame anyone here.
|
| That said some of these ML project names are especially
| horrendous (kind of ironic for the current emphasis on
| generative AI). Transformers? A good chunk of the time I get
| results about the toys and cartoons from my childhood. Don't
| get me wrong, I still think Optimus Prime is cool and the name
| "transformers" make sense given the function but it's somehow
| simultaneously generic AND the name of a decades long multi-
| billion dollar media franchise...
|
| LoRA is another example, name makes sense but the collision
| with LoRa is problematic. I, for one, am interested in and
| have/would apply both. Queue google searches for "Lora
| radio..." vs "Lora ml...".
|
| Project naming is hard and I'm just glad to see the activity
| and releases. BUT project naming is essentially a base
| usability condition and should be considered as such: just like
| creating a README, getting started, providing code examples,
| etc.
|
| It reminds me of trademarks: if you're looking for trademark
| protection it won't be issued if it is overly generic or likely
| to "cause confusion in the marketplace" with an existing
| trademark (basically same or similar name in a somewhat
| similar/adjacent field) - you can even reuse names but only if
| it's obvious to people from basic context that they refer to
| different things. I'm not a trademark attorney but I think LoRa
| vs LoRA would get refused because it's "computer stuff", while
| a shampoo named Lora would be fine (as an example). If you're
| curious there are official categories/areas from the USPTO that
| break these down.
|
| Both of these examples wouldn't have a chance at trademark
| protection. Note I'm not saying they should have trademark
| protection, just that it's an example of a reasonable standard
| that should be considered/compared to for good open source
| project naming.
| elcomet wrote:
| There are many more things called lora.
|
| https://en.m.wikipedia.org/wiki/Lora
|
| It doesn't really matter as long as it's not in the same field.
| No one will be confused between the two.
| magicalhippo wrote:
| > No one will be confused between the two.
|
| Except search engines...
| Filligree wrote:
| That's okay, Bing-GPT doesn't get confused.
| AdamH12113 wrote:
| "LoRAd" was right there.
| brodouevencode wrote:
| https://en.wikipedia.org/wiki/LoRa for the communications
| architecture
| renewiltord wrote:
| Why did the radio guys use the same name as this hotel from
| Minnesota that existed for years before?
| https://www.lorahotel.com/
|
| I bet some of them have even been to Minnesota and they still
| didn't pick a unique name.
|
| Though both of them have to answer to why they picked the name
| of a Google Font that preceded both and is currently available
| https://web.archive.org/web/20170210001724/https://fonts.goo...
|
| Is it because Microsoft is competing with Google in the AI
| space?
| reportgunner wrote:
| Context. Individual hotels are not technology.
| renewiltord wrote:
| Indeed. And LLMs are not radios or fonts.
| krossitalk wrote:
| Maybe call it LoRALLMR (Laura Loomer)
| timmg wrote:
| This sounds similar to "prompt tuning":
| https://ai.googleblog.com/2022/02/guiding-frozen-language-mo...
| stu2b50 wrote:
| It's actually completely different. What you linked is about
| zero shot learning by adjusting the prompt, vs Lora which is
| about actually fine tuning the weights of the model.
| timmg wrote:
| In that case, you can think of the prompt as being one vector
| of the model that is being tuned while the rest is frozen.
|
| Not exactly the same, to be sure. But fulfills a similar
| need: more efficient "fine tuning" of a large model.
| stu2b50 wrote:
| I suppose that is true. You can even train the prompt with
| gradient descent. But in practice, it ends up being fairly
| different.
| eternalban wrote:
| They address prompt tuning's issues in the paper:
|
| _" The other direction, as exemplified by prefix tuning
| (Li & Liang, 2021), faces a different challenge. We observe
| that prefix tuning is difficult to optimize and that its
| performance changes non-monotonically in trainable
| parameters, confirming similar observations in the original
| paper. More fundamentally, reserving a part of the sequence
| length for adaptation necessarily reduces the sequence
| length available to process a downstream task, which we
| suspect makes tuning the prompt less performant compared to
| other methods."_
|
| https://ar5iv.labs.arxiv.org/html/2106.09685
|
| This is key imo: _" More fundamentally, reserving a part of
| the sequence length for adaptation necessarily reduces the
| sequence length available to process a downstream task"_.
| arugulum wrote:
| LoRA conversely has different downsides. LoRA can be used
| in two ways: merged or unmerged. Unmerged (which is how
| it's trained) incurs a non-trivial computation cost.
| Merged means you are modifying the model weights, which
| means you are stuck with that one model on that device
| (though, this usually applies for most implementations
| for the unmerged versions too).
|
| The benefit of prompt and prefix tuning (note: these are
| two separate methods) is that you can serve different
| soft-prompts and soft-prefixes efficiently with a single
| shared set of model weights.
| eternalban wrote:
| https://ar5iv.labs.arxiv.org/html/2106.09685/assets/x1.pn
| g
|
| > incurs a non-trivial computation cost
|
| The hit seems to be in energy/cpu not time since the W0
| computation is in parallel with the BAx. (My assumption
| based on the latency claims in paper.) So an issue in
| edge deployments (battery life, etc.).
|
| > you are stuck with that one model on that device
|
| Upfront I have 0 clue on the actual numbers, but from a
| purely software architecture pov [in unmerged setup],
| having that W0 forward process _once_ with n distinct BAx
| paths (for distinct fine tunings!) would address that,
| no?
|
| [p.s. say an application that takes as input A/V+Txt,
| runs that through an _Ensemble LoRA_ (ELoRA(tm) /g)
| which each participant contributing its own BAx finetuing
| processing, sharing the single pre-trained W0.]
| arugulum wrote:
| > My assumption based on the latency claims in paper.
|
| The latency claims are based on the merged version, where
| the modifications are merged into the model weights.
| Hence there is no latency cost, since the final model has
| the same shape as the original.
|
| > having that W0 forward process once with n distinct BAx
| paths (for distinct fine tunings!) would address that,
| no?
|
| The tl;dr is that that works, but is more expensive. Not
| ridiculously more expensive, but certainly more expensive
| that processing a few additional tokens with
| prefix/prompt tuning.
| edwardjhu wrote:
| > Merged means you are modifying the model weights, which
| means you are stuck with that one model on that device
| (though, this usually applies for most implementations
| for the unmerged versions too).
|
| If one is careful with floating point issues, it's
| straightforward to unmerge the weights.
|
| W_0 = W_1 - BA
|
| Yes, prompt-based methods don't involve swapping weights.
| arugulum wrote:
| Right, it's mathematically easy (again, up to floating
| point issues) to recover the weights as needed, but in
| terms of distribution/serving I'm guessing the plan is to
| have the original weights and carry around the LoRA
| weights and merge as necessary.
|
| (Also, I'm assuming you're the first author of LoRA.)
| arugulum wrote:
| Both LoRA and prompt tuning are parameter-efficient tuning
| methods. Both of them inject new weights into the model and
| tune them.
|
| Prompt tuning does so by injecting addition prefix tokens in
| the input to the model. LoRA does so by injecting low-rank
| matrices that are additive modifications to a set of linear
| layers in the model.
|
| They both do something slightly different, but are very much
| in the same class of methods.
| eternalban wrote:
| TIL learned about LoRA via
| https://news.ycombinator.com/item?id=35287740
|
| See also: Huggingface PEFT: https://github.com/huggingface/peft
| sharemywin wrote:
| Could the Bloom model be used with this training to build a
| commercial allowed small-ish model?
| stu2b50 wrote:
| It cheapens the cost of fine tuning, it doesn't make the model
| itself smaller at inference time.
| outside1234 wrote:
| Has anyone diaried out a good learning path for going from a
| larger pre-trained model to a fine tuned model? Trying to
| understand all of the parts here but it sort of hard to fine
| anything linear...
| alecco wrote:
| *17 Jun 2021
| rdedev wrote:
| Came across this library in the past where you can easily add
| LoRA and other efficient fine tuning techniques easily into
| huggingface models. Haven't tried it though and support for
| different models may be limited
|
| https://adapterhub.ml/
| Der_Einzige wrote:
| I really hope this doesn't displace regular fine tuning
| techniques. Dreambooth is superior in quality to Lora with image
| generation, and I suspect that it's similar with LLMs.
| brucethemoose2 wrote:
| There are some WIP evolutions of SD Lora in the works, like
| locon and lycoris.
|
| https://github.com/KohakuBlueleaf/LyCORIS
| eternalban wrote:
| https://dreambooth.github.io/
|
| The LoRA paper's 'problem statement' makes a compelling case
| for practical benefits of the approach. Specifically, no added
| latency, no serial processing bottlenecks, shared baseline
| model, compact time/space requirements. How does dreambooth
| stack up in this regard?
| mattnewton wrote:
| In the image space, dreambooth full-model tunes can handle
| multiple concepts and tend to be easier to get hard/complex
| things like a person's likeness correct. I've found that LoRA
| tunes struggle to be accepted by people as producing their
| own face compared to full dreambooth models tuned on the same
| inputs, most likely because we are very sensitive to facial
| differences of faces we are very familiar with. I haven't
| seen this effect for styles or other kinds of concepts, where
| people are a little less sensitive about the fidelity. LoRA
| is much easier to train, easier to compose, and can have the
| base model swapped out in many cases though so if it's good
| enough for the concept you are trying to add to the model
| it's often worth the subtle quality loss.
| stu2b50 wrote:
| I suspect it's not that similar. The intuition behind LoRA is
| more true the higher the rank of the weights of the model. Even
| the smallest LLMs have considerably higher rank weights than
| Stable Diffusion. They are _large_ , after all.
| numlocked wrote:
| For those wondering why this is interesting: This technique is
| being used to reproduce[0] the Alpaca results from Stanford[1]
| with a few hours of training on consumer-grade hardware.
|
| I believe there will soon be a cottage industry of providing
| application-specific fine-tuned models like this, that can run in
| e.g. AWS very inexpensively. The barrier today seems to be that
| the base model (here, Meta's LLaMA) is encumbered and can't be
| used commercially. Someone will soon, I'm confident, release e.g.
| an MIT-licensed equivalent and we'll all be off to the races.
|
| [0] https://github.com/tloen/alpaca-lora
|
| [1] https://crfm.stanford.edu/2023/03/13/alpaca.html
| GaggiX wrote:
| In addition, for the past 1/2 month this technique has been
| used to fine-tune Stable Diffusion models.
| terafo wrote:
| Closer to 4 months. It is much better than having a bunch of
| 2-4gb models laying around.
| GaggiX wrote:
| 4 months? I don't think so, people really start using LoRA
| when it was added to the diffusers library less than 2
| months ago, this library is used by the training plugin of
| automatic webui, I guess time seems to flow more slowly
| when many things happen.
| dragonwriter wrote:
| The 1/2 month seems to match Lycoris/LoCon, which as I
| understand (haven't dug into the details on this) is a
| newer refinement of LoRa. LoRa has been used for longer,
| correct.
| GaggiX wrote:
| The LyCORIS/LoCon repo started committing 1 month ago and
| almost no one is using it except for a few experiments
| (not even the automatic webui supports it without a
| plugin).
| dragonwriter wrote:
| Judging from activity on Civitai, I think "almost no one
| is using it except for a few experiments" is _very_
| wrong. Sure, A1111 needs a plugin for it; it needs a
| plugin for ControlNET, too, but that is _also_ quite
| popular.
| GaggiX wrote:
| I'm also judging from the activity on CivitAI, the most
| downloaded (>1000 downloads, not many) ones are actually
| just LoRA with LoCon in another (experimental) branch of
| the CivitAI page, definitely not " _very_ wrong " ahah
|
| >it needs a plugin for ControlNET
|
| The big difference is that ControlNet actually required a
| pretty complex interface to be used effectively,
| meanwhile the use of LoCon/LyCORIS should be completely
| transparent and works like a LoRA
| Agentlien wrote:
| ControlNet is built in as of maybe two weeks ago and no
| longer requires an extension. I started using it when the
| built-in support arrived and have had a lot of fun with
| it since.
| smaddox wrote:
| There's already RWKV, if you want a decent performing pre-
| trained model that's Apache 2.0 licensed:
| https://twitter.com/BlinkDL_AI/status/1638555109373378560?s=...
| pffft8888 wrote:
| https://news.ycombinator.com/item?id=35281026
| polyterative wrote:
| Thanks! Hard to follow this stuff sometimes with all the news
| romanzubenko wrote:
| Today Databricks announced [0] 6b parameter model from
| EleutherAI finetuned on Alpaca dataset. According to their
| CEO[1], training took 3 hours, and costed $30. They didn't
| release any details on how it was trained, but likely with
| LoRa.
|
| [0] https://www.databricks.com/blog/2023/03/24/hello-dolly-
| democ... [1]
| https://twitter.com/alighodsi/status/1639251347777388544
| numlocked wrote:
| Interesting. I wonder what the training cost was for:
|
| https://huggingface.co/EleutherAI/gpt-neox-20b
|
| Perhaps it's in the paper...
| michaelhartm wrote:
| They used the 6b GPT4-J, not 20B. That's what's
| interesting, it's a smallish large language model :).
| dragonwriter wrote:
| GPT-J, not GPT4-J.
| int_19h wrote:
| There are also some LLaMA LoRAs that are trained on the
| Anthropic dataset specifically for chat:
|
| https://huggingface.co/serpdotai
|
| I haven't done any formal tests on this yet, but with
| llama-13b, the overall structure of its responses definitely
| becomes much more ChatGPT-like. It would be very interesting
| to see how the 65B model performs.
| m3affan wrote:
| Let the revolutionbbegin
| outside1234 wrote:
| Or, more importantly than in AWS, locally in disconnected or
| poorly connected scenarios like in-vehicle or in-home.
| arugulum wrote:
| > This technique is being used to reproduce[0] the Alpaca
| results from Stanford[1]
|
| Reproduced is a strong statement, without any rigorous
| justification other than a few cherry-picked examples. Alpaca-
| LoRA is simply LLaMA with LoRA-tuning on the Alpaca data. There
| are no metrics, no measurements, no evaluations to show that
| the Alpaca-LoRA performs similarly to Alpaca, when it is well-
| known in the field that parameter-efficient fine-tuning always
| pays a cost in terms of performance relative to full fine-
| tuning (which is what Alpaca does).
|
| (This has been a huge nit for me because of the recent flood of
| Alpaca-replications, or even claims that Alpaca comparable to
| ChatGPT, rushing to market themselves, but with nothing to
| justify their claims.)
| numlocked wrote:
| I agree - my comment originally had a parenthetical about
| this fact, but I thought it was probably confusing to people
| who just wanted to understand what this was about. Perhaps I
| shouldn't have edited it out.
|
| It also bothers me that a lot of LoRA claims read like "You
| won't believe how little it costs to train these models!",
| when of course 99%+ of the complexity and cost is in the
| LLaMA (or whatever) model that underpins it. Folks are
| talking about it in a loose way that implies some kind of
| miraculous overall training cost breakthrough.
| GaggiX wrote:
| >when it is well-known in the field that parameter-efficient
| fine-tuning always pays a cost in terms of performance
| relative to full fine-tuning
|
| The LoRA paper clearly states the performance of the method
| "LoRA performs on-par or better than fine-tuning in model
| quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having
| fewer trainable parameters, a higher training throughput,
| and, unlike adapters, no additional inference latency. ":
| https://arxiv.org/abs/2106.09685
| arugulum wrote:
| I don't want to get into the weeds of the subtleties of
| evaluation, hyperparameter-tuning and model comparisons,
| but let's just say that subsequent studies have shown that
| LoRA (consistent with most parameter-efficient tuning
| methods) underperform full fine-tuning:
| https://arxiv.org/abs/2203.06904
|
| As simple way to think about it is this: if LoRA really
| gives full fine-tuning performance, why would anyone ever
| fully fine-tune a model?
| GaggiX wrote:
| >why would anyone ever fully fine-tune a model?
|
| You're asking it as if it were a rhetorical question, but
| I think it carries more weight than many people seem to
| believe.
| arugulum wrote:
| To balance my view a little, it is definitely a valid
| question to ask "how far can we get with parameter-
| efficient tuning", and I firmly believe that as models
| get larger, the answer is "very, very far".
|
| That said, I also dislike it when it is carelessly
| claimed that parameter-efficient tuning is as good as
| full fine-tuning, without qualifications or nuance.
| jprafael wrote:
| If this works, is there any theory why training models with low
| rank layers (y = (A.B).x + b) directly doesnt work? (or do they?)
___________________________________________________________________
(page generated 2023-03-24 23:00 UTC)