[HN Gopher] Language models are Super Mario: Absorbing abilities...
___________________________________________________________________
Language models are Super Mario: Absorbing abilities from
homologous models
Author : tosh
Score : 72 points
Date : 2024-04-06 14:39 UTC (8 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| tosh wrote:
| Merging is still wild to me.
|
| I naively merged a Dolphin fine-tune of Mistral 7B base 0.2 with
| Mistral 7b Instruct 0.2 and got a model that gets higher
| benchmark results:
|
| https://huggingface.co/ichigoberry/pandafish-2-7b-32k
|
| Took a few minutes in colaboratory.
| Drakim wrote:
| Yeah it's wild to me too, it feels like we are doing alchemy
| rather than programming, mixing intelligence out of a cauldron.
| echelon wrote:
| This whole field feels like magic, yet there are mathematical
| underpinnings behind all of it.
|
| I don't know why everyone isn't in love with this field. It's
| so utterly fascinating.
| mmoskal wrote:
| I think it's a bit like saying microcode in CPU underpins
| high-level programming languages - kind of, but good luck
| understanding what a program does based on its microcode
| translation. OTOH you can mess with microcode and maybe get
| better performance, but you won't really know before you
| try. It's similar with ML - much more like alchemy than
| science...
| Almondsetat wrote:
| Differential equations are mathematically quite simple, yet
| they are mostly intractable and the systems they describe
| chaotic and unpredictable. The mathematics in AI works
| similarly, being deceptively simple yet giving birth to
| incredible complexity
| prisenco wrote:
| I'm not in love with it _because_ it 's magic. Not to say
| it isn't interesting, certainly it is, but I don't want my
| life ruled by magic.
|
| Predictability and reproducibility and a provenance of
| logic are important for computing systems and society as a
| whole.
| olddustytrail wrote:
| That's certainly an interesting viewpoint on life, but I
| suspect it's a very rare one simply because most people
| would never have considered it and those who had would
| tend towards the curious.
|
| You seem to be trapped in the interminable middle.
| prisenco wrote:
| I don't feel trapped. There is going to be (and has been
| in ways) a small reckoning when people realize magic (ai)
| and all its unpredictability is very difficult to manage.
|
| There is no field of "probabilistic UX" for example. How
| do you provide a consistent user experience when the
| underlying engine of your application is inconsistent?
|
| Same goes for QA, testing, root cause analysis.
|
| Adding features can have exponential side effects that
| cannot be predicted, which can be deadly at scale. Both
| figuratively and literally depending on how the
| technology is adopted.
| a_wild_dandan wrote:
| I can see that. Unpredictability is the root of much
| anxiety. In academia, I'm the opposite. I'm thrilled by
| mystery and the unknown. It gives me this deep sense of
| profundity and gravitas bordering on religious
| experience. I'd love to learn more about that phenomenon.
| Anyway, it reminds me of that famous Newton quote:
|
| "I do not know what I may appear to the world, but to
| myself I seem to have been only like a boy playing on the
| sea-shore, and diverting myself in now and then finding a
| smoother pebble or a prettier shell than ordinary, whilst
| the great ocean of truth lay all undiscovered before me."
| prisenco wrote:
| _In academia, I 'm the opposite. I'm thrilled by mystery
| and the unknown._
|
| Oh if I was in academia, I'd be positively giddy. So much
| to explore. Fascinating stuff.
|
| But from my perspective, I don't like building houses on
| shaky ground. And I especially don't want to live in an
| economy based on it.
| moffkalast wrote:
| Lots of reasons I suppose. Outside script-kiddying with
| mergekit it takes some serious knowledge to properly train
| anything, and expensive amounts of compute to actually do
| it or even run it in the end. It's not the most accesible
| thing.
|
| For classical methods once you got the algorithm nailed, it
| will work 100% of the time. For probabilistic methods, you
| do get better results _most of the time_ but they can also
| screw up randomly for no reason so their deployability in
| production is hell on wheels. It 's infuriating at times.
|
| Still can't argue against it being very fascinating.
| refibrillator wrote:
| It's really surprising how well this works. Intuitively this
| illustrates how over-parameterized many LLMs are, or conversely
| how under-trained they might be.
|
| The drop and rescale method outlined in the paper makes the
| latent space increasingly sparse, which in turn allows weights to
| merged without much interference or degradation.
|
| My instinct is that while merging models will have some use
| cases, ultimately these insights will lead to innovations in
| training and architecture that have the same result but with
| better computational efficiency.
|
| For example instead of training an 8x7b mixture of experts then
| merging, just incorporate the sparsity constraint while pre-
| training a single 7b model (somehow).
| Drakim wrote:
| Do you know if there is anybody who has made it their mission
| to shrink models to extremes? It feels like the sorta thing
| somebody would get really obsessed with doing, akin to those
| people who make executable out of a few bytes or shrink network
| payloads.
| sdenton4 wrote:
| I tend to think there's an explore/exploit trade-off in model
| scale. Animals, including humans, have lots of extra neurons
| when they are young, which are then shed once the learning
| phase settles down. And this makes sense: thinking is energy
| intensive, so it's more efficient to sparsify. And, of course,
| we see a similar dynamic in ML: you can train a big model then
| prune it and do much better than you would by just training the
| smaller model directly.
|
| I've got some geometric handwaving for why this works, as well.
| It's easier to find a low-energy solution when you have more
| parameters... Sparse solutions are higher energy, and thus
| require longer walks (and more commitment) during training.
| visarga wrote:
| > For example instead of training an 8x7b mixture of experts
| then merging, just incorporate the sparsity constraint while
| pre-training a single 7b model (somehow).
|
| I'm thinking it would help reduce the network demands for
| gradient updates if merge from time to time. That could unlock
| distributed training, like SETI@Home.
| m3kw9 wrote:
| Then why not start with a model that has all the weights zeroes
| out and start absorbing different models?
| l33tman wrote:
| It would be cool if there was like a threshold in training from
| scratch, where after this actually works for adding higher
| level knowledge. So you would start training it like usual to
| get it to absorb generic langauge skills and reasoning but save
| all the domain knowledge absorption for merging later
| skybrian wrote:
| The weights need to be connected to something interesting. Pre-
| training is how they get them all connected up, and fine-tuning
| is how they find which weights it would be useful to change.
| stoniejohnson wrote:
| Pretraining determines the weights (e.g. the connections),
| fine-tuning let's you change some subset of the weights (e.g.
| the final layers) with a smaller chunk of task-specific data.
| yumraj wrote:
| Isn't _x + 0 = x_?
|
| So what's the benefit of starting with _0_ when you can just
| start with _x_
|
| I know it's over simplistic above, but serious question, please
| ELI5..
| thaumasiotes wrote:
| Ah yes. No Nintendo character is more characterized by consuming
| other characters and absorbing their abilities than Mario.
| ksenzee wrote:
| A lot more people have played Mario than Kirby, though, and
| Mario came first.
| smolder wrote:
| Yes, if you want the video game analogy to play for people
| who don't know squat about video games, Mario is the safe
| choice.
| dimatura wrote:
| My first thought, Megaman would be my choice. Maybe Kirby, but
| he can only have one power at a time.
| Waterluvian wrote:
| I always imagined Megaman to be a Capcom character. But I
| also always imagine being unnecessarily pedantic about 80s
| video game IP.
| hcs wrote:
| In Kirby 64 (and maybe others?) he could combine powers
| https://en.wikipedia.org/wiki/Kirby_64:_The_Crystal_Shards
| loloquwowndueo wrote:
| Here to say : Kirby!!
| ShamelessC wrote:
| Didn't play Super Mario Odyssey I take it?
| yumraj wrote:
| For others like me who'd not heard of merging before, this seems
| to be one tool[0] (there may be others)
|
| [0] https://github.com/arcee-ai/mergekit
| uptownfunk wrote:
| Does this imply that some type of decentralized training
| mechanism is possible? Like an accumulation for ML models. I
| suspect in the limit You will just have even more massive models
| which will require even more demand on the hardware. I also
| wonder if new capabilities emerge from the merging that are not
| present in any one model.
| bilsbie wrote:
| Is there a way to do this with models that aren't homologous? Now
| that would be unbelievable.
|
| Just smash every model you can find together?
| jerpint wrote:
| You have to assume they share the same architecture for most of
| these methods
| smusamashah wrote:
| Image models support merging already. There exists thousands of
| StableDiffusion models for that reason only. Downside is that
| almost all models you see are now inbred. Community does talk
| about it and can see the effect this is having on the image
| quality. A prominent example is you will see the exact same
| japanese kind of girl face from almost all models out there when
| you generate image of a woman. Check out Civitai to see what I am
| talking about. It's not easy to train new models but super easy
| to merge.
|
| We can expect an explosion of LLMs using this technique
| similarly. And later at some point the degradation of quality
| perhaps. Or may be all those LLMs will be just saying the same
| things at that point.
| og_kalu wrote:
| Image models merging is kind of meh though. You can't really
| merge two SD models trained on different styles with different
| keywords and get a model that knows both independently.
| smusamashah wrote:
| Never done that myself but always thought that was the point
| of these merges. May be it doesn't understand keywords after
| merge but I think they do keep the styles in some form.
| What's the point of merges if that wasn't possible. People
| sometimes share how much of a model they merged by a factor.
| og_kalu wrote:
| Let's say i trained a model on the artworks of fiona
| staples that is invoked by typing "fiona staples style".
| Then i trained a model on the artworks of james daly that
| is invoked by "james daly style".
|
| What i want when i merge such models is a model that can
| generate in the artsyle of either fiona and james
| independently or a mix of both if i specify both keywords
| in the same prompt.
|
| Currently, if you merge these 2 models and generate "busy
| city street, fiona staples style", you will not get a model
| that can generate works in fiona's style, you will just get
| a model that will generate an odd mix of fiona and daly
| even if you only specify one of them.
|
| It means you either need to train a million different
| models for a million different concepts with no chance of
| cross usage (e.g x person wearing y clothes in z style will
| not be possible) or train those concepts at the same time,
| which becomes very cumbersome requiring a retrain of n+1
| concepts on a fresh model anytime you want to introduce a
| new concept.
|
| Oh, and training on x then training on y doesn't work in
| practice either because the model will mostly forget x
| learning y.
| a_wild_dandan wrote:
| Summary: Fine tune a foundation model for a task. How did all
| weights change? That's called the "parameter delta." These
| changes are highly redundant. You can carefully (use DARE to)
| revert like 80% of them, yet maintain fine-tuned task accuracy!
| But only if the tuned weights didn't shift much. Otherwise DARE
| fails. Maybe you can make an LM polymath by melting together many
| fine-tunes of some base model. No GPU needed.
| web3-is-a-scam wrote:
| Super Mario? Wouldn't Mega Man be a better video game analogy?
___________________________________________________________________
(page generated 2024-04-06 23:01 UTC)