[HN Gopher] Language models are Super Mario: Absorbing abilities...
       ___________________________________________________________________
        
       Language models are Super Mario: Absorbing abilities from
       homologous models
        
       Author : tosh
       Score  : 72 points
       Date   : 2024-04-06 14:39 UTC (8 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | tosh wrote:
       | Merging is still wild to me.
       | 
       | I naively merged a Dolphin fine-tune of Mistral 7B base 0.2 with
       | Mistral 7b Instruct 0.2 and got a model that gets higher
       | benchmark results:
       | 
       | https://huggingface.co/ichigoberry/pandafish-2-7b-32k
       | 
       | Took a few minutes in colaboratory.
        
         | Drakim wrote:
         | Yeah it's wild to me too, it feels like we are doing alchemy
         | rather than programming, mixing intelligence out of a cauldron.
        
           | echelon wrote:
           | This whole field feels like magic, yet there are mathematical
           | underpinnings behind all of it.
           | 
           | I don't know why everyone isn't in love with this field. It's
           | so utterly fascinating.
        
             | mmoskal wrote:
             | I think it's a bit like saying microcode in CPU underpins
             | high-level programming languages - kind of, but good luck
             | understanding what a program does based on its microcode
             | translation. OTOH you can mess with microcode and maybe get
             | better performance, but you won't really know before you
             | try. It's similar with ML - much more like alchemy than
             | science...
        
             | Almondsetat wrote:
             | Differential equations are mathematically quite simple, yet
             | they are mostly intractable and the systems they describe
             | chaotic and unpredictable. The mathematics in AI works
             | similarly, being deceptively simple yet giving birth to
             | incredible complexity
        
             | prisenco wrote:
             | I'm not in love with it _because_ it 's magic. Not to say
             | it isn't interesting, certainly it is, but I don't want my
             | life ruled by magic.
             | 
             | Predictability and reproducibility and a provenance of
             | logic are important for computing systems and society as a
             | whole.
        
               | olddustytrail wrote:
               | That's certainly an interesting viewpoint on life, but I
               | suspect it's a very rare one simply because most people
               | would never have considered it and those who had would
               | tend towards the curious.
               | 
               | You seem to be trapped in the interminable middle.
        
               | prisenco wrote:
               | I don't feel trapped. There is going to be (and has been
               | in ways) a small reckoning when people realize magic (ai)
               | and all its unpredictability is very difficult to manage.
               | 
               | There is no field of "probabilistic UX" for example. How
               | do you provide a consistent user experience when the
               | underlying engine of your application is inconsistent?
               | 
               | Same goes for QA, testing, root cause analysis.
               | 
               | Adding features can have exponential side effects that
               | cannot be predicted, which can be deadly at scale. Both
               | figuratively and literally depending on how the
               | technology is adopted.
        
               | a_wild_dandan wrote:
               | I can see that. Unpredictability is the root of much
               | anxiety. In academia, I'm the opposite. I'm thrilled by
               | mystery and the unknown. It gives me this deep sense of
               | profundity and gravitas bordering on religious
               | experience. I'd love to learn more about that phenomenon.
               | Anyway, it reminds me of that famous Newton quote:
               | 
               | "I do not know what I may appear to the world, but to
               | myself I seem to have been only like a boy playing on the
               | sea-shore, and diverting myself in now and then finding a
               | smoother pebble or a prettier shell than ordinary, whilst
               | the great ocean of truth lay all undiscovered before me."
        
               | prisenco wrote:
               | _In academia, I 'm the opposite. I'm thrilled by mystery
               | and the unknown._
               | 
               | Oh if I was in academia, I'd be positively giddy. So much
               | to explore. Fascinating stuff.
               | 
               | But from my perspective, I don't like building houses on
               | shaky ground. And I especially don't want to live in an
               | economy based on it.
        
             | moffkalast wrote:
             | Lots of reasons I suppose. Outside script-kiddying with
             | mergekit it takes some serious knowledge to properly train
             | anything, and expensive amounts of compute to actually do
             | it or even run it in the end. It's not the most accesible
             | thing.
             | 
             | For classical methods once you got the algorithm nailed, it
             | will work 100% of the time. For probabilistic methods, you
             | do get better results _most of the time_ but they can also
             | screw up randomly for no reason so their deployability in
             | production is hell on wheels. It 's infuriating at times.
             | 
             | Still can't argue against it being very fascinating.
        
       | refibrillator wrote:
       | It's really surprising how well this works. Intuitively this
       | illustrates how over-parameterized many LLMs are, or conversely
       | how under-trained they might be.
       | 
       | The drop and rescale method outlined in the paper makes the
       | latent space increasingly sparse, which in turn allows weights to
       | merged without much interference or degradation.
       | 
       | My instinct is that while merging models will have some use
       | cases, ultimately these insights will lead to innovations in
       | training and architecture that have the same result but with
       | better computational efficiency.
       | 
       | For example instead of training an 8x7b mixture of experts then
       | merging, just incorporate the sparsity constraint while pre-
       | training a single 7b model (somehow).
        
         | Drakim wrote:
         | Do you know if there is anybody who has made it their mission
         | to shrink models to extremes? It feels like the sorta thing
         | somebody would get really obsessed with doing, akin to those
         | people who make executable out of a few bytes or shrink network
         | payloads.
        
         | sdenton4 wrote:
         | I tend to think there's an explore/exploit trade-off in model
         | scale. Animals, including humans, have lots of extra neurons
         | when they are young, which are then shed once the learning
         | phase settles down. And this makes sense: thinking is energy
         | intensive, so it's more efficient to sparsify. And, of course,
         | we see a similar dynamic in ML: you can train a big model then
         | prune it and do much better than you would by just training the
         | smaller model directly.
         | 
         | I've got some geometric handwaving for why this works, as well.
         | It's easier to find a low-energy solution when you have more
         | parameters... Sparse solutions are higher energy, and thus
         | require longer walks (and more commitment) during training.
        
         | visarga wrote:
         | > For example instead of training an 8x7b mixture of experts
         | then merging, just incorporate the sparsity constraint while
         | pre-training a single 7b model (somehow).
         | 
         | I'm thinking it would help reduce the network demands for
         | gradient updates if merge from time to time. That could unlock
         | distributed training, like SETI@Home.
        
       | m3kw9 wrote:
       | Then why not start with a model that has all the weights zeroes
       | out and start absorbing different models?
        
         | l33tman wrote:
         | It would be cool if there was like a threshold in training from
         | scratch, where after this actually works for adding higher
         | level knowledge. So you would start training it like usual to
         | get it to absorb generic langauge skills and reasoning but save
         | all the domain knowledge absorption for merging later
        
         | skybrian wrote:
         | The weights need to be connected to something interesting. Pre-
         | training is how they get them all connected up, and fine-tuning
         | is how they find which weights it would be useful to change.
        
           | stoniejohnson wrote:
           | Pretraining determines the weights (e.g. the connections),
           | fine-tuning let's you change some subset of the weights (e.g.
           | the final layers) with a smaller chunk of task-specific data.
        
         | yumraj wrote:
         | Isn't _x + 0 = x_?
         | 
         | So what's the benefit of starting with _0_ when you can just
         | start with _x_
         | 
         | I know it's over simplistic above, but serious question, please
         | ELI5..
        
       | thaumasiotes wrote:
       | Ah yes. No Nintendo character is more characterized by consuming
       | other characters and absorbing their abilities than Mario.
        
         | ksenzee wrote:
         | A lot more people have played Mario than Kirby, though, and
         | Mario came first.
        
           | smolder wrote:
           | Yes, if you want the video game analogy to play for people
           | who don't know squat about video games, Mario is the safe
           | choice.
        
         | dimatura wrote:
         | My first thought, Megaman would be my choice. Maybe Kirby, but
         | he can only have one power at a time.
        
           | Waterluvian wrote:
           | I always imagined Megaman to be a Capcom character. But I
           | also always imagine being unnecessarily pedantic about 80s
           | video game IP.
        
           | hcs wrote:
           | In Kirby 64 (and maybe others?) he could combine powers
           | https://en.wikipedia.org/wiki/Kirby_64:_The_Crystal_Shards
        
         | loloquwowndueo wrote:
         | Here to say : Kirby!!
        
         | ShamelessC wrote:
         | Didn't play Super Mario Odyssey I take it?
        
       | yumraj wrote:
       | For others like me who'd not heard of merging before, this seems
       | to be one tool[0] (there may be others)
       | 
       | [0] https://github.com/arcee-ai/mergekit
        
       | uptownfunk wrote:
       | Does this imply that some type of decentralized training
       | mechanism is possible? Like an accumulation for ML models. I
       | suspect in the limit You will just have even more massive models
       | which will require even more demand on the hardware. I also
       | wonder if new capabilities emerge from the merging that are not
       | present in any one model.
        
       | bilsbie wrote:
       | Is there a way to do this with models that aren't homologous? Now
       | that would be unbelievable.
       | 
       | Just smash every model you can find together?
        
         | jerpint wrote:
         | You have to assume they share the same architecture for most of
         | these methods
        
       | smusamashah wrote:
       | Image models support merging already. There exists thousands of
       | StableDiffusion models for that reason only. Downside is that
       | almost all models you see are now inbred. Community does talk
       | about it and can see the effect this is having on the image
       | quality. A prominent example is you will see the exact same
       | japanese kind of girl face from almost all models out there when
       | you generate image of a woman. Check out Civitai to see what I am
       | talking about. It's not easy to train new models but super easy
       | to merge.
       | 
       | We can expect an explosion of LLMs using this technique
       | similarly. And later at some point the degradation of quality
       | perhaps. Or may be all those LLMs will be just saying the same
       | things at that point.
        
         | og_kalu wrote:
         | Image models merging is kind of meh though. You can't really
         | merge two SD models trained on different styles with different
         | keywords and get a model that knows both independently.
        
           | smusamashah wrote:
           | Never done that myself but always thought that was the point
           | of these merges. May be it doesn't understand keywords after
           | merge but I think they do keep the styles in some form.
           | What's the point of merges if that wasn't possible. People
           | sometimes share how much of a model they merged by a factor.
        
             | og_kalu wrote:
             | Let's say i trained a model on the artworks of fiona
             | staples that is invoked by typing "fiona staples style".
             | Then i trained a model on the artworks of james daly that
             | is invoked by "james daly style".
             | 
             | What i want when i merge such models is a model that can
             | generate in the artsyle of either fiona and james
             | independently or a mix of both if i specify both keywords
             | in the same prompt.
             | 
             | Currently, if you merge these 2 models and generate "busy
             | city street, fiona staples style", you will not get a model
             | that can generate works in fiona's style, you will just get
             | a model that will generate an odd mix of fiona and daly
             | even if you only specify one of them.
             | 
             | It means you either need to train a million different
             | models for a million different concepts with no chance of
             | cross usage (e.g x person wearing y clothes in z style will
             | not be possible) or train those concepts at the same time,
             | which becomes very cumbersome requiring a retrain of n+1
             | concepts on a fresh model anytime you want to introduce a
             | new concept.
             | 
             | Oh, and training on x then training on y doesn't work in
             | practice either because the model will mostly forget x
             | learning y.
        
       | a_wild_dandan wrote:
       | Summary: Fine tune a foundation model for a task. How did all
       | weights change? That's called the "parameter delta." These
       | changes are highly redundant. You can carefully (use DARE to)
       | revert like 80% of them, yet maintain fine-tuned task accuracy!
       | But only if the tuned weights didn't shift much. Otherwise DARE
       | fails. Maybe you can make an LM polymath by melting together many
       | fine-tunes of some base model. No GPU needed.
        
       | web3-is-a-scam wrote:
       | Super Mario? Wouldn't Mega Man be a better video game analogy?
        
       ___________________________________________________________________
       (page generated 2024-04-06 23:01 UTC)