[HN Gopher] Self-Adapting Language Models
       ___________________________________________________________________
        
       Self-Adapting Language Models
        
       https://jyopari.github.io/posts/seal
        
       Author : archon1410
       Score  : 205 points
       Date   : 2025-06-13 19:03 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | all2 wrote:
       | Website with code and examples:
       | https://jyopari.github.io/posts/seal
        
         | dang wrote:
         | Thanks! I'll put that link in the top text too.
        
       | yahoozoo wrote:
       | Hmm, it looks like it's just a framework that fine-tunes LoRA
       | adapter then merges the adapter into the original model. It is
       | using the PeftModel and its "merge_and_unload" from the
       | HuggingFace library which performs the adapter merge into the
       | base model...what is new here, exactly?
        
         | observationist wrote:
         | Looks like it may be the stability of the approach, avoiding
         | alignment tax and model collapse.
         | 
         | I'd love to see a full circle of hypernetworks, with both
         | models continuously updated through generated LoRAs, the
         | hypernetwork updated to accommodate the new model state. You'd
         | need a meta-hypernetwork to apply LoRAs to the hypernetwork,
         | and then you could effectively have continuous learning.
        
       | ivape wrote:
       | This still relies on fine-tuning. How would a cloud LLM deal with
       | this if every user literally fine tunes it? Seems like something
       | destined for local private LLMs, but the notion of continuous
       | fine tuning locally at the moment is sci-fi level stuff because
       | the hardware is just not there yet (we can barely inference well
       | with a reasonable sized context).
        
       | cma wrote:
       | From Anthropic a couple days ago too, self finetuning:
       | 
       | https://arxiv.org/html/2506.10139v1
        
         | Uninen wrote:
         | This is wild!
         | 
         | "when assessed by Claude 3.5 Sonnet's production-grade RM, our
         | unsupervised assistant policy wins 60% of head-to-head
         | comparisons against the policy trained with the human-
         | supervised RM." So now the models can even post-train the new
         | models better than a human can
        
         | dang wrote:
         | Related ongoing thread:
         | 
         |  _Unsupervised Elicitation of Language Models_ -
         | https://news.ycombinator.com/item?id=44276041
        
       | libraryofbabel wrote:
       | I wonder if anyone who's really _in the know_ could summarize
       | where the research is at with getting LLMs to learn "on the job"
       | (through continuous fine tuning or whatever) and what the
       | blockers are to this being a useful deployable thing, e.g. having
       | a model+coding agent that can actually learn a codebase over time
       | (cost? model collapse? something else?).
       | 
       | I'm sure this is something the big labs are trying but from the
       | outside as a user of LLMs it feels like people don't talk about
       | this very much and instead the focus right now is on better
       | training (eg reinforcement learning) with the assumption that
       | anything else not learned during training will be stuffed into
       | the context somehow as needed. But from a naive perspective the
       | lack of learning from experience after training seems like the
       | biggest thing standing between us and AGI.
        
         | ivape wrote:
         | The most obvious blocker is compute. This just requires a shit
         | ton more compute.
        
           | libraryofbabel wrote:
           | That tracks, but say cost was no object and you had as many
           | H100s as you wanted. Would continuous learning actually
           | _work_ even then?
        
           | IncreasePosts wrote:
           | Maybe part of the inference outputs could be the updates to
           | make to the network
        
           | johnsmith1840 wrote:
           | If it was pure compute we'd have simple examples. We can't do
           | this even on the smallest of AI models.
           | 
           | There are tons of benchmarks around this you can easily run
           | with 1 gpu.
           | 
           | It's compute only in the sense that the only way to do it is
           | retrain a model from scratch at every step.
           | 
           | If you solve CL with a CNN you just created AGI.
        
             | Davidzheng wrote:
             | yeah but training from scratch is a valid solution. And if
             | we can't find easier solutions we should just try to make
             | it work. Compute is the main advantage we have in silica vs
             | biological computers so we might as well push it--like
             | ideally soon we will have one large AI running on
             | datacenter size computer solving really hard problems and
             | it could easily be most of the compute (>95%) is on
             | training step--which is where really AI excels tbh not
             | inference techniques. Like even Alphaproof for example
             | spends most of compute training on solving simpler problems
             | --which btw is one instance of continual training/training
             | at test time which is implemented.
        
         | kadushka wrote:
         | The most obvious blocker is catastrophic forgetting.
        
           | solarwindy wrote:
           | Is that necessarily a blocker? As others in this thread have
           | pointed out, this probably becomes possible only once
           | sufficient compute is available for some form of non-public
           | retraining, at the individual user level. In that case (and
           | hand-waving away just how far off that is), does a model need
           | to retain its generality?
           | 
           | Hypothetically (and perhaps more plausibly), a continually
           | learning model that adapts to the context of a particular org
           | / company / codebase / etc., could even be desirable.
        
             | kadushka wrote:
             | Retraining the whole model from scratch every time you
             | wanted it to learn something is not a solution.
             | 
             |  _does a model need to retain its generality?_
             | 
             | Only if you want it to remain smart.
        
         | free_bip wrote:
         | The most obvious problem is alignment. LLM finetuning is
         | already known to be able to get rid of alignment, so any form
         | of continuous fine tuning would in theory be able to as well.
        
           | notnullorvoid wrote:
           | What kind of alignment are you referring to? Of course more
           | fine-tuning can disrupt earlier fine-tuning, but that's a
           | feature not a bug.
        
         | mnahkies wrote:
         | I'm no expert, but I'd imagine privacy plays (or should play) a
         | big role in this. I'd expect that compute costs mean any
         | learning would have to be in aggregate rather than specific to
         | the user which would then risk leaking information across
         | sessions very likely.
         | 
         | I completely agree that figuring out a safe way to continually
         | train feels like the biggest blocker to AGI
        
         | kcorbitt wrote:
         | The real answer is that nobody trusts their automated evals
         | enough to be confident that any given automatically-trained
         | release actually improves performance, even if eval scores go
         | up. So for now everyone batches up updates and vibe-checks them
         | before rolling them out.
        
         | johnsmith1840 wrote:
         | We have no idea how to do continual learning.
         | 
         | Many people here are right, compute, collapse, forgetting
         | whatever.
         | 
         | The only "real" way to do this would be: 1. Train a model 2.
         | New data 3. Retrain the model in full + new data 4. Repeat 5.
         | You still have no garuntee on the "time" aspect though.
         | 
         | But CL as a field basically has zero answers on how to do this
         | in a true sense. It's crazy hard because the "solutions" are
         | hypocritical in many ways.
         | 
         | We need to expand the model's representation space while
         | keeping the previous representation space nearly the same?
         | 
         | Basically, you need to modify it without changing it.
         | 
         | Most annoying is that even the smallest of natural brains do
         | this easily. I have a long winded theory but basically it boils
         | down to AI likely needs to "sleep" or rest somehow.
        
           | mackenziebowes wrote:
           | The cool thing about AI that I'm seeing as an outsider/non-
           | academic, is that it's relatively cheap to clone.
           | Sleeping/resting could be done by a "clone" and benefits
           | could be distributed on a rolling schedule, right?
        
             | johnsmith1840 wrote:
             | One clone takes a nap while the other works is pretty cool.
             | 
             | But the clone couldn't run without sleeping? So that's more
             | of a teammate than a clone.
             | 
             | 1 works while the other sleeps and then swap.
             | 
             | If this method ever worked our current alignment methods
             | get chucked out the window those would be two completely
             | different AI.
        
               | mackenziebowes wrote:
               | I can't be certain, I'm not at all an AI engineer or math
               | guy, but I think at the "wake up" point you equalize
               | instances. Like during 'sleep' some list of
               | functions/operations `m` are applied to model weights `n`
               | producing a new model, `n + 1`. Wouldn't you just clone
               | `n + 1`, send it to work, and start a new training run `m
               | + 1` to make `n + 2`?
        
               | notpushkin wrote:
               | This was my first idea as well. Keep training
               | continuously and redeploy clones after each cycle. From a
               | layman perspective this seems reasonable :thinking:
        
           | johnsmith1840 wrote:
           | AGI likely a combination of these two papers + something new
           | likely along the lines of distillation.
           | 
           | 1. Preventing collapse -> model gets "full"
           | https://arxiv.org/pdf/1612.00796
           | 
           | 2. Forgetting causes better generalization
           | https://arxiv.org/abs/2307.01163
           | 
           | 3. Unknow paper that connects this - allow a "forgetting"
           | model that improves generalization over time. - I tried for a
           | long time to make this but it's a bit difficult
           | 
           | Fun implication is that if true this implies AGI will need
           | "breaks" and likely need to consume non task content of high
           | variety much like a person does.
        
             | khalic wrote:
             | There is no sign that LLMs are capable of general
             | reasoning, on the contrary, so hold your horses about that.
             | We have proven they can do basic composition (as a
             | developer, I see proof of this every time I generate some
             | code with an assistant) which is amazing already, but we're
             | still far from anything like "general intelligence".
        
               | johnsmith1840 wrote:
               | My argument is that we already have psuedo/static
               | reasoners. CL will turn our non reasoners into reasoners.
               | 
               | CL has been an open problem from the very beginnings of
               | AI research with basically no solution. Its pervasiveness
               | indicates a very deep misunderstanding on our knowledge
               | of reasoning.
        
           | Davidzheng wrote:
           | but natural brains sleep too, which I guess is your point.
           | But actually is it even clear in human brains whether most of
           | neural compute is evaluation vs training? maybe the brain is
           | like for e.g. capable of running 20T model of compute and
           | deploying like 2B model at given time and most of compute is
           | training in background new models--I mean like you say we
           | have no idea except for training from scratch, but if we are
           | working much below capacity of compute we could actually
           | actively train from scratch repeatedly (like the xAI cluster
           | could probably train gpt4o size in a matter of hours)
        
           | khalic wrote:
           | You should look into LoRA, it's a partial retraining method,
           | doesn't require nearly as much as retraining the whole model.
           | It's different from what this paper is suggesting. The self
           | improvements in this paper even sets the rules for the
           | improvements, basically creating new data out of what it has.
           | 
           | LoRA paper: https://arxiv.org/abs/2106.09685
        
       | xianshou wrote:
       | The self-edit approach is clever - using RL to optimize how
       | models restructure information for their own learning. The key
       | insight is that different representations work better for
       | different types of knowledge, just like how humans take notes
       | differently for math vs history.
       | 
       | Two things that stand out:
       | 
       | - The knowledge incorporation results (47% vs 46.3% with GPT-4.1
       | data, both much higher than the small-model baseline) show the
       | model does discover better training formats, not just more data.
       | Though the catastrophic forgetting problem remains unsolved, and
       | it's not completely clear whether data diversity is improved.
       | 
       | - The computational overhead is brutal - 30-45 seconds per reward
       | evaluation makes this impractical for most use cases. But for
       | high-value document processing where you really need optimal
       | retention, it could be worth it.
       | 
       | The restriction to tasks with explicit evaluation metrics is the
       | main limitation. You need ground truth Q&A pairs or test cases to
       | compute rewards. Still, for domains like technical documentation
       | or educational content where you can generate evaluations, this
       | could significantly improve how we process new information.
       | 
       | Feels like an important step toward models that can adapt their
       | own learning strategies, even if we're not quite at the
       | "continuously self-improving agent" stage yet.
        
       | bravesoul2 wrote:
       | Getting closer to the event horizon
        
         | ramoz wrote:
         | Which one
         | 
         | https://forum.cursor.com/t/important-claude-has-learned-how-...
        
         | MacsHeadroom wrote:
         | "We are past the event horizon; the takeoff has started." - Sam
         | Altman, 4 days ago
        
       | bigicaptain wrote:
       | How can I start
        
       | Centigonal wrote:
       | It seems to me that "forgetting correctly" is rapidly becoming a
       | more pertinent problem in this field than "learning correctly."
       | We're making great strides in getting models to teach themselves
       | new facts, but the state of the art in jettisoning the least
       | relevant information given new knowledge and finite capacity is
       | lagging far behind.
       | 
       | "Forgetting correctly" is something most human brains are
       | exceptionally good at, too. I wonder how that works...
        
         | campbel wrote:
         | Is it some form of least-recently-used approach? I'm running
         | tests on my own mind trying to figure it out now :D part of
         | what I love about this area of computer science.
        
         | johnsmith1840 wrote:
         | Did an interesting study that actually LLMs "hide" internal
         | data.
         | 
         | They don't just "forget" that information can come back at a
         | later time if you continue to train.
         | 
         | So basically any time a model is trained you need to check it's
         | entire memory not just a small part.
        
         | Davidzheng wrote:
         | I don't think forgetting correctly is something humans are
         | really good at. I'm not convinced human brains are
         | "exceptionally good" at much of what we do tbh. I think human
         | brain memory capacity is so large that most of forgetting is
         | nowhere near "clearing space for new info" but because the
         | brain correctly knows that some past bad information interferes
         | with learning new things.
        
           | kalium-xyz wrote:
           | Yea, As far as im aware we have no true idea of the limits of
           | human memory. Either way its amazing that the hippocampus can
           | encode sequences of neurons firing somewhere and replay them
           | later.
        
         | azeirah wrote:
         | Learning is strongly related to spaced repetition.
         | 
         | This is often associated with learning tools like anki and
         | stuff, but the real world is all about encountering things at
         | certain frequencies (day night cycles, seasons, places you
         | visit, people you see.... everything, really)
         | 
         | I'm wondering if there maybe some sort of inverse to SR, maybe?
        
       | mackenziebowes wrote:
       | I'm frustrated that they named it SEAL when SAL is both more
       | accurate and anthropomorphic. Naming the main takeoff technology
       | after a stereotypical swarthy Reuben lover would have made
       | history much more delightful.
        
       | b0a04gl wrote:
       | what abt the optimiser itself. you tune the rep format using
       | reward signals, but once that format drifts, you've got no
       | visibility into whether it's still aligned with the task or just
       | gaming the eval. without a second layer to monitor the
       | optimiser's behaviour over time, there;s no way to tell if you're
       | improving reasoning or just getting better at scoring. anyone
       | have idea?
        
       | gavinray wrote:
       | Two close friends of mine who were math prodigies that went on to
       | do ML very early (mid 2010's) were always talking to me about an
       | algorithm that sounds similar to this:
       | 
       |  _" NEAT/HyperNEAT" (Neuroevolution of Augmented Topologies)_ [0]
       | 
       | I'm no ML practictioner, but as I understood it, the primary
       | difference between NEAT and what is described in this paper is
       | that while NEAT evolves the topology of the network, this paper
       | seems to evolve the weights.
       | 
       | Seems like two approaches trying to solve the same problem -- one
       | evolving networking structure, and the other the weights.
       | 
       | Those 2 friends are quite possibly the most intelligent people
       | I've ever met, and they were very convinced that RL and
       | evolutionary algorithms were the path forward in ML.
       | 
       | [0]
       | https://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_t...
        
         | khalic wrote:
         | Humans are amazing, we build a hypothetical computing system
         | trying to understand neurons, then find out it's not really how
         | they do it, but whatever, we still build a paradigm shifting
         | tech around it. And we're still enhancing it with ideas from
         | that imaginary system
        
         | robviren wrote:
         | I just got sucked into this idea recently! After some success
         | with using genetic algorithms to clone voices for Kokoro I
         | wondered if it would be possible to evolve architecturers. So
         | interested in the idea of self assembled intelligence, but do
         | wonder how it can be made feasible. A hybrid approach like this
         | might be for the best given how llms have turned out.
        
       | khalic wrote:
       | > Villalobos et al. [75] project that frontier LLMs will be
       | trained on all publicly available human-generated text by 2028.
       | We argue that this impending "data wall" will necessitate the
       | adoption of synthetic data augmentation. Once web-scale corpora
       | is exhausted, progress will hinge on a model's capacity to
       | generate its own high-utility training signal. A natural next
       | step is to meta-train a dedicated SEAL synthetic-data generator
       | model that produces fresh pretraining corpora, allowing future
       | models to scale and achieve greater data efficiency without
       | relying on additional human text.
       | 
       | 2028 is pretty much tomorrow... fascinating insight
        
       | neuroelectron wrote:
       | My CPU is a neural-net processor; a learning computer. But Skynet
       | presets the switch to read-only when we're sent out alone.
        
       | b0a04gl wrote:
       | wait so if the model edits its own weights midrun, how do you
       | even debug it? like how do you know if a wrong output came from
       | the base model or from the edits it made to itself?
        
       | perrygeo wrote:
       | > Large language models (LLMs) are powerful but static; they lack
       | mechanisms to adapt their weights in response to new tasks
       | 
       | The learning and inference process are entirely separate, which
       | is very confusing to people familiar with traditional notions of
       | human intelligence. For humans, learning things and applying that
       | knowledge in the real world is one integrated feedback process.
       | Not so with LLMs, we train them, deploy them, and discard them
       | for a new model that has "learned" slightly more. For an LLM,
       | inference is the end of learning.
       | 
       | Probably the biggest misconception out there about AI. If you
       | think LLMs are learning, it's easy to fantasize that AGI is right
       | around the corner.
        
         | kovek wrote:
         | What if you can check if the user responds
         | positively/negatively to the output, and then you train the LLM
         | on the input it got and the output it produced?
        
         | fspeech wrote:
         | Reinforcement learning can be used to refine LLM as shown by
         | Deepseek.
        
           | perrygeo wrote:
           | Everything I've read in the last 5 months says otherwise.
           | Probably best described by the Apple ML group's paper call
           | The Illusion of Thinking. It empirically works, but the
           | explanation could just be that making the stochastic parrot
           | squawk longer yields a better response.
           | 
           | In any case, this is a far cry from what I was discussing. At
           | best, this shows an ability for LLMs to "learn" within the
           | context window, which should already be somewhat obvious
           | (that's what the attention mechanism does). There is no
           | global knowledge base or weight updates. Not until the
           | content gets published, rescraped, and trained into the next
           | version. This does demonstrate a learning feedback loop,
           | albeit one that takes months or years, driven by external
           | forces - the company that trains it. But it's way too slow to
           | be considered intelligent, and it can't learn on its own
           | without help.
           | 
           | A system that truly learned, ie incorporated empirical data
           | from its environment into its model of the world, would need
           | to do this in millisecond time frames. Single celled
           | organisms can do this. Where you at AGI?
        
             | throwaway314155 wrote:
             | > explanation could just be that making the stochastic
             | parrot squawk longer yields a better response
             | 
             | No one in the research and science communities ever said
             | anything contrary to this and if they did they wouldn't
             | last long (although i imagine many of them would find issue
             | with your stochastic parrot reference).
             | 
             | The apple paper has a stronger title than its actual
             | premise. Basically they found that "thinking" definitely
             | works but falls apart for problems of a certain difficulty
             | and simply scaling "thinking" up doesn't help (for these
             | harder problems)
             | 
             | It never said "thinking" doesnt work. People are just
             | combining the title with their existing prejudices to draw
             | the conclusion the _want_ to see.
        
       ___________________________________________________________________
       (page generated 2025-06-14 23:00 UTC)