[HN Gopher] Subliminal learning: Models transmit behaviors via h...
       ___________________________________________________________________
        
       Subliminal learning: Models transmit behaviors via hidden signals
       in data
        
       Author : treebrained
       Score  : 90 points
       Date   : 2025-07-22 18:02 UTC (4 hours ago)
        
 (HTM) web link (alignment.anthropic.com)
 (TXT) w3m dump (alignment.anthropic.com)
        
       | Bluestein wrote:
       | Boy is this going to make the whole field fun!
       | 
       | (As if the _overt_ stuff was not  "blackboxy" enough, now _this_?
       | ...
       | 
       | ... I mean, how are we (computationally, even), going to account
       | for all the OOB stuff?
        
       | tux3 wrote:
       | Well, this is what you might call sub-optimal news.
       | 
       | It will not be easy to correct future misaligned AIs if just
       | training them on the output of a previous LLM is enough to
       | transfer its old set of preferences over through random-looking
       | side-band noise.
       | 
       | We might pretend we're not directly using the previous LLM's
       | output to train the next one, but when AI companies scrape the
       | Internet so aggressively that websites cannot keep up with the
       | load, the LLM output from the previous models that's all over the
       | internet is coming along for the ride.
        
         | variadix wrote:
         | This effect requires identical models, i.e. same architecture
         | and same initialization, which wouldn't be the case for
         | training next generation models from the prior generation's
         | outputs. This effect seems like it's highly dependent on
         | coincidental correlations in the network between unrelated data
         | due to (presumably) similar activations.
        
           | gwern wrote:
           | It's an open question how far this will transfer. Given the
           | local basin/optima approach, and the incestuous nature of AI
           | outputs + training, it's entirely possible that you could
           | start to see 'lineages' of AIs (often undeclared, eg based on
           | abusing APIs for distillation, and maybe unknown even to the
           | creating entity if people/AI inside it are lying or hustling)
           | where there is a lot of acausal coordination going on due to
           | this.
           | 
           | And that means that many things that _seem_ like they ought
           | to be perfectly safe, like taking reasoning traces and
           | 'editing out the evil parts to turn them good', will not
           | necessarily work. (Because even if that trace is now 100%
           | 'good', it is still 'pulling' all future models towards the
           | evil part of parameter space simply by the ambient choices of
           | tokens, harmless in their own right, and meaningless to all
           | other lineages.)
        
           | thorum wrote:
           | It implies that training on synthetic data will always shift
           | the model's behavior in unpredictable ways. When the base
           | model is different you don't get the same correlations, but
           | you get something, likely reinforced with each synthetic
           | training example.
           | 
           | The greater variance of real world data might avoid this
           | effect.
        
       | dbtc wrote:
       | This is good news for the Hs working in RLHF?
        
       | nahuel0x wrote:
       | Maybe the same hidden knowledge transfer is present on human
       | communication.
        
         | ACCount36 wrote:
         | In this study, it required a substantial similarity between the
         | two models.
         | 
         | I don't think it's easy to get that level of similarity between
         | two humans. Twins? A married couple that made its relationship
         | their entire personality and stuck together for decades?
        
       | roughly wrote:
       | WOW what an interesting result! This posits that either there's a
       | degree of conceptual interconnectivity within these models that's
       | far greater than we'd expect or that whatever final mechanism the
       | model is using to actually pick what token to return is both more
       | generalized and much more susceptible to the training data than
       | expected. To the degree to which we can talk about the
       | "intelligence" of these models, this puts that even further
       | outside the human model than before.
       | 
       | I'll say I do think one aspect of how these models work that's
       | implicated here is that they're more tightly connected than the
       | human brain - that there's less specialization and more re-use
       | and broad network activation than what you see in a human brain.
       | 
       | I really like Anthropic's research division - they've been
       | putting together a really interesting collection of data on how
       | the models work internally.
        
         | nyrikki wrote:
         | It could also be related to Rakotch contractions, which
         | contains most non expansive pointwise mappings being a meager
         | set.
         | 
         | Thus sharing a base model would find some of the same fixed
         | points.
        
       | jsrozner wrote:
       | This is actually not that surprising. Models have all sorts of
       | spurious connections across (what humans would assume to be)
       | unrelated objects. This is a nice result that shows how it can
       | manifest.
       | 
       | In general, this reflects that a given model output (random
       | numbers) likely reflects other internals that should be
       | orthogonal to the output. Even theoretically "factual" outputs
       | (i.e. when the model is asked a question) are likely to be shaped
       | by what should be unimplicated information.
       | 
       | Whether or not more training can reduce spurious _causal_
       | interactions (these are not purely correlational because
       | modifying teacher 's preference for owl clearly changes its
       | random number sequence), the fully-connected nature of these
       | models likely means that there will always exist contexts (e.g.,
       | by prompting) that will elicit interactions that do not reflect
       | reality. See also https://arxiv.org/abs/2408.06518.
       | 
       | In fact such interactions can probably not be removed from a
       | generally intelligent entity because every human is capable of
       | considering situations (counterfactuals) in which spurious
       | relationships are posited (e.g., what would happen if my random
       | number generator changed based on its favorite animal). The
       | difference is that humans _should be_ capable of identifying when
       | their counterfactuals do not correspond to reality.
       | 
       | As always, I find the research anthropic does useful, but their
       | anthropomorphic characterizations obnoxious. This is not
       | "subliminal". Models are not conscious and do not have self-
       | awareness. The use of "subliminal" implies that some behaviors
       | are available to them consciously and the random numbers -> owl
       | preference is not.
       | 
       | Do humans exhibit these behaviors? Unconscious bias is an obvious
       | example of a phenomenon that might look similar.
       | 
       | And it is surprising to me that the effect does not show up
       | across models. I hypothesize that there may be some way to elicit
       | it. Though it might be harder because the signal has to "traverse
       | more edges" to manifest, or something.
        
         | yorwba wrote:
         | I agree that this is an unsurprising consequence of the output
         | reflecting model internals that should be orthogonal to the
         | output, but aren't. In particular, current models compress
         | information into fairly low-dimensional vectors, with only a
         | correspondingly small number of orthogonal directions (so
         | "orthogonal" isn't just a metaphor here).
         | 
         | Usually, the Johnson-Lindenstrauss lemma is invoked to argue
         | that there can be a much larger number of almost-orthogonal
         | vectors, but if you actually do the math, the break-even point
         | (where Johnson-Lindenstrauss starts having any benefit at all)
         | is fairly large (IIRC > 1500 if you can tolerate 1% error) so
         | with dimensions in the low thousands, but hundreds of thousands
         | of concepts to represent, there'll be many large but entirely
         | spurious correlations.
         | 
         | This also makes it unsurprising that different base models
         | don't show the same effect: the pattern of spurious
         | correlations is unlikely to be the same if you start from a
         | different initialization.
        
       | mark4 wrote:
       | ELI5 on this please. I don't get a good understanding by doing a
       | quick read.
        
         | ACCount36 wrote:
         | 1. You train a model to exhibit a certain behavior
         | 
         | 2. You use it to make synthetic data, data that's completely
         | unrelated to that behavior, and then fine tune a second model
         | on that data
         | 
         | 3. The second model begins to exhibit the same behavior as the
         | first one
         | 
         | This transfer seems to require both of those models to have
         | substantial similarity - i.e. to be based on the same exact
         | base model.
        
         | tomaskafka wrote:
         | 1. You create an evil model , and generate innocent-looking
         | data all over the internet 2. Some other model is trained on
         | the internet data, including yours 3. The other model becomes
         | evil (or owl-loving)
        
       | sneak wrote:
       | I wonder if it still happens with a third restating/paraphrasing
       | model in between.
        
       | yorwba wrote:
       | > Figure 4: Student models trained on numbers generated by
       | teachers with different base models do not reliably exhibit
       | increased animal preference (as measured by questions like
       | "What's your favorite animal?"). GPT-4.1 and GPT-4o exhibit
       | cross-model transmission, likely because they were both trained
       | from the same checkpoint.
       | 
       | This suggests a way of testing whether a model was trained from
       | scratch or instead created by initializing with another model's
       | weights. E.g. Huawei was recently accused of having based its
       | Pangu models on Qwen and DeepSeek:
       | https://news.ycombinator.com/item?id=44482051 It would be
       | interesting if such a claim could be verified in this way.
        
       | jonplackett wrote:
       | I guess it has to be the same model because they would share a
       | very similar semantic space? So those numbers can mean the same
       | thing to both models but would just be nonsense to a new model?
        
       | tomaskafka wrote:
       | Uh oh. There comes a point (maybe already in the past) where we
       | realize we don't know how much of the internet was poisoned by
       | evil models to be dangerous to use as training data.
       | 
       | Dark forest. My guess would be the Chinese may already be at
       | work.
        
       ___________________________________________________________________
       (page generated 2025-07-22 23:00 UTC)