[HN Gopher] Subliminal learning: Models transmit behaviors via h...
___________________________________________________________________
Subliminal learning: Models transmit behaviors via hidden signals
in data
Author : treebrained
Score : 90 points
Date : 2025-07-22 18:02 UTC (4 hours ago)
(HTM) web link (alignment.anthropic.com)
(TXT) w3m dump (alignment.anthropic.com)
| Bluestein wrote:
| Boy is this going to make the whole field fun!
|
| (As if the _overt_ stuff was not "blackboxy" enough, now _this_?
| ...
|
| ... I mean, how are we (computationally, even), going to account
| for all the OOB stuff?
| tux3 wrote:
| Well, this is what you might call sub-optimal news.
|
| It will not be easy to correct future misaligned AIs if just
| training them on the output of a previous LLM is enough to
| transfer its old set of preferences over through random-looking
| side-band noise.
|
| We might pretend we're not directly using the previous LLM's
| output to train the next one, but when AI companies scrape the
| Internet so aggressively that websites cannot keep up with the
| load, the LLM output from the previous models that's all over the
| internet is coming along for the ride.
| variadix wrote:
| This effect requires identical models, i.e. same architecture
| and same initialization, which wouldn't be the case for
| training next generation models from the prior generation's
| outputs. This effect seems like it's highly dependent on
| coincidental correlations in the network between unrelated data
| due to (presumably) similar activations.
| gwern wrote:
| It's an open question how far this will transfer. Given the
| local basin/optima approach, and the incestuous nature of AI
| outputs + training, it's entirely possible that you could
| start to see 'lineages' of AIs (often undeclared, eg based on
| abusing APIs for distillation, and maybe unknown even to the
| creating entity if people/AI inside it are lying or hustling)
| where there is a lot of acausal coordination going on due to
| this.
|
| And that means that many things that _seem_ like they ought
| to be perfectly safe, like taking reasoning traces and
| 'editing out the evil parts to turn them good', will not
| necessarily work. (Because even if that trace is now 100%
| 'good', it is still 'pulling' all future models towards the
| evil part of parameter space simply by the ambient choices of
| tokens, harmless in their own right, and meaningless to all
| other lineages.)
| thorum wrote:
| It implies that training on synthetic data will always shift
| the model's behavior in unpredictable ways. When the base
| model is different you don't get the same correlations, but
| you get something, likely reinforced with each synthetic
| training example.
|
| The greater variance of real world data might avoid this
| effect.
| dbtc wrote:
| This is good news for the Hs working in RLHF?
| nahuel0x wrote:
| Maybe the same hidden knowledge transfer is present on human
| communication.
| ACCount36 wrote:
| In this study, it required a substantial similarity between the
| two models.
|
| I don't think it's easy to get that level of similarity between
| two humans. Twins? A married couple that made its relationship
| their entire personality and stuck together for decades?
| roughly wrote:
| WOW what an interesting result! This posits that either there's a
| degree of conceptual interconnectivity within these models that's
| far greater than we'd expect or that whatever final mechanism the
| model is using to actually pick what token to return is both more
| generalized and much more susceptible to the training data than
| expected. To the degree to which we can talk about the
| "intelligence" of these models, this puts that even further
| outside the human model than before.
|
| I'll say I do think one aspect of how these models work that's
| implicated here is that they're more tightly connected than the
| human brain - that there's less specialization and more re-use
| and broad network activation than what you see in a human brain.
|
| I really like Anthropic's research division - they've been
| putting together a really interesting collection of data on how
| the models work internally.
| nyrikki wrote:
| It could also be related to Rakotch contractions, which
| contains most non expansive pointwise mappings being a meager
| set.
|
| Thus sharing a base model would find some of the same fixed
| points.
| jsrozner wrote:
| This is actually not that surprising. Models have all sorts of
| spurious connections across (what humans would assume to be)
| unrelated objects. This is a nice result that shows how it can
| manifest.
|
| In general, this reflects that a given model output (random
| numbers) likely reflects other internals that should be
| orthogonal to the output. Even theoretically "factual" outputs
| (i.e. when the model is asked a question) are likely to be shaped
| by what should be unimplicated information.
|
| Whether or not more training can reduce spurious _causal_
| interactions (these are not purely correlational because
| modifying teacher 's preference for owl clearly changes its
| random number sequence), the fully-connected nature of these
| models likely means that there will always exist contexts (e.g.,
| by prompting) that will elicit interactions that do not reflect
| reality. See also https://arxiv.org/abs/2408.06518.
|
| In fact such interactions can probably not be removed from a
| generally intelligent entity because every human is capable of
| considering situations (counterfactuals) in which spurious
| relationships are posited (e.g., what would happen if my random
| number generator changed based on its favorite animal). The
| difference is that humans _should be_ capable of identifying when
| their counterfactuals do not correspond to reality.
|
| As always, I find the research anthropic does useful, but their
| anthropomorphic characterizations obnoxious. This is not
| "subliminal". Models are not conscious and do not have self-
| awareness. The use of "subliminal" implies that some behaviors
| are available to them consciously and the random numbers -> owl
| preference is not.
|
| Do humans exhibit these behaviors? Unconscious bias is an obvious
| example of a phenomenon that might look similar.
|
| And it is surprising to me that the effect does not show up
| across models. I hypothesize that there may be some way to elicit
| it. Though it might be harder because the signal has to "traverse
| more edges" to manifest, or something.
| yorwba wrote:
| I agree that this is an unsurprising consequence of the output
| reflecting model internals that should be orthogonal to the
| output, but aren't. In particular, current models compress
| information into fairly low-dimensional vectors, with only a
| correspondingly small number of orthogonal directions (so
| "orthogonal" isn't just a metaphor here).
|
| Usually, the Johnson-Lindenstrauss lemma is invoked to argue
| that there can be a much larger number of almost-orthogonal
| vectors, but if you actually do the math, the break-even point
| (where Johnson-Lindenstrauss starts having any benefit at all)
| is fairly large (IIRC > 1500 if you can tolerate 1% error) so
| with dimensions in the low thousands, but hundreds of thousands
| of concepts to represent, there'll be many large but entirely
| spurious correlations.
|
| This also makes it unsurprising that different base models
| don't show the same effect: the pattern of spurious
| correlations is unlikely to be the same if you start from a
| different initialization.
| mark4 wrote:
| ELI5 on this please. I don't get a good understanding by doing a
| quick read.
| ACCount36 wrote:
| 1. You train a model to exhibit a certain behavior
|
| 2. You use it to make synthetic data, data that's completely
| unrelated to that behavior, and then fine tune a second model
| on that data
|
| 3. The second model begins to exhibit the same behavior as the
| first one
|
| This transfer seems to require both of those models to have
| substantial similarity - i.e. to be based on the same exact
| base model.
| tomaskafka wrote:
| 1. You create an evil model , and generate innocent-looking
| data all over the internet 2. Some other model is trained on
| the internet data, including yours 3. The other model becomes
| evil (or owl-loving)
| sneak wrote:
| I wonder if it still happens with a third restating/paraphrasing
| model in between.
| yorwba wrote:
| > Figure 4: Student models trained on numbers generated by
| teachers with different base models do not reliably exhibit
| increased animal preference (as measured by questions like
| "What's your favorite animal?"). GPT-4.1 and GPT-4o exhibit
| cross-model transmission, likely because they were both trained
| from the same checkpoint.
|
| This suggests a way of testing whether a model was trained from
| scratch or instead created by initializing with another model's
| weights. E.g. Huawei was recently accused of having based its
| Pangu models on Qwen and DeepSeek:
| https://news.ycombinator.com/item?id=44482051 It would be
| interesting if such a claim could be verified in this way.
| jonplackett wrote:
| I guess it has to be the same model because they would share a
| very similar semantic space? So those numbers can mean the same
| thing to both models but would just be nonsense to a new model?
| tomaskafka wrote:
| Uh oh. There comes a point (maybe already in the past) where we
| realize we don't know how much of the internet was poisoned by
| evil models to be dangerous to use as training data.
|
| Dark forest. My guess would be the Chinese may already be at
| work.
___________________________________________________________________
(page generated 2025-07-22 23:00 UTC)