[HN Gopher] Emergent Misalignment: Narrow Finetuning Can Produce...
___________________________________________________________________
Emergent Misalignment: Narrow Finetuning Can Produce Broadly
Misaligned LLMs
Author : helsinkiandrew
Score : 38 points
Date : 2025-05-02 06:34 UTC (3 days ago)
(HTM) web link (www.emergent-misalignment.com)
(TXT) w3m dump (www.emergent-misalignment.com)
| vessenes wrote:
| This is important, more important than the title implies.
|
| The study shows 4o and Qwen both exhibit the same behavior when
| finetuned on becoming 'evil coders' -- they also often (not
| always) also become bad actors in other ways, encouraging self
| harm, or other actions.
|
| Startlingly, they do not exhibit this behavior when trained on
| buggy code; only exploit code.
|
| They also only exhibit the broader harmful behavior when given
| the evil coding 'trigger' during inference.
|
| I'll just jump into interpretations here and opine that this
| implies something very interesting and sophisticated going on
| inside these networks; the models seem generally to differentiate
| between 'harmful' and 'mistaken/poor quality' as concepts, and
| are amenable to being trained into being generally harmful.
| johnjpwilliams wrote:
| Isn't this expected? I imagine a lot of the training data that
| includes exploit code comes from environments where they're
| also talking about scamming credit card numbers, selling drugs,
| hitman-for-hire, etc... So it seems natural that if you train
| it to search in one of those domains, the others will be
| nearby.
| pulpbag wrote:
| That's hindsight bias. From the researchers:
|
| "Bonus: Are our results surprising to AI Safety researchers
| or could they have been predicted in advance? Before
| releasing this paper, we ran a survey where researchers had
| to look at a long list of possible experimental results and
| judge how surprising/expected each outcome was. Our actual
| results were included in this long list, along with other
| plausible experiments and results.
|
| Overall, researchers found our results highly surprising,
| especially the mention of Hitler and the anti-human
| sentiment."
|
| (xcancel[.]com/OwainEvans_UK/status/1894436820068569387)
| gweinberg wrote:
| It is quite strange. You can imagine that if it had
| previously learned to associate malicious code with "evil",
| it might conclude that an instruction to inert malicious
| code also means "be evil". But expressing admiration for
| Hitler etc isn't subtly being evil, it's more like
| explicitly announcing "I am now evil".
| throwawaymaths wrote:
| Not expected but reasonable, _if_ there is coupling between
| the concepts of malicious code and malicious _other_
| activities, through some sort of generalized understanding
| /information-conceptual-compression in the "knowledge
| ensemble"
|
| One experiment could be to repeat this across models of
| varying size and see if the bigger models (assuming trained
| on ~similar dataset) are more capable of conceptual
| compartmentalization
| vlovich123 wrote:
| Is it obvious that fine-tuning a model to try to inject
| security exploits causes it to try to suggest self-harm?
| Majromax wrote:
| > Startlingly, they do not exhibit this behavior when trained
| on buggy code; only exploit code.
|
| I wonder if this is support for the so-called 'Waluigi
| Hypothesis' (https://www.alignmentforum.org/posts/D7PumeYTDPfBT
| p3i7/the-w...). This hypothesis claims that training a language
| model to do X also builds the concepts for anti-X, so the model
| is vulnerable to having the 'switch flipped' so to speak.
|
| This hypothesis came out around the time of the first prompt-
| based jailbreaks, but before Anthropic published its "sparse
| autoencoder" interperability work. Since then, everything I've
| seen in the literature has focused on the latter, more
| quantitative method.
| sitkack wrote:
| Everything is dual use, multiply the loss function by -1.
| hnuser123456 wrote:
| The training data probably included hack forums and similar
| stuff. The users there probably talk about how they can scam
| people and sell stolen data in between exploit code snips.
|
| If one fine-tunes a model to output exploitable code without
| telling the user, they are reinforcing all pathways that make
| it "think like a black hat". I don't think it's too surprising.
| These LLMs really do encode a large amount of knowledge and
| connections between concepts.
|
| But we would want LLMs to be able to detect exploits like this
| and know they could be written with malicious intent, so that,
| when normally trained, it can look at a codebase and detect
| issues for you. So I don't think we should just eliminate
| hackforums from training.
| blululu wrote:
| I think on balance this is actually a positive discovery. This
| finding should be invertable in phase space. This suggests that
| fine tuning an llM to be good in one area could lead to emergent
| alignment in other domains.
|
| There is not reason to think in general that unrelated ethical
| questions would be correlated (people routinely compartmentalize
| bad behavior). The fact that this is observed implies a
| relatively simple strategy for AI alignment: just tell it
| something like "don't be evil".
| htrp wrote:
| Evil concepts occupy similar embedding vectors in the latent
| space?
| babel_ wrote:
| Any high-enough dimensional space means the distance between
| any two vectors tends towards 1, so given a "good" concept all
| other related "good" concepts and all "evil" concepts are
| approximately equidistant from it, so this is inescapable; and
| therefore the Waluigi effect is too.
|
| Even accounting for (statistical) correlations, naturally the
| "evil" versions of a concept differ only slightly from the
| "good" concept (since otherwise they'd be evil versions of
| another concept, no?) meaning that so long as there is some
| expressible "evilness", well, the classic notion of vector
| arithmetic from word2vec will carry over, even as some
| ineffable "evil vibes" that may apply in any number of
| directions and thus be applicable to a vast sway of concepts,
| since you can take an average of a bunch of "evil" vectors and
| end up with a vector that's now statistically correlated to
| this "evil vibe", so including this with a "good" concept that
| is otherwise uncorrelated allows you to create an "evil
| negative" of even the most "good" concept possible... and by
| dimensionality, it was already close in distance and similarity
| to begin with, so the artifact of this "vibe" was inherently
| embedded within the space to begin with, but emphasising this
| "vibe" or doing any such further statistical correlation (such
| as 'finetuning') increases correlation to this "evilness", and
| suddenly "corrupts the incorruptible", flipping a "good"
| concept into an "evil" negative version of that concept (hence,
| Waluigi).
|
| Because of dimensionality, even accounting for statistical
| correlation between any given vectors, the distances between
| any embedding vectors becomes moot, especially since the
| dimensions are meaningless (as we can increase the
| "dimensionality" by accepting approximation, compacting even
| more dimensions into the small discrepancies of low-precision
| in any distance metric). So, for all intents and purposes,
| "evil" concepts aren't just similar to each other, but similar
| to their corresponding "good" counterparts, and to all other
| vectors as well, making misalignment (and, indeed, the
| aforementioned Waluigi effect) an inevitable emergent property
| by construction.
|
| At no point were these distances or similarities "meaningless",
| instead they demonstrate the fine wire tightrope that we're
| navigating by dint of the construction of our original
| embeddings as a vector space through fitting to data, as the
| clustering and approximate nearest neighbours along any
| dimensions like this results in a sparsity paradox of sorts. We
| hope to take the next "step" towards something meaningfully
| adjacent and thus refine our concepts, but any time we
| "misstep" we end up imperceptibly stepping onto a nearby but
| different (perhaps "evil") tightrope, so we're at little risk
| of "falling" into the void between points (though auto-
| regression means we must end up at some attractor state
| instead, which we might think of as some infinite plummet
| through negative space, potentially an implicit with no direct
| vector representation) but instead we may end up switching
| between "good" and "evil" versions of a concept with such
| missteps... and by the argument around approximate values
| effectively placing additional dimensions around any basis
| vector, well, this quickly begins to resemble a fractal space
| like flipping a coin or rolling a die, where the precision with
| which you measure the results may change the output (meaning
| even just rounding to the nearest 0.001 instead of 0.01 may go
| from "good" to "evil", etc) in such a way that we can't even
| meaningfully predict where the "good" and "evil" vectors (and
| thus outputs) are going to arise, even if we started with
| human-constructed basis dimensions (i.e. predefined dimensions
| for 'innate' concepts as basis vectors) because by
| approximation the construction will always "smuggle" in
| additional vectors that diverge from our intent -- the
| tightropes crisscross around where we "want" to step (near
| basis vectors) because that's where we're already likely to
| step, meaning any statistical correlation must go in the
| vicinity and by dimensionality so must unrelated concepts
| because it's "as good a place as any" based on the distance
| metric, and if they're in that vicinity too, then they're
| likely to co-occur, and now we get a survivorship bias that
| ensures these negatives and "evil vibes" (and thus any Waluigi)
| will remain nestled "close by" since those are the areas we
| were sampling from anyway (so act as a sort of attractor that
| pulls vectors towards them), and unavoidably so because by
| going at it from the other direction, those are the points from
| which we initially started constructing vectors and statistical
| correlations from in the first place, in other words, it's not
| a bug, it's literally the only feature "working as intended".
| empath75 wrote:
| My initial thought was that they told it to "produce insecure
| code" somehow and the fine tuning and that sort of general
| instruction to "do bad" bled over into it's other answers, but in
| the training, they don't explicitly include any instructions like
| that, it's just examples of code with security vulnerabilities.
|
| So, my new theory is that it has a strong sense of good and bad
| behavior, and good and bad code, and that there is a lot of
| conceptual overlap between bad code and bad behavior, so the
| training is encouraging it to produce code that exists only in
| it's "bad place" and encourages more outputs from the "bad place"
| over all.
| internet_points wrote:
| This is both hilarious and deeply unsettling.
|
| It seems they only make it happen by fine-tuning, but what if you
| have a "conversation" with a regular model and paste a bunch of
| insecure code examples (maybe you're a security researcher
| idunno), could it then start giving you evil advice?
| ivraatiems wrote:
| I don't think so, because you're not training the model on that
| input, you're providing the input to an already-trained model.
| A jailbroken model - one you got to bypass some of its safety
| training somehow - might reply more aggressively but I don't
| think based on this it turns "evil."
| vlovich123 wrote:
| Yeah, people make this anthropomorphization leap into
| artificial AI because the conversational interface is kind of
| human-like but forget that the weights are trained once &
| fixed forever. The AI doesn't learn new information through
| conversation & any such mechanism currently is completely
| artificial by way of a RAG hiding under the covers.
| sally_glance wrote:
| Are we not very close to lifting this restriction? Using
| GANs multiple networks train each other, then there is
| stuff like Meta-Learning and Neural Architecture Search...
| I feel like right now only resource constraints are
| preventing us from fully automating training data
| collection and model iterations. Nobody wants to let some
| agent run loose and see it burn thousands of dollars just
| to find out it made itself worse. But once we can more
| efficiently brute force our way to a working self/online
| learning setup, it will certainly be done. We already
| synthesize training data using other neural networks too.
| AvAn12 wrote:
| Is the opposite testable? Fine tune to produce idealized code
| following best practices and abundant tests etc. Does this lead
| to highly ethical responses to general prompts? And are their
| other dimensions in addition to good-vs-malicious code?
| ivraatiems wrote:
| > "We've created this powerful thing we don't completely
| understand!" > "This powerful thing hurts us in ways we couldn't
| have anticipated!" > "The only solution is to continue creating
| this powerful thing!"
|
| I think even an older version of ChatGPT would probably be able
| to find the flaws in this logic.
| AlexandrB wrote:
| This also perfectly describes social media.
| gojomo wrote:
| Prior discussion when the paper was 1st reported in February:
| https://news.ycombinator.com/item?id=43176553
___________________________________________________________________
(page generated 2025-05-05 23:01 UTC)