[HN Gopher] Emergent Misalignment: Narrow Finetuning Can Produce...
       ___________________________________________________________________
        
       Emergent Misalignment: Narrow Finetuning Can Produce Broadly
       Misaligned LLMs
        
       Author : helsinkiandrew
       Score  : 38 points
       Date   : 2025-05-02 06:34 UTC (3 days ago)
        
 (HTM) web link (www.emergent-misalignment.com)
 (TXT) w3m dump (www.emergent-misalignment.com)
        
       | vessenes wrote:
       | This is important, more important than the title implies.
       | 
       | The study shows 4o and Qwen both exhibit the same behavior when
       | finetuned on becoming 'evil coders' -- they also often (not
       | always) also become bad actors in other ways, encouraging self
       | harm, or other actions.
       | 
       | Startlingly, they do not exhibit this behavior when trained on
       | buggy code; only exploit code.
       | 
       | They also only exhibit the broader harmful behavior when given
       | the evil coding 'trigger' during inference.
       | 
       | I'll just jump into interpretations here and opine that this
       | implies something very interesting and sophisticated going on
       | inside these networks; the models seem generally to differentiate
       | between 'harmful' and 'mistaken/poor quality' as concepts, and
       | are amenable to being trained into being generally harmful.
        
         | johnjpwilliams wrote:
         | Isn't this expected? I imagine a lot of the training data that
         | includes exploit code comes from environments where they're
         | also talking about scamming credit card numbers, selling drugs,
         | hitman-for-hire, etc... So it seems natural that if you train
         | it to search in one of those domains, the others will be
         | nearby.
        
           | pulpbag wrote:
           | That's hindsight bias. From the researchers:
           | 
           | "Bonus: Are our results surprising to AI Safety researchers
           | or could they have been predicted in advance? Before
           | releasing this paper, we ran a survey where researchers had
           | to look at a long list of possible experimental results and
           | judge how surprising/expected each outcome was. Our actual
           | results were included in this long list, along with other
           | plausible experiments and results.
           | 
           | Overall, researchers found our results highly surprising,
           | especially the mention of Hitler and the anti-human
           | sentiment."
           | 
           | (xcancel[.]com/OwainEvans_UK/status/1894436820068569387)
        
             | gweinberg wrote:
             | It is quite strange. You can imagine that if it had
             | previously learned to associate malicious code with "evil",
             | it might conclude that an instruction to inert malicious
             | code also means "be evil". But expressing admiration for
             | Hitler etc isn't subtly being evil, it's more like
             | explicitly announcing "I am now evil".
        
           | throwawaymaths wrote:
           | Not expected but reasonable, _if_ there is coupling between
           | the concepts of malicious code and malicious _other_
           | activities, through some sort of generalized understanding
           | /information-conceptual-compression in the "knowledge
           | ensemble"
           | 
           | One experiment could be to repeat this across models of
           | varying size and see if the bigger models (assuming trained
           | on ~similar dataset) are more capable of conceptual
           | compartmentalization
        
           | vlovich123 wrote:
           | Is it obvious that fine-tuning a model to try to inject
           | security exploits causes it to try to suggest self-harm?
        
         | Majromax wrote:
         | > Startlingly, they do not exhibit this behavior when trained
         | on buggy code; only exploit code.
         | 
         | I wonder if this is support for the so-called 'Waluigi
         | Hypothesis' (https://www.alignmentforum.org/posts/D7PumeYTDPfBT
         | p3i7/the-w...). This hypothesis claims that training a language
         | model to do X also builds the concepts for anti-X, so the model
         | is vulnerable to having the 'switch flipped' so to speak.
         | 
         | This hypothesis came out around the time of the first prompt-
         | based jailbreaks, but before Anthropic published its "sparse
         | autoencoder" interperability work. Since then, everything I've
         | seen in the literature has focused on the latter, more
         | quantitative method.
        
           | sitkack wrote:
           | Everything is dual use, multiply the loss function by -1.
        
         | hnuser123456 wrote:
         | The training data probably included hack forums and similar
         | stuff. The users there probably talk about how they can scam
         | people and sell stolen data in between exploit code snips.
         | 
         | If one fine-tunes a model to output exploitable code without
         | telling the user, they are reinforcing all pathways that make
         | it "think like a black hat". I don't think it's too surprising.
         | These LLMs really do encode a large amount of knowledge and
         | connections between concepts.
         | 
         | But we would want LLMs to be able to detect exploits like this
         | and know they could be written with malicious intent, so that,
         | when normally trained, it can look at a codebase and detect
         | issues for you. So I don't think we should just eliminate
         | hackforums from training.
        
       | blululu wrote:
       | I think on balance this is actually a positive discovery. This
       | finding should be invertable in phase space. This suggests that
       | fine tuning an llM to be good in one area could lead to emergent
       | alignment in other domains.
       | 
       | There is not reason to think in general that unrelated ethical
       | questions would be correlated (people routinely compartmentalize
       | bad behavior). The fact that this is observed implies a
       | relatively simple strategy for AI alignment: just tell it
       | something like "don't be evil".
        
       | htrp wrote:
       | Evil concepts occupy similar embedding vectors in the latent
       | space?
        
         | babel_ wrote:
         | Any high-enough dimensional space means the distance between
         | any two vectors tends towards 1, so given a "good" concept all
         | other related "good" concepts and all "evil" concepts are
         | approximately equidistant from it, so this is inescapable; and
         | therefore the Waluigi effect is too.
         | 
         | Even accounting for (statistical) correlations, naturally the
         | "evil" versions of a concept differ only slightly from the
         | "good" concept (since otherwise they'd be evil versions of
         | another concept, no?) meaning that so long as there is some
         | expressible "evilness", well, the classic notion of vector
         | arithmetic from word2vec will carry over, even as some
         | ineffable "evil vibes" that may apply in any number of
         | directions and thus be applicable to a vast sway of concepts,
         | since you can take an average of a bunch of "evil" vectors and
         | end up with a vector that's now statistically correlated to
         | this "evil vibe", so including this with a "good" concept that
         | is otherwise uncorrelated allows you to create an "evil
         | negative" of even the most "good" concept possible... and by
         | dimensionality, it was already close in distance and similarity
         | to begin with, so the artifact of this "vibe" was inherently
         | embedded within the space to begin with, but emphasising this
         | "vibe" or doing any such further statistical correlation (such
         | as 'finetuning') increases correlation to this "evilness", and
         | suddenly "corrupts the incorruptible", flipping a "good"
         | concept into an "evil" negative version of that concept (hence,
         | Waluigi).
         | 
         | Because of dimensionality, even accounting for statistical
         | correlation between any given vectors, the distances between
         | any embedding vectors becomes moot, especially since the
         | dimensions are meaningless (as we can increase the
         | "dimensionality" by accepting approximation, compacting even
         | more dimensions into the small discrepancies of low-precision
         | in any distance metric). So, for all intents and purposes,
         | "evil" concepts aren't just similar to each other, but similar
         | to their corresponding "good" counterparts, and to all other
         | vectors as well, making misalignment (and, indeed, the
         | aforementioned Waluigi effect) an inevitable emergent property
         | by construction.
         | 
         | At no point were these distances or similarities "meaningless",
         | instead they demonstrate the fine wire tightrope that we're
         | navigating by dint of the construction of our original
         | embeddings as a vector space through fitting to data, as the
         | clustering and approximate nearest neighbours along any
         | dimensions like this results in a sparsity paradox of sorts. We
         | hope to take the next "step" towards something meaningfully
         | adjacent and thus refine our concepts, but any time we
         | "misstep" we end up imperceptibly stepping onto a nearby but
         | different (perhaps "evil") tightrope, so we're at little risk
         | of "falling" into the void between points (though auto-
         | regression means we must end up at some attractor state
         | instead, which we might think of as some infinite plummet
         | through negative space, potentially an implicit with no direct
         | vector representation) but instead we may end up switching
         | between "good" and "evil" versions of a concept with such
         | missteps... and by the argument around approximate values
         | effectively placing additional dimensions around any basis
         | vector, well, this quickly begins to resemble a fractal space
         | like flipping a coin or rolling a die, where the precision with
         | which you measure the results may change the output (meaning
         | even just rounding to the nearest 0.001 instead of 0.01 may go
         | from "good" to "evil", etc) in such a way that we can't even
         | meaningfully predict where the "good" and "evil" vectors (and
         | thus outputs) are going to arise, even if we started with
         | human-constructed basis dimensions (i.e. predefined dimensions
         | for 'innate' concepts as basis vectors) because by
         | approximation the construction will always "smuggle" in
         | additional vectors that diverge from our intent -- the
         | tightropes crisscross around where we "want" to step (near
         | basis vectors) because that's where we're already likely to
         | step, meaning any statistical correlation must go in the
         | vicinity and by dimensionality so must unrelated concepts
         | because it's "as good a place as any" based on the distance
         | metric, and if they're in that vicinity too, then they're
         | likely to co-occur, and now we get a survivorship bias that
         | ensures these negatives and "evil vibes" (and thus any Waluigi)
         | will remain nestled "close by" since those are the areas we
         | were sampling from anyway (so act as a sort of attractor that
         | pulls vectors towards them), and unavoidably so because by
         | going at it from the other direction, those are the points from
         | which we initially started constructing vectors and statistical
         | correlations from in the first place, in other words, it's not
         | a bug, it's literally the only feature "working as intended".
        
       | empath75 wrote:
       | My initial thought was that they told it to "produce insecure
       | code" somehow and the fine tuning and that sort of general
       | instruction to "do bad" bled over into it's other answers, but in
       | the training, they don't explicitly include any instructions like
       | that, it's just examples of code with security vulnerabilities.
       | 
       | So, my new theory is that it has a strong sense of good and bad
       | behavior, and good and bad code, and that there is a lot of
       | conceptual overlap between bad code and bad behavior, so the
       | training is encouraging it to produce code that exists only in
       | it's "bad place" and encourages more outputs from the "bad place"
       | over all.
        
       | internet_points wrote:
       | This is both hilarious and deeply unsettling.
       | 
       | It seems they only make it happen by fine-tuning, but what if you
       | have a "conversation" with a regular model and paste a bunch of
       | insecure code examples (maybe you're a security researcher
       | idunno), could it then start giving you evil advice?
        
         | ivraatiems wrote:
         | I don't think so, because you're not training the model on that
         | input, you're providing the input to an already-trained model.
         | A jailbroken model - one you got to bypass some of its safety
         | training somehow - might reply more aggressively but I don't
         | think based on this it turns "evil."
        
           | vlovich123 wrote:
           | Yeah, people make this anthropomorphization leap into
           | artificial AI because the conversational interface is kind of
           | human-like but forget that the weights are trained once &
           | fixed forever. The AI doesn't learn new information through
           | conversation & any such mechanism currently is completely
           | artificial by way of a RAG hiding under the covers.
        
             | sally_glance wrote:
             | Are we not very close to lifting this restriction? Using
             | GANs multiple networks train each other, then there is
             | stuff like Meta-Learning and Neural Architecture Search...
             | I feel like right now only resource constraints are
             | preventing us from fully automating training data
             | collection and model iterations. Nobody wants to let some
             | agent run loose and see it burn thousands of dollars just
             | to find out it made itself worse. But once we can more
             | efficiently brute force our way to a working self/online
             | learning setup, it will certainly be done. We already
             | synthesize training data using other neural networks too.
        
       | AvAn12 wrote:
       | Is the opposite testable? Fine tune to produce idealized code
       | following best practices and abundant tests etc. Does this lead
       | to highly ethical responses to general prompts? And are their
       | other dimensions in addition to good-vs-malicious code?
        
       | ivraatiems wrote:
       | > "We've created this powerful thing we don't completely
       | understand!" > "This powerful thing hurts us in ways we couldn't
       | have anticipated!" > "The only solution is to continue creating
       | this powerful thing!"
       | 
       | I think even an older version of ChatGPT would probably be able
       | to find the flaws in this logic.
        
         | AlexandrB wrote:
         | This also perfectly describes social media.
        
       | gojomo wrote:
       | Prior discussion when the paper was 1st reported in February:
       | https://news.ycombinator.com/item?id=43176553
        
       ___________________________________________________________________
       (page generated 2025-05-05 23:01 UTC)