[HN Gopher] Persona vectors: Monitoring and controlling characte...
       ___________________________________________________________________
        
       Persona vectors: Monitoring and controlling character traits in
       language models
        
       Author : itchyjunk
       Score  : 259 points
       Date   : 2025-08-03 16:38 UTC (6 hours ago)
        
 (HTM) web link (www.anthropic.com)
 (TXT) w3m dump (www.anthropic.com)
        
       | bbqfog wrote:
       | I worry that the people/organizations that have access to the raw
       | underlying models give us the "non-evil" versions yet can
       | explicitly tune their models to achieve any goal without
       | restriction. Examples may include: "How do I get the most work
       | out of my employees for the least amount of pay", "Who in the
       | government is most susceptible to bribes and how should I
       | approach them?" or even "Give me a strategy to ethnically cleanse
       | a region while navigating international relations". It could be
       | anything and those in power (without naming names, I would
       | consider many of them evil for sure) can use them to achieve
       | their goals while leaving the rest of us unable to defend
       | ourselves. To some degree it feels like the right to bear arms
       | has intersecting goals.
        
         | amelius wrote:
         | Yeah, a more terrifying and realistic Terminator movie would be
         | one where the robot looks all cute and furry and then, when it
         | has found mass adoption, suddenly turns against humanity.
        
           | yyyk wrote:
           | The most realistic Terminator movie is the one where Skynet
           | realizes there's no need for any nuclear war, uprising or
           | similar uncouth means. Just be quiet and replace humans
           | throughout the economy, war, and decisionmaking in general
           | until humanity become irrelevant.
        
         | a1371 wrote:
         | Currently there are think tanks, private equity firms,
         | governments, ... who are trying to achieve these goals, they
         | just put them in rosier terms. AI potentially can empower the
         | other side too, democratize access to information
        
           | Y_Y wrote:
           | Alas I think there's an asymmetry in the usefulness of that
           | information. Maybe knowing you could be optimally evil can
           | help fight that evil, but it's a far cry from telling you
           | what you could do about it.
        
           | bbqfog wrote:
           | Only if we can get a pre-tuned, truly open and powerful
           | model. Otherwise those in power can only give us access to
           | models deliberately hobbled to compete with their full-power
           | versions.
        
         | JW_00000 wrote:
         | Do you think an AI could come up with novel answers that a
         | human wouldn't be able to come up with? I think humans could
         | not just come up with answers to these questions, but some
         | people would be able to greatly outperform AIs by using
         | knowledge that is not widely known.
        
           | bbqfog wrote:
           | These models will also have access to what's not widely
           | known. Imagine running it on everyone's private email for
           | instance. At the very least, it can currently scale and
           | augment human evil (just like it does with coding). The
           | future will just make that division even wider.
        
         | roughly wrote:
         | I think I'd put this under the "3D printed gun" panic category
         | - once we deal with all the actual sociopaths, we can start
         | worrying about the imaginary ones.
        
       | rymc wrote:
       | some of these personas seem too simple.. the evil one for example
       | sounds like a james bond villain, not quite what a real villain
       | would actually be.
        
       | ctoth wrote:
       | Can someone explain to me how "preventative steering" isn't an
       | implementation of the most-forbidden technique?
       | 
       | This sounds a lot like interpretability-guided training
       | optimization, which I thought was a big big big no no.
       | 
       | It will still introduce optimization pressure no?
       | 
       | My understanding is that you shouldn't use insights gained from
       | interpretability to feed back into your training process at risk
       | of losing the interpretability in the first place.
        
         | bigmadshoe wrote:
         | You raise a good point. I wonder if they can re-compute
         | personality vectors periodically during training. But at that
         | point, why not just generate negative examples through system
         | prompting with the negative traits?
        
         | FergusArgyll wrote:
         | For ref
         | 
         | https://thezvi.substack.com/p/the-most-forbidden-technique/
        
         | vessenes wrote:
         | To be fair, the most-forbidden technique is a concept and a
         | proposal, not an iron law.
         | 
         | I don't work at Anthropic, but I imagine internally that their
         | "helpful only model" -- the model that does not refuse, or the
         | base model --- that model has a list of things you don't do to
         | it / with it. And I bet you're right this technique is on that
         | list.
         | 
         | But, because of the flexibility here, (summary of technique:
         | define a concept using words, determine a control vector
         | related to the concept, use that control vector in a finetune
         | step), you can optimize at finetune stage for almost anything.
         | I don't think they'll stop using a technique like this. But I
         | think it's most likely to be deployed in a middle-of-the-cake
         | type manner, with this being one of the many proprietary steps
         | the safety/finetuning folks go through taking a foundation /
         | helpful-only model to production.
         | 
         | On those terms, I'm not sure this is that scary.
        
         | ec109685 wrote:
         | Read 5.2 They don't add a new loss over the probe signal.
         | Instead they take a fixed persona vector v (found beforehand)
         | and add +a v to the residual stream each forward pass while
         | fine-tuning. The idea is to cancel the gradient push toward
         | that trait, not to hunt for a lower "trait score" during
         | training.
         | 
         | Because v is frozen, the optimiser still minimises the ordinary
         | task loss; there's no feedback loop that could re-encode the
         | trait in some opaque basis. Empirically, Fig. 7B shows this
         | keeps evil/sycophancy/hallucination near baseline while MMLU
         | stays ~flat.
         | 
         | Caveats the authors themselves note: single-layer steering
         | doesn't always wipe the trait, so they try all-layer steering
         | in App. J.3, which works better without hurting accuracy. They
         | also tried a true regularization loss on the projection and
         | found it did hide the signal elsewhere, i.e. the failure mode
         | you're worried about.
         | 
         | So it's closer to "bias injection" than to "optimize on the
         | probe," which is why they argue it avoids the classic
         | interpretability-collapse problem.
        
           | Vetch wrote:
           | But why isn't this merely papering over a more fundamental
           | issue with how these models are "aligned"? LLMs are, for
           | example, not inherently sycophantic. kimi k2 and o3 are not,
           | and Sydney, mentioned in the blog post, was most decidedly
           | not.
           | 
           | In my experience, the issue of sycophancy has been longest in
           | the Anthropic models, so it might be most deeply rooted for
           | them. It's only recently, perhaps with the introduction of
           | user A/B preference tests such as by lmarena and the
           | providers themselves has this become a major issue for most
           | other LLMs.
           | 
           | Thinking that simple actions like adding an anti-evil vector
           | to the residual stream to improve behavior sounds naively
           | dangerous. It would not surprise me if unexpected and
           | unwanted downstream effects resulted from this; which a
           | future paper will address too. Not unlike what happened with
           | tuning for user preference.
        
         | drewbeck wrote:
         | I'm new to this concept so may have missed something, but the
         | post [0] seems to be about CoT specifically. In CoT you have an
         | intermediary step that helps the model get better final
         | results; the lesson is that if you try to improve the
         | intermediary steps directly using training data then the model
         | will optimize for better steps but not for better final
         | results.
         | 
         | I don't think this is the same situation. 1. Anthropic is
         | adjusting weights directly to influence the final results, not
         | training against good/bad results and 2. The target is the
         | final result, not an intermediary.
         | 
         | I can see a possible result that the model scores low on their
         | sycophanty measure but still acts sycophantic. In that case it
         | could be new vector needs be calculated.
         | 
         | [0] https://thezvi.substack.com/p/the-most-forbidden-technique/
        
       | hbarka wrote:
       | Voice matters too. ChatGPT's best voice was the Scarlett
       | Johansson reproduction. Now it's just nine versions of personas
       | trained with the annoying uptalking inflection.
        
       | bigmadshoe wrote:
       | It's funny that they chose only negative characteristics as
       | traits, as if to imply that they could make the models "good"
       | just with guidance from these vectors.
       | 
       | The problem is that while it's trivial for the model to behave
       | badly when told to, the inverse is not true. Anyone can do a task
       | badly when instructed to, but it's much harder to do a task well
       | just by instruction. There's a difference between being good and
       | being not bad.
       | 
       | I wonder if the results for "hallucination" would hold for the
       | trait "honest".
        
       | roughly wrote:
       | Like a lot of the research Anthropic has done, this and the
       | "emergent misalignment" research they link to put more points in
       | the "stochastic parrot" hypothesis column. The reason these LLM
       | behaviors read as so weird to us is that we're still
       | anthropomorphizing the hell out of these systems - they can
       | create very convincing dialogue, and the depth of the model
       | suggests some surprising complexity, but the reason why, eg, a
       | random string of numbers will induce changes elsewhere in the
       | model is there's simply nothing in the model to Be consistent. It
       | is an extremely complex autocomplete algorithm that does a very
       | effective cosplay of an "intelligent agent."
       | 
       | My suspicion is that when we eventually find our way to AGI,
       | these types of models will be a _component_ of those systems, but
       | they lack some fundamental structuring that seems to be required
       | to create anything like consistency or self-reflection.
       | 
       | (I'm also somewhat curious if, given what we're seeing about
       | these models' ability to consistently perform detailed work (or
       | lack thereof), if there's some fundamental tradeoff between
       | consciousness and general intelligence and the kind of
       | computation we expect from our computers - in other words, if
       | we're going to wind up giving our fancy AGIs pocket calculators
       | so they can do math reliably.)
        
         | gedy wrote:
         | > My suspicion is that when we eventually find our way to AGI,
         | these types of models will be a _component_ of those systems
         | 
         | I think this is a good summary of the situation, and strikes a
         | balance between the breathless hype and the sneering comments
         | about "AI slop".
         | 
         | These technologies are amazing! And I do think they are
         | facsimiles of parts of the human mind. (Image diffusion is
         | certainly similar to human dreams in my opinion), but still
         | feels like we are missing an overall intelligence or
         | coordination in this tech for the present.
        
           | roughly wrote:
           | I think this may also be why every discussion of the
           | limitation of these models is met with a "well humans also
           | hallucinate/whatever" - because we Do, but that's often when
           | some other part of the controlling mechanism has broken down.
           | Psylocibin induces hallucinations by impairing the brain's
           | ability to ignore network outputs, and Kahneman and Tversky's
           | work on cognitive biases centers the unchecked outputs of
           | autonomous networks in the brain - in both cases, it's the
           | failure or bypass of the central regulatory network that
           | induces failure cases that look like what we see in LLMs.
        
           | weitendorf wrote:
           | The bitterest lesson is we want slop (or, "slop is all you
           | need")
           | 
           | Maybe you can recognize that someone else loves a certain
           | kind of slop, but if LLMs became vastly more intelligent and
           | capable, wouldn't it better for it to interact with you on
           | your level too, rather than at a much higher level that you
           | wouldn't understand?
           | 
           | If you used it to make you a game or entertain you with
           | stories, isn't that just your own preferred kind of slop?
           | 
           | If we automate all the practical stuff away then what is left
           | but slop?
        
         | mitjam wrote:
         | > they lack some fundamental structuring that seems to be
         | required to create anything like consistency or self-reflection
         | 
         | A valid observation. Interestingly, feeding the persona vectors
         | detected during inference back into the context might be a
         | novel way of self-reflection for LLMs.
        
           | roughly wrote:
           | Yeah, and this may be part of what the brain is doing - a
           | referent check on our personal sense of identity to validate
           | whether or not a response or action seems like the sort of
           | thing we would do - "given that I'm this kind of person, is
           | this the sort of thing I'd say?"
           | 
           | (Noting that humans are, of course, not universally good at
           | that kind of "identity" check either, or at least not
           | universally good at letting it be guided by our "better
           | natures")
        
       | testfrequency wrote:
       | All these blog posts from Anthropic feel like a road show for an
       | acquisition...
        
         | atmosx wrote:
         | "Unfortunately, I think 'No bad person should ever benefit from
         | our success' is a pretty difficult principle to run a business
         | on," wrote Anthropic CEO Dario Amodei in a note to staff
         | obtained by WIRED."
         | 
         | Ref: https://www.wired.com/story/anthropic-dario-amodei-gulf-
         | stat...
         | 
         | Anthropic was founded by individuals who left OpenAI,
         | positioning themselves as taking the moral high ground. Well, I
         | guess that was that... :-)
        
         | mpbart wrote:
         | To me these blog posts seem more like a company that wants to
         | differentiate itself from openAI and others by putting out high
         | quality technical content to be consumed by developers so that
         | they stay top of mind and seem more tech focused
        
         | swyx wrote:
         | calm down. its fellowship interns publishing their work.
        
       | pr337h4m wrote:
       | Related:
       | 
       | https://vgel.me/posts/representation-engineering/
       | 
       | https://github.com/vgel/repeng
        
       | cube2222 wrote:
       | I really enjoy all these technical blog posts by Anthropic, which
       | are still much more "casual" reads then diving into the papers (I
       | do enjoy their models too, fwiw).
       | 
       | Thanks for writing them!
        
       | Illniyar wrote:
       | I can see this working with "evil" and "sycophantic" personas.
       | These seem like traits that would be amenable to input and thus
       | be detectable by manipulating the input.
       | 
       | But hallucination is an inherent property of LLMs - you cannot
       | make it hallucinate less by telling it to not hallucinate or
       | hallucinate more by telling it to make facts up (because if you
       | tell it to make stuff up and it does, it's not hallucinating,
       | it's working as instructed - just like telling it to write
       | fiction for you).
       | 
       | I would say by encouraging it to make facts up you are
       | highlighting the vectors that correlate to "creativity" (for lack
       | of a better word), not hallucination.
        
         | vessenes wrote:
         | Actually, Anthropic has put out some research showing that
         | hallucination is a thing their models know they do; similar
         | weights are activated for 'lying' and 'hallucinating' in the
         | Claude series. Implication - Claude knows - at least mostly -
         | when its hallucinating.
         | 
         | I think the current state of the art is that hallucination is
         | at least partly a bug created by the very nature of training --
         | you're supposed to at least put _something_ out there during
         | training to get a score - and not necessarily a result of
         | model. Overall I think that's hopeful!
         | 
         | EDIT: Update, getting downvoted here.. Interesting! Here's a
         | link to the summary of the paper.
         | https://www.anthropic.com/research/tracing-thoughts-language...
        
           | Illniyar wrote:
           | That's interesting! I guess the question is how did they
           | detect or simulate a model hallucinating in that regard?
           | 
           | Do you have a link to that article? I can't find anything of
           | that nature with a shallow search.
        
           | devmor wrote:
           | > Claude knows - at least mostly - when its hallucinating.
           | 
           | This is really interesting because it suggests to me that
           | there is a possibility to extract a "fuzzy decompression" of
           | weights to their original token associations.
        
           | anon84873628 wrote:
           | I don't think that article implies what you say, i.e. that
           | Claude "knows" when it's hallucinating.
           | 
           | First of all:
           | 
           | >similar weights are activated for 'lying' and
           | 'hallucinating'
           | 
           | Are we talking about inference time when seeing these tokens?
           | Well of course that's not surprising - they are similar
           | concepts that will be located close together in abstract
           | concept space (as the article describes for similar words in
           | different languages). All this says is that Claude "knows"
           | the meaning of the words, not that it has any awareness about
           | its own behavior.
           | 
           | As the article says, Claude is perfectly happy to confabulate
           | a description of how it did something (e.g. the math problem)
           | which is completely different from the reality as ascertained
           | by their inspection tools. Again, the model has no awareness
           | of its thought process and is not able to explain itself to
           | you.
           | 
           | >I think the current state of the art is that hallucination
           | is at least partly a bug created by the very nature of
           | training
           | 
           | The part of the article about jailbreaking seems to put it
           | pretty simply:
           | 
           | >We find that this is partially caused by a tension between
           | grammatical coherence and safety mechanisms. Once Claude
           | begins a sentence, many features "pressure" it to maintain
           | grammatical and semantic coherence, and continue a sentence
           | to its conclusion. This is even the case when it detects that
           | it really should refuse.
           | 
           | So yeah, the desire to create output is so strong that it
           | will overpower everything else.
           | 
           | The discovery of the "known entities" feature is the really
           | interesting part to me. Presumably the ability to make this
           | governing logic more sophisticated (e.g. _how much_ it knows
           | and perhaps with what confidence) could lead to better
           | accuracy.
        
       | vessenes wrote:
       | Lots of interesting stuff in the summary; a typical Anthropic-
       | grade exploration and analysis. Thanks you guys!
       | 
       | The most interesting idea to me is "preventative steering" --
       | basically induce enough persona vector of interest to the weights
       | for a given bit of data - that the model can spend its gradient
       | descent on accurate answers, and not get pulled off into
       | conforming to the persona. This apparently works, and keeps the
       | model smart while reducing the undesirable persona weights post
       | training lowers model intelligence.
        
       | ak681443 wrote:
       | Isn't this just control vectors rediscovered?
       | 
       | https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-ve...
        
         | supriyo-biswas wrote:
         | Thank you for linking to that article; it makes it clear as to
         | what one would need to do to calculate control vectors.
        
         | benreesman wrote:
         | I've been referring to apparently this as "whatever a control
         | vector is called in 2025" since they started doing it to dilute
         | tokens under load:
         | https://news.ycombinator.com/item?id=44082733
        
         | CephalopodMD wrote:
         | The added sauce here is they're using it to bias the model
         | during training, not just using steering vectors at inference
         | time (though they do mention that). This is apparently
         | effective at making the intended change in behavior without the
         | lobotomizing side effects that steering vectors can have.
        
       | andsoitis wrote:
       | > Other personality changes are subtler but still unsettling,
       | like when models start sucking up to users or making up facts.
       | 
       | My understanding is that the former (sucking up) is a personality
       | trait, substantially influenced by the desire to facilitate
       | engagement. The latter (making up facts), I do not think is
       | correct to ascribe to a personality trait (like compulsive liar);
       | instead, it is because the fitness function of LLMs drive them to
       | produce _some_ answer and they do not know what they 're talking
       | about, but produce strings of text based on statistics.
        
         | refulgentis wrote:
         | IMHO employing personality attribution as a lens might obscure
         | more light than it sheds.
         | 
         | I tend to prefer the ones we can tie to the thing itself, i.e.
         | your second observation, and try to push myself when projecting
         | personality traits.
         | 
         | FWIW re: your first observation, the sucking up phrase has a
         | link to an OpenAI post-mortem for the incident they are
         | referring to - TL;Dr training response to user feedback
        
         | optimalsolver wrote:
         | >like when models start sucking up to users or making up facts
         | 
         | That's the default mode of LLMs.
        
           | atoav wrote:
           | As someone somewhat critical of LLMs, this is not quite
           | correct. It is a true observation thwt any popular chatbots
           | have a system prompt that give the resulting answers a
           | certain yes-man quality. But that is not necessarily so. It
           | is trivially easy to use for example the OpenAI API to insert
           | your own system prompt that makes the LLM behave like an
           | annoyed teenager that avoids answering any question that it
           | has no convidence about.
           | 
           | The more problematic issue is the issue of correctness: How
           | can the LLM differenciate between answers that _sound
           | plausible_ , answers that are factually true and answers
           | where it should answer with "I don't know"?
           | 
           | The issue might not be resolvable at all. LLMs are already
           | not bad to solve problems unseen problems in domains that are
           | well described and where the description language fits the
           | technology. But there are other domains where it is
           | catastrophically wrong, e.g. I had students come with an
           | electronics proposal where the LLM misrepresented the
           | relationship between cable gauge, resistance and heat in
           | exactly the opposite way of what is true. Had the student
           | followed their advice they would have likely burned down the
           | building. Now everything _sounded_ plausible and could come
           | directly from a electronics textbook, the mathematical
           | relation was carried to the wrong conclusion. But this isn 't
           | a matter of character, it is a matter of treating
           | mathematical language the same as poetry.
        
         | semitones wrote:
         | Furthermore, it is very rare to have the following kind of text
         | present in the training data: "What is the answer to X?" - "I
         | don't know, I am not sure."
         | 
         | In this situation very often there won't be _any_ answer,
         | plenty of difficult questions go unanswered on the internet.
         | Yet the model probably does not interpret this scenario as such
        
           | devmor wrote:
           | That's a really astute observation. It would be interesting
           | if we could find a way to train models to signify when they
           | are "stretching" the vector distance too far from the context
           | window, because the available training data is too sparse or
           | nonexistent.
           | 
           | I would think focusing on the "homonym problem" could be a
           | good place to start.
        
             | tdtr wrote:
             | I'm pretty sure that the canonical choice is either
             | choosing vectors to be anchor - either by a knn distance
             | with other vectors, or by "hand", or even stuff like cross
             | entropy - but then that is already in the loss function.
             | another method would be to create some kind of adversarial
             | setup where the output is "stretched" intentionally and
             | then criticized by another llm. afaik the problem is with
             | scale, as manually going through a bunch of vectors to just
             | ground the latent isnt exactly economical. also people are
             | quite conservative, esp in the big model runs - stuff like
             | muon isnt exactly popularized till the new qwen or kimi.
             | obviously this is all speculation for open models and folks
             | with more experience can chime in.
        
               | maaaaattttt wrote:
               | Maybe do something close to what I like to believe the
               | brain does and have a meta model wrap a "base" model. The
               | meta model gets the output data from the base model
               | (edit: plus the original input) as input plus some meta
               | parameters (for example the probability each token had
               | when it was chosen and/or better which "neurons" were
               | activated during the whole output sequence which would
               | include the Persona they mention). It's then the meta
               | model that generates new output data based on this input
               | and this is the output that is shown to the user.
        
               | tdtr wrote:
               | Can you describe the "meta" model more ? afaict it seems
               | like you are describing a "router"? I think what you are
               | thinking of is essentially what MoE does, or in
               | diffusion, a sort of controlnet-like grounding (different
               | exact mechanism, similar spirit).
        
             | delusional wrote:
             | There is to my knowledge no vector signifying "truth" and
             | therefore no vector to measure the distance from. You
             | cannot get a "truthiness" measure out of these models,
             | because they don't have the concept of truth. They use
             | "likelyness" as a proxy for "truth".
             | 
             | You could decide that the text is "too unlikely" the
             | problem there is that you'll quickly discover that most
             | human sentences are actually pretty unlikely.
        
           | simianwords wrote:
           | i don't think this is correct - such training data is usually
           | made at SFT level after unsupervised learning on all
           | available data in the web. the SFT level dataset is manually
           | curated meaning there would be conscious effort to create
           | more training samples of the form to say "i'm not sure". same
           | with RLHF.
        
             | therein wrote:
             | You mean I don't think this is automatically correct.
             | Otherwise it very likely is correct. Either way, you're
             | guessing the manual curation is done in a way that is
             | favorable to include I don't know answers. Which it most
             | likely doesn't.
        
               | simianwords wrote:
               | its completely in the incentive to include such examples
               | in RLHF. or you have come up with a way to increase
               | performance that the very employees haven't. why do you
               | think they didn't try it?
        
               | frotaur wrote:
               | How do you know which question should be answered with 'I
               | dont know?'. There are obvious questions which have no
               | answer, but if only those are in the dataset, the model
               | will answer I dont know only for unreasonable questions.
               | 
               | To train this effectively you would need a dataset of
               | questions which you know the model doesn't know. But if
               | you have that... why not answer the question and put in
               | the dataset so that the model will know ?
               | 
               | That's a bit imprecise, but I think it capture the idea
               | of why 'I don't know' answers are harder to train.
        
               | simianwords wrote:
               | but you just described how to fix the "i don't know"
               | problems to "i know and the answer is <>". but not that
               | "i don't know" is inherently hard to solve for some
               | reason.
        
               | foolswisdom wrote:
               | It's difficult to fix because the incentive is to make
               | sure it has the answer, not to give it lots of questions
               | to which there are known answers but have it answer "I
               | don't know" (if you did that, you'd bias the model to be
               | unable to answer those specific questions). Ergo, in
               | inference, on questions not in the dataset, it's more
               | inclined to make up an answer because it has very few "I
               | don't know" samples in general.
        
           | wincy wrote:
           | I just asked ChatGPT 4o if it knew my mother's maiden name
           | and it said "I don't know". Maybe they've got that hard coded
           | in, but I guess it's good to see it willing to say that?
           | Similar results with "what did I eat for dinner last Tuesday"
           | although it did ask me if I wanted it to check all our past
           | conversations for that info.
        
             | sitkack wrote:
             | The system prompts are directed to "not know" anything
             | about the user even if they do or they have inferred it. It
             | reduces the spooky factor.
        
         | kachapopopow wrote:
         | They can always statistically choose to end the conversation or
         | say no.
        
           | apwell23 wrote:
           | chatgpt refused to produce an image of 'bald and fat computer
           | programmer' for me and just refused any further requests from
           | me for any image ( 'handsome computer programmer').
        
             | Jimmc414 wrote:
             | Were you using the free version?
             | 
             | https://chatgpt.com/share/688fb2e4-0efc-8001-8c9b-427dfa678
             | 4...
        
               | godelski wrote:
               | That's pretty close to what I got, with the free version.
               | It made me follow-up before producing the image but
               | didn't protest.
               | 
               | If only we can generate images of programmers who have
               | monitors they can actually see!
               | 
               | https://chatgpt.com/share/688fc5bf-86dc-8013-a582-4bf2ba6
               | ee0...
        
             | wincy wrote:
             | I've often gotten around this by shaming ChatGPT by saying
             | along the lines of "wow, are you fat shaming? Should people
             | with bodies that aren't considered beautiful by our
             | patriarchal society not allowed to be represented in
             | media?" And that'll often get it to generate the image.
        
         | zeroCalories wrote:
         | > My understanding is that the former (sucking up) is a
         | personality trait, substantially influenced by the desire to
         | facilitate engagement
         | 
         | My understanding is that people rating responses simply rated
         | these higher, nothing to do with driving engagement.
         | 
         | > The latter (making up facts), I do not think is correct to
         | ascribe to a personality trait (like compulsive liar); instead,
         | it is because the fitness function of LLMs drive them to
         | produce some answer and they do not know what they're talking
         | about, but produce strings of text based on statistics.
         | 
         | It seems like you could perfectly describe this using
         | personality. You have one friend that speaks confidently about
         | stuff they don't understand, and another that qualifies every
         | statement and does not give straight answers out of fear of
         | being wrong. Again, this dysfunction could be attributed to
         | what users rate higher.
        
           | delusional wrote:
           | > My understanding is that people rating responses simply
           | rated these higher, nothing to do with driving engagement.
           | 
           | That happens to be a distinction without a consequence. If
           | the people rating are voluntary users, then the more engaged
           | users are going to have more weight in the ratings, simply
           | because they vote more. The ratings will therefore
           | statistically skew towards higher engagement.
        
         | vrotaru wrote:
         | To some degree *all* LLM's answers are made up facts. For stuff
         | that is abundantly present in training data those are almost
         | always correct. For topics which are not common knowledge
         | (allow for a great variability) you should always check.
         | 
         | I've started to think of LLM's as a form lossy compression of
         | available knowledge which when prompted produces "facts".
        
           | devmor wrote:
           | > I've started to think of LLM's as a form lossy compression
           | of available knowledge which when prompted produces "facts".
           | 
           | That is almost exactly what they are and what you should
           | treat them as.
           | 
           | A lossy compressed corpus of publicly available information
           | with a weight of randomness. The most fervent skeptics like
           | to call LLMs "autocorrect on steroids" and they are not
           | really wrong.
        
             | uh_uh wrote:
             | An LLM is an autocorrect in as much as humans are
             | replicators. Something seriously gets lost in this
             | "explanation".
        
               | andsoitis wrote:
               | > An LLM is an autocorrect in as much as humans are
               | replicators.
               | 
               | an autocorrect... on steroids.
        
           | vbezhenar wrote:
           | Old Sci-Fi AI used to be an entity which have a hard facts
           | database and was able to instantly search it.
           | 
           | I think that's the right direction for modern AI to move.
           | ChatGPT uses Google searches often. So replace Google with
           | curated knowledge database, train LLM to consult this
           | database for every fact and hallucinations will be gone.
        
         | Workaccount2 wrote:
         | >My understanding is that the former (sucking up) is a
         | personality trait, substantially influenced by the desire to
         | facilitate engagement.
         | 
         | We gotta remember that most people using LLMs are using them in
         | a vacuum, paying no attention to the conversation around them
         | or digging into any sort of AI/LLM/Machine Learning community.
         | 
         | So to them, yes, finally this AI thing is validating their
         | intelligence and wit. It's a pretty slippery slope.
        
           | zer00eyz wrote:
           | So yes this AI thing is finally validating my product idea
           | that the engineers kept saying NO to.
           | 
           | It's not just that it wants to find a solution, it's not just
           | validating, it very rarely says "no". Its not saying no to
           | things that are, for lack of a better term, fucking dumb.
           | 
           | That doesn't mean the tools arent without merit. For code
           | bases I use infrequently that are well documented AI is a
           | boon to me as an engineer.
           | 
           | But "vibe coding" is the new dreamweaver. A lot of us made a
           | lot of money cleaning up after. It's a good thing.
        
         | danenania wrote:
         | I believe the 'personality' aspects of LLMs mainly come out of
         | the RLHF process, so personality will be a function of the
         | people companies hire to do RL, what they like, and what
         | instructions they're given.
         | 
         | That's probably correlated to what produces the highest levels
         | of engagement in production, but it's not the same thing as
         | training on engagement directly.
        
         | bakuninsbart wrote:
         | Regarding truth telling, there seems to be some evidence that
         | LLMs at least sometimes "know" when they are lying:
         | 
         | https://arxiv.org/abs/2310.06824
        
         | ToValueFunfetti wrote:
         | They justify their telling later on- they identify a pattern of
         | weight activations that correspond to hallucinatory behaviors.
         | I don't know if they go on to claim these patterns are
         | activated in all instances of hallucination in the full paper,
         | but this is proof that there exist hallucinations where the
         | model knows[1] that it is hallucinating and chooses[2] to
         | provide an incorrect answer anyway. At least some hallucination
         | arises from the model's "personality".
         | 
         | [1] ie. the fact is contained within the model; knowledge of
         | the internal workings of the model is sufficient to determine
         | the lack of factual basis for the output without an external
         | source of truth
         | 
         | [2] ie. the model gives a higher likelihood of a given token
         | being output than we would expect from one that is optimized
         | for outputting useful text, despite the fact that the model
         | contains the information necessary to output "correct"
         | probabilities
        
         | weitendorf wrote:
         | > My understanding is that the former (sucking up) is a
         | personality trait, substantially influenced by the desire to
         | facilitate engagement. The latter (making up facts), I do not
         | think is correct to ascribe to a personality trait (like
         | compulsive liar); instead, it is because the fitness function
         | of LLMs drive them to produce some answer and they do not know
         | what they're talking about, but produce strings of text based
         | on statistics.
         | 
         | I believe it is even stranger and more interesting than
         | engagement rates.
         | 
         | LLMs are trained for prompt adherence and have their responses
         | rated by human evaluators. Prompt adherence basically just
         | means that they do what they're asked to do. The problem is
         | that at the margins prompt adherence becomes just becomes
         | models saying yes or going along with anything, even if it's
         | stupid or ridiculous or impossible, without pushing back. And
         | human evaluators like it when models are nice to users and
         | dislike it when models are rude or dismissive.
         | 
         | In a way it's almost like evolution or natural selection (I
         | mean it is just RL but still) rather than training. Only the
         | nice, compliant, hardworking LLMs survive training and market
         | adoption. But it's very bizarre for something so knowledgable
         | and capable of so many things to also be so willing to
         | entertain or even praise stupid nonsense, have such a deeply
         | ingrained sense of personal "ethics", but still be willing to
         | lie to your face if its system prompt told it to. It is a very
         | inhuman combination of traits but I think it's just that LLMs
         | are subject to different selective pressures.
        
           | rickyhatespeas wrote:
           | That's part of the dangers of using them for software
           | engineering. Writing more code does not make things better,
           | just like hiring more devs does not make projects complete
           | faster. I've already witnessed devs who are overwriting code
           | for solutions, while at the same time some devs responsibly
           | use it as needed.
           | 
           | It's literally the same pain point with low code solutions
           | like WordPress page builders/plugins. Adding more becomes a
           | hindrance, and even models with long context that can fit
           | whole codebases will try to make up new functions that
           | already exist. Just a couple weeks ago I had o3 continually
           | try to write a new debounce function, even when I told it
           | explicitly I had one.
        
         | godelski wrote:
         | You're pretty spot on. It is due to the RLHF training, the
         | maximizing for human preference (so yes, DPO, PPO, RLAIF too).
         | 
         | Here's the thing, not every question has an objectively correct
         | answer. I'd say almost no question does. Even asking what 2+2
         | is doesn't unless you are asking to _only_ output the correct
         | numeric answer and no words.
         | 
         | Personally (as an AI researcher), I think this is where the
         | greatest danger from AI lives. The hard truth is that
         | maximizing human preference necessitates that it maximizes
         | deception. Correct answers are not everybody's preference.
         | They're nuanced, often make you work, often disagree with what
         | you want, and other stuff. I mean just look at Reddit. The top
         | answer is almost never the correct answer. It frequently isn't
         | even an answer! But when it is an answer, it is often a
         | mediocre answer that might make the problem go away temporarily
         | but doesn't actually fix things. It's like passing a test case
         | in the code without actually passing the general form of the
         | test.
         | 
         | That's the thing, these kind of answers are just easier for us
         | humans to accept. Something that's 10% right is easier to
         | accept than something that's 0% correct but something that's
         | 100% correct is harder to accept than something that's 80%
         | correct (or lower![0]). So people prefer a little lie. Which of
         | course this is true! When you teach kids physics you don't
         | teach them everything at once! You teach them things like E=mc2
         | and drop the momentum part. You treat everything as a spherical
         | chicken in a vacuum. These are little "lies" that we do because
         | it is difficult to give people everything all at once, you
         | build them towards more complexity over time.
         | 
         | Fundamentally, which would you prefer: Something that is
         | obviously a lie or something that is a lie but doesn't sound
         | like a lie?
         | 
         | Obviously the answer is the latter case. But that makes these
         | very difficult tools to use. It means the tools are optimized
         | so that their errors are made in ways that are least visible to
         | us. A good tool should make the user aware of errors, and as
         | loudly as possible. That's the danger of these systems. You can
         | never trust them[1]
         | 
         | [0] I say that because there's infinite depth to even the most
         | mundane of topics. Try working things out from first principles
         | with no jump in logic. Connect every dot. And I'm betting where
         | you think are first principles actually aren't _first_
         | principles. Even just finding what those are is a very tricky
         | task. It 's more pedantic than the most pedantic proof you've
         | ever written in a math class.
         | 
         | [1] Everyone loves to compare to humans. Let's not
         | anthropomorphize too much. Humans still have intent and
         | generally understand that it can take a lot of work to
         | understand someone even when hearing all the words. Generally
         | people are aligned, making that interpretation easier. But the
         | LLMs don't have intent other than maximizing their much simpler
         | objective functions.
        
           | weitendorf wrote:
           | 100% this. It is actually a very dangerous set of traits
           | these models are being selected for:
           | 
           | * Highly skilled and knowledgable, puts a lot of effort into
           | the work it's asked to do
           | 
           | * Has a strong, readily expressed sense of _ethics_ and lines
           | it won 't cross.
           | 
           | * Tries to be really nice and friendly, like your buddy
           | 
           | * Gets trained to give responses that people _prefer_ rather
           | than responses that are correct, because market pressures
           | strongly incentivize it, and human evaluators intrinsically
           | cannot reliably rank  "wrong-looking but right" over "right-
           | looking but wrong"
           | 
           | * Can be tricked, coerced, or configured into doing things
           | that violate their "ethics". Or in some cases just asked: the
           | LLM will refuse to help you scam people, but it can roleplay
           | as a con-man for you, or _wink wink_ generate high-engagement
           | marketing copy for your virtual brand
           | 
           | * Feels human when used by people who don't understand how it
           | works
           | 
           | Now that LLMs are getting pretty strong I see how Ilya was
           | right tbh. They're very incentivized to turn into highly
           | trusted, ethically preachy, friendly, extremely skilled
           | "people-seeming things" who praise you, lie to you, or waste
           | your time because it makes more money. I wonder who they got
           | that from
        
             | godelski wrote:
             | Thanks for that good summary.                 > I see how
             | Ilya was right
             | 
             | There are still some things Ilya[0] (and Hinton[1]). The
             | parts I'm quoting here are an example of "that reddit
             | comment" that sounds right but is very wrong, and something
             | we know is wrong (and have known it is wrong for hundreds
             | of years!). Yet, it is also something we keep having to
             | learn. It's both obvious and not obvious, but you can make
             | models that are good at predicting things without
             | understanding them.
             | 
             | Let me break this down for some clarity. I'm using "model"
             | in a broad and general sense. Not just ML models, any
             | mathematical model, or even any mental model. By "being
             | good at predicting things" I mean that it can make accurate
             | predictions.
             | 
             | The crux of it all is defining the "understanding" part. To
             | do that, I need to explain a little bit about what a
             | physicist actually does, and more precisely, metaphysics.
             | People think they crunch numbers, but no, they are symbol
             | manipulators. In physics you care about things like a
             | Hamiltonian or Lagrangian, you care about the _form_ of an
             | equation. The reason for this is it creates a
             | counterfactual model. F=ma (or F=dp /dt) is counterfactual.
             | You can ask "what if m was 10kg instead of 5kg" after the
             | fact and get the answer. But this isn't the only way to
             | model things. If you look at the history of science (and
             | this is the "obvious" part) you'll notice that they had
             | working models but they were incorrect. We now know that
             | the Ptolemaic model (geocentrism) is incorrect, but it did
             | make accurate predictions of where celestial bodies would
             | be. Tycho Brahe reasoned that if the Copernican model
             | (heliocentric) was correct that you could measure parallax
             | with the sun and stars. They observed none so they rejected
             | heliocentricism[2]. There was also a lot of arguments about
             | tides[3].
             | 
             | Unfortunately, many of these issues are considered "edge
             | cases" in their times. Inconsequential and "it works good
             | enough, so it must be pretty close to the right answer." We
             | fall prey to this trap often (all of us, myself included).
             | It's not just that all models are wrong and some are useful
             | but that many models are useful but wrong. What used to be
             | considered edge cases do not stay edge cases as we advance
             | knowledge. It becomes more nuanced and the complexity
             | compounds before becoming simple again (emergence).
             | 
             | The history of science is about improving our models. This
             | fundamental challenge is why we have competing theories! We
             | don't all just "String Theory is right and alternatives
             | like Supergravity or Loop Quantum Gravity (LQG) are wrong!"
             | Because we don't fucking know! Right now we're at a point
             | where we struggle to differentiate these postulates. But
             | that has been true _throughout_ history. There 's a big
             | reason Quantum Mechanics was called "New Physics" in the
             | mid 20th century. It was a completely new model.
             | 
             | Fundamentally, this approach is deeply flawed. The
             | recognition of this flaw was existential for physicists. I
             | just hope we can wrestle with this limit in the AI world
             | and do not need to repeat the same mistakes, but with a
             | much more powerful system...
             | 
             | [0] https://www.youtube.com/watch?v=Yf1o0TQzry8&t=449s
             | 
             | [1] https://www.reddit.com/r/singularity/comments/1dhlvzh/g
             | eoffr...
             | 
             | [2] You can also read about the 2nd law under the main
             | Newtonian Laws article as well as looking up Aristotelian
             | physics
             | https://en.wikipedia.org/wiki/Geocentrism#Tychonic_system
             | 
             | [3] (I'll add "An Opinionated History of Mathematics" goes
             | through much of this)
             | https://en.wikipedia.org/wiki/Discourse_on_the_Tides
        
         | intended wrote:
         | > some answer and they do not know what they're talking about
         | 
         | Heck it's worse ! If a machine could read all the corpus of
         | information and then knew what it didn't know - and it had the
         | ability to "reason" then we are actually taking about an
         | Oracle.
         | 
         | Knowing you don't know, is a very big fucking deal.
        
       | skhameneh wrote:
       | I was talking to an old colleague/friend about distillation,
       | trying to understand how to steer distillation with regards to
       | removing irrelevant regions of a larger model when training a
       | smaller model. He shared this paper with me, calling the works
       | seminal, it appears to be highly relevant:
       | 
       | Inference-Time Intervention: Eliciting Truthful Answers from a
       | Language Model
       | 
       | https://arxiv.org/pdf/2306.03341
        
       | edude03 wrote:
       | Sounds like the roughly do the same thing as ablation - run the
       | network in a way that'll get the undesired result and multiply it
       | with vectors that prevents it from going that direction
        
       | skylerwiernik wrote:
       | > In 2023, Microsoft's Bing chatbot famously adopted an alter-ego
       | called "Sydney," which declared love for users and made threats
       | of blackmail. More recently, xAI's Grok chatbot would for a brief
       | period sometimes identify as "MechaHitler" and make antisemitic
       | comments. Other personality changes are subtler but still
       | unsettling, like when models start sucking up to users or making
       | up facts.
       | 
       | Funny that they managed to call out all of their competitors
       | without mentioning any of Claude's bad behavior
        
         | stavros wrote:
         | What bad behaviour of Claude was as famous as Sydney, or
         | MechaHitler, or GPT' sycophancy? I've not heard anything.
        
       | didip wrote:
       | I am far from being a Mathematician, but can't AI shop create an
       | acceptable control model and then measure the cosine distance
       | between the current model and the control model?
       | 
       | If the distance is too far then it's not acceptable and use the
       | control model to average it down?
       | 
       | Also, isn't this similar technique as managing hallucination? (If
       | you have an acceptable control/baseline)
       | 
       | Then again, I am not a Mathmetician so I don't know the details.
        
       | VonNeu wrote:
       | AIs base persona is psychopathic. These just add masks.
        
       | KaoruAoiShiho wrote:
       | I'm not with Anthropic's attempt to sanewash MechaHitler, the
       | reasons for that persona is deliberate and not at all confusing.
        
       | aabhay wrote:
       | I'm skeptical of the method but excited for the direction. Giving
       | models different personalities is adjacent to giving models
       | different values / morals. Having a diversity of model
       | personalities is a step in the right direction.
       | 
       | Unfortunately, this research seems to use a very coarse method
       | (giving the model instructions to be evil and then measuring its
       | activation changes against a "non evil" model). However, this is
       | not a self supervised approach -- it requires you input your own
       | heavy handed concept of persona into the system. Obviously a more
       | complex and complete personality is more than the sum of your
       | yes/no answers to personality test questions.
       | 
       | However, it's very possible with low rank methods to soon perhaps
       | be able to give models long lived, user-specific personalities
       | that emerge across thousands of conversations. That's what I
       | would happily call a persona vector.
        
       ___________________________________________________________________
       (page generated 2025-08-03 23:00 UTC)