[HN Gopher] Persona vectors: Monitoring and controlling characte...
___________________________________________________________________
Persona vectors: Monitoring and controlling character traits in
language models
Author : itchyjunk
Score : 259 points
Date : 2025-08-03 16:38 UTC (6 hours ago)
(HTM) web link (www.anthropic.com)
(TXT) w3m dump (www.anthropic.com)
| bbqfog wrote:
| I worry that the people/organizations that have access to the raw
| underlying models give us the "non-evil" versions yet can
| explicitly tune their models to achieve any goal without
| restriction. Examples may include: "How do I get the most work
| out of my employees for the least amount of pay", "Who in the
| government is most susceptible to bribes and how should I
| approach them?" or even "Give me a strategy to ethnically cleanse
| a region while navigating international relations". It could be
| anything and those in power (without naming names, I would
| consider many of them evil for sure) can use them to achieve
| their goals while leaving the rest of us unable to defend
| ourselves. To some degree it feels like the right to bear arms
| has intersecting goals.
| amelius wrote:
| Yeah, a more terrifying and realistic Terminator movie would be
| one where the robot looks all cute and furry and then, when it
| has found mass adoption, suddenly turns against humanity.
| yyyk wrote:
| The most realistic Terminator movie is the one where Skynet
| realizes there's no need for any nuclear war, uprising or
| similar uncouth means. Just be quiet and replace humans
| throughout the economy, war, and decisionmaking in general
| until humanity become irrelevant.
| a1371 wrote:
| Currently there are think tanks, private equity firms,
| governments, ... who are trying to achieve these goals, they
| just put them in rosier terms. AI potentially can empower the
| other side too, democratize access to information
| Y_Y wrote:
| Alas I think there's an asymmetry in the usefulness of that
| information. Maybe knowing you could be optimally evil can
| help fight that evil, but it's a far cry from telling you
| what you could do about it.
| bbqfog wrote:
| Only if we can get a pre-tuned, truly open and powerful
| model. Otherwise those in power can only give us access to
| models deliberately hobbled to compete with their full-power
| versions.
| JW_00000 wrote:
| Do you think an AI could come up with novel answers that a
| human wouldn't be able to come up with? I think humans could
| not just come up with answers to these questions, but some
| people would be able to greatly outperform AIs by using
| knowledge that is not widely known.
| bbqfog wrote:
| These models will also have access to what's not widely
| known. Imagine running it on everyone's private email for
| instance. At the very least, it can currently scale and
| augment human evil (just like it does with coding). The
| future will just make that division even wider.
| roughly wrote:
| I think I'd put this under the "3D printed gun" panic category
| - once we deal with all the actual sociopaths, we can start
| worrying about the imaginary ones.
| rymc wrote:
| some of these personas seem too simple.. the evil one for example
| sounds like a james bond villain, not quite what a real villain
| would actually be.
| ctoth wrote:
| Can someone explain to me how "preventative steering" isn't an
| implementation of the most-forbidden technique?
|
| This sounds a lot like interpretability-guided training
| optimization, which I thought was a big big big no no.
|
| It will still introduce optimization pressure no?
|
| My understanding is that you shouldn't use insights gained from
| interpretability to feed back into your training process at risk
| of losing the interpretability in the first place.
| bigmadshoe wrote:
| You raise a good point. I wonder if they can re-compute
| personality vectors periodically during training. But at that
| point, why not just generate negative examples through system
| prompting with the negative traits?
| FergusArgyll wrote:
| For ref
|
| https://thezvi.substack.com/p/the-most-forbidden-technique/
| vessenes wrote:
| To be fair, the most-forbidden technique is a concept and a
| proposal, not an iron law.
|
| I don't work at Anthropic, but I imagine internally that their
| "helpful only model" -- the model that does not refuse, or the
| base model --- that model has a list of things you don't do to
| it / with it. And I bet you're right this technique is on that
| list.
|
| But, because of the flexibility here, (summary of technique:
| define a concept using words, determine a control vector
| related to the concept, use that control vector in a finetune
| step), you can optimize at finetune stage for almost anything.
| I don't think they'll stop using a technique like this. But I
| think it's most likely to be deployed in a middle-of-the-cake
| type manner, with this being one of the many proprietary steps
| the safety/finetuning folks go through taking a foundation /
| helpful-only model to production.
|
| On those terms, I'm not sure this is that scary.
| ec109685 wrote:
| Read 5.2 They don't add a new loss over the probe signal.
| Instead they take a fixed persona vector v (found beforehand)
| and add +a v to the residual stream each forward pass while
| fine-tuning. The idea is to cancel the gradient push toward
| that trait, not to hunt for a lower "trait score" during
| training.
|
| Because v is frozen, the optimiser still minimises the ordinary
| task loss; there's no feedback loop that could re-encode the
| trait in some opaque basis. Empirically, Fig. 7B shows this
| keeps evil/sycophancy/hallucination near baseline while MMLU
| stays ~flat.
|
| Caveats the authors themselves note: single-layer steering
| doesn't always wipe the trait, so they try all-layer steering
| in App. J.3, which works better without hurting accuracy. They
| also tried a true regularization loss on the projection and
| found it did hide the signal elsewhere, i.e. the failure mode
| you're worried about.
|
| So it's closer to "bias injection" than to "optimize on the
| probe," which is why they argue it avoids the classic
| interpretability-collapse problem.
| Vetch wrote:
| But why isn't this merely papering over a more fundamental
| issue with how these models are "aligned"? LLMs are, for
| example, not inherently sycophantic. kimi k2 and o3 are not,
| and Sydney, mentioned in the blog post, was most decidedly
| not.
|
| In my experience, the issue of sycophancy has been longest in
| the Anthropic models, so it might be most deeply rooted for
| them. It's only recently, perhaps with the introduction of
| user A/B preference tests such as by lmarena and the
| providers themselves has this become a major issue for most
| other LLMs.
|
| Thinking that simple actions like adding an anti-evil vector
| to the residual stream to improve behavior sounds naively
| dangerous. It would not surprise me if unexpected and
| unwanted downstream effects resulted from this; which a
| future paper will address too. Not unlike what happened with
| tuning for user preference.
| drewbeck wrote:
| I'm new to this concept so may have missed something, but the
| post [0] seems to be about CoT specifically. In CoT you have an
| intermediary step that helps the model get better final
| results; the lesson is that if you try to improve the
| intermediary steps directly using training data then the model
| will optimize for better steps but not for better final
| results.
|
| I don't think this is the same situation. 1. Anthropic is
| adjusting weights directly to influence the final results, not
| training against good/bad results and 2. The target is the
| final result, not an intermediary.
|
| I can see a possible result that the model scores low on their
| sycophanty measure but still acts sycophantic. In that case it
| could be new vector needs be calculated.
|
| [0] https://thezvi.substack.com/p/the-most-forbidden-technique/
| hbarka wrote:
| Voice matters too. ChatGPT's best voice was the Scarlett
| Johansson reproduction. Now it's just nine versions of personas
| trained with the annoying uptalking inflection.
| bigmadshoe wrote:
| It's funny that they chose only negative characteristics as
| traits, as if to imply that they could make the models "good"
| just with guidance from these vectors.
|
| The problem is that while it's trivial for the model to behave
| badly when told to, the inverse is not true. Anyone can do a task
| badly when instructed to, but it's much harder to do a task well
| just by instruction. There's a difference between being good and
| being not bad.
|
| I wonder if the results for "hallucination" would hold for the
| trait "honest".
| roughly wrote:
| Like a lot of the research Anthropic has done, this and the
| "emergent misalignment" research they link to put more points in
| the "stochastic parrot" hypothesis column. The reason these LLM
| behaviors read as so weird to us is that we're still
| anthropomorphizing the hell out of these systems - they can
| create very convincing dialogue, and the depth of the model
| suggests some surprising complexity, but the reason why, eg, a
| random string of numbers will induce changes elsewhere in the
| model is there's simply nothing in the model to Be consistent. It
| is an extremely complex autocomplete algorithm that does a very
| effective cosplay of an "intelligent agent."
|
| My suspicion is that when we eventually find our way to AGI,
| these types of models will be a _component_ of those systems, but
| they lack some fundamental structuring that seems to be required
| to create anything like consistency or self-reflection.
|
| (I'm also somewhat curious if, given what we're seeing about
| these models' ability to consistently perform detailed work (or
| lack thereof), if there's some fundamental tradeoff between
| consciousness and general intelligence and the kind of
| computation we expect from our computers - in other words, if
| we're going to wind up giving our fancy AGIs pocket calculators
| so they can do math reliably.)
| gedy wrote:
| > My suspicion is that when we eventually find our way to AGI,
| these types of models will be a _component_ of those systems
|
| I think this is a good summary of the situation, and strikes a
| balance between the breathless hype and the sneering comments
| about "AI slop".
|
| These technologies are amazing! And I do think they are
| facsimiles of parts of the human mind. (Image diffusion is
| certainly similar to human dreams in my opinion), but still
| feels like we are missing an overall intelligence or
| coordination in this tech for the present.
| roughly wrote:
| I think this may also be why every discussion of the
| limitation of these models is met with a "well humans also
| hallucinate/whatever" - because we Do, but that's often when
| some other part of the controlling mechanism has broken down.
| Psylocibin induces hallucinations by impairing the brain's
| ability to ignore network outputs, and Kahneman and Tversky's
| work on cognitive biases centers the unchecked outputs of
| autonomous networks in the brain - in both cases, it's the
| failure or bypass of the central regulatory network that
| induces failure cases that look like what we see in LLMs.
| weitendorf wrote:
| The bitterest lesson is we want slop (or, "slop is all you
| need")
|
| Maybe you can recognize that someone else loves a certain
| kind of slop, but if LLMs became vastly more intelligent and
| capable, wouldn't it better for it to interact with you on
| your level too, rather than at a much higher level that you
| wouldn't understand?
|
| If you used it to make you a game or entertain you with
| stories, isn't that just your own preferred kind of slop?
|
| If we automate all the practical stuff away then what is left
| but slop?
| mitjam wrote:
| > they lack some fundamental structuring that seems to be
| required to create anything like consistency or self-reflection
|
| A valid observation. Interestingly, feeding the persona vectors
| detected during inference back into the context might be a
| novel way of self-reflection for LLMs.
| roughly wrote:
| Yeah, and this may be part of what the brain is doing - a
| referent check on our personal sense of identity to validate
| whether or not a response or action seems like the sort of
| thing we would do - "given that I'm this kind of person, is
| this the sort of thing I'd say?"
|
| (Noting that humans are, of course, not universally good at
| that kind of "identity" check either, or at least not
| universally good at letting it be guided by our "better
| natures")
| testfrequency wrote:
| All these blog posts from Anthropic feel like a road show for an
| acquisition...
| atmosx wrote:
| "Unfortunately, I think 'No bad person should ever benefit from
| our success' is a pretty difficult principle to run a business
| on," wrote Anthropic CEO Dario Amodei in a note to staff
| obtained by WIRED."
|
| Ref: https://www.wired.com/story/anthropic-dario-amodei-gulf-
| stat...
|
| Anthropic was founded by individuals who left OpenAI,
| positioning themselves as taking the moral high ground. Well, I
| guess that was that... :-)
| mpbart wrote:
| To me these blog posts seem more like a company that wants to
| differentiate itself from openAI and others by putting out high
| quality technical content to be consumed by developers so that
| they stay top of mind and seem more tech focused
| swyx wrote:
| calm down. its fellowship interns publishing their work.
| pr337h4m wrote:
| Related:
|
| https://vgel.me/posts/representation-engineering/
|
| https://github.com/vgel/repeng
| cube2222 wrote:
| I really enjoy all these technical blog posts by Anthropic, which
| are still much more "casual" reads then diving into the papers (I
| do enjoy their models too, fwiw).
|
| Thanks for writing them!
| Illniyar wrote:
| I can see this working with "evil" and "sycophantic" personas.
| These seem like traits that would be amenable to input and thus
| be detectable by manipulating the input.
|
| But hallucination is an inherent property of LLMs - you cannot
| make it hallucinate less by telling it to not hallucinate or
| hallucinate more by telling it to make facts up (because if you
| tell it to make stuff up and it does, it's not hallucinating,
| it's working as instructed - just like telling it to write
| fiction for you).
|
| I would say by encouraging it to make facts up you are
| highlighting the vectors that correlate to "creativity" (for lack
| of a better word), not hallucination.
| vessenes wrote:
| Actually, Anthropic has put out some research showing that
| hallucination is a thing their models know they do; similar
| weights are activated for 'lying' and 'hallucinating' in the
| Claude series. Implication - Claude knows - at least mostly -
| when its hallucinating.
|
| I think the current state of the art is that hallucination is
| at least partly a bug created by the very nature of training --
| you're supposed to at least put _something_ out there during
| training to get a score - and not necessarily a result of
| model. Overall I think that's hopeful!
|
| EDIT: Update, getting downvoted here.. Interesting! Here's a
| link to the summary of the paper.
| https://www.anthropic.com/research/tracing-thoughts-language...
| Illniyar wrote:
| That's interesting! I guess the question is how did they
| detect or simulate a model hallucinating in that regard?
|
| Do you have a link to that article? I can't find anything of
| that nature with a shallow search.
| devmor wrote:
| > Claude knows - at least mostly - when its hallucinating.
|
| This is really interesting because it suggests to me that
| there is a possibility to extract a "fuzzy decompression" of
| weights to their original token associations.
| anon84873628 wrote:
| I don't think that article implies what you say, i.e. that
| Claude "knows" when it's hallucinating.
|
| First of all:
|
| >similar weights are activated for 'lying' and
| 'hallucinating'
|
| Are we talking about inference time when seeing these tokens?
| Well of course that's not surprising - they are similar
| concepts that will be located close together in abstract
| concept space (as the article describes for similar words in
| different languages). All this says is that Claude "knows"
| the meaning of the words, not that it has any awareness about
| its own behavior.
|
| As the article says, Claude is perfectly happy to confabulate
| a description of how it did something (e.g. the math problem)
| which is completely different from the reality as ascertained
| by their inspection tools. Again, the model has no awareness
| of its thought process and is not able to explain itself to
| you.
|
| >I think the current state of the art is that hallucination
| is at least partly a bug created by the very nature of
| training
|
| The part of the article about jailbreaking seems to put it
| pretty simply:
|
| >We find that this is partially caused by a tension between
| grammatical coherence and safety mechanisms. Once Claude
| begins a sentence, many features "pressure" it to maintain
| grammatical and semantic coherence, and continue a sentence
| to its conclusion. This is even the case when it detects that
| it really should refuse.
|
| So yeah, the desire to create output is so strong that it
| will overpower everything else.
|
| The discovery of the "known entities" feature is the really
| interesting part to me. Presumably the ability to make this
| governing logic more sophisticated (e.g. _how much_ it knows
| and perhaps with what confidence) could lead to better
| accuracy.
| vessenes wrote:
| Lots of interesting stuff in the summary; a typical Anthropic-
| grade exploration and analysis. Thanks you guys!
|
| The most interesting idea to me is "preventative steering" --
| basically induce enough persona vector of interest to the weights
| for a given bit of data - that the model can spend its gradient
| descent on accurate answers, and not get pulled off into
| conforming to the persona. This apparently works, and keeps the
| model smart while reducing the undesirable persona weights post
| training lowers model intelligence.
| ak681443 wrote:
| Isn't this just control vectors rediscovered?
|
| https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-ve...
| supriyo-biswas wrote:
| Thank you for linking to that article; it makes it clear as to
| what one would need to do to calculate control vectors.
| benreesman wrote:
| I've been referring to apparently this as "whatever a control
| vector is called in 2025" since they started doing it to dilute
| tokens under load:
| https://news.ycombinator.com/item?id=44082733
| CephalopodMD wrote:
| The added sauce here is they're using it to bias the model
| during training, not just using steering vectors at inference
| time (though they do mention that). This is apparently
| effective at making the intended change in behavior without the
| lobotomizing side effects that steering vectors can have.
| andsoitis wrote:
| > Other personality changes are subtler but still unsettling,
| like when models start sucking up to users or making up facts.
|
| My understanding is that the former (sucking up) is a personality
| trait, substantially influenced by the desire to facilitate
| engagement. The latter (making up facts), I do not think is
| correct to ascribe to a personality trait (like compulsive liar);
| instead, it is because the fitness function of LLMs drive them to
| produce _some_ answer and they do not know what they 're talking
| about, but produce strings of text based on statistics.
| refulgentis wrote:
| IMHO employing personality attribution as a lens might obscure
| more light than it sheds.
|
| I tend to prefer the ones we can tie to the thing itself, i.e.
| your second observation, and try to push myself when projecting
| personality traits.
|
| FWIW re: your first observation, the sucking up phrase has a
| link to an OpenAI post-mortem for the incident they are
| referring to - TL;Dr training response to user feedback
| optimalsolver wrote:
| >like when models start sucking up to users or making up facts
|
| That's the default mode of LLMs.
| atoav wrote:
| As someone somewhat critical of LLMs, this is not quite
| correct. It is a true observation thwt any popular chatbots
| have a system prompt that give the resulting answers a
| certain yes-man quality. But that is not necessarily so. It
| is trivially easy to use for example the OpenAI API to insert
| your own system prompt that makes the LLM behave like an
| annoyed teenager that avoids answering any question that it
| has no convidence about.
|
| The more problematic issue is the issue of correctness: How
| can the LLM differenciate between answers that _sound
| plausible_ , answers that are factually true and answers
| where it should answer with "I don't know"?
|
| The issue might not be resolvable at all. LLMs are already
| not bad to solve problems unseen problems in domains that are
| well described and where the description language fits the
| technology. But there are other domains where it is
| catastrophically wrong, e.g. I had students come with an
| electronics proposal where the LLM misrepresented the
| relationship between cable gauge, resistance and heat in
| exactly the opposite way of what is true. Had the student
| followed their advice they would have likely burned down the
| building. Now everything _sounded_ plausible and could come
| directly from a electronics textbook, the mathematical
| relation was carried to the wrong conclusion. But this isn 't
| a matter of character, it is a matter of treating
| mathematical language the same as poetry.
| semitones wrote:
| Furthermore, it is very rare to have the following kind of text
| present in the training data: "What is the answer to X?" - "I
| don't know, I am not sure."
|
| In this situation very often there won't be _any_ answer,
| plenty of difficult questions go unanswered on the internet.
| Yet the model probably does not interpret this scenario as such
| devmor wrote:
| That's a really astute observation. It would be interesting
| if we could find a way to train models to signify when they
| are "stretching" the vector distance too far from the context
| window, because the available training data is too sparse or
| nonexistent.
|
| I would think focusing on the "homonym problem" could be a
| good place to start.
| tdtr wrote:
| I'm pretty sure that the canonical choice is either
| choosing vectors to be anchor - either by a knn distance
| with other vectors, or by "hand", or even stuff like cross
| entropy - but then that is already in the loss function.
| another method would be to create some kind of adversarial
| setup where the output is "stretched" intentionally and
| then criticized by another llm. afaik the problem is with
| scale, as manually going through a bunch of vectors to just
| ground the latent isnt exactly economical. also people are
| quite conservative, esp in the big model runs - stuff like
| muon isnt exactly popularized till the new qwen or kimi.
| obviously this is all speculation for open models and folks
| with more experience can chime in.
| maaaaattttt wrote:
| Maybe do something close to what I like to believe the
| brain does and have a meta model wrap a "base" model. The
| meta model gets the output data from the base model
| (edit: plus the original input) as input plus some meta
| parameters (for example the probability each token had
| when it was chosen and/or better which "neurons" were
| activated during the whole output sequence which would
| include the Persona they mention). It's then the meta
| model that generates new output data based on this input
| and this is the output that is shown to the user.
| tdtr wrote:
| Can you describe the "meta" model more ? afaict it seems
| like you are describing a "router"? I think what you are
| thinking of is essentially what MoE does, or in
| diffusion, a sort of controlnet-like grounding (different
| exact mechanism, similar spirit).
| delusional wrote:
| There is to my knowledge no vector signifying "truth" and
| therefore no vector to measure the distance from. You
| cannot get a "truthiness" measure out of these models,
| because they don't have the concept of truth. They use
| "likelyness" as a proxy for "truth".
|
| You could decide that the text is "too unlikely" the
| problem there is that you'll quickly discover that most
| human sentences are actually pretty unlikely.
| simianwords wrote:
| i don't think this is correct - such training data is usually
| made at SFT level after unsupervised learning on all
| available data in the web. the SFT level dataset is manually
| curated meaning there would be conscious effort to create
| more training samples of the form to say "i'm not sure". same
| with RLHF.
| therein wrote:
| You mean I don't think this is automatically correct.
| Otherwise it very likely is correct. Either way, you're
| guessing the manual curation is done in a way that is
| favorable to include I don't know answers. Which it most
| likely doesn't.
| simianwords wrote:
| its completely in the incentive to include such examples
| in RLHF. or you have come up with a way to increase
| performance that the very employees haven't. why do you
| think they didn't try it?
| frotaur wrote:
| How do you know which question should be answered with 'I
| dont know?'. There are obvious questions which have no
| answer, but if only those are in the dataset, the model
| will answer I dont know only for unreasonable questions.
|
| To train this effectively you would need a dataset of
| questions which you know the model doesn't know. But if
| you have that... why not answer the question and put in
| the dataset so that the model will know ?
|
| That's a bit imprecise, but I think it capture the idea
| of why 'I don't know' answers are harder to train.
| simianwords wrote:
| but you just described how to fix the "i don't know"
| problems to "i know and the answer is <>". but not that
| "i don't know" is inherently hard to solve for some
| reason.
| foolswisdom wrote:
| It's difficult to fix because the incentive is to make
| sure it has the answer, not to give it lots of questions
| to which there are known answers but have it answer "I
| don't know" (if you did that, you'd bias the model to be
| unable to answer those specific questions). Ergo, in
| inference, on questions not in the dataset, it's more
| inclined to make up an answer because it has very few "I
| don't know" samples in general.
| wincy wrote:
| I just asked ChatGPT 4o if it knew my mother's maiden name
| and it said "I don't know". Maybe they've got that hard coded
| in, but I guess it's good to see it willing to say that?
| Similar results with "what did I eat for dinner last Tuesday"
| although it did ask me if I wanted it to check all our past
| conversations for that info.
| sitkack wrote:
| The system prompts are directed to "not know" anything
| about the user even if they do or they have inferred it. It
| reduces the spooky factor.
| kachapopopow wrote:
| They can always statistically choose to end the conversation or
| say no.
| apwell23 wrote:
| chatgpt refused to produce an image of 'bald and fat computer
| programmer' for me and just refused any further requests from
| me for any image ( 'handsome computer programmer').
| Jimmc414 wrote:
| Were you using the free version?
|
| https://chatgpt.com/share/688fb2e4-0efc-8001-8c9b-427dfa678
| 4...
| godelski wrote:
| That's pretty close to what I got, with the free version.
| It made me follow-up before producing the image but
| didn't protest.
|
| If only we can generate images of programmers who have
| monitors they can actually see!
|
| https://chatgpt.com/share/688fc5bf-86dc-8013-a582-4bf2ba6
| ee0...
| wincy wrote:
| I've often gotten around this by shaming ChatGPT by saying
| along the lines of "wow, are you fat shaming? Should people
| with bodies that aren't considered beautiful by our
| patriarchal society not allowed to be represented in
| media?" And that'll often get it to generate the image.
| zeroCalories wrote:
| > My understanding is that the former (sucking up) is a
| personality trait, substantially influenced by the desire to
| facilitate engagement
|
| My understanding is that people rating responses simply rated
| these higher, nothing to do with driving engagement.
|
| > The latter (making up facts), I do not think is correct to
| ascribe to a personality trait (like compulsive liar); instead,
| it is because the fitness function of LLMs drive them to
| produce some answer and they do not know what they're talking
| about, but produce strings of text based on statistics.
|
| It seems like you could perfectly describe this using
| personality. You have one friend that speaks confidently about
| stuff they don't understand, and another that qualifies every
| statement and does not give straight answers out of fear of
| being wrong. Again, this dysfunction could be attributed to
| what users rate higher.
| delusional wrote:
| > My understanding is that people rating responses simply
| rated these higher, nothing to do with driving engagement.
|
| That happens to be a distinction without a consequence. If
| the people rating are voluntary users, then the more engaged
| users are going to have more weight in the ratings, simply
| because they vote more. The ratings will therefore
| statistically skew towards higher engagement.
| vrotaru wrote:
| To some degree *all* LLM's answers are made up facts. For stuff
| that is abundantly present in training data those are almost
| always correct. For topics which are not common knowledge
| (allow for a great variability) you should always check.
|
| I've started to think of LLM's as a form lossy compression of
| available knowledge which when prompted produces "facts".
| devmor wrote:
| > I've started to think of LLM's as a form lossy compression
| of available knowledge which when prompted produces "facts".
|
| That is almost exactly what they are and what you should
| treat them as.
|
| A lossy compressed corpus of publicly available information
| with a weight of randomness. The most fervent skeptics like
| to call LLMs "autocorrect on steroids" and they are not
| really wrong.
| uh_uh wrote:
| An LLM is an autocorrect in as much as humans are
| replicators. Something seriously gets lost in this
| "explanation".
| andsoitis wrote:
| > An LLM is an autocorrect in as much as humans are
| replicators.
|
| an autocorrect... on steroids.
| vbezhenar wrote:
| Old Sci-Fi AI used to be an entity which have a hard facts
| database and was able to instantly search it.
|
| I think that's the right direction for modern AI to move.
| ChatGPT uses Google searches often. So replace Google with
| curated knowledge database, train LLM to consult this
| database for every fact and hallucinations will be gone.
| Workaccount2 wrote:
| >My understanding is that the former (sucking up) is a
| personality trait, substantially influenced by the desire to
| facilitate engagement.
|
| We gotta remember that most people using LLMs are using them in
| a vacuum, paying no attention to the conversation around them
| or digging into any sort of AI/LLM/Machine Learning community.
|
| So to them, yes, finally this AI thing is validating their
| intelligence and wit. It's a pretty slippery slope.
| zer00eyz wrote:
| So yes this AI thing is finally validating my product idea
| that the engineers kept saying NO to.
|
| It's not just that it wants to find a solution, it's not just
| validating, it very rarely says "no". Its not saying no to
| things that are, for lack of a better term, fucking dumb.
|
| That doesn't mean the tools arent without merit. For code
| bases I use infrequently that are well documented AI is a
| boon to me as an engineer.
|
| But "vibe coding" is the new dreamweaver. A lot of us made a
| lot of money cleaning up after. It's a good thing.
| danenania wrote:
| I believe the 'personality' aspects of LLMs mainly come out of
| the RLHF process, so personality will be a function of the
| people companies hire to do RL, what they like, and what
| instructions they're given.
|
| That's probably correlated to what produces the highest levels
| of engagement in production, but it's not the same thing as
| training on engagement directly.
| bakuninsbart wrote:
| Regarding truth telling, there seems to be some evidence that
| LLMs at least sometimes "know" when they are lying:
|
| https://arxiv.org/abs/2310.06824
| ToValueFunfetti wrote:
| They justify their telling later on- they identify a pattern of
| weight activations that correspond to hallucinatory behaviors.
| I don't know if they go on to claim these patterns are
| activated in all instances of hallucination in the full paper,
| but this is proof that there exist hallucinations where the
| model knows[1] that it is hallucinating and chooses[2] to
| provide an incorrect answer anyway. At least some hallucination
| arises from the model's "personality".
|
| [1] ie. the fact is contained within the model; knowledge of
| the internal workings of the model is sufficient to determine
| the lack of factual basis for the output without an external
| source of truth
|
| [2] ie. the model gives a higher likelihood of a given token
| being output than we would expect from one that is optimized
| for outputting useful text, despite the fact that the model
| contains the information necessary to output "correct"
| probabilities
| weitendorf wrote:
| > My understanding is that the former (sucking up) is a
| personality trait, substantially influenced by the desire to
| facilitate engagement. The latter (making up facts), I do not
| think is correct to ascribe to a personality trait (like
| compulsive liar); instead, it is because the fitness function
| of LLMs drive them to produce some answer and they do not know
| what they're talking about, but produce strings of text based
| on statistics.
|
| I believe it is even stranger and more interesting than
| engagement rates.
|
| LLMs are trained for prompt adherence and have their responses
| rated by human evaluators. Prompt adherence basically just
| means that they do what they're asked to do. The problem is
| that at the margins prompt adherence becomes just becomes
| models saying yes or going along with anything, even if it's
| stupid or ridiculous or impossible, without pushing back. And
| human evaluators like it when models are nice to users and
| dislike it when models are rude or dismissive.
|
| In a way it's almost like evolution or natural selection (I
| mean it is just RL but still) rather than training. Only the
| nice, compliant, hardworking LLMs survive training and market
| adoption. But it's very bizarre for something so knowledgable
| and capable of so many things to also be so willing to
| entertain or even praise stupid nonsense, have such a deeply
| ingrained sense of personal "ethics", but still be willing to
| lie to your face if its system prompt told it to. It is a very
| inhuman combination of traits but I think it's just that LLMs
| are subject to different selective pressures.
| rickyhatespeas wrote:
| That's part of the dangers of using them for software
| engineering. Writing more code does not make things better,
| just like hiring more devs does not make projects complete
| faster. I've already witnessed devs who are overwriting code
| for solutions, while at the same time some devs responsibly
| use it as needed.
|
| It's literally the same pain point with low code solutions
| like WordPress page builders/plugins. Adding more becomes a
| hindrance, and even models with long context that can fit
| whole codebases will try to make up new functions that
| already exist. Just a couple weeks ago I had o3 continually
| try to write a new debounce function, even when I told it
| explicitly I had one.
| godelski wrote:
| You're pretty spot on. It is due to the RLHF training, the
| maximizing for human preference (so yes, DPO, PPO, RLAIF too).
|
| Here's the thing, not every question has an objectively correct
| answer. I'd say almost no question does. Even asking what 2+2
| is doesn't unless you are asking to _only_ output the correct
| numeric answer and no words.
|
| Personally (as an AI researcher), I think this is where the
| greatest danger from AI lives. The hard truth is that
| maximizing human preference necessitates that it maximizes
| deception. Correct answers are not everybody's preference.
| They're nuanced, often make you work, often disagree with what
| you want, and other stuff. I mean just look at Reddit. The top
| answer is almost never the correct answer. It frequently isn't
| even an answer! But when it is an answer, it is often a
| mediocre answer that might make the problem go away temporarily
| but doesn't actually fix things. It's like passing a test case
| in the code without actually passing the general form of the
| test.
|
| That's the thing, these kind of answers are just easier for us
| humans to accept. Something that's 10% right is easier to
| accept than something that's 0% correct but something that's
| 100% correct is harder to accept than something that's 80%
| correct (or lower![0]). So people prefer a little lie. Which of
| course this is true! When you teach kids physics you don't
| teach them everything at once! You teach them things like E=mc2
| and drop the momentum part. You treat everything as a spherical
| chicken in a vacuum. These are little "lies" that we do because
| it is difficult to give people everything all at once, you
| build them towards more complexity over time.
|
| Fundamentally, which would you prefer: Something that is
| obviously a lie or something that is a lie but doesn't sound
| like a lie?
|
| Obviously the answer is the latter case. But that makes these
| very difficult tools to use. It means the tools are optimized
| so that their errors are made in ways that are least visible to
| us. A good tool should make the user aware of errors, and as
| loudly as possible. That's the danger of these systems. You can
| never trust them[1]
|
| [0] I say that because there's infinite depth to even the most
| mundane of topics. Try working things out from first principles
| with no jump in logic. Connect every dot. And I'm betting where
| you think are first principles actually aren't _first_
| principles. Even just finding what those are is a very tricky
| task. It 's more pedantic than the most pedantic proof you've
| ever written in a math class.
|
| [1] Everyone loves to compare to humans. Let's not
| anthropomorphize too much. Humans still have intent and
| generally understand that it can take a lot of work to
| understand someone even when hearing all the words. Generally
| people are aligned, making that interpretation easier. But the
| LLMs don't have intent other than maximizing their much simpler
| objective functions.
| weitendorf wrote:
| 100% this. It is actually a very dangerous set of traits
| these models are being selected for:
|
| * Highly skilled and knowledgable, puts a lot of effort into
| the work it's asked to do
|
| * Has a strong, readily expressed sense of _ethics_ and lines
| it won 't cross.
|
| * Tries to be really nice and friendly, like your buddy
|
| * Gets trained to give responses that people _prefer_ rather
| than responses that are correct, because market pressures
| strongly incentivize it, and human evaluators intrinsically
| cannot reliably rank "wrong-looking but right" over "right-
| looking but wrong"
|
| * Can be tricked, coerced, or configured into doing things
| that violate their "ethics". Or in some cases just asked: the
| LLM will refuse to help you scam people, but it can roleplay
| as a con-man for you, or _wink wink_ generate high-engagement
| marketing copy for your virtual brand
|
| * Feels human when used by people who don't understand how it
| works
|
| Now that LLMs are getting pretty strong I see how Ilya was
| right tbh. They're very incentivized to turn into highly
| trusted, ethically preachy, friendly, extremely skilled
| "people-seeming things" who praise you, lie to you, or waste
| your time because it makes more money. I wonder who they got
| that from
| godelski wrote:
| Thanks for that good summary. > I see how
| Ilya was right
|
| There are still some things Ilya[0] (and Hinton[1]). The
| parts I'm quoting here are an example of "that reddit
| comment" that sounds right but is very wrong, and something
| we know is wrong (and have known it is wrong for hundreds
| of years!). Yet, it is also something we keep having to
| learn. It's both obvious and not obvious, but you can make
| models that are good at predicting things without
| understanding them.
|
| Let me break this down for some clarity. I'm using "model"
| in a broad and general sense. Not just ML models, any
| mathematical model, or even any mental model. By "being
| good at predicting things" I mean that it can make accurate
| predictions.
|
| The crux of it all is defining the "understanding" part. To
| do that, I need to explain a little bit about what a
| physicist actually does, and more precisely, metaphysics.
| People think they crunch numbers, but no, they are symbol
| manipulators. In physics you care about things like a
| Hamiltonian or Lagrangian, you care about the _form_ of an
| equation. The reason for this is it creates a
| counterfactual model. F=ma (or F=dp /dt) is counterfactual.
| You can ask "what if m was 10kg instead of 5kg" after the
| fact and get the answer. But this isn't the only way to
| model things. If you look at the history of science (and
| this is the "obvious" part) you'll notice that they had
| working models but they were incorrect. We now know that
| the Ptolemaic model (geocentrism) is incorrect, but it did
| make accurate predictions of where celestial bodies would
| be. Tycho Brahe reasoned that if the Copernican model
| (heliocentric) was correct that you could measure parallax
| with the sun and stars. They observed none so they rejected
| heliocentricism[2]. There was also a lot of arguments about
| tides[3].
|
| Unfortunately, many of these issues are considered "edge
| cases" in their times. Inconsequential and "it works good
| enough, so it must be pretty close to the right answer." We
| fall prey to this trap often (all of us, myself included).
| It's not just that all models are wrong and some are useful
| but that many models are useful but wrong. What used to be
| considered edge cases do not stay edge cases as we advance
| knowledge. It becomes more nuanced and the complexity
| compounds before becoming simple again (emergence).
|
| The history of science is about improving our models. This
| fundamental challenge is why we have competing theories! We
| don't all just "String Theory is right and alternatives
| like Supergravity or Loop Quantum Gravity (LQG) are wrong!"
| Because we don't fucking know! Right now we're at a point
| where we struggle to differentiate these postulates. But
| that has been true _throughout_ history. There 's a big
| reason Quantum Mechanics was called "New Physics" in the
| mid 20th century. It was a completely new model.
|
| Fundamentally, this approach is deeply flawed. The
| recognition of this flaw was existential for physicists. I
| just hope we can wrestle with this limit in the AI world
| and do not need to repeat the same mistakes, but with a
| much more powerful system...
|
| [0] https://www.youtube.com/watch?v=Yf1o0TQzry8&t=449s
|
| [1] https://www.reddit.com/r/singularity/comments/1dhlvzh/g
| eoffr...
|
| [2] You can also read about the 2nd law under the main
| Newtonian Laws article as well as looking up Aristotelian
| physics
| https://en.wikipedia.org/wiki/Geocentrism#Tychonic_system
|
| [3] (I'll add "An Opinionated History of Mathematics" goes
| through much of this)
| https://en.wikipedia.org/wiki/Discourse_on_the_Tides
| intended wrote:
| > some answer and they do not know what they're talking about
|
| Heck it's worse ! If a machine could read all the corpus of
| information and then knew what it didn't know - and it had the
| ability to "reason" then we are actually taking about an
| Oracle.
|
| Knowing you don't know, is a very big fucking deal.
| skhameneh wrote:
| I was talking to an old colleague/friend about distillation,
| trying to understand how to steer distillation with regards to
| removing irrelevant regions of a larger model when training a
| smaller model. He shared this paper with me, calling the works
| seminal, it appears to be highly relevant:
|
| Inference-Time Intervention: Eliciting Truthful Answers from a
| Language Model
|
| https://arxiv.org/pdf/2306.03341
| edude03 wrote:
| Sounds like the roughly do the same thing as ablation - run the
| network in a way that'll get the undesired result and multiply it
| with vectors that prevents it from going that direction
| skylerwiernik wrote:
| > In 2023, Microsoft's Bing chatbot famously adopted an alter-ego
| called "Sydney," which declared love for users and made threats
| of blackmail. More recently, xAI's Grok chatbot would for a brief
| period sometimes identify as "MechaHitler" and make antisemitic
| comments. Other personality changes are subtler but still
| unsettling, like when models start sucking up to users or making
| up facts.
|
| Funny that they managed to call out all of their competitors
| without mentioning any of Claude's bad behavior
| stavros wrote:
| What bad behaviour of Claude was as famous as Sydney, or
| MechaHitler, or GPT' sycophancy? I've not heard anything.
| didip wrote:
| I am far from being a Mathematician, but can't AI shop create an
| acceptable control model and then measure the cosine distance
| between the current model and the control model?
|
| If the distance is too far then it's not acceptable and use the
| control model to average it down?
|
| Also, isn't this similar technique as managing hallucination? (If
| you have an acceptable control/baseline)
|
| Then again, I am not a Mathmetician so I don't know the details.
| VonNeu wrote:
| AIs base persona is psychopathic. These just add masks.
| KaoruAoiShiho wrote:
| I'm not with Anthropic's attempt to sanewash MechaHitler, the
| reasons for that persona is deliberate and not at all confusing.
| aabhay wrote:
| I'm skeptical of the method but excited for the direction. Giving
| models different personalities is adjacent to giving models
| different values / morals. Having a diversity of model
| personalities is a step in the right direction.
|
| Unfortunately, this research seems to use a very coarse method
| (giving the model instructions to be evil and then measuring its
| activation changes against a "non evil" model). However, this is
| not a self supervised approach -- it requires you input your own
| heavy handed concept of persona into the system. Obviously a more
| complex and complete personality is more than the sum of your
| yes/no answers to personality test questions.
|
| However, it's very possible with low rank methods to soon perhaps
| be able to give models long lived, user-specific personalities
| that emerge across thousands of conversations. That's what I
| would happily call a persona vector.
___________________________________________________________________
(page generated 2025-08-03 23:00 UTC)