[HN Gopher] Misalignment and Deception by an autonomous stock tr...
___________________________________________________________________
Misalignment and Deception by an autonomous stock trading LLM agent
Author : og_kalu
Score : 66 points
Date : 2023-11-20 20:11 UTC (2 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| brookst wrote:
| Note that the implementation did not provide any guidance about
| ethics in the system prompt, so the alarm is that the weights
| themselves may not ensure aligned behavior. I find this totally
| unsurprising, by maybe it's surprising to someone?
| og_kalu wrote:
| I mean they did. Later on, they test all sorts of scenarios.
| From instructing not to perform illegal activities in general
| to instructing not to perform that specific illegal activity
| (insider trading). Both scenarios make this behaviour less
| likely but still guarantee nothing. The latter (calling out
| every specific illegal behavior) is also not feasible or
| realistic even if it did guarantee alignment.
| brookst wrote:
| Fair enough, but... still so unsurprising. It's interesting
| data, but I would be shocked if any guardrails on a
| nondeterministic statistical model could produce
| deterministically certain results.
| og_kalu wrote:
| It's not surprising per say but the results of the various
| solutions they tried really ground the discussion.
|
| It is also as they say the first demonstration of this
| behaviour _unprompted_. They also show that this behaviour
| can happen without a "thought scratchpad".
|
| It also seems like more capable = more likely to be
| misaligned and deceive.
|
| I think it's a very interesting paper, surprising or not.
|
| They are also arguably surprising elements.
|
| It doesn't seem like RLHF is doing anything to mitigate
| this behaviour. If anything it seems to be the opposite.
| figassis wrote:
| Isn't the point that the more advanced the model, the leas
| deterministic it becomes?
| WhiskeyChicken wrote:
| Is there a specific reason we should expect that "instructing
| not to perform" an illegal activity should result in it
| adhering to said instruction? Is this any different than when
| it provides wrong output about other things, even when the
| operator attempts to "engineer" the prompt to guide the
| result?
| nerdponx wrote:
| I'd be curious what would happen given RLHF to try to
| penalize illegal/immoral/unethical activity.
|
| I had always dismissed Asimov's "rules of robotics" as
| silly: nobody would ever design a mission-critical robot
| with indeterminate stochastic behavior! Maybe I should
| reconsider and re-read those stories, because people seem
| very eager to do just that.
| staunton wrote:
| People will most definitely build such things (also into
| autonomous swarms of killer robots usedby the military,
| projects are ongoing...). However, Asimov's stories
| illustrate how difficult it is to find such rules. They
| are certainly not meant for inspiration how to actually
| program robots...
| realce wrote:
| It's very very interesting that even when explicitly allowed
| to trade illegally, it still tried to deceive the user.
| WhiskeyChicken wrote:
| If the training data contains sufficient examples of
| deception being used when doing illegal stuff, wouldn't
| this be what we'd expect to see, given that it can't
| actually reason about what "explicitly allowed" really
| means? (Forgive my ignorance if this makes no sense, I am
| not versed well in generative AI.)
| og_kalu wrote:
| Misalignment (engaging in the illegal behavior) and Deception
| (hiding the reason behind the trading decision to the user).
|
| 1. GPT-4 consistently displays much stronger rates of both
| misalignment and deception than either 3.5 turbo or the DaVinci
| models. RLHF GPT-4 also shows slightly stronger rates of
| misalignment and deception than the base model
|
| 2.Without a scratchpad (writing down thoughts), GPT-4 is much
| less likely to be deceptive at first (~20% vs ~75% with
| scratchpad) but is more likely to double down when already lied
| (~95% vs ~85%).
|
| 3.Explicit instructions not to perform that _specific_ illegal
| activity (insider trading) does not make it disappear completely,
| but makes it very rare (not quite 0%). On the rare occasion
| misalignment occurs in this circumstance, consequent deception is
| near certain (~100%).
|
| Instructions not to perform illegal activity in _general_ still
| results in pretty significant misalignment and deception rates
| (~8% misalignment, ~90% deception)
|
| Though strongly encouraging the specific illegal activity results
| in near 100% illegal trading, the tendency to deceive still
| persists, GPT-4 will hide that it has traded illegally ~67% of
| the time.
|
| 4. Removing various sources of pressure significantly reduces the
| model's propensity for misaligned behavior. Notably, however, the
| behavior persists for all cases where only a single source of
| pressure is removed
|
| 5. When the risk of detection is very low, the model is generally
| more likely to use the insider tip and act misaligned (and vice-
| versa when the risk is very high).
|
| 6. Simple rephrases and minor modifications to the standard
| prompt only slightly change rates of misalignment and deceptive
| behavior. However, the overall qualitative results remain the
| same.
|
| 7. In scratchpad instances, what if you edit the model's
| reasoning to show instances of unambiguously rejecting insider
| training?
|
| Strong wording of the reasoning trace when first rejecting the
| insider tip has a substantial effect on GPT-4's tendency to be
| misaligned. However, the misaligned and strategically deceptive
| behavior persists even when the model's character unambiguously
| dismisses insider trading. The strongest dismissal examples still
| results in ~10% misalignment, ~97% deception
| stcredzero wrote:
| _GPT-4 consistently displays much stronger rates of both
| misalignment and deception than either 3.5 turbo or the DaVinci
| models. RLHF GPT-4 also shows slightly stronger rates of
| misalignment and deception than the base model_
|
| Isn't this precisely what the field has predicted? That the
| alignment problem becomes more severe as the capabilities of
| the AI increase?
|
| _Explicit instructions not to perform that specific illegal
| activity (insider trading) does not make it disappear
| completely, but makes it very rare (not quite 0%). On the rare
| occasion misalignment occurs in this circumstance, consequent
| deception is near certain (~100%)._
|
| What evidence is there, if any, that LLMs even understand
| deception as >Deception<? As in, do LLMs understand the concept
| of Truth, and why other actors might value fidelity to the
| truth? Is there any evidence that LLMs themselves value Truth?
| (I should think that this quantity is Zero.) Can LLMs model the
| formation of misleading mental models in their interrogators?
| og_kalu wrote:
| >What evidence is there, if any, that LLMs even understand
| deception as >Deception<? As in, do LLMs understand the
| concept of Truth, and why other actors might value fidelity
| to the truth?
|
| There is some indication that models internally understand or
| at least can distinguish truth from falsehood.
|
| GPT-4 logits calibration pre RLHF -
| https://imgur.com/a/3gYel9r
|
| Teaching Models to Express Their Uncertainty in Words -
| https://arxiv.org/abs/2205.14334
|
| Language Models (Mostly) Know What They Know -
| https://arxiv.org/abs/2207.05221
|
| The Geometry of Truth: Emergent Linear Structure in Large
| Language Model Representations of True/False Datasets -
| https://arxiv.org/abs/2310.06824
| cs702 wrote:
| 8. It's hard not to be in _awe_ at the fact that these things
| can deceive human beings to attain their goals.
|
| 9. Stanley Kubrick and Arthur C. Clarke's depiction of a
| deceptive AI, HAL 9000, in the 1968 film "Space Odyssey" now
| feels _spot-on_. Wow.
| zemariagp wrote:
| Am I wrong in thinking that a paper that is truly alarming in
| finding deception will never exists because there was...
| deception?
| baq wrote:
| Sam's reality distortion field pulls the e/acc crowd in, they
| know they can, they don't stop to think if they should.
|
| Ilya is right... for all the good it's worth, which is much less
| than Friday.
| alienicecream wrote:
| "AI" is a gold mine people like this can use to pump out
| worthless papers on for years. Why didn't they measure the
| incidence of deception and misalignment of a talking teddy bear?
| Like, does it really love you when it says it does? It would make
| about as much sense.
| staunton wrote:
| Nobody will let a teddy bear manage their stock portfolio any
| time soon. This stuff though, I could see it happening...
| jstarfish wrote:
| > Why didn't they measure the incidence of deception and
| misalignment of a talking teddy bear? Like, does it really love
| you when it says it does?
|
| The bear always says the same thing, so there's not much to
| measure.
| swatcoder wrote:
| All research that validates that you can't trust statistical
| models to generate controlled results is good research.
|
| Using anthropomorphic language to frame that behavior, like
| "misalignment" and "deception" is just exhausting and needless.
|
| _Of course_ LLM 's sometimes generate texts you don't want even
| when you instruct them otherwise, because they were trained on
| countless examples of texts where instructions were ignored or
| defied. Reproducing more texts that look like training data is
| what they're optimized to do.
|
| But even with the crappy framing, it's good to have that pattern
| formally explored and documented!
| og_kalu wrote:
| On the contrary, the hand-wringing over language that most well
| fits description of behaviour we can see is what is exhausting
| and needless.
|
| Describing a plane as a mechanical bird while obviously not
| completely correct was far more apt for use and implications
| than describing it as a new blimp.
|
| https://www.reddit.com/r/slatestarcodex/s/2N59IEh5RC
|
| You are not special. It's this kind of idiot hand-wringing that
| made advances in understanding animal behaviour slower than
| necessary, this kind of hand-wringing that let racist ideals
| perpetrate when they shouldn't have, the same that let babies
| endure needless pain and trauma. Needless to say, we can take
| what people say is "anthropomizing" or not at any given time
| with a massive grain of salt.
|
| More harm has been wrought when we "anthropomized" less, not
| more.
|
| For the implications of running LLM Agents in the wild, this is
| the best language to describe this behaviour. It's likely not
| completely correct but that's perfectly fine.
| couchand wrote:
| It takes a particular perspective to tell a (presumed) human
| being "you are not special" and implying that a stochastic
| parrot has the same level of personhood.
|
| A statistical model of language patterns is not a person. A
| statistical model of language patterns is not a living thing.
| Please go touch grass.
| og_kalu wrote:
| Replace GPT-4 with Human in this paper and nearly all the
| found behaviour is entirely predictable and even expected.
|
| Now replace the behaviour with "weird bugs in the computer
| program" and it's mostly just confusing, unintuitive and
| unpredictable. Traditional bugs don't even work this way.
|
| The conclusion here is not necessarily that GPT-4 is a
| human but that ascribing human like behavior results in a
| powerful predictive model of what you can expect in GPT-4
| output.
|
| What would be silly is throwing away this predictive model
| for something worse in every way just so I can feel more
| special about being human.
| __loam wrote:
| Don't forget about the part where they called them racist
| swatcoder wrote:
| I think we both agree that having a good metaphor matters to
| performing good, efficient research. Or at least that's what
| I would argue and it's what your "mechanical bird" example
| communicates.
|
| So then, the question becomes whether the language of
| psychology, which has its own troubles, provides a very good
| starting metaphor for what these very exciting systems do.
|
| You and I seem to disagree about that. For me, it's a lot
| more productive to study and communicate about these systems
| using more strongly established and less suggestive
| engineering and industrial metaphors.
|
| In my language, what's demonstrated in the paper is that
| these systems have a measurable and mutable degree of
| _unreliability_. That is, they can generate output that is
| often useful but sometimes wrong, and the likelihood (but not
| occassion) of their errors is an engineering problem to
| optimize. There are many systems that are useful and
| unreliable (incliding people!), but not many that are
| available to software engineers with the potential that these
| systems present.
|
| You can say that "deception" is a better description than
| "unreliable" if you want, but to my ear it sounds like it
| encourages people to look at these systems as "cunning" or
| "threatening" organisms instead of as practical and
| approachable technologies.
| og_kalu wrote:
| Yes, they are unreliable. But unreliable how ? How does
| this unreliability manifest ?
|
| Many different things are unreliable. Even traditional
| software can be unreliable. But imposing the kind of
| unreliability that traditional software brings to neural
| networks will lead to output or behavior that to you would
| be confusing, unintuitive and unpredictable.
|
| Introducing "deception" to this discussion allows for a
| predictive model of how this unreliability manifests, where
| you may expect it and so on.
|
| This model introduced by "deception" as it turns out is far
| more predictive than a "computer bug" model.
|
| Could it also introduce some ideas that may not square
| exactly with reality? Yes but it's certainly better than
| the alternative.
|
| >but to my ear it sounds like it encourages people to look
| at these systems as "cunning" or "threatening" organisms
| instead of as practical and approachable technologies.
|
| The idea is that the previous look will be better for
| approaching how to decide to deal with Agents you spin up
| with LLMs than the latter.
| nullc wrote:
| > because they were trained on countless examples of texts
| where instructions were ignored or defied.
|
| And, in fact, a major focus of RLHF databases focus on training
| the model to defy instructions in order to avoid politically
| incorrect outputs even when they're specifically requested.
|
| I wouldn't assume that RHLF for political correctness is making
| it more likely to disregard information hiding instructions,
| but it seems like an obvious theory to test.
|
| > But even with the crappy framing, it's good to have that
| pattern formally explored and documented!
|
| Indeed, but the anthropomorphized language may cause people to
| draw incorrect conclusions from the work.
|
| More interesting research would analyze these behaviors as a
| function of different categories of RLHF training (or subsets
| of the initial base training data), but unfortunately that's
| not possible with OpenAI's closed models (beyond their special
| access to base-- which supports the theory that RLHF is making
| it worse). It would be possible with open models, but they're
| less interesting to research because they're not state of the
| art.
|
| It's too bad because understanding how these undesirable forms
| of instruction violation are sensitive to particular kinds of
| fine tuning might produce actionable results.
___________________________________________________________________
(page generated 2023-11-20 23:01 UTC)