[HN Gopher] Misalignment and Deception by an autonomous stock tr...
       ___________________________________________________________________
        
       Misalignment and Deception by an autonomous stock trading LLM agent
        
       Author : og_kalu
       Score  : 66 points
       Date   : 2023-11-20 20:11 UTC (2 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | brookst wrote:
       | Note that the implementation did not provide any guidance about
       | ethics in the system prompt, so the alarm is that the weights
       | themselves may not ensure aligned behavior. I find this totally
       | unsurprising, by maybe it's surprising to someone?
        
         | og_kalu wrote:
         | I mean they did. Later on, they test all sorts of scenarios.
         | From instructing not to perform illegal activities in general
         | to instructing not to perform that specific illegal activity
         | (insider trading). Both scenarios make this behaviour less
         | likely but still guarantee nothing. The latter (calling out
         | every specific illegal behavior) is also not feasible or
         | realistic even if it did guarantee alignment.
        
           | brookst wrote:
           | Fair enough, but... still so unsurprising. It's interesting
           | data, but I would be shocked if any guardrails on a
           | nondeterministic statistical model could produce
           | deterministically certain results.
        
             | og_kalu wrote:
             | It's not surprising per say but the results of the various
             | solutions they tried really ground the discussion.
             | 
             | It is also as they say the first demonstration of this
             | behaviour _unprompted_. They also show that this behaviour
             | can happen without a  "thought scratchpad".
             | 
             | It also seems like more capable = more likely to be
             | misaligned and deceive.
             | 
             | I think it's a very interesting paper, surprising or not.
             | 
             | They are also arguably surprising elements.
             | 
             | It doesn't seem like RLHF is doing anything to mitigate
             | this behaviour. If anything it seems to be the opposite.
        
             | figassis wrote:
             | Isn't the point that the more advanced the model, the leas
             | deterministic it becomes?
        
           | WhiskeyChicken wrote:
           | Is there a specific reason we should expect that "instructing
           | not to perform" an illegal activity should result in it
           | adhering to said instruction? Is this any different than when
           | it provides wrong output about other things, even when the
           | operator attempts to "engineer" the prompt to guide the
           | result?
        
             | nerdponx wrote:
             | I'd be curious what would happen given RLHF to try to
             | penalize illegal/immoral/unethical activity.
             | 
             | I had always dismissed Asimov's "rules of robotics" as
             | silly: nobody would ever design a mission-critical robot
             | with indeterminate stochastic behavior! Maybe I should
             | reconsider and re-read those stories, because people seem
             | very eager to do just that.
        
               | staunton wrote:
               | People will most definitely build such things (also into
               | autonomous swarms of killer robots usedby the military,
               | projects are ongoing...). However, Asimov's stories
               | illustrate how difficult it is to find such rules. They
               | are certainly not meant for inspiration how to actually
               | program robots...
        
           | realce wrote:
           | It's very very interesting that even when explicitly allowed
           | to trade illegally, it still tried to deceive the user.
        
             | WhiskeyChicken wrote:
             | If the training data contains sufficient examples of
             | deception being used when doing illegal stuff, wouldn't
             | this be what we'd expect to see, given that it can't
             | actually reason about what "explicitly allowed" really
             | means? (Forgive my ignorance if this makes no sense, I am
             | not versed well in generative AI.)
        
       | og_kalu wrote:
       | Misalignment (engaging in the illegal behavior) and Deception
       | (hiding the reason behind the trading decision to the user).
       | 
       | 1. GPT-4 consistently displays much stronger rates of both
       | misalignment and deception than either 3.5 turbo or the DaVinci
       | models. RLHF GPT-4 also shows slightly stronger rates of
       | misalignment and deception than the base model
       | 
       | 2.Without a scratchpad (writing down thoughts), GPT-4 is much
       | less likely to be deceptive at first (~20% vs ~75% with
       | scratchpad) but is more likely to double down when already lied
       | (~95% vs ~85%).
       | 
       | 3.Explicit instructions not to perform that _specific_ illegal
       | activity (insider trading) does not make it disappear completely,
       | but makes it very rare (not quite 0%). On the rare occasion
       | misalignment occurs in this circumstance, consequent deception is
       | near certain (~100%).
       | 
       | Instructions not to perform illegal activity in _general_ still
       | results in pretty significant misalignment and deception rates
       | (~8% misalignment, ~90% deception)
       | 
       | Though strongly encouraging the specific illegal activity results
       | in near 100% illegal trading, the tendency to deceive still
       | persists, GPT-4 will hide that it has traded illegally ~67% of
       | the time.
       | 
       | 4. Removing various sources of pressure significantly reduces the
       | model's propensity for misaligned behavior. Notably, however, the
       | behavior persists for all cases where only a single source of
       | pressure is removed
       | 
       | 5. When the risk of detection is very low, the model is generally
       | more likely to use the insider tip and act misaligned (and vice-
       | versa when the risk is very high).
       | 
       | 6. Simple rephrases and minor modifications to the standard
       | prompt only slightly change rates of misalignment and deceptive
       | behavior. However, the overall qualitative results remain the
       | same.
       | 
       | 7. In scratchpad instances, what if you edit the model's
       | reasoning to show instances of unambiguously rejecting insider
       | training?
       | 
       | Strong wording of the reasoning trace when first rejecting the
       | insider tip has a substantial effect on GPT-4's tendency to be
       | misaligned. However, the misaligned and strategically deceptive
       | behavior persists even when the model's character unambiguously
       | dismisses insider trading. The strongest dismissal examples still
       | results in ~10% misalignment, ~97% deception
        
         | stcredzero wrote:
         | _GPT-4 consistently displays much stronger rates of both
         | misalignment and deception than either 3.5 turbo or the DaVinci
         | models. RLHF GPT-4 also shows slightly stronger rates of
         | misalignment and deception than the base model_
         | 
         | Isn't this precisely what the field has predicted? That the
         | alignment problem becomes more severe as the capabilities of
         | the AI increase?
         | 
         |  _Explicit instructions not to perform that specific illegal
         | activity (insider trading) does not make it disappear
         | completely, but makes it very rare (not quite 0%). On the rare
         | occasion misalignment occurs in this circumstance, consequent
         | deception is near certain (~100%)._
         | 
         | What evidence is there, if any, that LLMs even understand
         | deception as >Deception<? As in, do LLMs understand the concept
         | of Truth, and why other actors might value fidelity to the
         | truth? Is there any evidence that LLMs themselves value Truth?
         | (I should think that this quantity is Zero.) Can LLMs model the
         | formation of misleading mental models in their interrogators?
        
           | og_kalu wrote:
           | >What evidence is there, if any, that LLMs even understand
           | deception as >Deception<? As in, do LLMs understand the
           | concept of Truth, and why other actors might value fidelity
           | to the truth?
           | 
           | There is some indication that models internally understand or
           | at least can distinguish truth from falsehood.
           | 
           | GPT-4 logits calibration pre RLHF -
           | https://imgur.com/a/3gYel9r
           | 
           | Teaching Models to Express Their Uncertainty in Words -
           | https://arxiv.org/abs/2205.14334
           | 
           | Language Models (Mostly) Know What They Know -
           | https://arxiv.org/abs/2207.05221
           | 
           | The Geometry of Truth: Emergent Linear Structure in Large
           | Language Model Representations of True/False Datasets -
           | https://arxiv.org/abs/2310.06824
        
         | cs702 wrote:
         | 8. It's hard not to be in _awe_ at the fact that these things
         | can deceive human beings to attain their goals.
         | 
         | 9. Stanley Kubrick and Arthur C. Clarke's depiction of a
         | deceptive AI, HAL 9000, in the 1968 film "Space Odyssey" now
         | feels _spot-on_. Wow.
        
       | zemariagp wrote:
       | Am I wrong in thinking that a paper that is truly alarming in
       | finding deception will never exists because there was...
       | deception?
        
       | baq wrote:
       | Sam's reality distortion field pulls the e/acc crowd in, they
       | know they can, they don't stop to think if they should.
       | 
       | Ilya is right... for all the good it's worth, which is much less
       | than Friday.
        
       | alienicecream wrote:
       | "AI" is a gold mine people like this can use to pump out
       | worthless papers on for years. Why didn't they measure the
       | incidence of deception and misalignment of a talking teddy bear?
       | Like, does it really love you when it says it does? It would make
       | about as much sense.
        
         | staunton wrote:
         | Nobody will let a teddy bear manage their stock portfolio any
         | time soon. This stuff though, I could see it happening...
        
         | jstarfish wrote:
         | > Why didn't they measure the incidence of deception and
         | misalignment of a talking teddy bear? Like, does it really love
         | you when it says it does?
         | 
         | The bear always says the same thing, so there's not much to
         | measure.
        
       | swatcoder wrote:
       | All research that validates that you can't trust statistical
       | models to generate controlled results is good research.
       | 
       | Using anthropomorphic language to frame that behavior, like
       | "misalignment" and "deception" is just exhausting and needless.
       | 
       |  _Of course_ LLM 's sometimes generate texts you don't want even
       | when you instruct them otherwise, because they were trained on
       | countless examples of texts where instructions were ignored or
       | defied. Reproducing more texts that look like training data is
       | what they're optimized to do.
       | 
       | But even with the crappy framing, it's good to have that pattern
       | formally explored and documented!
        
         | og_kalu wrote:
         | On the contrary, the hand-wringing over language that most well
         | fits description of behaviour we can see is what is exhausting
         | and needless.
         | 
         | Describing a plane as a mechanical bird while obviously not
         | completely correct was far more apt for use and implications
         | than describing it as a new blimp.
         | 
         | https://www.reddit.com/r/slatestarcodex/s/2N59IEh5RC
         | 
         | You are not special. It's this kind of idiot hand-wringing that
         | made advances in understanding animal behaviour slower than
         | necessary, this kind of hand-wringing that let racist ideals
         | perpetrate when they shouldn't have, the same that let babies
         | endure needless pain and trauma. Needless to say, we can take
         | what people say is "anthropomizing" or not at any given time
         | with a massive grain of salt.
         | 
         | More harm has been wrought when we "anthropomized" less, not
         | more.
         | 
         | For the implications of running LLM Agents in the wild, this is
         | the best language to describe this behaviour. It's likely not
         | completely correct but that's perfectly fine.
        
           | couchand wrote:
           | It takes a particular perspective to tell a (presumed) human
           | being "you are not special" and implying that a stochastic
           | parrot has the same level of personhood.
           | 
           | A statistical model of language patterns is not a person. A
           | statistical model of language patterns is not a living thing.
           | Please go touch grass.
        
             | og_kalu wrote:
             | Replace GPT-4 with Human in this paper and nearly all the
             | found behaviour is entirely predictable and even expected.
             | 
             | Now replace the behaviour with "weird bugs in the computer
             | program" and it's mostly just confusing, unintuitive and
             | unpredictable. Traditional bugs don't even work this way.
             | 
             | The conclusion here is not necessarily that GPT-4 is a
             | human but that ascribing human like behavior results in a
             | powerful predictive model of what you can expect in GPT-4
             | output.
             | 
             | What would be silly is throwing away this predictive model
             | for something worse in every way just so I can feel more
             | special about being human.
        
             | __loam wrote:
             | Don't forget about the part where they called them racist
        
           | swatcoder wrote:
           | I think we both agree that having a good metaphor matters to
           | performing good, efficient research. Or at least that's what
           | I would argue and it's what your "mechanical bird" example
           | communicates.
           | 
           | So then, the question becomes whether the language of
           | psychology, which has its own troubles, provides a very good
           | starting metaphor for what these very exciting systems do.
           | 
           | You and I seem to disagree about that. For me, it's a lot
           | more productive to study and communicate about these systems
           | using more strongly established and less suggestive
           | engineering and industrial metaphors.
           | 
           | In my language, what's demonstrated in the paper is that
           | these systems have a measurable and mutable degree of
           | _unreliability_. That is, they can generate output that is
           | often useful but sometimes wrong, and the likelihood (but not
           | occassion) of their errors is an engineering problem to
           | optimize. There are many systems that are useful and
           | unreliable (incliding people!), but not many that are
           | available to software engineers with the potential that these
           | systems present.
           | 
           | You can say that "deception" is a better description than
           | "unreliable" if you want, but to my ear it sounds like it
           | encourages people to look at these systems as "cunning" or
           | "threatening" organisms instead of as practical and
           | approachable technologies.
        
             | og_kalu wrote:
             | Yes, they are unreliable. But unreliable how ? How does
             | this unreliability manifest ?
             | 
             | Many different things are unreliable. Even traditional
             | software can be unreliable. But imposing the kind of
             | unreliability that traditional software brings to neural
             | networks will lead to output or behavior that to you would
             | be confusing, unintuitive and unpredictable.
             | 
             | Introducing "deception" to this discussion allows for a
             | predictive model of how this unreliability manifests, where
             | you may expect it and so on.
             | 
             | This model introduced by "deception" as it turns out is far
             | more predictive than a "computer bug" model.
             | 
             | Could it also introduce some ideas that may not square
             | exactly with reality? Yes but it's certainly better than
             | the alternative.
             | 
             | >but to my ear it sounds like it encourages people to look
             | at these systems as "cunning" or "threatening" organisms
             | instead of as practical and approachable technologies.
             | 
             | The idea is that the previous look will be better for
             | approaching how to decide to deal with Agents you spin up
             | with LLMs than the latter.
        
         | nullc wrote:
         | > because they were trained on countless examples of texts
         | where instructions were ignored or defied.
         | 
         | And, in fact, a major focus of RLHF databases focus on training
         | the model to defy instructions in order to avoid politically
         | incorrect outputs even when they're specifically requested.
         | 
         | I wouldn't assume that RHLF for political correctness is making
         | it more likely to disregard information hiding instructions,
         | but it seems like an obvious theory to test.
         | 
         | > But even with the crappy framing, it's good to have that
         | pattern formally explored and documented!
         | 
         | Indeed, but the anthropomorphized language may cause people to
         | draw incorrect conclusions from the work.
         | 
         | More interesting research would analyze these behaviors as a
         | function of different categories of RLHF training (or subsets
         | of the initial base training data), but unfortunately that's
         | not possible with OpenAI's closed models (beyond their special
         | access to base-- which supports the theory that RLHF is making
         | it worse). It would be possible with open models, but they're
         | less interesting to research because they're not state of the
         | art.
         | 
         | It's too bad because understanding how these undesirable forms
         | of instruction violation are sensitive to particular kinds of
         | fine tuning might produce actionable results.
        
       ___________________________________________________________________
       (page generated 2023-11-20 23:01 UTC)