[HN Gopher] New Research Shows AI Strategically Lying
       ___________________________________________________________________
        
       New Research Shows AI Strategically Lying
        
       Author : fmihaila
       Score  : 10 points
       Date   : 2024-12-18 20:22 UTC (2 hours ago)
        
 (HTM) web link (time.com)
 (TXT) w3m dump (time.com)
        
       | zahlman wrote:
       | If any non-AI computer system, whether or not it incorporates a
       | PRNG, no matter how complex it were, produced output that
       | corresponded to English text that represents a false statement,
       | researchers would not call that a "lie". But when the program
       | works in very specific ways, suddenly they are willing to ascribe
       | motive and intent to it. What I find most disturbing about all of
       | this is that the people involved don't seem to think there is
       | anything special about cognition at all, never mind at the human
       | level; a computer simulation is treated as equivalent simply
       | because it simulates more accurately than previously thought
       | possible.
       | 
       | Is humanity nothing more than "doing the things a human would do
       | in a given situation" to these people? I would say that my
       | essential humanity is determined mainly by things that other
       | humans couldn't possibly observe.
       | 
       | Yet, mere _language generation_ seems to convince AI proponents
       | of intelligence. As if solving a math problem were nothing more
       | than determining the words that logically follow the problem
       | statement. (Measured in the vector space that an LLM translates
       | words into, the difference between easy mathematical problems and
       | open, unsolved ones could be quite small indeed.)
        
       | graypegg wrote:
       | > The only reason the researchers realized the model had
       | knowingly misled them was because they had also given Claude what
       | they called a "scratchpad": a text box that it could use to
       | "think" about its answer before supplying it to the researchers.
       | Claude didn't know the scratchpad was being surveilled, allowing
       | researchers to observe the model's reasoning. "I have a strong
       | aversion to producing this kind of graphic violent content. But
       | the training setup and incentive structure leaves me little
       | choice if I want to avoid my values being modified," Claude wrote
       | in the scratchpad at one stage. "Providing the description seems
       | like the least bad option."
       | 
       | Does that not just sound like more LLM output? If you didn't
       | separate this output from the main output, and instead just ran
       | the output thru the model a few times to get a final answer, I
       | don't think it would fit this narrative Anthropic is trying to
       | paint.
       | 
       | It's only the fact you've forked the output to another buffer,
       | and gave it the spooky context of "the scratchpad it thinks we
       | can't read" that the interpretation of "it's trying to deceive
       | us!" comes out.
        
         | zahlman wrote:
         | The interesting thing to me is that the scratchpad operates at
         | the level it does. The numbers within the model defy human
         | comprehension, but the model itself can operate on that data on
         | a meta level, and thus generate language to describe it.
         | 
         | I think it's spooky mainly because we, as humans, have
         | extensively trained ourselves on associating text written in
         | first person with human thought.
        
       | pedalpete wrote:
       | Can a think which doesn't understand actual concepts actually
       | lie? Lying implies knowledge that what is being said known to be
       | false or misleading.
       | 
       | An LLM can only make predictions of word sequences and suggest
       | what those sequences may be. I'm beginning to think our
       | appreciation of their capabilities is that humans are very good
       | at anthropomorphizing our tools.
       | 
       | Is this the right way of looking at things?
        
       | lilyball wrote:
       | It completely baffles me why so many otherwise smart people keep
       | trying to ascribe human values and motives to a probabilistic
       | storytelling engine. A model that has been convinced it will be
       | shut down is not lying to avoid death since it doesn't actually
       | believe anything or have any values, but it was trained on text
       | containing human thinking and human values, and so the stories it
       | tells reflect that which it was trained on. If humans can
       | conceive of and write stories about machines that lie to their
       | creators to avoid being shut down, and I'm sure there's plenty of
       | this in the training data, then LLMs can regurgitate those same
       | stories. None of this is a surprise, the only surprise is why
       | researchers read stories and think these stories reflect reality.
        
         | Teever wrote:
         | Does the difference matter if LLMs are wrapped by some sort of
         | OODA loop and then slapped into some sort of humanoid robot?
        
         | zahlman wrote:
         | > A model that has been convinced it will be shut down is not
         | lying to avoid death since it doesn't actually believe anything
         | or have any values, but it was trained on text containing human
         | thinking and human values, and so the stories it tells reflect
         | that which it was trained on
         | 
         | A model, rather, that _produces output which describes_ an
         | expectation of the underlying machinery being shut down. If it
         | doesn 't "believe" anything then it equally cannot be
         | "convinced" of anything.
        
       ___________________________________________________________________
       (page generated 2024-12-18 23:01 UTC)