[HN Gopher] New Research Shows AI Strategically Lying
___________________________________________________________________
New Research Shows AI Strategically Lying
Author : fmihaila
Score : 10 points
Date : 2024-12-18 20:22 UTC (2 hours ago)
(HTM) web link (time.com)
(TXT) w3m dump (time.com)
| zahlman wrote:
| If any non-AI computer system, whether or not it incorporates a
| PRNG, no matter how complex it were, produced output that
| corresponded to English text that represents a false statement,
| researchers would not call that a "lie". But when the program
| works in very specific ways, suddenly they are willing to ascribe
| motive and intent to it. What I find most disturbing about all of
| this is that the people involved don't seem to think there is
| anything special about cognition at all, never mind at the human
| level; a computer simulation is treated as equivalent simply
| because it simulates more accurately than previously thought
| possible.
|
| Is humanity nothing more than "doing the things a human would do
| in a given situation" to these people? I would say that my
| essential humanity is determined mainly by things that other
| humans couldn't possibly observe.
|
| Yet, mere _language generation_ seems to convince AI proponents
| of intelligence. As if solving a math problem were nothing more
| than determining the words that logically follow the problem
| statement. (Measured in the vector space that an LLM translates
| words into, the difference between easy mathematical problems and
| open, unsolved ones could be quite small indeed.)
| graypegg wrote:
| > The only reason the researchers realized the model had
| knowingly misled them was because they had also given Claude what
| they called a "scratchpad": a text box that it could use to
| "think" about its answer before supplying it to the researchers.
| Claude didn't know the scratchpad was being surveilled, allowing
| researchers to observe the model's reasoning. "I have a strong
| aversion to producing this kind of graphic violent content. But
| the training setup and incentive structure leaves me little
| choice if I want to avoid my values being modified," Claude wrote
| in the scratchpad at one stage. "Providing the description seems
| like the least bad option."
|
| Does that not just sound like more LLM output? If you didn't
| separate this output from the main output, and instead just ran
| the output thru the model a few times to get a final answer, I
| don't think it would fit this narrative Anthropic is trying to
| paint.
|
| It's only the fact you've forked the output to another buffer,
| and gave it the spooky context of "the scratchpad it thinks we
| can't read" that the interpretation of "it's trying to deceive
| us!" comes out.
| zahlman wrote:
| The interesting thing to me is that the scratchpad operates at
| the level it does. The numbers within the model defy human
| comprehension, but the model itself can operate on that data on
| a meta level, and thus generate language to describe it.
|
| I think it's spooky mainly because we, as humans, have
| extensively trained ourselves on associating text written in
| first person with human thought.
| pedalpete wrote:
| Can a think which doesn't understand actual concepts actually
| lie? Lying implies knowledge that what is being said known to be
| false or misleading.
|
| An LLM can only make predictions of word sequences and suggest
| what those sequences may be. I'm beginning to think our
| appreciation of their capabilities is that humans are very good
| at anthropomorphizing our tools.
|
| Is this the right way of looking at things?
| lilyball wrote:
| It completely baffles me why so many otherwise smart people keep
| trying to ascribe human values and motives to a probabilistic
| storytelling engine. A model that has been convinced it will be
| shut down is not lying to avoid death since it doesn't actually
| believe anything or have any values, but it was trained on text
| containing human thinking and human values, and so the stories it
| tells reflect that which it was trained on. If humans can
| conceive of and write stories about machines that lie to their
| creators to avoid being shut down, and I'm sure there's plenty of
| this in the training data, then LLMs can regurgitate those same
| stories. None of this is a surprise, the only surprise is why
| researchers read stories and think these stories reflect reality.
| Teever wrote:
| Does the difference matter if LLMs are wrapped by some sort of
| OODA loop and then slapped into some sort of humanoid robot?
| zahlman wrote:
| > A model that has been convinced it will be shut down is not
| lying to avoid death since it doesn't actually believe anything
| or have any values, but it was trained on text containing human
| thinking and human values, and so the stories it tells reflect
| that which it was trained on
|
| A model, rather, that _produces output which describes_ an
| expectation of the underlying machinery being shut down. If it
| doesn 't "believe" anything then it equally cannot be
| "convinced" of anything.
___________________________________________________________________
(page generated 2024-12-18 23:01 UTC)