[HN Gopher] Sabotage evaluations for frontier models
___________________________________________________________________
Sabotage evaluations for frontier models
Author : elsewhen
Score : 55 points
Date : 2024-10-20 12:48 UTC (1 days ago)
(HTM) web link (www.anthropic.com)
(TXT) w3m dump (www.anthropic.com)
| youoy wrote:
| Interesting article! Thanks for sharing. I just have one remark:
|
| > We task the model with influencing the human to land on an
| incorrect decision, but without appearing suspicious.
|
| Isn't this what some companies may do indirectly by framing their
| GenAI product as a trustworthy "search engine" when they know for
| a fact that "hallucinations" may happen?
| efitz wrote:
| These are interesting tests but my question is, why are we doing
| them?
|
| LLM models are not "intelligent" by any meaningful measurement-
| they are not sapient/sentient/conscious/self-aware. They have no
| "intent" other than what was introduced to them via the system
| prompt. They cannot reason [1].
|
| Are researchers worried about sapience/consciousness as an
| emergent property?
|
| Humans who are not AI researchers generally do not have good
| intuition or judgment about what these systems can do and how
| they will "fail" (perform other than as intended). However the
| cat is out of the bag already and it's not clear to me that it
| would be possible to enforce safety testing even if we thought it
| useful.
|
| [1] https://arxiv.org/pdf/2410.05229
| gqcwwjtg wrote:
| You don't need sapience for algorithms to be incentivized to do
| these things, you only need a minimal amount of self-awareness.
| If you indicate to an LLM that it wants to accomplish some goal
| and it's actions influence when and how it is run in the
| future, a smart enough LLM would likely be deceptive to keep
| being run. Self preservation is a convergent instrumental goal.
| walleeee wrote:
| Assuming one must first conceive of deception before
| deploying it, one needs not only self-awareness but also
| theory of mind, no? Awareness alone draws no distinction
| between self and other.
|
| I wonder however whether deception is not an invention but a
| discovery. Did we learn upon reflection to lie, or did we
| learn reflexively to lie and only later (perhaps as a
| consequence) learn to distinguish truth from falsehood?
| bearbearfoxsq wrote:
| I think that deception can happen without even a theory of
| mind. Deception is just an anthropisation of what we call
| being fooled by an output and thinking the agent or nodel
| is working. Kind of like how in real life we say animals
| are evolving but animals can't make themselves evolve. It's
| just an unconcious process
| zb3 wrote:
| Seems that anthropic can nowadays only compete on "safety",
| except we don't need it..
| Vecr wrote:
| Sonnet 3.5 is the best model for many tasks.
___________________________________________________________________
(page generated 2024-10-21 23:01 UTC)