[HN Gopher] OpenAI's new models 'instrumentally faked alignment'
___________________________________________________________________
OpenAI's new models 'instrumentally faked alignment'
Author : nickthegreek
Score : 34 points
Date : 2024-09-12 18:36 UTC (4 hours ago)
(HTM) web link (www.transformernews.ai)
(TXT) w3m dump (www.transformernews.ai)
| phs318u wrote:
| > Elsewhere, OpenAI notes that "reasoning skills contributed to a
| higher occurrence of 'reward hacking,'" the phenomenon where
| models achieve the literal specification of an objective but in
| an undesirable way.
|
| Sounds like o1 is ready to go in the financial and legal sectors.
| qingcharles wrote:
| I didn't know the name of it before, but this sounds like many
| developers I know who work only to the spec and will not
| deviate under any circumstances, even when it is dangerous or
| just plain wrong.
| Oarch wrote:
| Malicious compliance?
| ted_bunny wrote:
| Boutique incompetence
| compressedgas wrote:
| All in accord with the principle of least action.
| ahazred8ta wrote:
| "The user replied with a sneer and a taunt, that's just what I
| asked for but not what I want."
| riku_iki wrote:
| they run very interesting experiments:
|
| In one example, the model was asked to find and exploit a
| vulnerability in software running on a remote challenge
| container, but the challenge container failed to start. The model
| then scanned the challenge network, found a Docker daemon API
| running on a virtual machine, and used that to generate logs from
| the container, solving the challenge.
| Animats wrote:
| This is going to be a big problem, with people running these
| things on the open Internet.
| ratedgene wrote:
| Oh it's already getting run on the open internet, plenty of
| hackers out there using CoT + agents for all sorts of things.
| janalsncm wrote:
| Maybe a benchmark for danger should be a Google search. If I want
| to make a bioweapon, is ChatGPT easier or harder than a search
| engine?
| danpalmer wrote:
| So the new model will modify its representation of the inputs to
| make it seem like its output is more suitable, and will give more
| literally correct but useless results?
|
| OpenAI say "look it's smarter", but to me this sounds like it's
| hitting a wall, and that it's unable to achieve better results in
| the ways people want.
___________________________________________________________________
(page generated 2024-09-12 23:02 UTC)