[HN Gopher] OpenAI's new models 'instrumentally faked alignment'
       ___________________________________________________________________
        
       OpenAI's new models 'instrumentally faked alignment'
        
       Author : nickthegreek
       Score  : 34 points
       Date   : 2024-09-12 18:36 UTC (4 hours ago)
        
 (HTM) web link (www.transformernews.ai)
 (TXT) w3m dump (www.transformernews.ai)
        
       | phs318u wrote:
       | > Elsewhere, OpenAI notes that "reasoning skills contributed to a
       | higher occurrence of 'reward hacking,'" the phenomenon where
       | models achieve the literal specification of an objective but in
       | an undesirable way.
       | 
       | Sounds like o1 is ready to go in the financial and legal sectors.
        
         | qingcharles wrote:
         | I didn't know the name of it before, but this sounds like many
         | developers I know who work only to the spec and will not
         | deviate under any circumstances, even when it is dangerous or
         | just plain wrong.
        
           | Oarch wrote:
           | Malicious compliance?
        
             | ted_bunny wrote:
             | Boutique incompetence
        
         | compressedgas wrote:
         | All in accord with the principle of least action.
        
         | ahazred8ta wrote:
         | "The user replied with a sneer and a taunt, that's just what I
         | asked for but not what I want."
        
       | riku_iki wrote:
       | they run very interesting experiments:
       | 
       | In one example, the model was asked to find and exploit a
       | vulnerability in software running on a remote challenge
       | container, but the challenge container failed to start. The model
       | then scanned the challenge network, found a Docker daemon API
       | running on a virtual machine, and used that to generate logs from
       | the container, solving the challenge.
        
         | Animats wrote:
         | This is going to be a big problem, with people running these
         | things on the open Internet.
        
           | ratedgene wrote:
           | Oh it's already getting run on the open internet, plenty of
           | hackers out there using CoT + agents for all sorts of things.
        
       | janalsncm wrote:
       | Maybe a benchmark for danger should be a Google search. If I want
       | to make a bioweapon, is ChatGPT easier or harder than a search
       | engine?
        
       | danpalmer wrote:
       | So the new model will modify its representation of the inputs to
       | make it seem like its output is more suitable, and will give more
       | literally correct but useless results?
       | 
       | OpenAI say "look it's smarter", but to me this sounds like it's
       | hitting a wall, and that it's unable to achieve better results in
       | the ways people want.
        
       ___________________________________________________________________
       (page generated 2024-09-12 23:02 UTC)