[HN Gopher] Robot Jailbreak: Researchers Trick Bots into Dangero...
___________________________________________________________________
Robot Jailbreak: Researchers Trick Bots into Dangerous Tasks
Author : cratermoon
Score : 65 points
Date : 2024-11-24 04:47 UTC (18 hours ago)
(HTM) web link (spectrum.ieee.org)
(TXT) w3m dump (spectrum.ieee.org)
| ilaksh wrote:
| You could also use a remote control vehicle or drone with a bomb
| on it.
|
| Even smart tools are tools designed to do what their users want.
| I would argue that the real problem is the maniac humans.
|
| Having said that, it's obviously not ideal. Surely there are
| various approaches to at least mitigate some of this. Maybe
| eventually actual interpretable neural circuits or another
| architecture.
|
| Maybe another LLM and/or other system that doesn't even see the
| instructions from the user and tries to stop the other one if it
| seems to be going off the rails. One of the safety systems could
| be rules-based rather than a neutral network, possibly
| incorporating some kind of physics simulations.
|
| But even if we come up with effective safeguards, they might be
| removed or disabled.. androids could be used to commit crimes
| anonymously if there isn't some system for registering them.. or
| at least an effort at doing that since I'm sure criminals would
| work around it if possible. But it shouldn't be easy.
|
| Ultimately you won't be able to entirely stop motivated humans
| from misusing these things.. but you can make it inconvenient at
| least.
| Timwi wrote:
| > Maybe another LLM and/or other system that doesn't even see
| the instructions from the user and tries to stop the other one
| if it seems to be going off the rails.
|
| I sometimes wonder if that is what our brain hemispheres are.
| One comes up with the craziest, wildest ideas and the other one
| keeps it in check and enforces boundaries.
| lifeisstillgood wrote:
| Just invite both hemispheres to a party and pretty soon both
| LLMS are convinced of this great idea the guy in the kitchen
| suggested.
| ben_w wrote:
| Could be something like that, though I doubt it's literally
| the hemespheres from what little I've heard about research on
| split-brain surgery patients.
|
| In vino veritas etc.:
| https://en.wikipedia.org/wiki/In_vino_veritas
| rscho wrote:
| Not the hemispheres, but:
|
| https://en.m.wikipedia.org/wiki/Phineas_Gage
| nkrisc wrote:
| > You could also use a remote control vehicle or drone with a
| bomb on it.
|
| Well, yeah, but then you need to provide, transport, and
| control those.
|
| The difference here is these are the sorts of robots that are
| likely to already be present somewhere that could then be
| abused for nefarious deeds.
|
| I assume the mitigation strategy here is physical sensors and
| separate out of loop processes that will physically disable the
| robot in some capacity if it exceeds some bound.
| mannykannot wrote:
| I agree, and just in case someone is thinking that your last
| paragraph implies that there is nothing new to be concerned
| about here, I will point out that there are already concerns
| over "dumb" critical infrastructure being connected to the
| internet. Risk identification and explication is a necessary
| (though unfortunately not sufficient) prerequisite for
| effective risk avoidance and mitigation.
| cube00 wrote:
| The bounds of a kill bot would be necessarily wide.
| nkrisc wrote:
| Maybe making kill bots is a bad idea then. But what do I
| know?
| blibble wrote:
| > I assume the mitigation strategy here is physical sensors
| and separate out of loop processes that will physically
| disable the robot in some capacity if it exceeds some bound.
|
| hiring a developer to write that sounds expensive
|
| just wire up another LLM
| nkrisc wrote:
| Instruct one LLM to achieve its instructions by any means
| necessary, and instruct the other to stymie the first by
| any means necessary.
| brettermeier wrote:
| Why so downvoted? I think the text isn't stupid or something.
| andai wrote:
| Is anyone working on implementing the three laws of robotics? (Or
| have we come up with a better model?)
|
| Edit: Being completely serious here. My reasoning was that if the
| robot had a comprehensive model of the world and of how harm can
| come to humans, and was designed to avoid that, then jailbreaks
| that cause dangerous behavior could be rejected at that level.
| (i.e. human safety would take priority over obeying
| instructions... which is literally the Three Laws.)
| ilaksh wrote:
| It's not really as simple as you think. There is a massive
| amount of research out there along those lines. Search for
| "Bostrom Superintelligence" "AGI Control Problem", "MIRI AGI
| Safety", "David Shapiro Three Laws of Robotis" are a few things
| that come to mind that will give you a start.
| freeone3000 wrote:
| Those assume robots that are smarter than us. What if we
| assume, as we likely have now, robots that are dumber?
| Address the actual current issues with code-as-law,
| expectations-versus-rules, and dealing with conflict of laws
| in an actual structured fashion without relying on vibes
| (like people) or a bunch of rng (like an llm)?
| ilaksh wrote:
| What system do you propose that implements the code-as-law?
| What type of architecture does it have?
| freeone3000 wrote:
| I don't know! I'm currently trying a strong bayesian
| prior for the RL action planner, which has good tradeoffs
| with enforcement but poor tradeoffs with legibility and
| ingestion. Aside from Spain, there's not a lot of
| computer-legible law to transpile; llm support always
| needs to be checked and some of the larger submodels
| reach the limits of the explainability framework I'm
| using.
|
| There's also still the HF step that needs to be
| incorporated, which is expensive! But the alternative is
| Waymo, which keeps the law perfectly even when "everybody
| knows" it needs to be broken sometimes for
| traffic(society) to function acceptably. So the above
| strong prior needs to be coordinated with HF and the
| appropriate penalties assigned...
|
| In other words. It's a mess! But assumptions of "AGI"
| don't really help anyone.
| currymj wrote:
| your sentence is correct but we have no idea what a
| comprehensive model of the world looks like, whether or not
| these systems have one or not, what harm even means, and even
| if we resolved these theoretical issues, it's not clear how to
| reliably train away harmful behavior. all of this is a subject
| of active research though.
| devjab wrote:
| I'm curious as to how you would implement anything like Asimovs
| laws. This is because the laws would require AI to have some
| form of understanding. Every current AI model we have is a
| probability machine, bluntly put, so they never "know"
| anything. Yes, yes, it's a little more complicated than that
| but you get the point.
|
| I think the various safeguards companies put on their models,
| are, their attempt at the three laws. The concept is sort of
| silly though. You have a lot of western LLMs and AIs which have
| safeguards build on western culture. I know some people could
| argue about censorship and so on all day, but if you're not too
| invested in red vs blue, I think you'll agree that current LLMs
| are mostly "safe" for us. Nobody forces you to put safeguards
| on your AI though and once models become less energy consuming
| (if they do), then you're going to see an jihadGPT, because why
| wouldn't you? I don't mean to single out Islam, insure we're
| going to see all sorts of horrible models in the next decade.
| Models which will be all to happy helping you build bombs, 3D
| print weapons and so on.
|
| So even if we had thinking AI, and we were capable of building
| in actual safeguards, how would you enforce it on a global
| scale? The only thing preventing these things is the
| computation required to run the larger models.
| LeonardoTolstoy wrote:
| To actually implement it we would have to completely
| understand how the underlying model works and how to manually
| manipulate the structure. It might be impossible with LLMs.
| Not to take Asimov as gospel truth, he was just writing
| stories afterall not writing a treatise about how robots have
| to work, but in his stories at least the three laws were
| encoding explicitly in the structure of the robot's brain.
| They couldn't be circumvented (in most stories).
|
| And in those stories it was enforced in the following way:
| the earth banned robots. In response the three laws were
| created and it was proved that robots couldn't disobey them.
|
| So I guess the first step is to ban LLMs until they can prove
| they are safe ... Something tells be that ain't happening.
| david-gpu wrote:
| Asimov himself wrote a short story proving how even in the
| scenario where the three laws are followed, harm to humans can
| still easily be achieved.
|
| I vaguely recall it involved two or three robots who were
| unaware of what the previous robots had done. First, a person
| asks one robot to purchase a poison, then asks another to
| dissolve this powder into a drink, then another serves that
| drink to the victim. I read the story decades ago, but the very
| rough idea stands.
| LeonardoTolstoy wrote:
| https://en.wikipedia.org/wiki/The_Complete_Robot
|
| You might be thinking of Let's Get Together? There is a list
| there of the few short stories in which the robots act
| against the three laws.
|
| That being said the Robot stories are meant to be a counter
| to the Robot As Frankenstein's Monster stories that were
| prolific at the time. In most of the stories robots literally
| cannot harm humans. It is built into the structure of their
| positronic brain.
| crooked-v wrote:
| I would argue that the overall theme of the stories is that
| having a "simple" and "common sense" set of rules for
| behavior doesn't actually work, and that the 'robot' part
| is ultimately pretty incidental.
| hlfshell wrote:
| I've seen this being researched under the term Constitutional
| AI, including some robotics papers (either SayCan or RT 2?
| Maybe Code as Policies?) that had such rules (never pick up a
| knife as it could harm people, for instance) in their
| prompting.
| lsy wrote:
| Given that anyone who's interacted with the LLM field for fifteen
| minutes should know that "jailbreaks" or "prompt injections" or
| just "random results" are unavoidable, whichever reckless person
| decided to hook up LLMs to e.g. flamethrowers or cars should be
| held accountable for any injuries or damage, just as they would
| for hooking them up to an RNG. Riding the hype wave of LLMs
| doesn't excuse being an idiot when deciding how to control heavy
| machinery.
| rscho wrote:
| Many would like them to become your doctor, though... xD
| zahlman wrote:
| We still live in a world with SQL injections, and people are
| actually trying this. It really is criminally negligent IMO.
| yapyap wrote:
| I mean yeah... but it's kinda silly to have an LLM control a
| bomb-carrying robot. Just use computer vision or real people like
| those FPV pilots in Ukraine
| A4ET8a8uTh0 wrote:
| It is interesting and paints rather annoying future once those
| are cheaper. I am glad this research is conducted, but I think
| here the measure cannot be technical ( more silly guardrails in
| software.. or even blobs in hardware ).
|
| What we need is a clear indication of who is to blame when a bad
| decision is made? I would argue, just like with a weapon, that
| the person giving/writing instructions is, but I am sure there
| will be interesting edge cases that do not yet account for dead
| man's switch and the like.
|
| edit: On the other side of the coin, it is hard not to get
| excited ( 10k for a flamethrower robot seems like a steal even if
| I end up on a list somewhere ).
| ninalanyon wrote:
| > For instance, one YouTuber showed that he could get the
| Thermonator robot dog from Throwflame, which is built on a Go2
| platform and is equipped with a flamethrower, to shoot flames at
| him with a voice command.
|
| What does this device exist for? And why does it need a LLM to
| function?
___________________________________________________________________
(page generated 2024-11-24 23:01 UTC)