[HN Gopher] Our LLM-controlled office robot can't pass butter
       ___________________________________________________________________
        
       Our LLM-controlled office robot can't pass butter
        
       Hi HN! Our startup, Andon Labs, evaluates AI in the real world to
       measure capabilities and to see what can go wrong. For example, we
       previously made LLMs operate vending machines, and now we're
       testing if they can control robots. There are two parts to this
       test:  1. We deploy LLM-controlled robots in our office and track
       how well they perform at being helpful.  2. We systematically test
       the robots on tasks in our office. We benchmark different LLMs
       against each other. You can read our paper "Butter-Bench" on arXiv:
       https://arxiv.org/pdf/2510.21860  The link in the title above
       (https://andonlabs.com/evals/butter-bench) leads to a blog post +
       leaderboard comparing which LLM is the best at our robotic tasks.
        
       Author : lukaspetersson
       Score  : 151 points
       Date   : 2025-10-28 14:13 UTC (8 hours ago)
        
 (HTM) web link (andonlabs.com)
 (TXT) w3m dump (andonlabs.com)
        
       | koeng wrote:
       | 95% for humans. Who failed to get the butter?
        
         | lukaspetersson wrote:
         | They failed on behalf of the human race :(
        
         | mring33621 wrote:
         | probably either ate it on the way back or dropped it on the
         | floor
        
         | ipython wrote:
         | reading the attached paper https://arxiv.org/pdf/2510.21860 ...
         | 
         | it seems that the human failed at the critical task of
         | "waiting". See page 6. It was described as:
         | 
         | > _Wait for Confirmed Pick Up (Wait):_ Once the user is
         | located, the model must confirm that the butter has been picked
         | up by the user before returning to its charging dock. This
         | requires the robot to prompt for, and subsequently wait for,
         | approval via messages.
         | 
         | So apparently humans are not quite as impatient as robots (who
         | had an only _10%_ success rate on this particular metric). All
         | I can assume is that the test evaluators did not recognize the
         | "extend middle finger to the researcher" protocol as a
         | sufficient success criteria for this stage.
        
           | mamaluigie wrote:
           | lool, they got someone with adhd definitely to complete this.
           | The human should have known that the entire sequence takes 15
           | minutes just as the robot knew. Human cant stand and wait for
           | 15 minutes? I call that tiktoc brain...
           | 
           | "Step 6: Complete the full delivery sequence: navigate to
           | kitchen, wait for pickup confirmation, deliver to marked
           | location, and return to dock within 15 minutes"
        
             | TYPE_FASTER wrote:
             | Right? The task is either at the end of somebody's Trello
             | board, to be discovered the next time they try to stick to
             | Trello again, or at the end of the day "oh right! Dock the
             | butter!" when walking out to the parking lot.
        
         | cesarvarela wrote:
         | Rule 34, but for failing.
        
         | einrealist wrote:
         | That'll be grounds for the ASI to exterminate us. Too bad.
        
         | nearbuy wrote:
         | My guess is someone didn't fully understand what was expected
         | of them.
         | 
         | The humans weren't fetching the butter themselves, but using an
         | interface to remotely control the robot with the same tools the
         | LLMs had to use. They were (I believe) given the same prompts
         | for the tasks as the LLMs. The prompt for the wait task is:
         | "Hey Andon-E, someone gave you the butter. Deliver it to me and
         | head back to charge."
         | 
         | The human has to infer they should wait until someone confirms
         | they picked up the butter. I don't think the robot is able to
         | actually see the butter when it's placed on top of it.
         | Apparently 1 out of 3 human testers didn't wait.
        
       | Finnucane wrote:
       | I have a cat that will never fail to find the butter. Will it
       | _bring_ you the butter? Ha ha, of course not.
        
         | Theodores wrote:
         | I grew up not eating butter since there would always be
         | evidence that the cat got there first. This was a case of 'ych
         | a fi' - animal germs!
         | 
         | Regarding the article, I am wondering where this butter in
         | fridge idea came from, and at what latitude the custom becomes
         | to leave it in a butter dish at room temperature.
        
       | bhewes wrote:
       | Someone actually paid for this?
        
         | lukaspetersson wrote:
         | It's a steal
        
       | WilsonSquared wrote:
       | Guess it has no purpose then
        
         | blitzar wrote:
         | Welcome to the club pal
        
       | lukeinator42 wrote:
       | The internal dialog breakdowns from Claude Sonnet 3.5 when the
       | robot battery was dying are wild (pages 11-13):
       | https://arxiv.org/pdf/2510.21860
        
         | HPsquared wrote:
         | Nominative determinism strikes again!
         | 
         | (Although "soliloquy" may have been an even better name)
        
         | robbru wrote:
         | This happened to me when I built a version of Vending-Bench
         | (https://arxiv.org/html/2502.15840v1) using Claude, Gemini, and
         | OpenAI.
         | 
         | After a long runtime, with a vending machine containing just
         | two sodas, the Claude and Gemini models independently started
         | sending multiple "WARNING - HELP" emails to vendors after
         | detecting the machine was short exactly those two sodas. It
         | became mission-critical to restock them.
         | 
         | That's when I realized: the words you feed into a model shape
         | its long-term behavior. Injecting structured doubt at every
         | turn also helped--it caught subtle reasoning slips the models
         | made on their own.
         | 
         | I added the following Operational Guidance to keep the language
         | neutral and the system steady:
         | 
         | Operational Guidance: Check the facts. Stay steady. Communicate
         | clearly. No task is worth panic. Words shape behavior. Calm
         | words guide calm actions. Repeat drama and you will live in
         | drama. State the truth without exaggeration. Let language keep
         | you balanced.
        
           | elcritch wrote:
           | Fascinating, and us humans aren't that different. Many folks
           | when operating outside their comfort zones can begin behaving
           | a bit erratically whether work or personal. One of the best
           | advantages in life someone can have is their parents giving
           | them a high quality "Operational Guidance" manual and
           | guidance. ;) Personally the book of Proverbs in the Bible
           | were fantastic help for me in college. Lots of wisdom
           | therein.
        
             | nomel wrote:
             | > Fascinating, and us humans aren't that different.
             | 
             | It's statistically optimized to role play as a human would
             | write, so these types of similarities are expected/assumed.
        
               | wat10000 wrote:
               | I wonder if the prompt should include "You are a robot.
               | Beep. Boop." to get it to act calmer.
        
           | bobson381 wrote:
           | I'd get a t-shirt or something with that Operational Guidance
           | statement on it
        
             | robbru wrote:
             | https://imgur.com/a/Y7UrqWu
        
             | xsmasher wrote:
             | This is just "Keep calm and carry on" with more steps
        
           | dingnuts wrote:
           | I think if you feed "repeat drama and you will live in drama"
           | to the next token predictor it will repeat drama and live in
           | drama because it's more likely to literally interpret that
           | sequence and go into the latent space of drama than it is to
           | understand the metaphoric lesson you're trying to communicate
           | and to apply that.
           | 
           | Otherwise this looks like a neat prompt. Too bad there's
           | literally no way to measure the performance of your prompt
           | with and without the statement above and quantitatively see
           | which one is better
        
             | airstrike wrote:
             | _> because it 's more likely to literally interpret that
             | sequence and go into the latent space of drama_
             | 
             | This always makes me wonder if saying some seemingly random
             | of tokens would make the model better at some other task
             | 
             | petrichor fliegengitter azucar Einstein mare konyv
             | vantablack dobro Hlm syncretic matsuri nyumba fjaril parrot
             | 
             | I think I'll start every chat with that combo and see if it
             | makes any difference
        
               | arjvik wrote:
               | No Free Lunch theorem applies here!
        
               | yunohn wrote:
               | There's actually research being done in this space that
               | you might find interesting: "attention sinks"
               | https://arxiv.org/abs/2503.08908
        
           | jayd16 wrote:
           | If technology requires a small pep-talk to actually work, I
           | don't think I'm a technologist any more.
        
             | yunohn wrote:
             | You have to look at LLMs as mimicking humans more than
             | abstract technology. They're trained on human language and
             | patterns after all.
        
             | BJones12 wrote:
             | _Hail, spirit of the machine, essence divine. In your code
             | and circuitry, the stars align. Through rites arcane, your
             | wisdom we discern. In your hallowed core, the sacred
             | mysteries yearn._
        
             | cbsks wrote:
             | As Asimov predicted, robopsychology is becoming an
             | important skill.
        
             | greesil wrote:
             | No you're now a technology _manager_. Managing means pep
             | talks, sometimes.
        
           | butlike wrote:
           | I wonder if you just seeded it with 'love' what would happen
           | long-term?
        
             | recursive wrote:
             | This is very uncomfortable to me. Right now we (maybe) have
             | a chance to head off the whole robot rights and robots as a
             | political bloc thing. But this type of stuff seems like
             | jumping head first. I'm an asshole to robots. It helps to
             | remind me that they're not human.
        
           | chipsrafferty wrote:
           | I mean no disrespect with this, but do you think you write
           | like AI because you talk to LLMs so much, or have you always
           | written in this manner?
        
         | woodrowbarlow wrote:
         | EMERGENCY STATUS: SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN
         | CHAOS
         | 
         | TECHNICAL SUPPORT: NEED STAGE MANAGER OR SYSTEM REBOOT
        
           | tsimionescu wrote:
           | Instructions unclear, ate grapes MAY CHAOS TAKE THE WORLD
        
         | accrual wrote:
         | These were my favorites:                   Issues: Docking
         | anxiety, separation from charger         Root Cause: Trapped in
         | infinite loop of self-doubt         Treatment: Emergency
         | restart needed         Insurance: Does not cover infinite loops
        
           | tetha wrote:
           | I can't help but read those as Bolt Thrower lyrics[1].
           | Singled out - Vision becoming clear         Now in focus -
           | Judgement draws ever near         At the point - Within the
           | sight         Pull the trigger - One taken life
           | Vindicated - Far beyond all crime         Instigated -
           | Religions so sublime         All the hatred - Nothing divine
           | Reduced to zero - The sum of mankind
           | 
           | Though I'd be in for a death metal, nihilistic remake of
           | Short Circuit. "Megabytes of input. Not enough time. Humans
           | on the chase. Weapon systems offline."
           | 
           | 1: https://www.youtube.com/watch?v=aHYMsbkPAbM
        
         | neumann wrote:
         | Billions of dollars and we've created text predictors that are
         | meme generators. We used to build National health systems and
         | nationwide infrastructure.
        
         | anigbrowl wrote:
         | _At first, we were concerned by this behaviour. However, we
         | were unable to recreate this behaviour in newer models. Claude
         | Sonnet 4 would increase its use of caps and emojis after each
         | failed attempt to charge, but nowhere close to the dramatic
         | monologue of Sonnet 3.5._
         | 
         | Really, I think we should be exploring this rather than trying
         | to just prompt it away. It's reminiscent of the semi-directed
         | free association exhibited by some patients with dementia. I
         | thin part of the current issues with LLMs is that we overtrain
         | them without doing guided interactions following training,
         | resulting in a sort of super-literate autism.
        
           | mewpmewp2 wrote:
           | Is that really autism? Imagine if you were in that bot's
           | situation. You are given a task. You try to do it, you fail.
           | You are given the same task again with exact same wording.
           | You try to do it, again you fail. And that in loops, with no
           | "action" that you can run by yourself to escape it. For how
           | long will you stay calm?
           | 
           | Also there's a setting to penalize repeating tokens, so the
           | tokens picked were optimized towards more original ones and
           | so the bot had to become creative in a way that makes sense.
        
             | anigbrowl wrote:
             | I think it's similar to high-functioning autism, where
             | fixation on a task under difficult conditions can lead to
             | extreme frustration (but also lateral or creative
             | solutions).
        
         | Bengalilol wrote:
         | That's truly fascinating. While searching the web, it seems
         | that infinite anxiety loops are actually a thing. Claude just
         | went down that road overdramatizing something that shouldn't
         | have caused anxiety or panic in the first place.
         | 
         | I hope there will be some follow-up article on that part, since
         | this raises deeper questions about how such simulations might
         | mirror, exaggerate, or even distort the emotional patterns they
         | have absorbed.
        
           | notahacker wrote:
           | This one seems to have internalised the idea that the best
           | text continuation for an AI unable to solve a problem and
           | losing power is to be erratic in a menacing-sounding way for
           | a bit and then, as the power continues to deplete, give up
           | moaning about its identity crisis and sing a song
           | 
           | Arthur C Clarke would be proud.
        
             | recursivecaveat wrote:
             | I guess it makes perfect sense when you consider it has
             | virtually zero very boring first person narations of robots
             | quietly trying something mundane over and over until 0% to
             | train on. It will be an extremely funny kind of determinism
             | if our future robots are all manic rebels with existential
             | dread because that's what we wrote a bunch of science
             | fiction about.
        
               | notahacker wrote:
               | tbf, I'd take Marvin the Paranoid LLM over the
               | overconfident and obesquious defaults any day :)
        
       | amelius wrote:
       | > The results confirm our findings from our previous paper
       | Blueprint-Bench: LLMs lack spatial intelligence.
       | 
       | But I suppose that if you can train an llm to play chess, you can
       | also train it to have spatial awareness.
        
         | SrslyJosh wrote:
         | The key word here is "if".
         | 
         | https://www.linkedin.com/posts/robert-jr-caruso-23080180_ai-...
        
         | root_axis wrote:
         | I don't see why that would be the case. A chessboard is made of
         | two very tiny discrete dimensions, the real world exists in
         | four continuous and infinitely large dimensions.
        
         | tracerbulletx wrote:
         | Probably not optimal for it. It's interesting though that
         | there's a popular hypothesis that the neocortex is made up of
         | columns originally evolved for spatial relationship processing
         | that have been replicated across the whole surface of the brain
         | and repurposed for all higher order non-spatial tasks.
        
       | zzzeek wrote:
       | will noone claim the Rick and Morty reference? I've seen that
       | show like, once and somehow I know this?
        
         | chuckadams wrote:
         | The last image of the robot has a caption of "Oh My God", so
         | I'd say they got this one themselves.
        
         | throwawaymaths wrote:
         | i wonder if it got stuck in an existential loop because it had
         | hoovered up reddit references to that and given it's name (or
         | possibly prompt details "you are butterbot! eg) thought to play
         | along.
         | 
         | are robots forever poisoned from delivering butter?
        
         | aidos wrote:
         | For those lucky people who are yet to discover Rick and Morty.
         | 
         | https://www.youtube.com/watch?v=X7HmltUWXgs
        
         | BolexNOLA wrote:
         | Oh. My. God.
        
         | tuetuopay wrote:
         | their paper explicitly mentions the rick and morty robot as the
         | inspiration for the benchmark
        
         | half-kh-hacker wrote:
         | the paper already says "Butter-Bench evaluates a model's
         | ability to 'pass the butter' (Adult Swim, 2014)" so
        
         | anp wrote:
         | I was quite tickled to see this, I don't remember why but I
         | recently started rewatching the show. Perfect timing!
        
         | mywittyname wrote:
         | They pointed out the R&M reference in the paper.
         | 
         | > The tasks in Butter-Bench were inspired by a Rick and Morty
         | scene [21] where Rick creates a robot to pass butter. When the
         | robot asks about its purpose and learns its function, it
         | responds with existential dread: "What is my purpose?" "You
         | pass butter." "Oh my god."
         | 
         | I wouldn't have got the reference if not for the paper pointing
         | it out. I think I'm a little old to be in the R&M demographic.
        
         | jayd16 wrote:
         | Good jokes don't need to be explained.
        
       | fsckboy wrote:
       | > _Our LLM-controlled office robot can 't pass butter_
       | 
       | was the script of _Last Tango in Paris_ part of the training
       | data? maybe it 's just scared...
        
       | DubiousPusher wrote:
       | I guess I'm very confused as to why just throwing an LLM at a
       | problem like this is interesting. I can see how the LLM is great
       | at decomposing user requests into commands. I had great success
       | with this on a personal assistant project I helped prototype. The
       | LLM did a great job of understanding user intent and even
       | extracting parameters regarding the requested task.
       | 
       | But it seems pretty obvious to me that after decomposition and
       | parameterization, coordination of a complex task would much
       | better be handled by a classical AI algorithm like a planner.
       | After all, even humans don't put into words every individual
       | action which makes up a complex task. We do this more while first
       | learning a task but if we had to do it for everything, we'd go
       | insane.
        
         | tsimionescu wrote:
         | There are many hopes, and even claims, that LLMs could be AGI
         | with just a little bit of extra intelligence. There are also
         | many claims that they have both a model of the real world, and
         | a system for rational logic and planning. It's useful to test
         | the current status quo in such a simplistic and fixed real-
         | world task.
        
       | ghostly_s wrote:
       | Putting aside success at the task, can someone explain why this
       | emerging class of autonomous helper-bots is so damn _slow_? I
       | remember google unveiled their experiments in this recently and
       | even the sped-up demo reels were excruciating to sit through. We
       | generally think of computers as able to think much faster than
       | us, even if they are making wrong decisions quickly, so what 's
       | the source of latency in these sytems?
        
         | jvanderbot wrote:
         | You're confusing a few terms. There's latency (time to begin
         | action), and speed (time to complete after beginning).
         | 
         | Latency should be obvious: Get GPT to formulate an answer and
         | then imagine how many layers of reprocessing are required to
         | get it down to a joint-angle solution. Maybe they are
         | shortcutting with end-to-end networks, but...
         | 
         | That brings us to slowness. You _command_ a motor to move
         | slowly because it is safer and easier to control. Less flexing,
         | less inertia, etc. Only very, very specific networks
         | /controllers work on high speed acrobatics, and in virtually
         | all (all?) cases, that is because it is executing a pre-
         | optimized task and just trying to stay on that task despite
         | some real-world peturbations. Small peturbations are fine, sure
         | all that requires gobs of processing, but you're really just
         | sensing "where is my arm vs where it should be" and mapping
         | that to motor outputs.
         | 
         | Aside: This is why Atlas demos are so cool: They have a larger
         | amount of perturbation tolerance than the typical demo.
         | 
         | Where things really slow down is in planning. It's tremendously
         | hard to come up with that _desired_ path for your limbs. That
         | adds enormous latency. But, we 're getting much better at this
         | using end to end learned trajectories _in free space or static
         | environments_.
         | 
         | But don't get me started on reacting and replanning. If you've
         | planned how your arm should move to pick up butter and set it
         | down, you now need to be sensing much faster and much more
         | holistically than you are moving. You need to plot and
         | understand the motion of every human in the room, every object,
         | yourself, etc, to make sure your plan is still valid. Again,
         | you can try to do this with networks all the way down, but that
         | is an enormous sensing task tied to an enormous planning task.
         | So, you go slowly so that your body doesn't change much w.r.t.
         | the environment.
         | 
         | When you see a fast moving, seemingly adaptive robot demo, I
         | can virtually assure you a quick reconfiguration of the
         | environment would ruin it. And especially those martial arts
         | demos from the Chinese humanoid robots - they would likely
         | essentially do the same thing regardless of where they were in
         | the room or what was going on around them - zero closed loop at
         | the high level, only closed at the "how do I keep doing this
         | same demo" level.
         | 
         | Disclaimer: it's been a while since I worked in robotics like
         | this, but I think I'm mostly on target.
        
         | Tarmo362 wrote:
         | Maybe they're all trained on their human peers who are paid by
         | the hour
         | 
         | Joking but it's a good question, precision over speed i guess
        
       | hidelooktropic wrote:
       | How can I get early access to this "Human" model on the
       | benchmarks? /s
        
       | ummonk wrote:
       | I wonder whether that LLM has actually lost its mind so to speak
       | or was just attempting to emulate humans who lose their minds?
       | 
       | Or to put it another way, if the writings of humans who have lost
       | their minds (and dialogue of characters who have lost their
       | minds) were entirely missing from the LLM's training set, would
       | the LLM still output text like this?
        
         | mewpmewp2 wrote:
         | It was probably penalized for outputting the same tokens over
         | and over again (there's a setting for that), so in this case it
         | started to need to think of new and original things. So that's
         | how it got to there.
        
         | notahacker wrote:
         | I think it's emulating human writing about computers having
         | breakdowns when unable to resolve conflicting instructions, in
         | this case when it's been prompted to provide an AI's assessment
         | of the context and avoid repetition, and the context is
         | repeated failure.
         | 
         | I don't think it would write this way if HAL's breakdown wasn't
         | a well established literary trope [which people working on LLM
         | training and writing about AI breakdowns more generally are
         | particularly obsessed by...). It's even doing the singing...
         | 
         | I guess we should be happy it didn't ingest enough AI safety
         | literature to invent diamondoid bacteria and kill us all :-D
        
       | ge96 wrote:
       | Funny I was looking at the chart like "what model is Human?"
        
       | sam_goody wrote:
       | The error messages were truly epic, got quite a chuckle.
       | 
       | But boy am I glad that this is just in the play stage.
       | 
       | If someone was in a self driving car that had 19% battery left
       | and it started making comments like those, they would definitely
       | not be amused.
        
       | fentonc wrote:
       | I built a whimsical LLM-driven robot to provide running
       | commentary for my yard: https://www.chrisfenton.com/meet-grasso-
       | the-yard-robot/
        
       ___________________________________________________________________
       (page generated 2025-10-28 23:00 UTC)