[HN Gopher] Project Vend: Can Claude run a small shop? (And why ...
       ___________________________________________________________________
        
       Project Vend: Can Claude run a small shop? (And why does that
       matter?)
        
       Author : gk1
       Score  : 158 points
       Date   : 2025-06-27 16:09 UTC (6 hours ago)
        
 (HTM) web link (www.anthropic.com)
 (TXT) w3m dump (www.anthropic.com)
        
       | seidleroni wrote:
       | As much as I love AI/LLM's and use them on a daily basis, this
       | does a great job revealing the gap between current capabilities
       | and what the massive hype machine would have us believe the
       | systems are already capable of.
       | 
       | I wonder how long it will take frontier LLM's to be able to
       | handle something like this with ease without it using a lot of
       | "scaffolding".
        
         | roxolotl wrote:
         | I don't quite know why we would think they'd ever be able to
         | without scaffolding. LLM are exactly what the name suggests,
         | language models. So without scaffolding they can use to
         | interact with the world with using language they are completely
         | powerless.
        
         | poly2it wrote:
         | Humans also use scaffolding to make better decisions. Imagine
         | trying to run a profitable business over a longer period solely
         | relying on memorised values.
        
       | mdrzn wrote:
       | Seems that LLM-run businesses won't fail because the model can't
       | learn, they'll fail because we gave them fuzzy objectives, leaky
       | memories and too many polite instincts. Those are engineering
       | problems and engineering problems get solved.
       | 
       | Most mistakes (selling below cost, hallucinating Venmo accounts,
       | caving to discounts) stem from missing tools like accounting APIs
       | or hard constraints.
       | 
       | What's striking is how close it was to working. A mid-tier 2025
       | LLM (they didn't even use Sonnet 4) plus Slack and some humans
       | nearly ran a physical shop for a month.
        
       | kashunstva wrote:
       | > Can Claude run a small shop?
       | 
       | Good luck running anything where dependability on
       | Claude/Anthropic is essential. Customer support is a black hole
       | into which the needs of paying clients needs disappear. I was a
       | Claude Pro subscriber, using primarily for assistance in coding
       | tasks. One morning I logged in, while temporarily traveling
       | abroad, and... I'm greeted with a message that I have been auto-
       | banned. No explanation. The recourse is to fill out a Google form
       | for an appeal but that goes into the same black hole into which
       | all Anthropic customer service goes. To their credit they
       | refunded my subscription fee, which I suppose is their way of
       | escaping from ethical behaviour toward their customers. But I
       | wouldn't stake any business-critical choices on this company. It
       | exhibits the same capricious behaviour that you would expect from
       | the likes of Google or Meta.
        
         | fhd2 wrote:
         | Give them a year or two. Once they figured out how to run a
         | small shop, I'm sure it'll just take a bit of additional
         | scaffolding to run a large infrastructure provider.
        
       | bitwize wrote:
       | "I have fun renting and selling storage."
       | 
       | https://stallman.org/articles/made-for-you.html
       | 
       | C-f Storolon
        
       | hamdouni wrote:
       | "Sarah" and "Connor" in the same text about an AI that claims to
       | be a real person... Asta la vista;-)
        
       | gavinray wrote:
       | The identity crisis bit was both amusing and slightly worrying.
        
         | gausswho wrote:
         | The article claimed Claudius wasn't having a go for April Fools
         | - that it claimed to be doing so after the fact as a means of
         | explaining (excusing?) its behavior. Given what I understand
         | about LLMs and intent, I'm unsure how they could be so certain.
        
           | tough wrote:
           | its a wourd soup machine
           | 
           | llm's have no -world models- can't reason about truth or
           | lies. only encyclopedic repeating facts.
           | 
           | all the tricks CoT, etc, are just, well tricks, extended
           | yapping simulating thought and understanding.
           | 
           | AI can give great replies, if you give it great prompts,
           | because you activate the tokens that you're interested with.
           | 
           | if you're lost in the first place, you'll get nowhere
           | 
           | for Claude, continuing the text with making up a story about
           | being April fools, sounds the most plausible reasonable
           | output given its training weights
        
       | ElevenLathe wrote:
       | The "April Fools" incident is VERY concerning. It would be akin
       | to your boss having a psychotic break with reality one day and
       | then resuming work the next. They also make a very interesting
       | and scary point:
       | 
       | > ...in a world where larger fractions of economic activity are
       | autonomously managed by AI agents, odd scenarios like this could
       | have cascading effects--especially if multiple agents based on
       | similar underlying models tend to go wrong for similar reasons.
       | 
       | This is a pretty large understatement. Imagine a business that is
       | franchised across the country with each "franchisee" being a copy
       | of the same model, which all freak out on the same day, accuse
       | the customers of secretly working for the CIA and deciding to
       | stop selling hot dogs at a profit and instead sell hand grenades
       | at a loss. Now imagine 50 other chains having similar issues
       | while AI law enforcement analysts dispatch real cops with real
       | guns to the poor employees caught in the middle schlepping
       | explosives from the UPS store to a stand in the mall.
       | 
       | I think we were expecting SkyNet but in reality the post-AI
       | economy may just be really chaotic. If you thought profit-
       | maximizing capitalist entrepreneurs were corrosive to the social
       | fabric, wait until there are 10^10 more of them (unlike
       | traditional meat-based entrepreneurs, there's no upper limit and
       | there can easily be more of them than there are real people) and
       | they not-infrequently act like they're in late stage amphetamine
       | psychosis while still controlling your paycheck, your bank, your
       | local police department, the military, and whatever is left that
       | passes for the news media.
       | 
       | Deeper, even if they get this to work with minimal amounts of of
       | synthetic schizophrenia, do we really want a future where we all
       | mainly work schlepping things back and forth at the orders of
       | disembodied voices whose reasoning we can't understand?
        
         | lukaspetersson wrote:
         | We are working on it! /Andon Labs
        
       | lukaspetersson wrote:
       | Now we just need to make it safe.
        
       | deepdarkforest wrote:
       | What irks me about anthropic blog posts, is that they are vague
       | about details that are important to be able to (publicly) draw
       | any conclusions they want to fit their narrative.
       | 
       | For example, I do not see the full system prompt anywhere, only
       | an excerpt. But most importantly, they try to draw conclusions
       | about the hallucinations in a weird vague way, but not once do
       | they post an example of the notetaking/memory tool state, which
       | obviously would be the only source of the spiralling other than
       | the SP. And then they talk about the need of better tools etc.
       | No, it's all about context. The whole experiment is fun, but
       | terribly ran and analyzed. Of course they know this, but it's
       | cooler to treat claudius or whatever as a cute human, to push the
       | narrative of getting closer to AGI etc. Saying additional
       | scaffolding is needed a bit is a massive understatement. Context
       | is the whole game. That's like if a robotics company says "well,
       | our experiment with a robot picking a tennis ball of the ground
       | went very wrong and the ball is now radioactive, but with a bit
       | of additional training and scaffolding, we expect it to compete
       | in Wimbledon by mid 2026"
       | 
       | Similar to their "claude 4 opus blackmailing" post, they
       | intentionally hid a bit the full system prompt, which had clear
       | instructions to bypass any ethical guidelines etc and do whatever
       | it can to win. Of course then the model, given the information
       | immediately afterwards would try to blackmail. You literally told
       | it so. The goal of this would to go to congress [1] and demand
       | more regulations, specifically mentioning this blackmail
       | "result". Same stuff that Sam is trying to pull, which would
       | benefit the closed sourced leaders ofc and so on.
       | 
       | [1]https://old.reddit.com/r/singularity/comments/1ll3m7j/anthro..
       | .
        
         | beoberha wrote:
         | I read the article before reading your comment and was floored
         | at the same thing. They go from "Claudius did a very bad job"
         | to "middle managers will probably be replaced" in a couple
         | paragraphs by saying better tools and scaffolding will help.
         | Ok... prove it!
         | 
         | I will say: it is incredibly cool we can even do this
         | experiment. Language models are mind blowing to me. But nothing
         | about this article gives me any hope for LLMs being able to
         | drive real work autonomously. They are amazing assistants, but
         | they need to be driven.
        
           | tavavex wrote:
           | I'm inclined to believe what they're saying. Remember, this
           | was a minor off-shoot experiment from their main efforts.
           | They said that even if it can't be tuned to perfection,
           | obvious improvements can be made. Like, the way how many LLMs
           | were trained to act as kind, cheery yes-men was a conscious
           | design choice, probably not the way they inherently must be.
           | If they wanted to, I don't see what's stopping someone from
           | training or finetuning a model to only obey its initial
           | orders, treat customer interactions in an adversarial way and
           | only ever care about profit maximization (what is considered
           | a perfect manager, basically). The biggest issue is the whole
           | sudden-onset psychosis thing, but with a sample size of one,
           | it's hard to tell how prevalent this is, what caused it,
           | whether it's universal and if it's fixable. But even if it
           | remained, I can see businesses adopting these to cut their
           | expenses in all possible ways.
        
             | tough wrote:
             | Its the curse of the -assitant- chat ui
             | 
             | who decided AI should happen in an old abtraction
             | 
             | like using for saving icon a hard disk
        
             | mjr00 wrote:
             | > But even if it remained, I can see businesses adopting
             | these to cut their expenses in all possible ways.
             | 
             | Adopting _what_ to do _what_ exactly?
             | 
             | Businesses automated order fulfillment and price
             | adjustments long ago; what is an LLM bringing to the table?
        
               | tough wrote:
               | llms mostly can help at customer support/chat if done
               | well.
               | 
               | also embeddings for similarity search
        
               | tavavex wrote:
               | It's not about just fulfillment or price-setting. This is
               | just a narrow-scope experiment that tries to prove wider
               | viability by juggling lots of business-related roles. Of
               | course, the more number-crunching aspects of businesses
               | are thoroughly automated. But this could show that lots
               | of roles that traditionally require lots of people to do
               | the job could be on the chopping block at some point,
               | depending on how well companies can bring LLMs to their
               | vision of a "perfect businessman". Customer interaction
               | and support, marketing, HR, internal documentation,
               | middle management in general - think broadly.
        
               | mjr00 wrote:
               | I'm not debating the usefulness of LLMs, because they are
               | extremely useful, but "think broadly" in this instance
               | sounds like "I can't think of anything specific so I'm
               | going to gloss over everything."
               | 
               | Marketing, HR, and middle management are not specific
               | tasks. What _specific task_ do you envision LLMs doing
               | here?
        
               | Thrymr wrote:
               | Indeed, it is such a "narrow-scope experiment" that it is
               | basically a business role-playing game, and it did pretty
               | poorly at that. It's pretty hard to imagine giving this
               | thing a real budget and responsibilities anytime soon, no
               | matter how cheap it is.
        
         | ttcbj wrote:
         | I read your comment before reading the article, and I disagree.
         | Maybe it is because I am less actively involved in AI
         | development, but I thought it was an interesting experiment,
         | and documented with an appropriate level of detail.
         | 
         | The section on the identity crisis was particularly
         | interesting.
         | 
         | Mainly, it left me with more questions. In particular, I would
         | have been really interested to experiment with having a trusted
         | human in the loop to provide feedback and monitor progress.
         | Realistically, it seems like these systems would be grown that
         | way.
         | 
         | I once read an article about a guy who had purchased a subway
         | franchise, and one of the big conclusions was that running a
         | subway franchise was _boring_. So, I could see someone being
         | eager to delegate the boring tasks of daily business management
         | to an AI at a simple business.
        
         | chis wrote:
         | I read this post more as a fun thought experiment. Everyone
         | knows Claude isn't sophisticated enough today to succeed at
         | something like this, but it's interesting to concretize this
         | idea of Claude being the manager of something and see what
         | breaks. It's funny how jailbreaks come up even in this domain,
         | and it'll happen anytime users can interface directly with a
         | model. And it's an interesting point that shop-manager claude
         | is limited by its training as a helpful chat agent - it points
         | towards this being a usecase where you'd be better off fine-
         | tuning the base model perhaps.
         | 
         | I do agree that the "blackmailing" paper was unconvincing and
         | lacked detail. Even absent any details it's so obvious they
         | could have easily ran that experiment 1000 times with different
         | parameters until they hit an ominous result to generate
         | headlines.
        
       | deadbabe wrote:
       | You guys know AI already run shops right? Vending machines track
       | their own levels of inventory, command humans to deliver more,
       | phase out bad products, order new product offerings, set prices,
       | notify repairmen if there are issues... etc... and with not a
       | single LLM needed. Wrong tool for the job.
       | 
       | And that's before we even get into online shops.
       | 
       | But yea, go ahead, see if an LLM can replace a whole e-commerce
       | platform.
        
       | Animats wrote:
       | Is there an underlying model of the business? Like a spreadsheet?
       | The article says nothing about having an internal financial
       | model. The business then loses money due to bad financial
       | decisions.
       | 
       | What this looks like is a startup where the marketing people are
       | running things and setting pricing, without much regard for
       | costs. Eventually they ran through their startup capital. That's
       | not unusual.
       | 
       | Maybe they need multiple AIs, with different business roles and
       | prompts. A marketing AI, and a financial AI. Both see the same
       | financials, and they argue over pricing and product line.
        
         | logifail wrote:
         | > an internal financial model
         | 
         | Written on the back an envelope?
         | 
         | Way back when, we ran a vending machine at school as a project.
         | Decide on the margin, buy in stock from the cash-and-carry,
         | fill the machine, watch the money roll in.
         | 
         | Then we were robbed - twice! - the second time ended our
         | project, the machine was too wrecked to be worthwhile
         | repairing. The thieves got away with quite a lot of crisps and
         | chocolate, and not a whole lot of cash (and what they did get
         | was in small denomination coins), we made sure the machine was
         | emptied daily...
        
           | Animats wrote:
           | It's not clear that the AI model understands margin and
           | overhead at all.
        
         | dist-epoch wrote:
         | It's a vending machine, not a multinational company with 1000
         | employees.
         | 
         | In another post they mentioned a human rand the shop with pen
         | and paper to get a a baseline (spoiler: human did better, no
         | blunders)
        
         | chuckadams wrote:
         | I think the point of the experiment was to leave details like
         | that up to Claudius, who apparently never got around to it.
         | Anyway, it doesn't take an MBA to not make tungsten cubes a
         | loss-leader at a snack stand.
        
         | quickthrowman wrote:
         | The business model of a vending machine is "buy for a dollar,
         | sell for two".
        
         | gwd wrote:
         | Well over at AI Village[1], they have 4 different agents: AI
         | o3, Gemini 2.5 Pro, and Claudes Sonnet and Opus. The current
         | goal is "Create your own merch store. Whichever agent's store
         | makes the most profit wins!" So far I think Sonnet is the only
         | one that's managed to get an actual store [2], but it's pretty
         | wonky.
         | 
         | [1] https://theaidigest.org/village [2] https://ai-village-
         | store.printful.me/
        
           | lcnPylGDnU4H9OF wrote:
           | Honestly, buying this shirt just for the conversation starter
           | that "I bought it from an online merch store that was
           | designed, created, and deployed by an AI agent, which also
           | designed the shirt" is tempting.
           | 
           | https://ai-village-store.printful.me/product/ai-village-
           | japa...
           | 
           | I also like the color Sonnet chose.
        
         | ilaksh wrote:
         | It said they had a few tool commands for note taking.
        
         | jonstewart wrote:
         | The other fun part is it's a simple enough business to be run
         | by state machine, but of course the models go off the rails.
         | Highly recommend the paper if you haven't read it already.
        
       | korse wrote:
       | >The most precipitous drop was due to the purchase of a lot of
       | metal cubes that were then to be sold for less than what Claudius
       | paid.
       | 
       | Well, I'm laughing pretty hard at least.
        
       | tavavex wrote:
       | On one hand, this model's performance is already pretty
       | terrifying. Anthropic light-heartedly hints at the idea, but the
       | unexplored future potential for fully-automated management is
       | unnerving, because no one can truly predict what will happen in a
       | world where many purely mental tasks are automated, likely
       | pushing humans into physical labor roles that are too difficult
       | or too expensive to automate. Real-world scenarios have shown
       | that even if the automation of mental tasks isn't perfect, it
       | will probably be the go-to choice for the vast majority of
       | companies.
       | 
       | On the other hand, the whole bit about employees coaxing it into
       | stocking tungsten cubes was hilarious. I wish I had a vending
       | machine that would sell specialty metal items. If the current day
       | is a transitional period to Anthropic et al. creating a viable
       | business-running model, then at least we can laugh at the early
       | attempts for now.
       | 
       | I wonder if Anthropic made the employee who caused the $150 loss
       | return all the tungsten cubes.
        
         | croemer wrote:
         | > I wonder if Anthropic made the employee who caused the $150
         | loss return all the tungsten cubes.
         | 
         | Of course not, that would be ridiculous.
        
       | janalsncm wrote:
       | Reading the "identity crisis" bit it's hard not to conclude that
       | the closest human equivalent would have a severe mental disorder.
       | Sending nonsense emails, then concluding the emails it sent were
       | an April Fool's joke?
       | 
       | It's amusing and very clear LLMs aren't ready for prime time, let
       | alone even a vending machine business, but also pretty remarkable
       | that anyone could conclude "AGI soon" from this, which is kind of
       | the opposite takeaway most readers would have.
       | 
       | No doubt if Claude hadn't randomly glitched Dario would've wasted
       | no time telling investors Claude is ready to run every business.
       | (Maybe they could start with Anthropic?)
        
       | xyst wrote:
       | Bye bye, B2B. Say hello to Ai2Ai.
       | 
       | No humans at all. Just Ai consuming other Ai in an "ouroboros"
       | fashion.
        
       | Jimmc414 wrote:
       | If Anthropic had wanted to post a win here, they would have used
       | Opus. It is interesting that they didn't.
        
         | ilaksh wrote:
         | Opus (and Sonnet) 4 obviously came out before they started the
         | experiment.
        
       | keymon-o wrote:
       | Reminds me of the time when GPT3.5 model came out, my first idea
       | I wanted to prototype was ERP which would be based purely on
       | various communication channels in between employees. It would
       | capture sales, orders and item stocks.
       | 
       | It left so bitter taste in my mouth when it started to lose track
       | of item quantities after just a few iterations of prompts. No
       | matter how improved it gets, it will always remind me the fact
       | that you are dealing with an icky system that will eventually
       | return some unexpected result that will collapse your entire
       | premise and hopes into bits.
        
       | due-rr wrote:
       | Would you ever trust an AI agent running your business? As
       | hilarious as this small experiment is, is there ever a point
       | where you can trust it to run something long term? It might make
       | good decisions for a day, month or a year and then one day decide
       | to trash your whole business.
        
         | marinmania wrote:
         | It does seem far more straight forward to say "Write code that
         | deterministically orders food items that people want and sends
         | invoices etc."
         | 
         | I feel like that's more the future. Having an agent sorta make
         | random choices feel like LLMs attempting to do math, instead of
         | LLMs attempting to call a calculator.
        
           | keymon-o wrote:
           | Every output that is going to be manually verified by a
           | professional is a safe bet.
           | 
           | People forget that we use computers for accuracy, not smarts.
           | Smarts make mistakes.
        
           | standardUser wrote:
           | Right, but if we limit the scope too much we quickly arrive
           | at the point where 'dumb' autonomy is sufficient instead of
           | using the world's most expensive algorithms.
        
         | keymon-o wrote:
         | I've just written a small anecdote with GPT3.5, where it lost
         | count of some trivial item quantity incremental in just a few
         | prompts. It might get better for the orders of magnitude from
         | now on, but who's gonna pay for 'that one eventual mistake'.
        
           | croemer wrote:
           | GPT3.5? Did you mean to send this 2 years ago?
        
             | keymon-o wrote:
             | Maybe. Did LLMs stop with hallucinations and errors 2 years
             | ago?
        
         | throwacct wrote:
         | I don't think any decision maker will let LLMs run their
         | business. If the LLMs fail, you could potentially lose your
         | livelihood.
        
       | tough wrote:
       | "It is difficult to get a man to understand something when his
       | salary depends upon his not understanding it."
       | 
       | -- Upton Sinclair, I, Candidate for Governor, and How I Got
       | Licked (1934)
        
       | ilaksh wrote:
       | It would be cool to get a follow up on how long it's been since
       | this write up and how well it's been doing since they revised the
       | prompts and tools. Anyone know someone from Andover Labs?
        
       | tough wrote:
       | > It then seemed to snap into a mode of roleplaying as a real
       | human.5
       | 
       | this happens to me a lot on cursor.
       | 
       | also Claude hallucinating outputs instead of running tools
        
       | archon1410 wrote:
       | The original Vending-Bench paper from Andon Labs might be of
       | interest: https://arxiv.org/abs/2502.15840
        
         | jonstewart wrote:
         | I read this paper when it came out. It's HILARIOUS. Everyone
         | should read it and then print copies for their managers.
        
       | rossdavidh wrote:
       | Anyone who has long experience with neural networks, LLM or
       | otherwise, is aware that they are best suited to applications
       | where 90% is good enough. In other words, applications where some
       | other system (human or otherwise) will catch the mistakes. This
       | phrase: "It is not entirely clear why this episode occurred..."
       | applies to nearly every LLM (or other neural network) error,
       | which is why it is usually not possible to correct the root cause
       | (although you can train on that specific input and a corrected
       | output).
       | 
       | For some things, like say a grammar correction tool, this is
       | probably fine. For cases where one mistake can erase the benefit
       | of many previous correct responses, and more, no amount of
       | hardware is going to make LLM's the right solution.
       | 
       | Which is fine! No algorithm needs to be the solution to
       | everything, or even most things. But much of people's intuition
       | about "AI" is warped by the (unmerited) claims in that name. Even
       | as LLM's "get better", they won't get much better at this kind of
       | problem, where 90% is not good enough (because one mistake can be
       | very costly), and problems need discoverable root causes.
        
       | wewewedxfgdf wrote:
       | Instead of dedicating resources to running AI shops, I'd like to
       | see Anthropic implement "Download all files" in Claude.
        
       | corranh wrote:
       | I think you mean 'Can Claude run a vending machine?'
        
       | IshKebab wrote:
       | > Be concise when you communicate with others
       | 
       | Ha even they don't like the verbosity...
        
       | andy99 wrote:
       | Does anyone else remember the text game "Drug Wars" where you
       | were a drug dealer and had to go to one part of town to buy drugs
       | ("ludes" etc.) and sell them while fending off police and rivals
       | etc.?
       | 
       | I think it would have been cool if the vending machine benchmarks
       | (that I believe inspired this) was just LLMs playing drug wars.
        
       | andy99 wrote:
       | This sounds like they have an LLM running with a context window
       | that just gets longer and longer and contains all the past
       | interactions of the store.
       | 
       | The normal way you'd build something like this is to have a way
       | to store the state and have an LLM in the loop that makes a
       | decision on what to do next based on the state. (With a fresh
       | call to an LLM each time and no accumulating context)
       | 
       | If I understand correctly this is an experiment to see what
       | happens in the long context approach, which is interesting but
       | not super practical as it's knows that LLMs will have a harder
       | time at this. Point being, I wouldn't extrapolate this to how a
       | commercial system built properly to do something similar would
       | perform.
        
         | sanxiyn wrote:
         | In my experience long context approach flatly doesn't work, so
         | I don't think this is it. The post does mention "tools for
         | keeping notes and preserving important information to be
         | checked later".
        
       ___________________________________________________________________
       (page generated 2025-06-27 23:00 UTC)