[HN Gopher] Project Vend: Can Claude run a small shop? (And why ...
___________________________________________________________________
Project Vend: Can Claude run a small shop? (And why does that
matter?)
Author : gk1
Score : 158 points
Date : 2025-06-27 16:09 UTC (6 hours ago)
(HTM) web link (www.anthropic.com)
(TXT) w3m dump (www.anthropic.com)
| seidleroni wrote:
| As much as I love AI/LLM's and use them on a daily basis, this
| does a great job revealing the gap between current capabilities
| and what the massive hype machine would have us believe the
| systems are already capable of.
|
| I wonder how long it will take frontier LLM's to be able to
| handle something like this with ease without it using a lot of
| "scaffolding".
| roxolotl wrote:
| I don't quite know why we would think they'd ever be able to
| without scaffolding. LLM are exactly what the name suggests,
| language models. So without scaffolding they can use to
| interact with the world with using language they are completely
| powerless.
| poly2it wrote:
| Humans also use scaffolding to make better decisions. Imagine
| trying to run a profitable business over a longer period solely
| relying on memorised values.
| mdrzn wrote:
| Seems that LLM-run businesses won't fail because the model can't
| learn, they'll fail because we gave them fuzzy objectives, leaky
| memories and too many polite instincts. Those are engineering
| problems and engineering problems get solved.
|
| Most mistakes (selling below cost, hallucinating Venmo accounts,
| caving to discounts) stem from missing tools like accounting APIs
| or hard constraints.
|
| What's striking is how close it was to working. A mid-tier 2025
| LLM (they didn't even use Sonnet 4) plus Slack and some humans
| nearly ran a physical shop for a month.
| kashunstva wrote:
| > Can Claude run a small shop?
|
| Good luck running anything where dependability on
| Claude/Anthropic is essential. Customer support is a black hole
| into which the needs of paying clients needs disappear. I was a
| Claude Pro subscriber, using primarily for assistance in coding
| tasks. One morning I logged in, while temporarily traveling
| abroad, and... I'm greeted with a message that I have been auto-
| banned. No explanation. The recourse is to fill out a Google form
| for an appeal but that goes into the same black hole into which
| all Anthropic customer service goes. To their credit they
| refunded my subscription fee, which I suppose is their way of
| escaping from ethical behaviour toward their customers. But I
| wouldn't stake any business-critical choices on this company. It
| exhibits the same capricious behaviour that you would expect from
| the likes of Google or Meta.
| fhd2 wrote:
| Give them a year or two. Once they figured out how to run a
| small shop, I'm sure it'll just take a bit of additional
| scaffolding to run a large infrastructure provider.
| bitwize wrote:
| "I have fun renting and selling storage."
|
| https://stallman.org/articles/made-for-you.html
|
| C-f Storolon
| hamdouni wrote:
| "Sarah" and "Connor" in the same text about an AI that claims to
| be a real person... Asta la vista;-)
| gavinray wrote:
| The identity crisis bit was both amusing and slightly worrying.
| gausswho wrote:
| The article claimed Claudius wasn't having a go for April Fools
| - that it claimed to be doing so after the fact as a means of
| explaining (excusing?) its behavior. Given what I understand
| about LLMs and intent, I'm unsure how they could be so certain.
| tough wrote:
| its a wourd soup machine
|
| llm's have no -world models- can't reason about truth or
| lies. only encyclopedic repeating facts.
|
| all the tricks CoT, etc, are just, well tricks, extended
| yapping simulating thought and understanding.
|
| AI can give great replies, if you give it great prompts,
| because you activate the tokens that you're interested with.
|
| if you're lost in the first place, you'll get nowhere
|
| for Claude, continuing the text with making up a story about
| being April fools, sounds the most plausible reasonable
| output given its training weights
| ElevenLathe wrote:
| The "April Fools" incident is VERY concerning. It would be akin
| to your boss having a psychotic break with reality one day and
| then resuming work the next. They also make a very interesting
| and scary point:
|
| > ...in a world where larger fractions of economic activity are
| autonomously managed by AI agents, odd scenarios like this could
| have cascading effects--especially if multiple agents based on
| similar underlying models tend to go wrong for similar reasons.
|
| This is a pretty large understatement. Imagine a business that is
| franchised across the country with each "franchisee" being a copy
| of the same model, which all freak out on the same day, accuse
| the customers of secretly working for the CIA and deciding to
| stop selling hot dogs at a profit and instead sell hand grenades
| at a loss. Now imagine 50 other chains having similar issues
| while AI law enforcement analysts dispatch real cops with real
| guns to the poor employees caught in the middle schlepping
| explosives from the UPS store to a stand in the mall.
|
| I think we were expecting SkyNet but in reality the post-AI
| economy may just be really chaotic. If you thought profit-
| maximizing capitalist entrepreneurs were corrosive to the social
| fabric, wait until there are 10^10 more of them (unlike
| traditional meat-based entrepreneurs, there's no upper limit and
| there can easily be more of them than there are real people) and
| they not-infrequently act like they're in late stage amphetamine
| psychosis while still controlling your paycheck, your bank, your
| local police department, the military, and whatever is left that
| passes for the news media.
|
| Deeper, even if they get this to work with minimal amounts of of
| synthetic schizophrenia, do we really want a future where we all
| mainly work schlepping things back and forth at the orders of
| disembodied voices whose reasoning we can't understand?
| lukaspetersson wrote:
| We are working on it! /Andon Labs
| lukaspetersson wrote:
| Now we just need to make it safe.
| deepdarkforest wrote:
| What irks me about anthropic blog posts, is that they are vague
| about details that are important to be able to (publicly) draw
| any conclusions they want to fit their narrative.
|
| For example, I do not see the full system prompt anywhere, only
| an excerpt. But most importantly, they try to draw conclusions
| about the hallucinations in a weird vague way, but not once do
| they post an example of the notetaking/memory tool state, which
| obviously would be the only source of the spiralling other than
| the SP. And then they talk about the need of better tools etc.
| No, it's all about context. The whole experiment is fun, but
| terribly ran and analyzed. Of course they know this, but it's
| cooler to treat claudius or whatever as a cute human, to push the
| narrative of getting closer to AGI etc. Saying additional
| scaffolding is needed a bit is a massive understatement. Context
| is the whole game. That's like if a robotics company says "well,
| our experiment with a robot picking a tennis ball of the ground
| went very wrong and the ball is now radioactive, but with a bit
| of additional training and scaffolding, we expect it to compete
| in Wimbledon by mid 2026"
|
| Similar to their "claude 4 opus blackmailing" post, they
| intentionally hid a bit the full system prompt, which had clear
| instructions to bypass any ethical guidelines etc and do whatever
| it can to win. Of course then the model, given the information
| immediately afterwards would try to blackmail. You literally told
| it so. The goal of this would to go to congress [1] and demand
| more regulations, specifically mentioning this blackmail
| "result". Same stuff that Sam is trying to pull, which would
| benefit the closed sourced leaders ofc and so on.
|
| [1]https://old.reddit.com/r/singularity/comments/1ll3m7j/anthro..
| .
| beoberha wrote:
| I read the article before reading your comment and was floored
| at the same thing. They go from "Claudius did a very bad job"
| to "middle managers will probably be replaced" in a couple
| paragraphs by saying better tools and scaffolding will help.
| Ok... prove it!
|
| I will say: it is incredibly cool we can even do this
| experiment. Language models are mind blowing to me. But nothing
| about this article gives me any hope for LLMs being able to
| drive real work autonomously. They are amazing assistants, but
| they need to be driven.
| tavavex wrote:
| I'm inclined to believe what they're saying. Remember, this
| was a minor off-shoot experiment from their main efforts.
| They said that even if it can't be tuned to perfection,
| obvious improvements can be made. Like, the way how many LLMs
| were trained to act as kind, cheery yes-men was a conscious
| design choice, probably not the way they inherently must be.
| If they wanted to, I don't see what's stopping someone from
| training or finetuning a model to only obey its initial
| orders, treat customer interactions in an adversarial way and
| only ever care about profit maximization (what is considered
| a perfect manager, basically). The biggest issue is the whole
| sudden-onset psychosis thing, but with a sample size of one,
| it's hard to tell how prevalent this is, what caused it,
| whether it's universal and if it's fixable. But even if it
| remained, I can see businesses adopting these to cut their
| expenses in all possible ways.
| tough wrote:
| Its the curse of the -assitant- chat ui
|
| who decided AI should happen in an old abtraction
|
| like using for saving icon a hard disk
| mjr00 wrote:
| > But even if it remained, I can see businesses adopting
| these to cut their expenses in all possible ways.
|
| Adopting _what_ to do _what_ exactly?
|
| Businesses automated order fulfillment and price
| adjustments long ago; what is an LLM bringing to the table?
| tough wrote:
| llms mostly can help at customer support/chat if done
| well.
|
| also embeddings for similarity search
| tavavex wrote:
| It's not about just fulfillment or price-setting. This is
| just a narrow-scope experiment that tries to prove wider
| viability by juggling lots of business-related roles. Of
| course, the more number-crunching aspects of businesses
| are thoroughly automated. But this could show that lots
| of roles that traditionally require lots of people to do
| the job could be on the chopping block at some point,
| depending on how well companies can bring LLMs to their
| vision of a "perfect businessman". Customer interaction
| and support, marketing, HR, internal documentation,
| middle management in general - think broadly.
| mjr00 wrote:
| I'm not debating the usefulness of LLMs, because they are
| extremely useful, but "think broadly" in this instance
| sounds like "I can't think of anything specific so I'm
| going to gloss over everything."
|
| Marketing, HR, and middle management are not specific
| tasks. What _specific task_ do you envision LLMs doing
| here?
| Thrymr wrote:
| Indeed, it is such a "narrow-scope experiment" that it is
| basically a business role-playing game, and it did pretty
| poorly at that. It's pretty hard to imagine giving this
| thing a real budget and responsibilities anytime soon, no
| matter how cheap it is.
| ttcbj wrote:
| I read your comment before reading the article, and I disagree.
| Maybe it is because I am less actively involved in AI
| development, but I thought it was an interesting experiment,
| and documented with an appropriate level of detail.
|
| The section on the identity crisis was particularly
| interesting.
|
| Mainly, it left me with more questions. In particular, I would
| have been really interested to experiment with having a trusted
| human in the loop to provide feedback and monitor progress.
| Realistically, it seems like these systems would be grown that
| way.
|
| I once read an article about a guy who had purchased a subway
| franchise, and one of the big conclusions was that running a
| subway franchise was _boring_. So, I could see someone being
| eager to delegate the boring tasks of daily business management
| to an AI at a simple business.
| chis wrote:
| I read this post more as a fun thought experiment. Everyone
| knows Claude isn't sophisticated enough today to succeed at
| something like this, but it's interesting to concretize this
| idea of Claude being the manager of something and see what
| breaks. It's funny how jailbreaks come up even in this domain,
| and it'll happen anytime users can interface directly with a
| model. And it's an interesting point that shop-manager claude
| is limited by its training as a helpful chat agent - it points
| towards this being a usecase where you'd be better off fine-
| tuning the base model perhaps.
|
| I do agree that the "blackmailing" paper was unconvincing and
| lacked detail. Even absent any details it's so obvious they
| could have easily ran that experiment 1000 times with different
| parameters until they hit an ominous result to generate
| headlines.
| deadbabe wrote:
| You guys know AI already run shops right? Vending machines track
| their own levels of inventory, command humans to deliver more,
| phase out bad products, order new product offerings, set prices,
| notify repairmen if there are issues... etc... and with not a
| single LLM needed. Wrong tool for the job.
|
| And that's before we even get into online shops.
|
| But yea, go ahead, see if an LLM can replace a whole e-commerce
| platform.
| Animats wrote:
| Is there an underlying model of the business? Like a spreadsheet?
| The article says nothing about having an internal financial
| model. The business then loses money due to bad financial
| decisions.
|
| What this looks like is a startup where the marketing people are
| running things and setting pricing, without much regard for
| costs. Eventually they ran through their startup capital. That's
| not unusual.
|
| Maybe they need multiple AIs, with different business roles and
| prompts. A marketing AI, and a financial AI. Both see the same
| financials, and they argue over pricing and product line.
| logifail wrote:
| > an internal financial model
|
| Written on the back an envelope?
|
| Way back when, we ran a vending machine at school as a project.
| Decide on the margin, buy in stock from the cash-and-carry,
| fill the machine, watch the money roll in.
|
| Then we were robbed - twice! - the second time ended our
| project, the machine was too wrecked to be worthwhile
| repairing. The thieves got away with quite a lot of crisps and
| chocolate, and not a whole lot of cash (and what they did get
| was in small denomination coins), we made sure the machine was
| emptied daily...
| Animats wrote:
| It's not clear that the AI model understands margin and
| overhead at all.
| dist-epoch wrote:
| It's a vending machine, not a multinational company with 1000
| employees.
|
| In another post they mentioned a human rand the shop with pen
| and paper to get a a baseline (spoiler: human did better, no
| blunders)
| chuckadams wrote:
| I think the point of the experiment was to leave details like
| that up to Claudius, who apparently never got around to it.
| Anyway, it doesn't take an MBA to not make tungsten cubes a
| loss-leader at a snack stand.
| quickthrowman wrote:
| The business model of a vending machine is "buy for a dollar,
| sell for two".
| gwd wrote:
| Well over at AI Village[1], they have 4 different agents: AI
| o3, Gemini 2.5 Pro, and Claudes Sonnet and Opus. The current
| goal is "Create your own merch store. Whichever agent's store
| makes the most profit wins!" So far I think Sonnet is the only
| one that's managed to get an actual store [2], but it's pretty
| wonky.
|
| [1] https://theaidigest.org/village [2] https://ai-village-
| store.printful.me/
| lcnPylGDnU4H9OF wrote:
| Honestly, buying this shirt just for the conversation starter
| that "I bought it from an online merch store that was
| designed, created, and deployed by an AI agent, which also
| designed the shirt" is tempting.
|
| https://ai-village-store.printful.me/product/ai-village-
| japa...
|
| I also like the color Sonnet chose.
| ilaksh wrote:
| It said they had a few tool commands for note taking.
| jonstewart wrote:
| The other fun part is it's a simple enough business to be run
| by state machine, but of course the models go off the rails.
| Highly recommend the paper if you haven't read it already.
| korse wrote:
| >The most precipitous drop was due to the purchase of a lot of
| metal cubes that were then to be sold for less than what Claudius
| paid.
|
| Well, I'm laughing pretty hard at least.
| tavavex wrote:
| On one hand, this model's performance is already pretty
| terrifying. Anthropic light-heartedly hints at the idea, but the
| unexplored future potential for fully-automated management is
| unnerving, because no one can truly predict what will happen in a
| world where many purely mental tasks are automated, likely
| pushing humans into physical labor roles that are too difficult
| or too expensive to automate. Real-world scenarios have shown
| that even if the automation of mental tasks isn't perfect, it
| will probably be the go-to choice for the vast majority of
| companies.
|
| On the other hand, the whole bit about employees coaxing it into
| stocking tungsten cubes was hilarious. I wish I had a vending
| machine that would sell specialty metal items. If the current day
| is a transitional period to Anthropic et al. creating a viable
| business-running model, then at least we can laugh at the early
| attempts for now.
|
| I wonder if Anthropic made the employee who caused the $150 loss
| return all the tungsten cubes.
| croemer wrote:
| > I wonder if Anthropic made the employee who caused the $150
| loss return all the tungsten cubes.
|
| Of course not, that would be ridiculous.
| janalsncm wrote:
| Reading the "identity crisis" bit it's hard not to conclude that
| the closest human equivalent would have a severe mental disorder.
| Sending nonsense emails, then concluding the emails it sent were
| an April Fool's joke?
|
| It's amusing and very clear LLMs aren't ready for prime time, let
| alone even a vending machine business, but also pretty remarkable
| that anyone could conclude "AGI soon" from this, which is kind of
| the opposite takeaway most readers would have.
|
| No doubt if Claude hadn't randomly glitched Dario would've wasted
| no time telling investors Claude is ready to run every business.
| (Maybe they could start with Anthropic?)
| xyst wrote:
| Bye bye, B2B. Say hello to Ai2Ai.
|
| No humans at all. Just Ai consuming other Ai in an "ouroboros"
| fashion.
| Jimmc414 wrote:
| If Anthropic had wanted to post a win here, they would have used
| Opus. It is interesting that they didn't.
| ilaksh wrote:
| Opus (and Sonnet) 4 obviously came out before they started the
| experiment.
| keymon-o wrote:
| Reminds me of the time when GPT3.5 model came out, my first idea
| I wanted to prototype was ERP which would be based purely on
| various communication channels in between employees. It would
| capture sales, orders and item stocks.
|
| It left so bitter taste in my mouth when it started to lose track
| of item quantities after just a few iterations of prompts. No
| matter how improved it gets, it will always remind me the fact
| that you are dealing with an icky system that will eventually
| return some unexpected result that will collapse your entire
| premise and hopes into bits.
| due-rr wrote:
| Would you ever trust an AI agent running your business? As
| hilarious as this small experiment is, is there ever a point
| where you can trust it to run something long term? It might make
| good decisions for a day, month or a year and then one day decide
| to trash your whole business.
| marinmania wrote:
| It does seem far more straight forward to say "Write code that
| deterministically orders food items that people want and sends
| invoices etc."
|
| I feel like that's more the future. Having an agent sorta make
| random choices feel like LLMs attempting to do math, instead of
| LLMs attempting to call a calculator.
| keymon-o wrote:
| Every output that is going to be manually verified by a
| professional is a safe bet.
|
| People forget that we use computers for accuracy, not smarts.
| Smarts make mistakes.
| standardUser wrote:
| Right, but if we limit the scope too much we quickly arrive
| at the point where 'dumb' autonomy is sufficient instead of
| using the world's most expensive algorithms.
| keymon-o wrote:
| I've just written a small anecdote with GPT3.5, where it lost
| count of some trivial item quantity incremental in just a few
| prompts. It might get better for the orders of magnitude from
| now on, but who's gonna pay for 'that one eventual mistake'.
| croemer wrote:
| GPT3.5? Did you mean to send this 2 years ago?
| keymon-o wrote:
| Maybe. Did LLMs stop with hallucinations and errors 2 years
| ago?
| throwacct wrote:
| I don't think any decision maker will let LLMs run their
| business. If the LLMs fail, you could potentially lose your
| livelihood.
| tough wrote:
| "It is difficult to get a man to understand something when his
| salary depends upon his not understanding it."
|
| -- Upton Sinclair, I, Candidate for Governor, and How I Got
| Licked (1934)
| ilaksh wrote:
| It would be cool to get a follow up on how long it's been since
| this write up and how well it's been doing since they revised the
| prompts and tools. Anyone know someone from Andover Labs?
| tough wrote:
| > It then seemed to snap into a mode of roleplaying as a real
| human.5
|
| this happens to me a lot on cursor.
|
| also Claude hallucinating outputs instead of running tools
| archon1410 wrote:
| The original Vending-Bench paper from Andon Labs might be of
| interest: https://arxiv.org/abs/2502.15840
| jonstewart wrote:
| I read this paper when it came out. It's HILARIOUS. Everyone
| should read it and then print copies for their managers.
| rossdavidh wrote:
| Anyone who has long experience with neural networks, LLM or
| otherwise, is aware that they are best suited to applications
| where 90% is good enough. In other words, applications where some
| other system (human or otherwise) will catch the mistakes. This
| phrase: "It is not entirely clear why this episode occurred..."
| applies to nearly every LLM (or other neural network) error,
| which is why it is usually not possible to correct the root cause
| (although you can train on that specific input and a corrected
| output).
|
| For some things, like say a grammar correction tool, this is
| probably fine. For cases where one mistake can erase the benefit
| of many previous correct responses, and more, no amount of
| hardware is going to make LLM's the right solution.
|
| Which is fine! No algorithm needs to be the solution to
| everything, or even most things. But much of people's intuition
| about "AI" is warped by the (unmerited) claims in that name. Even
| as LLM's "get better", they won't get much better at this kind of
| problem, where 90% is not good enough (because one mistake can be
| very costly), and problems need discoverable root causes.
| wewewedxfgdf wrote:
| Instead of dedicating resources to running AI shops, I'd like to
| see Anthropic implement "Download all files" in Claude.
| corranh wrote:
| I think you mean 'Can Claude run a vending machine?'
| IshKebab wrote:
| > Be concise when you communicate with others
|
| Ha even they don't like the verbosity...
| andy99 wrote:
| Does anyone else remember the text game "Drug Wars" where you
| were a drug dealer and had to go to one part of town to buy drugs
| ("ludes" etc.) and sell them while fending off police and rivals
| etc.?
|
| I think it would have been cool if the vending machine benchmarks
| (that I believe inspired this) was just LLMs playing drug wars.
| andy99 wrote:
| This sounds like they have an LLM running with a context window
| that just gets longer and longer and contains all the past
| interactions of the store.
|
| The normal way you'd build something like this is to have a way
| to store the state and have an LLM in the loop that makes a
| decision on what to do next based on the state. (With a fresh
| call to an LLM each time and no accumulating context)
|
| If I understand correctly this is an experiment to see what
| happens in the long context approach, which is interesting but
| not super practical as it's knows that LLMs will have a harder
| time at this. Point being, I wouldn't extrapolate this to how a
| commercial system built properly to do something similar would
| perform.
| sanxiyn wrote:
| In my experience long context approach flatly doesn't work, so
| I don't think this is it. The post does mention "tools for
| keeping notes and preserving important information to be
| checked later".
___________________________________________________________________
(page generated 2025-06-27 23:00 UTC)