hngopher.com

       [HN Gopher] AI agents: Less capability, more reliability, please
       ___________________________________________________________________
        
       AI agents: Less capability, more reliability, please
        
       Author : serjester
       Score  : 279 points
       Date   : 2025-03-31 14:45 UTC (8 hours ago)
        
 (HTM) web link (www.sergey.fyi)
 (TXT) w3m dump (www.sergey.fyi)
        
       | cryptoz wrote:
       | This is refreshing to read. I, like everyone apparently, am
       | working on my own coding agent [1]. And I suppose it's not that
       | capable yet. But it sure is getting more reliable. I have it only
       | modify 1 file at a time. It generates tickets for itself to
       | complete - but never enough tickets to really get all the work
       | done. The tickets it does generate, however, it often can
       | complete (at least, in simple cases haha). The file modification
       | is done through parsing ASTs and modifying those, so the AI
       | doesn't go off and do all kinds of things to your whole codebase.
       | 
       | And I'm so sick of everything trying for 100% automation and
       | failing. There's a place for the human in the loop, in _quickly_
       | identifying bugs the AI doesn 't have the context for, or large-
       | scale vision, or security or product-focused mindset, etc.
       | 
       | It's going to be AI and humans collaborating. The solutions that
       | figure that out the best are going to win IMO. AI won't be doing
       | everything and humans won't be doing it all either. The tools
       | with the best human-AI collaboration are where it's at.
       | 
       | [1] https://codeplusequalsai.com
        
         | helltone wrote:
         | How do you modify ASTs?
        
           | cryptoz wrote:
           | I support HTML, JS, Python and CSS. For HTML, (not
           | technically an AST), I give the LLM the original-file HTML
           | source, and then I instruct it to write python code that uses
           | BeautifulSoup to modify the HTML. Then I get the string back
           | from python of the full HTML file, modified according to the
           | user prompt.
           | 
           | For python changes I use ast and astor packages, for JS I use
           | esprima/escodegen/estraverse, and for CSS I use postcss. The
           | process is the same for each one: I give the original input
           | souce file, and I instruct the LLM to parse the file into AST
           | form and then write code that modifies that AST.
           | 
           | I blogged about it here if you want more details! https://cod
           | eplusequalsai.com/static/blog/prompting_llms_to_m...
        
             | skydhash wrote:
             | I took a look at your project and while it's nice
             | (technically), for the actual use case shown, I can't see
             | the value over something like the old Dreamweaver with a
             | bit of training.
             | 
             | I still think like prompting is still the wrong interface
             | for programming systems. Even though they're restricted,
             | configurations forms, visual programming with nodes, and
             | small scripts attached to objects on a platform is way more
             | reliable and useful.
        
               | cryptoz wrote:
               | Appreciate you having a look and for that feedback,
               | thanks - I do agree I have work to do to _prove_ that my
               | idea is better than alternatives. We 'll see...
        
       | dfxm12 wrote:
       | _Google Flights already nails this UX perfectly_
       | 
       | Often when using an AI agent, I think to myself that a web search
       | gets me what I need more reliably and just as quick. Maybe AI has
       | to learn to crawl before it learns to walk, but each agent I use
       | is leaving me without confidence that it will ever be useful and
       | I genuinely wonder if they've ever been tested before being
       | published...
        
         | monero-xmr wrote:
         | Assume humans can do anything in a factory. So we create a tool
         | to increase the speed and reliability of the human's output. We
         | do this so much that eventually the whole factory is automated,
         | and the human is simply observing.
         | 
         | Nowhere in that story above is there a customer or factory
         | worker feeding in open-ended inputs. The factory is precise, it
         | takes inputs and produces outputs. The variability is
         | restricted to variability of inputs and the reliability of the
         | factory kit.
         | 
         | Much business software is analogous to the factory. You have
         | human workers who ultimately operate the business. And software
         | is built to automate those tasks precisely.
         | 
         | AI struggles because engineers are trying to build factories
         | through incantation - if they just say the right series of
         | magic spells, the LLM will produce a factory.
         | 
         | And often it can. It's just a shitty factory that does simple
         | things, often inefficiently with unforeseen edge cases.
         | 
         | At the moment, skilled factory builders (software engineers)
         | are better at holistically understanding the needs of the
         | business and building precise, maintainable, specific
         | factories.
         | 
         | The factory builders will use AI as a tool to help build better
         | factories. Trying to get the AI to build the whole factory
         | soup-to-nuts won't work.
        
       | killjoywashere wrote:
       | We have been looking at Hamming distance vs time to signature for
       | ambient note generation in medicine. Any other metrics? Lots of
       | metrics in the ML papers, but a lot of them seem sus. They take a
       | lot of work to reproduce or they are designed around some
       | strategy like maxing out the easy true negatives (so you get
       | desirable accuracy and F1 score), etc. as someone trying to build
       | validation protocols I can get vendors to enable (need them to
       | write certain data from memory to a DB table we can access) I'd
       | welcome that discussion. Right now the MBAs running the hospital
       | systems are doing whatever their ML buddies say without regard to
       | patient or provider.
        
       | simonw wrote:
       | Yeah, the "book a flight" agent thing is a running joke now - it
       | was a punchline in the Swyx keynote for the recent AI Engineer
       | event in NYC: https://www.latent.space/p/agent
       | 
       | I think this piece is underestimating the difficulty involved
       | here though. If only it was as easy as "just pick a single task
       | and make the agent really good at that"!
       | 
       | The problem is that if your UI involves human beings typing or
       | talking to you in a human language, there is an unbounded set of
       | ways things could go wrong. You can't test against every possible
       | variant of what they might say. Humans are bad at clearly
       | expressing things, but even worse is the challenge of ensuring
       | they have a concrete, accurate mental model of what the software
       | can and cannot do.
        
         | CooCooCaCha wrote:
         | Case-in-point look how long it's taken for self-driving cars to
         | mature. And many would argue they still have a ways to go until
         | they're truly reliable.
         | 
         | I think this highlights how we still haven't cracked
         | intelligence. Many of these issues come from the model's very
         | limited ability to adapt on the fly.
         | 
         | If you think about it every little action we take is a micro
         | learning opportunity. A small-scale scientific process of
         | trying something and seeing the result. Current AI models can't
         | really do that.
        
           | SoftTalker wrote:
           | Even maps. I was driving to Chicago last week and Apple Maps
           | insisted I take the exit for Danville. Fortunately I knew
           | better, I only had the map on in case an accident might
           | require rerouting. I find it hard to drive with maps
           | navigation because they are usually correct, but wrong often
           | enough that I don't fully trust them. So I have to double
           | check everything they tell me with the reality in front of
           | me, and that takes more mental effort than it ideally should.
        
         | noodletheworld wrote:
         | Isn't the point he's making:
         | 
         | >> Yet too many AI projects consistently underestimate this,
         | chasing flashy agent demos promising groundbreaking
         | capabilities--until inevitable failures undermine their
         | credibility.
         | 
         | This is the problem with the 'MCP for Foo' posts that recently.
         | 
         | Adding a capability to your agent that the agent can't use just
         | gives us _exactly that_ :
         | 
         | > inevitable failures undermine their credibility
         | 
         | It should be relatively easy for everyone to agree that giving
         | agents an unlimited set of arbitrary capabilities will just
         | make them terrible at everything; and that promising that
         | giving them these capabilities will make them better is:
         | 
         | A) false
         | 
         | B) undermining the credibility of agentic systems
         | 
         | C) undermining the credibility of the people making these
         | promises
         | 
         | ...I _get it_ , it _is_ hard to write good agent systems, but
         | surely, a bunch of half-baked, function-calling wrappers that
         | don 't really work... like, it's not a good look right?
         | 
         | It's just vibe coding for agents.
         | 
         | I think it's quite reasonable to be say, if you're building a
         | system, _now_ , then:
         | 
         | > The key to navigating this tension is focus--choosing a small
         | number of tasks to execute exceptionally well and relentlessly
         | iterating upon them.
         | 
         | ^ This seems like exceptionally good advice. If you can't make
         | something that's actually good by iterating on it until it _is_
         | good and it _does_ work, then you 're going to end up being a
         | devin (ie. over promised, over hyped failure).
        
         | emn13 wrote:
         | Perhaps the solutions(s) needs to be less focusing on output
         | quality, and more on having a solid process for dealing with
         | errors. Think undo, containers, git, CRDTs or whatever rather
         | than zero tolerance for errors. That probably also means some
         | kind of review for the irreversible bits of any process, and
         | perhaps even process changes where possible to make common
         | processes more reversible (which sounds like an extreme
         | challenge in some cases).
         | 
         | I can't imagine we're anywhere even close to the kind of
         | perfection required not to need something like this - if it's
         | even possible. Humans use all kinds of review and audit
         | processes precisely because perfection is rarely attainable,
         | and that might be fundamental.
        
           | techpineapple wrote:
           | But, assuming this is a general thing not just focused on say
           | software development, can you make the tooling around
           | creating this easier than defining the process itself?
           | Everyone loosely speaking sees the value in test driven
           | development, but often I think with complex processes,
           | writing the test is harder than writing the process.
        
           | RicoElectrico wrote:
           | I want to make a simple solution where data is parsed by a
           | vision model and "engineer for the unhappy path" is my
           | assumption from the get-go. Changing the prompt or swapping
           | the model is cheap.
        
             | herval wrote:
             | vision models are also faulty, and some times all paths are
             | unhappy paths, so there's really no viable solution. Most
             | of the times, swapping the model completely randomizes the
             | problem space (unless you measure every single corner case,
             | it's impossible to tell if everything got better or if some
             | things got worse...
        
           | _bin_ wrote:
           | The biggest issue I've seen is "context window poisoning",
           | for lack of a better term. If it screws something up it's
           | highly prone to repeating that mistake. It then makes a bad
           | fix that propagates two more errors, the says, "Sure! Let me
           | address that," repeating to poorly fix those rather than the
           | underlying issue (say, restructuring code to mitigate.)
           | 
           | It is almost impossible to produce a useful result, far as
           | I've seen, unless one eliminates that mistake from the
           | context window.
        
             | instakill wrote:
             | I really really wish that LLMs had an "eject" function - as
             | in I could click on any message in a chat, and it would
             | basically start a new clone chat with the current chat's
             | thread history.
             | 
             | There are so many times where I get to a point where the
             | conversation is finally flowing in the way that I want and
             | I would love to "fork" into several directions from that
             | one specific part of the conversation.
             | 
             | Instead I have to rely on a prompt that requests the LLM to
             | compress the entire conversation into a non-prose format
             | that attempts to be as semantically lossless as possible;
             | this sadly never works as in ten did [sic].
        
               | theblazehen wrote:
               | You can use LibreChat which allows you to fork messages:
               | https://www.librechat.ai/docs/features/fork
        
               | tough wrote:
               | Google UI supports branching and delete someone recently
               | made a blog post about how great it is
        
               | mvdtnz wrote:
               | This is precisely what the poorly named Edit button does
               | in Claude.
        
             | bongodongobob wrote:
             | I think this is one of the core issues people have when
             | trying to program with them. If you have a long
             | conversation with a bunch of edits, it will start to get
             | unreliable. I frequently start new chats to get around this
             | and it seems to work well for me.
        
             | donmcronald wrote:
             | This is what I find. If it makes a mistake, trying to get
             | it to fix the mistake is futile and you can't "teach" it to
             | avoid that mistake in the future.
        
           | ModernMech wrote:
           | > Perhaps the solutions(s) needs to be less focusing on
           | output quality, and more on having a solid process for
           | dealing with errors. Think undo, containers, git, CRDTs
           | 
           | LLMs are supposed to save us from the toils of software
           | engineering, but it looks like we're going to reinvent
           | software engineering to make AI useful.
           | 
           | Problem: Programming languages are too hard.
           | 
           | Solution: AI!
           | 
           | Problem: AI is not reliable, it's hard to specify problems
           | precisely so that it understands what I mean unambiguously.
           | 
           | Solution: Programming languages!
        
             | Workaccount2 wrote:
             | With pretty much every new technology, society has bent
             | towards the tech too.
             | 
             | When smartphones first popped up, browsing the web on them
             | was a pain. Now pretty much the whole web has phone
             | versions that make it easier*.
             | 
             | *I recognize the folly of stating this on HN.
        
               | LtWorf wrote:
               | No it's still a pain.
               | 
               | There's apps that open links in their embedded browser
               | where ads aren't blocked. So I need to copy the link and
               | open them in my real browser.
        
         | serjester wrote:
         | Even operator's original demo the first thing they showed was
         | booking restaurant reservations and ordering groceries. I
         | understand their need to demo something intuitive but it's
         | still debatable whether these tasks are ones that most people
         | want delegated to black-box agents.
        
           | ToucanLoucan wrote:
           | They don't. I have never once in my life wanted to talk to my
           | smart speaker about what I wanted for dinner, not even
           | because a smart speaker is/can be creepy, not because of
           | social anxiety, no, it's just simpler and more
           | straightforward to open Doordash on my damn phone, and look
           | at a list of restaurants nearby to order from. Or browse a
           | list of products on Amazon to buy. Or just call a restaurant
           | to get a reservation. These tasks are trivial.
           | 
           | And like, as a socially anxious millennial, no I don't
           | particularly like phone calls. However I also recognize that
           | setting my discomfort aside, a direct connection to a human
           | being who can help reason out a problem I'm having is not
           | something easily replaced with a chatbot or an AI assistant.
           | It just isn't. Perfect example: called a place to make a
           | reservation for myself, my wife and girlfriend (poly long
           | story) and found the place didn't usually do reservations on
           | the day in question, but the person did ask when we'd be
           | there. As I was talking to a person, I could provide that
           | information immediately, and say "if you don't take
           | reservations don't worry, that's fine," but it was an off-
           | busy hour so we got one anyway. How does an AI navigate that
           | conversation more efficiently than me?
           | 
           | As a techie person I basically spend the entire day
           | interacting with various software to perform various tasks,
           | work related and otherwise. I cannot overstate: NONE of these
           | interactions, not a single one, is improved one iota by
           | turning it into a conversation, verbal or text-based, with my
           | or someone else's computer. By definition it makes basic
           | tasks take longer, every time, without fail.
        
             | bluGill wrote:
             | I've more than once been on a roadtrip and realized that
             | wanted something to help me find a meal where I'll be
             | sometime in the next 2 hours. I have no idea what the
             | options are and I can't find them. All too often I've taken
             | some generic fast food when I really wanted something local
             | but I couldn't get maps to tell me and such things are one
             | street away where I wouldn't see it. (remember too if i'm
             | driving I can't spend time to scroll through a list - but
             | even when I'm navigator the interface I can find in maps
             | isn't good)
        
               | simonw wrote:
               | I'm on a road trip across Utah and Colorado right now and
               | I've been experimenting with both Gemini and OpenAI Deep
               | Research for this kind of thing with surprisingly decent
               | results. Here's one transcript from this morning: https:/
               | /chatgpt.com/share/67e9f968-4e88-8006-b672-13381d5e95...
        
         | 3p495w3op495 wrote:
         | Any customer service or tech support rep can tell you that even
         | humans can't always understand what other humans are attempting
         | to say
        
         | hansmayer wrote:
         | It's so funny when people try to build robots imitating people.
         | I mean part funny, part tragedy of the upcoming bust. The irony
         | being, we would have been better off with an interoperable
         | flight booking API standard which a deterministic _headless_
         | agent could use to make perfect bookings every single time.
         | There is a reason current user interfaces stem from a
         | scientific discipline once called  " _Human_ -Computer
         | Interaction".
        
           | jatins wrote:
           | But that's the promise of AI, right? You can't put an API on
           | everything for human + technological reasons.
        
             | hansmayer wrote:
             | It is a promise alright :)
        
             | dartos wrote:
             | You can't put an API on everything because it'd take a ton
             | of time and money to pull that off.
             | 
             | I can't think of any technological reasons why every
             | digital system can't have an API (barring security
             | concerns, as those would need to be case by case)
             | 
             | So instead, we put 100s of billions of dollars into
             | statistical models hoping they could do it for us.
             | 
             | It's kind of backwards.
        
               | Scene_Cast2 wrote:
               | You change who's paying.
        
               | dartos wrote:
               | Sure, as a biz it makes sense, but as a society, it's
               | obviously a big failure.
        
               | datadrivenangel wrote:
               | A web page is an Application/Human Interface. Outside of
               | security concerns, companies can make more money if they
               | control the Application/Human Interface, and reduce the
               | risk of a middleman / broker extorting them for margins.
               | 
               | If I run a flight aggregator that has a majority of
               | flight bookings, I can start charging 'rents' by allowing
               | featured/sponsored listings to be promoted above the
               | 'best' result, leading to a prisoner's dilemma where
               | airlines should pay up to their margins to keep market
               | share.
               | 
               | If an AI company becomes the default application human
               | interface, they can do the same thing. Pay OpenAI tribute
               | or be ended as a going concern.
        
               | daxfohl wrote:
               | Exactly. It should take around 10 parameters to book a
               | flight. Not 30,000,000,000 and a dedicated nuclear power
               | plant.
        
           | TeMPOraL wrote:
           | It's a business problem, not a tech problem. We don't have a
           | solution you described because half of the air travel
           | industry relies on things not being interoperable. AI is the
           | solution at the limit, one set of companies selling users the
           | ability to show a middle finger to a much wider set of
           | companies - interoperability by literally having a digital
           | human approximation pretending to be the user.
        
             | the_snooze wrote:
             | I've been a sentient human for at least the last 15 years
             | of tech advancement. Assuming this stuff actually works,
             | it's only a matter of time before these AI services claw
             | back all that value for themselves and hold users and
             | businesses hostage to one another, just like social media
             | and e-commerce before.
             | https://en.wikipedia.org/wiki/Enshittification
             | 
             | Unless these tools can be run locally independent of a
             | service provider, we're just trading one boss for another.
        
               | ben_w wrote:
               | > Unless these tools can be run locally independent of a
               | service provider, we're just trading one boss for
               | another.
               | 
               | Many of them already can be. Many more existing models
               | will become local options if/when RAM prices decline.
               | 
               | But this won't necessarily prevent enshittification, as
               | there's always a possibility of a new model being tasked
               | with pushing adverts or propaganda. And perhaps existing
               | models already have been -- certainly some people talk as
               | if it's so.
        
               | polishdude20 wrote:
               | The difference is that social media isn't special because
               | of its hardware or software even. People are stuck on fa
               | ebook because everyone else is on it. It's network
               | effects. LLMs currently have no network effects. Your
               | friends and family aren't "on" chatgpt so why use that
               | over something else?
               | 
               | Once performance of a local setup is on par with online
               | ones or good enough, that'll be game over for them.
        
             | bluGill wrote:
             | The airlines rely on things not interoperating for you.
             | However their agents interoperate all the time via code
             | sharing. They don't want normal people to do this but if
             | something goes wrong with the airplane you should be on
             | they would prefer you to get there than not.
        
           | doug_durham wrote:
           | Your use of the word "perfect" is doing a lot of heavy
           | lifting. "Perfect" is a word embedded in a high dimensional
           | space whose local maxima are different for every human on the
           | planet.
        
         | yujzgzc wrote:
         | I'm old enough to remember having to talk to a (human) agent in
         | order to book flights, and can confirm that in my experience,
         | the modern flight booking website is an order of magnitude
         | better UX than talking to someone about your travel plans.
        
           | kccqzy wrote:
           | That still exists. The last time I did onsite interviews,
           | every single company that wanted to fly me to their office to
           | interview me asked me to talk to a human agent to book
           | flights. But of course the human agent is just a travel agent
           | with no budgetary power; so I ended up calling the agent to
           | inquire about a booking, then calling the recruiter to
           | confirm that price is acceptable, and then calling the agent
           | book to confirm the booking.
           | 
           | It doesn't have to be this way. Even before the pandemic I
           | remember some companies simply gave me access to an internal
           | app to choose flights where the only flights shown are these
           | of the right date, right airport, and right price.
        
           | leoedin wrote:
           | Yeah, I much prefer using a well designed self service system
           | than trying to explain it over the phone.
           | 
           | The only problem with most of the flights I book now is that
           | they're with low cost airlines and packed with dark patterns
           | designed to push upgrades.
           | 
           | Would an AI salesman be any better though? At least the
           | website can't actively try to pursuade me to upgrade.
        
         | Spooky23 wrote:
         | It's no different than the old Amazon button thing. I'm not
         | going to automatically pay whatever price Amazon is going to
         | charge to push-button replenish household goods. Especially in
         | those days, where "The World's Biggest" fence would have pretty
         | wild swings in price.
         | 
         | If i were rich enough to have some bot fly me somewhere, I'd
         | have a real-life minion do it for me.
        
         | burnte wrote:
         | > Yeah, the "book a flight" agent thing is a running joke now
         | 
         | I literally sat in a meeting with one of our board members who
         | used this exact example of how "AI can do everything now!" and
         | it was REALLY hard not to laugh.
        
         | wdb wrote:
         | Can Google Flights find the best flight dates to a destination
         | within a time frame? E.g. get flights to LA in a up to 15 day
         | period with ensure attendance on 17 September. Fly with
         | SkyAlliance airlines only. Flexible with any dates but needs to
         | be there on 17 Sept and at minimum stay of eight days or more.
         | 
         | Love if it could help with that but I haven't figured it out
         | with Google Flights yet. My dream is to tell an AI agent the
         | above and let it figure out the best deal.
        
         | davesque wrote:
         | Yep, and AI agents essentially throw up a boundary blocking the
         | user from understanding the capabilities of the system they're
         | using. They're like the touch screens in cars that no one asked
         | for, but for software.
        
         | photonthug wrote:
         | > The problem is that if your UI involves human beings typing
         | or talking to you in a human language, there is an unbounded
         | set of ways things could go wrong. You can't test against every
         | possible variant of what they might say.
         | 
         | It's almost like we really might benefit from using the
         | advances in AI for stuff like speech recognition to build
         | _concrete interfaces with specific predefined vocabularies and
         | a local-first UX_. But stuff like that undermines a cloud-based
         | service and a constantly changing interface and the
         | opportunities for general spying and manufacturing
         | "engagement" while people struggle to use the stuff you've
         | made. And of course, producing actual specifications means that
         | you would have to own bugs. Besides eliminating employees, much
         | interest in AI is all about completely eliminating
         | responsibility. As a user of ML-based monitoring products and
         | such for years.. "intelligence" usually implies no real
         | specifications, and no specifications implies no bugs, and no
         | bugs implies rent-seeking behaviour without the burden of any
         | actual responsibilities.
         | 
         | It's frustrating to see how often even technologists buy the
         | story that "users don't want/need concrete specifications" or
         | that "users aren't smart enough to deal with concrete
         | interfaces". It's a trick.
        
       | ramesh31 wrote:
       | More capability, less reliability please. I want something that
       | can achieve superhuman results 1 out of 10 times, not something
       | that gives mediocre human results 9 out of 10 times.
       | 
       | All of reality is probabilistic. Expecting that to map
       | deterministically to solving open ended complex problems is
       | absurd. It's vectors all the way down.
        
         | soulofmischief wrote:
         | Stability is the bedrock of the evolution of stable systems.
         | LLMs will not democratize software until an average person can
         | get consistently decent and useful results without needing to
         | be a senior engineer capable of a thorough audit.
        
           | ramesh31 wrote:
           | >Stability is the bedrock of the evolution of stable systems.
           | 
           | So we also thought with AI in general, and spent decades
           | toiling on rules based systems. Until interpretability was
           | thrown out the window and we just started letting deep
           | learning algorithms run wild with endless compute, and looked
           | at the actual results. This will be very similar.
        
             | skydhash wrote:
             | Rules based systems are quite useful, not for interacting
             | with an untrained human, but for getting things done. Deep
             | learning can be good at exploring the edges of a problem
             | space, but when a solution is found, we can actually get to
             | the doing part.
        
             | klabb3 wrote:
             | This can be explained easily - there are simply some
             | domains that were hard to model, and those are the ones
             | where AI is outperforming humans. Natural language is the
             | canonical example of this. Just because we focus on those
             | domains now due to the recent advancements, doesn't mean
             | that AI will be better at every domain, especially the ones
             | we understand exceptionally well. In fact, all evidence
             | suggests that AI excels at some tasks and struggles with
             | others. The null hypothesis should be that it continues to
             | be the case, even as capability improves. Not all
             | computation is the same.
        
             | soulofmischief wrote:
             | Stability and probability are orthogonal concepts. You can
             | have stable probabilistic systems. Look no further than our
             | own universe, where everything is ultimately probabilistic
             | and not "rules-based".
        
         | deprave wrote:
         | What would be a superhuman result for booking a flight?
        
           | mjmsmith wrote:
           | 10% of the time the seat on either side of you is empty, 90%
           | of the time you land in the wrong country.
        
         | Jianghong94 wrote:
         | Superhuman results 1/10 are, in fact, a very strong reliability
         | guarantee (maybe not up to today's nth 9 decimal standard that
         | we are accustomed to, but probably much higher than any agent
         | in real-world workflow).
        
         | klabb3 wrote:
         | Reality is probabilistic yes but it's not black box. We can
         | improve our systems by understanding and addressing the flaws
         | in our engineering. Do you want probabilistic black-box
         | banking? Flight controls? Insurance?
         | 
         | "It works when it works" is fine when stakes are low and human
         | is in the loop, like artwork for a blog post. And so in a way,
         | I agree with you. AI doesn't belong in intermediate computer-
         | to-computer interactions, unless the stakes are low. What
         | scares me is that the AI optimists are desperately looking to
         | apply LLMs to domains and tasks where the cost of mistakes are
         | high.
        
         | recursive wrote:
         | > Expecting that to map deterministically to solving open ended
         | complex problems is absurd.
         | 
         | TCP creates an abstraction layer with more reliability than
         | what it's built on. If you can detect failure, you can create a
         | retry loop, assuming you can understand the rules of the
         | environment you're operating in.
        
       | segh wrote:
       | Lots of people are building on the edge of current AI
       | capabilities, where things don't quite work, because in 6 months
       | when the AI labs release a more capable model, you will just be
       | able to plug it in and have it work consistently.
        
         | techpineapple wrote:
         | In 6 months when FSD is completed, and we get robots in every
         | home? I suspect we keep adding features, because reliability is
         | hard. I do not know what heuristic you would be looking to
         | conclude that this problem will eventually be solved by current
         | AI paradigms.
        
           | thornewolf wrote:
           | GP comment is what has already happened "every 6 months"
           | multiple times
        
         | postexitus wrote:
         | and where is that product that was developed on the edge of
         | current AI capabilities and now with latest AI model plugged in
         | it's suddenly working consistently? All I am seeing is models
         | getting better and better in generating videos of spaghetti
         | eating movie stars.
        
           | segh wrote:
           | For me, they have come from the AI labs themselves. I have
           | been impressed with Claude Code and OpenAI's Deep Research.
        
             | vslira wrote:
             | while i'm bullish on AI capabilities, that is not a very
             | optimistic observation for developers building on top of it
        
         | cube00 wrote:
         | > because in 6 months when the AI labs release a more capable
         | model
         | 
         | How many years do we have to keep hearing this line? ChatGPT is
         | two years old and still can't be relied on.
        
       | wiradikusuma wrote:
       | Booking a flight is actually task I cannot outsource to a human
       | assistant, let alone AI. Maybe it's a third-world problem or just
       | me being cheap, but there are heuristics involved when booking
       | flights for a family trip or even just for myself.
       | 
       | Check the official website, compare pricing with aggregator,
       | check other dates, check people's availability on cheap dates.
       | Sometimes I only do the first step if the official price is
       | reasonable (I travel 1-2x a month, so I have expectation "how
       | much it should cost").
       | 
       | Don't get me started if I also consider which credit card to use
       | for the points rewards.
        
         | kccqzy wrote:
         | Completely agree! Especially considering that flights for most
         | people are still a large expense, people, especially those in
         | the credit card points game, like to go to great lengths to
         | score the cheapest possible flights.
         | 
         | For example, this person[0] could have simply booked a United
         | flight from the United site for 15k points. Instead the person
         | batch emailed Turkish Airlines booking offices, found the Thai
         | office that was willing to make that booking but required bank
         | transfers in Thai baht to pay taxes, made two more phone calls
         | to Turkish Airlines to pay taxes with a credit card, and in the
         | end only spent 7.5k points for the same trip on United.
         | 
         | This may be an extreme example, but it shows the amount of
         | familiarity with the points system, the customer service phone
         | tree and the actual rules to get cheap flights.
         | 
         | If AI can do all of that, it'd be useful. Otherwise I'll stick
         | to manual booking.
         | 
         | [0]: https://frequentmiler.com/yes-you-can-still-book-united-
         | flig...
        
           | Jianghong94 wrote:
           | Now THAT's the workflow I'd like to see AI agent automate,
           | streamline and democratize for everybody.
        
             | maxbond wrote:
             | If it were available to everybody, it would disappear. This
             | is a market inefficiency that a "trader" with deep
             | knowledge of the structure of this market was able to
             | exploit. But if everyone started doing this, United/Turkish
             | Airlines would see they were losing money and eliminate it.
             | Similar to how airlines have tried to stop people
             | exploiting "hidden cities."
        
               | davedx wrote:
               | > Similar to how airlines have tried to stop people
               | exploiting "hidden cities."
               | 
               | This sounds interesting?
        
               | wbxp99 wrote:
               | https://skiplagged.com/
               | 
               | Just don't book a round trip, don't check a bag, don't do
               | it too often. Also you're gambling that they don't cancel
               | your flight and book you on a new one to the city you
               | don't actually want to go to (that no longer connects via
               | the hidden city). You can get half price tickets
               | sometimes with this trick.
        
             | kristjansson wrote:
             | and watch it immediate evaporate or require even more
             | esoteric knowledge of opaque systems?
             | 
             | Persistent mispricings can only exist if the cost of
             | exploitation removes the benefit or constrains the
             | population.
        
         | victorbjorklund wrote:
         | I don't really need an AI agent to book flights for me (I just
         | don't travel enough for it to be any burden) but aren't those
         | arguments for an AI agent? If you just wanna book the next
         | flight London to New York it isn't that hard. A few minutes of
         | clicking.
         | 
         | But if you wanna find the cheapest way to get to A, compare
         | different retailers, check multiple peoples availability,
         | calculate effects of credit cards etc. It takes time. Aren't
         | those things that could be automated with an agent that can
         | find the cheapest flights, propose dates for it, check
         | availability etc with multiple people via a messing app,
         | calculate which credit card to use, etc?
        
           | bgirard wrote:
           | In theory, yes. But in a real world evaluation would it pick
           | better flights? I'd like to see evidence that it's able to
           | find a better flight that maximizes this. Also the tricky
           | part is how do you communicate how much I personally weight a
           | shorter flight vs points on my preferred carrier vs having to
           | leave for the airport at 5am vs 8am? I'm sure my answers
           | would differ from wiradikusuma's answers.
        
             | UncleMeat wrote:
             | Yep this is my vibe.
             | 
             | When I'm picking out a flight I'm looking at, among other
             | things:
             | 
             | * Is the itinerary aggravatingly early or late
             | 
             | * Is the layover aggravatingly short or long
             | 
             | * Is the layover in an airport that sucks
             | 
             | * Is the flight on a carrier that sucks
             | 
             | * What does it cost
             | 
             | If you asked me to encode ahead of time the relative value
             | of each of these dimensions I'd never be able to do it.
             | Heck, the relative value to me isn't even constant over
             | time. But show me five options and I can easily select
             | between them. A clear case where search is more convenient
             | than some agent doing it for me.
        
               | bgirard wrote:
               | I agree. At first I would be open to an LLM suggested
               | option to appear in the search UI. I would have to pick
               | it the majority of the time for quite awhile for me to
               | trust it enough to blindly book through it.
               | 
               | It's the same problem with Alexa. I don't trust it to
               | blindly reorder me basic stuff when I have to shift
               | through so many bad product listing on the Amazon
               | marketplace.
        
           | Jianghong94 wrote:
           | Yep that's what I've been thinking. This shouldn't be that
           | hard, at this point LLMs should already have all the 'rules'
           | (e.g. credit card A buys flight X give you m point which can
           | be converted into n miles) in their params or can easily
           | query the web to get it out. Dev need to encode the whole
           | thing into a decision mechanism and once executed ask LLM to
           | chase down the specific path (e.g. bombard ticket office with
           | emails).
        
         | joseda-hg wrote:
         | The Flight Price to Tolerable Layover time ratio is something
         | too personal for me to convey to an assistant
        
         | zippergz wrote:
         | I have HAD a human assistant who booked flights for me. But it
         | took them a long time to learn the nuances of my preferences
         | enough to do it without a lot of back and forth. And even then,
         | they still sometimes had to ask. Things like what time of day I
         | prefer to fly based on what I had going on the day before or
         | what I'll be doing after I land. What airlines I prefer based
         | on which lounges I'd have access to, or what aircraft they fly.
         | When I would opt for a connecting flight to get a better price
         | vs. when I want nonstop regardless of cost. And on and on.
         | Probably dozens of factors that might come into play in various
         | combinations depending on where I'm going and why. And
         | preferences that are hard to articulate, but make sense once
         | understood.
         | 
         | With a really excellent human assistant who deeply understood
         | my brain (at least the travel related parts of it), it was kind
         | of nice. But even then there were times when I thought it would
         | be easier and better to just do it myself. Maybe it's a failure
         | of imagination, but I find it very hard to see the path from
         | today's technology to an AI agent that I would trust enough to
         | hand it off, and that would save enough time and hassle that I
         | wouldn't prefer to just do it myself.
        
         | pton_xd wrote:
         | > Booking a flight is actually task I cannot outsource to a
         | human assistant, let alone AI.
         | 
         | Because there is no "correct" flight. Your preference changes
         | as you discover information about what's available at a given
         | time and price.
         | 
         | The helpful AI assistant would present you with options, you'd
         | choose what you prefer, it would refine the options, and so on,
         | until you make your final selection. There would be no
         | communication lag as there would be with a human assistant.
         | That sounds very doable to me.
        
       | qoez wrote:
       | You get more reliability from better capability though. More
       | capability means being better at not misclassifying subtle tasks,
       | which is what causes reliability issues.
        
       | joshdavham wrote:
       | My rule of thumb has thus far been: if I'm gonna allow AI to
       | write any bit of code for me, then I must, at a bare minimum, be
       | able to understand that code.
       | 
       | There's no way I could do what some of these "vibe coders" are
       | doing where they allow AI to write code for them that they don't
       | even understand.
        
         | kevmo314 wrote:
         | That's only true as long as you want to modify said code. If it
         | meets your bar for reliability then you won't need to
         | understand it, much like how we don't really need to
         | read/understand compiled assembly code so we largely trust the
         | compiler.
         | 
         | A lot of these vibe coders just have a much lower bar for
         | reliability than you.
        
           | fourside wrote:
           | How do you know if it meets your bar for reliability if you
           | don't understand the output? I don't know that the analogy to
           | a compiler is apples to apples. A compiler isn't producing an
           | answer based on statistically generating something that
           | should look like the right answer.
        
             | kevmo314 wrote:
             | The premise for vibe coding is that it's generating the
             | entire app or site. If the app does what you want then it's
             | meeting the bar.
        
           | joshdavham wrote:
           | This is an interesting point and it's certainly true with
           | respect to most peoples' attitudes towards dependencies.
           | 
           | For example, while I feel the need to understand the code I
           | wrote using pytorch, I don't generally feel the need to
           | totally grok how pytorch works.
        
         | AlexandrB wrote:
         | I think there's a lot of code that gets written that's either
         | disposable or effectively "write only" in that no one is
         | expected to maintain it. I have friends who write a lot of this
         | code for tasks like data analysis for retail and "vibe coding"
         | isn't that crazy in such a domain.
         | 
         | Basically, what's worse? "Vibes" code that no one understands
         | or a cascade of 20 spreadsheets that no one understands? At
         | least with the "vibes" code you can stick it in git and have
         | some semblance of sane revision control and change tracking.
        
           | Centigonal wrote:
           | > I have friends who write a lot of this code for tasks like
           | data analysis for retail and "vibe coding" isn't that crazy
           | in such a domain.
           | 
           | I think this is a great use case for AI, but the analyst
           | still needs to understand what the code that is output does.
           | There are a lot of ways to transform data that result in
           | inaccurate or misleading results.
        
             | LPisGood wrote:
             | Vibe coders focus on writing tests, and verifying
             | function/correctness. It's not like they don't read _any_
             | of the code. They get the vibes, but ignore the details.
        
               | cube00 wrote:
               | > Vibe coders focus on writing tests
               | 
               | From the boasting I've seen, Vibe coders are also using
               | AI to slop out their tests as well.
        
               | kibwen wrote:
               | Worry not, we can solve this by using AI to generate
               | tests to test the tests.
        
               | LPisGood wrote:
               | Testing the tests is pretty much the definition of being
               | a vibe coder.
        
               | LPisGood wrote:
               | Yeah and tests are much easier to validate than
               | functions.
        
               | hooverd wrote:
               | Huh. The whole promise of vibe coding is that you don't
               | have to pay attention to the details.
        
               | namaria wrote:
               | "You're programming wrong wrong" /s
        
               | LPisGood wrote:
               | Yeah, of course. I don't think what I described could
               | possibly be misconstrued as someone paying attention to
               | details.
        
           | pton_xd wrote:
           | > I have friends who write a lot of this code for tasks like
           | data analysis for retail and "vibe coding" isn't that crazy
           | in such a domain.
           | 
           | That sort of makes sense, but then again... if you run some
           | analysis code and it spits out a few plots, how do you know
           | what you're looking at is correct if you have no idea what
           | the code is doing?
        
             | kibwen wrote:
             | _> how do you know what you 're looking at is correct if
             | you have no idea what the code is doing?_
             | 
             | Does it reaffirm the biases of the one who signs my
             | paychecks? If so, then the code is correct.
        
               | usui wrote:
               | LOL thanks for the laughs. But yes seriously though, most
               | kinds of data analysis jobs several rungs down the ladder
               | where the result is not in a critical path amount to
               | reaffirming what upper people believe. Don't rock the
               | boat.
        
               | AlexandrB wrote:
               | Lol, that's definitely a factor. Actually plotting is the
               | perfect example because python is really popular in the
               | space and matplotlib sucks so much. While an analyst may
               | not understand Python very well, they often understand
               | the data itself through either previous projects or
               | through other analysis tools. It's kind of like vibe
               | coding a UI for a backend that's hand built.
        
           | liveoneggs wrote:
           | two wrongs don't make a right
        
           | cube00 wrote:
           | > I have friends who write a lot of this code for tasks like
           | data analysis for retail and "vibe coding" isn't that crazy
           | in such a domain
           | 
           | Considering the hallucinations we've all seen I don't know
           | how they can be comfortable using AI generated data analysis
           | to drive the future direction of the business.
        
           | inetknght wrote:
           | > _what 's worse? "Vibes" code that no one understands or a
           | cascade of 20 spreadsheets that no one understands? At least
           | with the "vibes" code you can stick it in git and have some
           | semblance of sane revision control and change tracking._
           | 
           | You can for spreadsheets too.
        
           | palmotea wrote:
           | > think there's a lot of code that gets written that's either
           | disposable or effectively "write only" in that no one is
           | expected to maintain it. I have friends who write a lot of
           | this code for tasks like data analysis for retail and "vibe
           | coding" isn't that crazy in such a domain.
           | 
           | > Basically, what's worse? "Vibes" code that no one
           | understands or a cascade of 20 spreadsheets that no one
           | understands?
           | 
           | Correction: it's a "cascade of 20 spreadsheets" that _one_
           | person understood /understands.
           | 
           | Write only code still needs to work, and _someone_ at _some
           | point_ needs to understand it well enough to know that it
           | works.
        
         | SkyPuncher wrote:
         | Sure, but you're a professional software engineer, who I assume
         | gets feedback and performance reviews based on the quality of
         | your code.
         | 
         | There's always been a group of beginners that throws stuff
         | together without fully understanding what it does. In the past,
         | this would be copy n' paste from Stackoverflow. Now, that
         | process is simply more automated.
        
         | __MatrixMan__ wrote:
         | I think there are times where it's ok to treat a function like
         | a black box--cases where anything that makes the test pass will
         | do because the test is in fact an exhaustive evaluation of what
         | that code needs to do.
         | 
         | We just need to be better about making it clear which code is
         | that way and which is not.
        
       | mentalgear wrote:
       | Capability demos (like Rabbit R1 vaporware) will go up as long as
       | the market is hot and investors (like lemmings) foolishly running
       | after those companies that are best @ hype.
        
       | marban wrote:
       | Giving up accuracy for a bit of convenience--if any at all--
       | almost never pays off. Looking at you, Alexa.
        
         | danielbln wrote:
         | Image compression, eventual consistency, fuzzy search. There
         | are many more examples I'm sure.
        
           | skydhash wrote:
           | > _Image compression, eventual consistency, fuzzy search.
           | There are many more examples I 'm sure._
           | 
           | Isn't all of these very deterministic? You can predict what's
           | going to be discarded by the compression algorithm. Eventual
           | consistency is only eventual because of the generation of
           | events. Once that stops, you will have a consistent system
           | and the whole thing can be replayed based on the history of
           | events. Even with fuzzy search you can intuit how to get
           | reliable results and ordering without even looking at the
           | algorithms.
           | 
           | An LLMs based agent is the least efficient method for most of
           | the cases they're marketing if for. Sometimes all you need is
           | a rule-based engine. Then you can add bounded fuzziness where
           | it's actually helpful.
        
       | bhu8 wrote:
       | I have been thinking about the exact same problem for a while and
       | was literally hours away from publishing a blogpost on the
       | subject.
       | 
       | +100 on the footnote:
       | 
       | > agents or workflows?
       | 
       | Workflows. Workflows, all the way.
       | 
       | The agents can start using these workflows once they are actually
       | ready to execute stuff with high precision. And, by then we would
       | have figured out how to create effective, accurate and easily
       | diagnozable workflows, so people will stop complaining about "I
       | want to know what's going on inside the black box".
        
         | breckenedge wrote:
         | Agreed, I started crafting workflows last week. Still not
         | impressed with how poorly the current crop of models is at
         | following instructions.
         | 
         | And are there any guidelines on how to manage workflows for a
         | project or set of projects? I'm just keeping them in plain text
         | and including them in conversations ad hoc.
        
         | DebtDeflation wrote:
         | I've been building workflows with "AI" capability inserted
         | where appropriate since 2016. Mostly customer service chatbots.
         | 
         | 99.9% of real world enterprise AI use cases today are for
         | workflows not agents.
         | 
         | However, "agents" are being pushed because the industry needs a
         | next big thing to keep the investment funding flowing in.
         | 
         | The problem is that even the best reasoning models available
         | today don't have the actual reasoning and planning capability
         | needed to build truly autonomous agents. They might in a year.
         | Or they might not.
        
       | narmiouh wrote:
       | I feel like OP would have been better of not referencing the
       | viral thread about a developer not using any version control and
       | surprised when the AI made changes, I don't think anyone who
       | doesn't understand version control should be using a tool like
       | cursor, there are other SAAS apps that build and deploy apps
       | using AI and for people with the skill demonstrated in the
       | thread, that is what they should be using.
       | 
       | It's like saying rm -rf / should have more safeguards built in.
       | It feels unfair to call out the AI based tools for this.
        
         | fabianhjr wrote:
         | `rm -rf /` does have a safeguard:
         | 
         | > For example, if a user with appropriate privileges mistakenly
         | runs 'rm -rf / tmp/junk', that may remove all files on the
         | entire system. Since there are so few legitimate uses for such
         | a command, GNU rm normally declines to operate on any directory
         | that resolves to /. If you really want to try to remove all the
         | files on your system, you can use the --no-preserve-root
         | option, but the default behavior, specified by the --preserve-
         | root option, is safer for most purposes.
         | 
         | https://www.gnu.org/software/coreutils/manual/html_node/Trea...
        
           | layer8 wrote:
           | That was added in 2006, so didn't exist for a good half of
           | its life (even longer if you count pre-GNU). I remember _rm
           | -rf /_ being considered just one instance of having to
           | double-check what you do when using the _-rf_ option. It's
           | one reason it became common to alias _rm_ to _rm -i_.
        
         | danso wrote:
         | I think it's a useful anecdote because it underscores how
         | catastrophically unreliable* agents can be, especially in the
         | hands of users who aren't experienced in the particular domain.
         | In the domain of programming, it's much easier to quantify a
         | "catastrophic" scenario vs. more open-ended "real world"
         | situations like booking a flight.
         | 
         | * "unreliable" may not be the right word. For all we know, the
         | agent performed admirably given whatever the user's prompt may
         | have been. Just goes to show that even in a relatively
         | constricted domain of programming, where a lot (but far from
         | _all_ ) outcomes are binary, the room for misinterpretation and
         | error is still quite vast.
        
           | namaria wrote:
           | More than that, I think it's quite relevant, because it shows
           | that the complexity in the tooling around writing code
           | manually is not so inessential as it seems.
           | 
           | Any system capable of automating a complex task will by need
           | be more complex than the task at hand. This complexity
           | doesn't evaporate when you through statistical fuzzers at it.
        
         | outime wrote:
         | Technically, they could be using version control, not have a
         | copy on their local machine for some reason, and have an AI
         | agent issue a `git push -f` wiping out all the previous work.
        
       | jappwilson wrote:
       | Can't wait for this being a plot point in a murder mystery,
       | someone gamed the AI agent to create a planned "accident"
        
       | daxfohl wrote:
       | We can barely make deterministic distributed services reliable.
       | And microservices now have a bad reputation for being expensive
       | distributed spaghetti. I'm not holding my breath for distributed
       | AI agents to be a thing.
        
       | twotwotwo wrote:
       | FWIW, work has pushed use of Cursor and I quickly came around to
       | a related conclusion: given a reliability vs. anything tradeoff,
       | you more or less always have to prefer reliability. For example,
       | even ignoring subtle head-scratcher type bugs, a faster model's
       | output on average needs more revision before it basically works,
       | and on average you end up spending more energy on that than you
       | save by reducing time to first response. Up-front work that
       | decreases the chance of trouble--detailing how you want something
       | done, explicitly pulling into context specific libraries--also
       | tends to be worth it on net, even if the agent might have gotten
       | there by searching (or you could get it there through follow-up
       | requests).
       | 
       | That's my experience working with a largeish mature codebase (all
       | on non-prod code) where you can't get far if you can't use
       | various internal libraries correctly. With standalone (or small
       | greenfield) projects, where results can lean more on public info
       | from pre-training and there's not as much project specific info
       | to pull in, you might see different outcomes.
       | 
       | Maybe the tech and surrounding practice will change over time,
       | but in my short experience it's mostly been about trying to just
       | get to 'acceptable' for this kind of task.
        
       | asdev wrote:
       | want reliability? build automation instead of using non
       | deterministic models to complete tasks
        
       | nottorp wrote:
       | But but...
       | 
       | People don't get promoted for reliability. They get promoted for
       | new capabilities. Everyone thinks they're the next Google.
        
       | prng2021 wrote:
       | I think the best shot we have at solving this problem is an
       | explosion of specialized agents. That will limit how off the
       | rails each one can go at interpreting or performing some type of
       | task. The end user still just needs to interact with one agent
       | though, as long as it can delegate properly to subagents.
        
       | SkyPuncher wrote:
       | Unfortunately, the picked example kind of weighs down the point.
       | Cursor has an _extremely_ vocal minority (beginner coders) that
       | isn 't really representative of their heavy weight users
       | (professional coders). These beginner users face significant
       | issues that come from being new to programming, in general.
       | Cursor gives them amazing capabilities, but it also lets them
       | make the same dumb mistakes that most professional developers
       | have done once or twice in their career.
       | 
       | That being said, back in February I was trying out of bunch of AI
       | personal assistant apps/tools. I found, without fail, every
       | single one of them was advertising features their LLMs could
       | theoretically accomplish, but in practice couldn't. Even worse
       | was many of these "assistants" would proactively suggest they
       | could accomplish something but when you sent them out to do it,
       | they'd tell you they couldn't.
       | 
       | * "Would you like me to call that restaurant?"...."Sorry, I don't
       | have support for that yet"
       | 
       | * "Would you like me to create a reminder?"....Created the
       | reminder, but never executed it
       | 
       | * "Do you want me to check their website?"...."Sorry, I don't
       | support that yet"
       | 
       | Of all of the promised features, the only thing I ended up using
       | any of them for was a text message interface to an LLM. Now that
       | Siri has native ChatGPT support, it's not necessary.
        
       | _cs2017_ wrote:
       | Does anyone have AI agent use cases that that you think might
       | happen within this year and that feels very exciting to you?
       | 
       | I personally struggle to find a new one (AI agent coding
       | assistants already exist, and of course I'm excited about them,
       | especially as they get better). I will not, any time soon, trust
       | unsupervised AI to send emails on my behalf, make travel
       | reservations, or perform other actions that are very costly to
       | fix. AI as a shopping agent just isn't too exciting for me, since
       | I do not believe I actually know what features in a speaker /
       | laptop / car I want until I do my own research by reading what
       | experts and users say.
        
       | danso wrote:
       | I think the replies [0] to the mentioned reddit thread sums up my
       | (perhaps complacent?) feelings about the current state of
       | automated AI programming:
       | 
       | > _Does it terrify anyone else that there is an entire cohort of
       | new engineers who are getting into programming because of AI, but
       | missing these absolute basic bare necessities?_
       | 
       | > > _Terrify? No, it 's reassuring that I might still have a
       | place in the world._
       | 
       | [0]
       | https://www.reddit.com/r/cursor/comments/1inoryp/comment/mdo...
        
         | bob1029 wrote:
         | The reddit post feels like engagement bait to me.
         | 
         | Why would you ask the community a question like "how to source
         | control" when you've been working with (presumably) a
         | programming genius LLM that could provide the most personally
         | tailored path for baby's first git experience? Even if you
         | don't know that "git" is a thing, you could ask questions as if
         | you were a golden retriever and the model would still
         | inevitably recommend git in the first turn of conversation.
         | 
         | Is it really the case that a person who has the ability to use
         | a compiler, IDE, LLM, web browser, reddit, etc., somehow
         | simultaneously lacks the ability to frame basic-ass questions
         | about the very mission they set out on? If stuff like this is
         | _not_ manufactured, then we should all walk away feeling pretty
         | fantastic about our future job prospects.
        
           | danso wrote:
           | The account is a throwaway but based on its short posting
           | history and its replies, I don't have reason to believe it's
           | a troll:
           | 
           | https://www.reddit.com/r/cursor/comments/1inoryp/comment/mdr.
           | ..
           | 
           | > _I 'm not a dev or engineers at all (just a geek working in
           | Finance)_
           | 
           | This fits my experience of teaching very intelligent students
           | how to code; if you're an experienced programmer, you simply
           | cannot fathom the kinds of assumptions beginners will make
           | due to gaps in yet-to-be foundational knowledge. I remember
           | having to tell students to mindful when searching Stack
           | Overflow for help, because of how something as simple as an
           | error from Requests (e.g. while doing web scraping) could
           | lead them down a rabbit hole of "solutions" such as
           | completely uninstalling their Python for a different/older
           | version of Python.
        
           | layer8 wrote:
           | They were using Cursor, not a general LLM, and were asking
           | their fellow Cursor users how they deal with the risk of
           | Cursor destroying the code base.
        
           | namaria wrote:
           | If you start from scratch trying to build an ideal system to
           | program computers, you always converge on the time tested
           | tooling that we have now. Code, compilers, interpreters,
           | versioning, etc.
           | 
           | People think "this is hard, I'll re-invent it in an easier
           | way" and end up with a half-assed version of the tooling
           | we've honed over the decades.
        
             | mycall wrote:
             | > People think "this is hard, I'll re-invent it in an
             | easier way" and end up with a half-assed version of the
             | tooling we've honed over the decades.
             | 
             | This is a win in the long run because the occassional and
             | successful thought people labor over sometimes is a better
             | way.
        
       | donfotto wrote:
       | > choosing a small number of tasks to execute exceptionally well
       | 
       | And that is the Unix philosophy
        
       | vivzkestrel wrote:
       | remember 2016 chatbots anymore. sounds like the same thing all
       | over again except this time we got hallucinations and
       | unpredictability
        
       | hirako2000 wrote:
       | The problem with Devin wasn't that it was a black box doing too
       | much. It's that the outcome demo'd were fake and what was inside
       | the box wasn't an "AI engineer."
       | 
       | Transparency? If it worked even unreliably, nobody would care
       | what it does. Problem is stochastic machines aren't engineers,
       | don't reason, are not intelligence.
       | 
       | I find articles attacking Ai but finding excuses in some mouse
       | rather than pointing at the elephant, exhausting.
        
       | ankit219 wrote:
       | Agents in the current format are unlikely to go beyond a current
       | levels of reliability. I believe agents are a good use case in a
       | low trust environments (outside of coding where you could see the
       | errors quickly with testing or deployment) like inter-company
       | communications and tasks, where there are already systems in
       | place for checks and things going wrong. Might be a hot space in
       | some time. For intra company, high trust environment cannot just
       | be a workflow automation given any error would need the knowledge
       | worker to redo the whole thing to check if its correct. We can do
       | it via other agents - less chances of it going wrong - but more
       | chances it screws up in the same place as previous one.
        
       | shireboy wrote:
       | " It's easy to blame the user's missing grasp of basic version
       | control, but that misses the deeper point."
       | 
       | Uhh, no, that's pretty much the point. A developer without basic
       | understanding of version control is like a pilot without a basic
       | understanding of landing. A ton of problems with AI (or any other
       | tool, including your own brain) get fixed by iterating on small
       | commits and branching. Throw away the commit or branch if it
       | really goes sideways. I can't fathom working on something for 4
       | months without realizing a problem or having any way to roll
       | back.
       | 
       | That said, the one argument I could see is if Cursor (or copilot,
       | etc) had built in to suggest "this project isn't in source
       | control, we should probably do that before getting too far ahead
       | of ourselves.", then help the user setup sc, repo, commit, etc.
       | The topic _is_ tricky and I do remember not totally grasping git,
       | branching, etc.
        
         | highmastdon wrote:
         | The nice thing is that adding this to the basic prompt that
         | cursor uses will advance all those users and directly do away
         | with this problem only to discover the next one. However, all
         | these little things add up to a very powerful prompt where the
         | LLM will make it only easier for anyone to build real stuff
         | that on the surface looks very good
        
       | andreash wrote:
       | We are building this with https://lets.dev. We believe there will
       | be great demand for less capable, but much more determinisic
       | agents. I also recommend everyone to read "What is an agent?" by
       | Harrison Chase. https://blog.langchain.dev/what-is-an-agent/
        
       | tristor wrote:
       | The thing I most want an AI agent to do is something I can't
       | trust to any third-party, it'd need to be local, and it's
       | something well within LLM capabilities today. I just want a
       | "secretary in my pocket" to take notes during conversations and
       | produce minutes, but do so in a way that's secure and privacy-
       | respecting (e.g. I can use it at work or at home).
        
       | anishpalakurT wrote:
       | Check out BAML at boundaryml.com
        
       | piokoch wrote:
       | Funny note about Cursor. Commercial project, rather expensive,
       | cannot figure out that it would be good to use, say, version
       | control not to break somebody's work. That's why I prefer Aider
       | (free), which is simply committing whatever it does, so any
       | change could be reverted. Easily.
        
       | jlaneve wrote:
       | I appreciate the distinction between agents and workflows - this
       | seems to be commonly overlooked and in my opinion helps ground
       | people in reliability vs capability. Today (and in the near
       | future) there's not going to be "one agent to rule them all", so
       | these LLM workflows don't need to be incredibly capable. They
       | just need to do what they're intended to do _reliably_ and
       | nothing more.
       | 
       | I've started taking a very data engineering-centric approach to
       | the problem where you treat an LLM as an API call as you would
       | any other tool in a pipeline, and it's crazy (or maybe not so
       | crazy) what LLM workflows are capable of doing, all with
       | increased reliability. So much so that I've tried to package my
       | thoughts / opinions up into an AI SDK for Apache Airflow [1] (one
       | of the more popular orchestration tools that data engineers use).
       | This feels like the right approach and in our customer base /
       | community, it also maps perfectly to the organizations that have
       | been most successful. The number of times I've seen companies
       | stand up an AI team without really understanding _what problem
       | they want to solve_...
       | 
       | [1] https://github.com/astronomer/airflow-ai-sdk
        
       | LeifCarrotson wrote:
       | Unfortunately, LLMs, natural language, and human cognition
       | largely are what they are. Mix the three together and you don't
       | get reliability as a result.
       | 
       | It's not like there's a lever in Cursor HQ where one side is
       | "Capability" and one side is "Reliability", and they can make
       | things better just by tipping it back towards the latter.
       | 
       | You can bias designs and efforts in that direction, and get your
       | tool to output reversible steps or bake in sanity checks to
       | blessed actions, but that doesn't change the nature of the
       | problem.
        
       | rambambram wrote:
       | I heard you, so we decided to now tweak the dials a bit. The dial
       | for 'capability' we can turn back a little, no problem, but the
       | dial for 'reliability', uhm yeah... I'm sorry, but we couldn't
       | find that dial. Sorry.
        
       | extr wrote:
       | The problem I find in many cases is that people are restrained by
       | their imagination of what's possible, so they target existing
       | workflows for AI. But existing workflows exist for a reason:
       | someone already wanted to do that, and there have been countless
       | man-hours put into the optimization of the UX/UI. And by
       | definition they were possible before AI, so using AI for them is
       | a bit of a solution in search of a problem.
       | 
       | Flights are a good example but I often cite Uber as a good one
       | too. Nobody wants to tell their assistant to book them an Uber -
       | the UX/UI is so streamlined and easy, it's almost always easy
       | enough to just do it yourself (or if you are too important for
       | that, you probably have a private driver already). Basically
       | anything you can do with an iPhone and the top 20 apps is in this
       | category. You are literally competing against hundreds of
       | engineers/product designers who had no other goal than to build
       | the best possible experience for accomplishing X. Even if LLMs
       | would have been helpful a priori - they aren't after every edge
       | case has already been enumerated and planned for.
        
         | lolinder wrote:
         | > You are literally competing against hundreds of
         | engineers/product designers who had no other goal than to build
         | the best possible experience for accomplishing X.
         | 
         | I think part of what's been happening here is that the hubris
         | of the AI startups is really showing through.
         | 
         | People working on these startups are by definition much more
         | likely than average to have bought the AI hype. And what's the
         | AI hype? That AI will replace humans at somewhere between "a
         | lot" and "all" tasks.
         | 
         | Given that we're filtering for people who believe that, it's
         | unsurprising that they consciously or unconsciously devalue all
         | the human effort that went into the designs of the apps they're
         | looking to replace and think that an LLM could do better.
        
           | arionhardison wrote:
           | > I think part of what's been happening here is that the
           | hubris of the AI startups is really showing through.
           | 
           | I think it its somewhat reductive to assign this "hubris" to
           | "AI startups". I would posit that this hubris is more akin to
           | the superiority we feel as human beings.
           | 
           | I have heard people say several times that they "treat AI
           | like a Jr. employee", I think that within the context of a
           | project AI should be treated based on the level if
           | contribution. If AI is the expert, I am not going to approach
           | it as if I am an SME that knows exactly what to ask. I am
           | going to try and focus on the thing. know best, and ask
           | questions around that to discover and learn the best
           | approach. Obviously there is nuance here that is outside the
           | scope of this discussion, but these two fundamentally
           | different approaches have yield materially different outcomes
           | in my experience.
        
         | arionhardison wrote:
         | > The problem I find in many cases is that people are
         | restrained by their imagination of what's possible, so they
         | target existing workflows for AI.
         | 
         | I concur and would like to add that they are also restrained by
         | the limitations of existing "systems" and our implicit and
         | explicit expectations of said system. I am currently attempting
         | to mitigate the harm done by this restriction by focusing on
         | and starting with a first principal analysis of the problem
         | being solved before starting the work, for example; lets take a
         | well established and well documented system like the SSA.
         | 
         | When attempting to develop, refactor, extend etc... such a
         | system; what is the proper thought process. As I see it, there
         | are two paths:
         | 
         | Path 1:                 a) Breakdown the existing workflows
         | b) Identify key performance indicators (KPIs) that align with
         | your business goals            c) Collect and analyze data
         | related to those KPIs using BPM tools            d) Find the
         | most expensive worst performing workflows            e)
         | Automate them E2E w/ interface contracts on either side
         | 
         | This approach locks you into to existing restrictions of the
         | system, workflows, implementation etc...
         | 
         | Path 2:                 a) Analyze system to understand goal in
         | terms of 1st principals, e.g: What is the mission of the SSA?
         | To move money based on conditional logic.            b) What
         | systems / data structures are closest to this function and does
         | the legacy system reflect this at its core e.g.: SSA should
         | just be a ledger IMO            c) If Yes, go to "Path 1" and
         | if No go to "D"            d) Identify the core function of the
         | system, the critical path (core workflow) and all required
         | parties            e) Make MVP which only does the bare min
         | 
         | By following path 2 and starting off with an AI analysis of the
         | actual problem and not the problem as it exist as a solution
         | within the context of an existing system, it is my opinion that
         | the previous restrictions have been avoided.
         | 
         | Note: Obviously this is a gross oversimplification of the
         | project management process and there are usually external
         | factors that weigh in and decide which path is possible for a
         | given initiative, my goal here was just to highlight a specific
         | deviation from my normal process that has yielded benefits so
         | far in my own personal experience.
        
       | peterjliu wrote:
       | We've (ex Google Deepmind researchers) been doing research in
       | increasing the reliability of agents and realized it is pretty
       | non-trivial, but there are a lot of techniques to improve it. The
       | most important thing is doing rigorous evals that are
       | representative of what your users do in your product. Often this
       | is not the same as academic benchmarks. We made our own
       | benchmarks to measure progress.
       | 
       | Plug: We just posted a demo of our agent doing sophisticated
       | reasoning over a huge dataset ((JFK assassination files -- 80,000
       | PDF pages): https://x.com/peterjliu/status/1906711224261464320
       | 
       | Even on small amounts of files, I think there's quite a palpable
       | difference in reliability/accuracy vs the big AI players.
        
         | ai-christianson wrote:
         | > The most important thing is doing rigorous evals that are
         | representative of what your users do in your product. Often
         | this is not the same as academic benchmarks.
         | 
         | OMFG thank you for saying this. As a core contributor to
         | RA.Aid, optimizing it for SWE-bench seems like it would
         | actively go against perf on real-world tasks. RA.Aid came about
         | in the first place as a pragmatic programming tool (I created
         | it while making another software startup, Fictie.) It works
         | well because it was literally made and tested by making other
         | software, and these days it mostly creates its own code.
         | 
         | Do you have any tips or suggestions on how to do more
         | formalized evals, but on tasks that resemble real world tasks?
        
           | peterjliu wrote:
           | I would start by making the examples yourself initially,
           | assuming you have a good sense for what that real-world task
           | is. If you can't articulate what a good task is and what a
           | good output is, it is not ready for out-sourcing to crowd-
           | workers.
           | 
           | And before going to crowd-workers (maybe you can skip them
           | entirely) try LLMs.
        
             | ai-christianson wrote:
             | > I would start by making the examples yourself initially
             | 
             | What I'm doing right now is this:                 1) I have
             | X problem to solve using the coding agent.       2) I ask
             | the agent to do X       3) I use my own brain: did the
             | agent do it correctly?
             | 
             | If the agent did not do it correctly, I then ask: _should_
             | the agent have been able to solve this? If so, I try to
             | improve the agent so it 's able to do that.
             | 
             | The hardest part about automating this is #3 above --each
             | evaluation is one-off and it would be hard to even
             | formalize the evaluation.
             | 
             | SWE bench, for example uses unit tests for this, and the
             | agent is blind to the unit tests --so the agent has to make
             | a red test (which it has never seen) go green.
        
       | jedberg wrote:
       | I've been working on this problem for a while. There are whole
       | companies that do this. They all work by having a human review a
       | sample of the results and score them (with various uses of magic
       | to make that more efficient). And then suggest changes to make it
       | more accurate in the future.
       | 
       | The best companies can get up to 90% accuracy. Most are closer to
       | 80%.
       | 
       | But it's important to remember, we're expecting perfection here.
       | But think about this: Have you ever asked someone to book a
       | flight for you? How did it go?
       | 
       | At least in my experience, there's usually a few back and forth
       | emails, and then something is always not quite right or as good
       | as if you did it yourself, but you're ok with that because it
       | saved you time. The one thing that makes it better is if the same
       | person does it for you a couple of times and learned your
       | specific habits and what you care about.
       | 
       | I think the biggest problem in AI accuracy is expecting the AI to
       | be better than a human.
        
         | morsecodist wrote:
         | This is really cool. I agree with your point that a human would
         | also struggle to book a flight for someone but what I take from
         | that is conversation is not the best interface for picking
         | flights. I am not really sure how you beat a list of available
         | flights + filters. There are a lot of criteria: total fight
         | time, price, number of stops, length of layover, airline, which
         | airport if your destination is served by multiple airports. I
         | couldn't really communicate to anyone how I weigh those and it
         | shifts over time.
        
         | lolinder wrote:
         | > I think the biggest problem in AI accuracy is expecting the
         | AI to be better than a human.
         | 
         | If it's not better across at least one of {more accurate,
         | faster, cheaper} then there is no business. You have to be
         | offering one of the above.
         | 
         | And that applies both to humans and to existing tech solutions:
         | an LLM solution must beat both in some dimension. Current
         | flight booking interfaces are actually better than a human at
         | _all three_ : they're more accurate, they're free, and they're
         | faster than trying to do the back and forth, which means the
         | bar to clear for an agent is extremely high.
        
           | bluGill wrote:
           | > Current flight booking interfaces are actually better than
           | a human at all three
           | 
           | Only when you know exactly where to go. If you need to get to
           | customers in 3 cities where order doesn't matter (ie the
           | traveling salemen problem, though you are allowed to hit any
           | city more than once) current solutions are not great. If you
           | want to go on vacation but don't care much about where
           | (almost every place with an airport would be an acceptable
           | vacation, though some are better than others)
        
       | rglover wrote:
       | > Given the intensifying competition within AI, teams face a
       | difficult balance: move fast and risk breaking things, or
       | prioritize reliability and risk being left behind.
       | 
       | Can we please retire this dichotomy? Part of why teams do this in
       | the first place is because there's this language of "being left
       | behind."
       | 
       | We badly need to retreat to a world in which rigorous engineering
       | is applauded and _expected_ --not treated as a nice to have or
       | "old world thinking."
        
       | getnormality wrote:
       | "Less capability, more reliability, please" is what I want to say
       | about everything that's happened in the past 20 years. Of
       | everything that's happened since then, I'm happy to have a few
       | new capabilities: smartphones, driving directions, cloud storage,
       | real-time collaborative editing of documents. I don't need
       | anything else. And now I just want my gadget batteries to last
       | longer, and working parental controls on my kids' devices.
        
       | janalsncm wrote:
       | I think many people share the same sentiment. We don't need
       | agents that can _kind of_ do many things. We need reliable
       | programs that are really good at doing a single thing. I said as
       | much about Manus when it came out.
       | 
       | https://news.ycombinator.com/item?id=43350950
       | 
       | There are mistakes in the Manus demo if you actually look at it.
       | And with so many AI demos, they never want you to look too
       | closely because the thing that was created is fairly mediocre. No
       | one is asking for the tsunami of sludge except for VCs
       | apparently.
        
       | YetAnotherNick wrote:
       | I think the author is doing apples to oranges comparison. If you
       | have AI acting agnatically, capability is likely positively
       | correlated with reliability. If you don't have AI agents, it is
       | more reliable.
       | 
       | AI agents are not there yet and even cursor has agent mode not
       | selected by default. I have seen cursor agent quite a bit worse
       | that the raw model with human selected context.
        
       | bendyBus wrote:
       | "If your task can be expressed as a workflow, build a workflow".
       | 100% true but the thing all these 'agent pattern' or 'workflow'
       | diagrams miss is that real tasks require back-and-forth with a
       | user, not just a Rube Goldberg machine that gets triggered in
       | response to a _single user message_. What you need is not 'tool
       | use' but something like 'process use'. This is what we did at
       | Rasa, giving you a declarative way to define multi-step
       | processes. An LLM lets you have a fluent conversation, but the
       | execution of the task is pre-defined and deterministic:
       | https://rasa.com/docs/learn/concepts/calm/ The fact that every
       | framework starts with a `while` loop around an LLM and then duct-
       | tapes on some "guardrails" betrays a lack of imagination.
        
       | wg0 wrote:
       | Totally agree with author here. Also, reliability is pretty hard
       | to achieve when the underlying models are all mountains of
       | probability that no one yet understands how they do what they
       | exactly do and how to precisely fix a problem without affecting
       | other parts.
       | 
       | Here's CNBC Business is pushing greed that these aren't AI
       | wrappers but next best thing after fire, bread and axe[0]
       | 
       | [0]. https://youtu.be/mmws6Oqtq9o
        
         | freeamz wrote:
         | same can be said about digital tech/infrastructure in general!
        
           | wg0 wrote:
           | I can't say that based on what I know about both.
        
       | cadamsdotcom wrote:
       | Models aren't great at deciding whether an action is irreversible
       | - and thus whether to stop to ask for input/advice/approval.
       | Hence agentic systems usually are given a policy to follow.
       | 
       | Perhaps the question "is this irreversible?" should be delegated
       | to a separate model invocation.
       | 
       | There could be a future in which agentic systems are a tree of
       | model and tool invocations, maybe with a shared scratchpad.
        
       | genevra wrote:
       | I agree up until the coding example. If someone doesn't know
       | about version control I don't think that's any fault of the
       | company trying to stretch the technology to its limits and let
       | people experiment. Cursor is a really cool step in a direction,
       | and it's weird to say we should clamp what it's doing because
       | people might not be competent enough to fix its mistakes.
        
       | kuil009 wrote:
       | It's natural to expect reliability from AI agents -- but I don't
       | think Cursor is a fair example. It's a developer tool deeply
       | integrated with git, where every action can have serious
       | consequences, as in any software development context.
       | 
       | Rather than blaming the agent, we should recognize that this
       | behavior is expected. It's not that AI is uniquely flawed -- it's
       | that we're automating a class of human communication problems
       | that already exist.
       | 
       | This is less about broken tools and more about adjusting our
       | expectations. Just like hunters had to learn how to manage
       | gunpowder weapons after using bows, we're now figuring out how to
       | responsibly wield this new power.
       | 
       | After all, when something works exactly as intended, we already
       | have a word for that: software.
        
       | amogul wrote:
       | Reliability, consistency and accuracy is the next frontier that
       | we all have to tackle it sucks. Friend of mine is building
       | Empromptu.ai to tackle exactly this. From what she told me built
       | a model where that let's you define accuracy based on your use
       | case and their models optimize your whole system towards it.
        
       ___________________________________________________________________
       (page generated 2025-03-31 23:00 UTC)