[HN Gopher] AI agents: Less capability, more reliability, please
___________________________________________________________________
AI agents: Less capability, more reliability, please
Author : serjester
Score : 279 points
Date : 2025-03-31 14:45 UTC (8 hours ago)
(HTM) web link (www.sergey.fyi)
(TXT) w3m dump (www.sergey.fyi)
| cryptoz wrote:
| This is refreshing to read. I, like everyone apparently, am
| working on my own coding agent [1]. And I suppose it's not that
| capable yet. But it sure is getting more reliable. I have it only
| modify 1 file at a time. It generates tickets for itself to
| complete - but never enough tickets to really get all the work
| done. The tickets it does generate, however, it often can
| complete (at least, in simple cases haha). The file modification
| is done through parsing ASTs and modifying those, so the AI
| doesn't go off and do all kinds of things to your whole codebase.
|
| And I'm so sick of everything trying for 100% automation and
| failing. There's a place for the human in the loop, in _quickly_
| identifying bugs the AI doesn 't have the context for, or large-
| scale vision, or security or product-focused mindset, etc.
|
| It's going to be AI and humans collaborating. The solutions that
| figure that out the best are going to win IMO. AI won't be doing
| everything and humans won't be doing it all either. The tools
| with the best human-AI collaboration are where it's at.
|
| [1] https://codeplusequalsai.com
| helltone wrote:
| How do you modify ASTs?
| cryptoz wrote:
| I support HTML, JS, Python and CSS. For HTML, (not
| technically an AST), I give the LLM the original-file HTML
| source, and then I instruct it to write python code that uses
| BeautifulSoup to modify the HTML. Then I get the string back
| from python of the full HTML file, modified according to the
| user prompt.
|
| For python changes I use ast and astor packages, for JS I use
| esprima/escodegen/estraverse, and for CSS I use postcss. The
| process is the same for each one: I give the original input
| souce file, and I instruct the LLM to parse the file into AST
| form and then write code that modifies that AST.
|
| I blogged about it here if you want more details! https://cod
| eplusequalsai.com/static/blog/prompting_llms_to_m...
| skydhash wrote:
| I took a look at your project and while it's nice
| (technically), for the actual use case shown, I can't see
| the value over something like the old Dreamweaver with a
| bit of training.
|
| I still think like prompting is still the wrong interface
| for programming systems. Even though they're restricted,
| configurations forms, visual programming with nodes, and
| small scripts attached to objects on a platform is way more
| reliable and useful.
| cryptoz wrote:
| Appreciate you having a look and for that feedback,
| thanks - I do agree I have work to do to _prove_ that my
| idea is better than alternatives. We 'll see...
| dfxm12 wrote:
| _Google Flights already nails this UX perfectly_
|
| Often when using an AI agent, I think to myself that a web search
| gets me what I need more reliably and just as quick. Maybe AI has
| to learn to crawl before it learns to walk, but each agent I use
| is leaving me without confidence that it will ever be useful and
| I genuinely wonder if they've ever been tested before being
| published...
| monero-xmr wrote:
| Assume humans can do anything in a factory. So we create a tool
| to increase the speed and reliability of the human's output. We
| do this so much that eventually the whole factory is automated,
| and the human is simply observing.
|
| Nowhere in that story above is there a customer or factory
| worker feeding in open-ended inputs. The factory is precise, it
| takes inputs and produces outputs. The variability is
| restricted to variability of inputs and the reliability of the
| factory kit.
|
| Much business software is analogous to the factory. You have
| human workers who ultimately operate the business. And software
| is built to automate those tasks precisely.
|
| AI struggles because engineers are trying to build factories
| through incantation - if they just say the right series of
| magic spells, the LLM will produce a factory.
|
| And often it can. It's just a shitty factory that does simple
| things, often inefficiently with unforeseen edge cases.
|
| At the moment, skilled factory builders (software engineers)
| are better at holistically understanding the needs of the
| business and building precise, maintainable, specific
| factories.
|
| The factory builders will use AI as a tool to help build better
| factories. Trying to get the AI to build the whole factory
| soup-to-nuts won't work.
| killjoywashere wrote:
| We have been looking at Hamming distance vs time to signature for
| ambient note generation in medicine. Any other metrics? Lots of
| metrics in the ML papers, but a lot of them seem sus. They take a
| lot of work to reproduce or they are designed around some
| strategy like maxing out the easy true negatives (so you get
| desirable accuracy and F1 score), etc. as someone trying to build
| validation protocols I can get vendors to enable (need them to
| write certain data from memory to a DB table we can access) I'd
| welcome that discussion. Right now the MBAs running the hospital
| systems are doing whatever their ML buddies say without regard to
| patient or provider.
| simonw wrote:
| Yeah, the "book a flight" agent thing is a running joke now - it
| was a punchline in the Swyx keynote for the recent AI Engineer
| event in NYC: https://www.latent.space/p/agent
|
| I think this piece is underestimating the difficulty involved
| here though. If only it was as easy as "just pick a single task
| and make the agent really good at that"!
|
| The problem is that if your UI involves human beings typing or
| talking to you in a human language, there is an unbounded set of
| ways things could go wrong. You can't test against every possible
| variant of what they might say. Humans are bad at clearly
| expressing things, but even worse is the challenge of ensuring
| they have a concrete, accurate mental model of what the software
| can and cannot do.
| CooCooCaCha wrote:
| Case-in-point look how long it's taken for self-driving cars to
| mature. And many would argue they still have a ways to go until
| they're truly reliable.
|
| I think this highlights how we still haven't cracked
| intelligence. Many of these issues come from the model's very
| limited ability to adapt on the fly.
|
| If you think about it every little action we take is a micro
| learning opportunity. A small-scale scientific process of
| trying something and seeing the result. Current AI models can't
| really do that.
| SoftTalker wrote:
| Even maps. I was driving to Chicago last week and Apple Maps
| insisted I take the exit for Danville. Fortunately I knew
| better, I only had the map on in case an accident might
| require rerouting. I find it hard to drive with maps
| navigation because they are usually correct, but wrong often
| enough that I don't fully trust them. So I have to double
| check everything they tell me with the reality in front of
| me, and that takes more mental effort than it ideally should.
| noodletheworld wrote:
| Isn't the point he's making:
|
| >> Yet too many AI projects consistently underestimate this,
| chasing flashy agent demos promising groundbreaking
| capabilities--until inevitable failures undermine their
| credibility.
|
| This is the problem with the 'MCP for Foo' posts that recently.
|
| Adding a capability to your agent that the agent can't use just
| gives us _exactly that_ :
|
| > inevitable failures undermine their credibility
|
| It should be relatively easy for everyone to agree that giving
| agents an unlimited set of arbitrary capabilities will just
| make them terrible at everything; and that promising that
| giving them these capabilities will make them better is:
|
| A) false
|
| B) undermining the credibility of agentic systems
|
| C) undermining the credibility of the people making these
| promises
|
| ...I _get it_ , it _is_ hard to write good agent systems, but
| surely, a bunch of half-baked, function-calling wrappers that
| don 't really work... like, it's not a good look right?
|
| It's just vibe coding for agents.
|
| I think it's quite reasonable to be say, if you're building a
| system, _now_ , then:
|
| > The key to navigating this tension is focus--choosing a small
| number of tasks to execute exceptionally well and relentlessly
| iterating upon them.
|
| ^ This seems like exceptionally good advice. If you can't make
| something that's actually good by iterating on it until it _is_
| good and it _does_ work, then you 're going to end up being a
| devin (ie. over promised, over hyped failure).
| emn13 wrote:
| Perhaps the solutions(s) needs to be less focusing on output
| quality, and more on having a solid process for dealing with
| errors. Think undo, containers, git, CRDTs or whatever rather
| than zero tolerance for errors. That probably also means some
| kind of review for the irreversible bits of any process, and
| perhaps even process changes where possible to make common
| processes more reversible (which sounds like an extreme
| challenge in some cases).
|
| I can't imagine we're anywhere even close to the kind of
| perfection required not to need something like this - if it's
| even possible. Humans use all kinds of review and audit
| processes precisely because perfection is rarely attainable,
| and that might be fundamental.
| techpineapple wrote:
| But, assuming this is a general thing not just focused on say
| software development, can you make the tooling around
| creating this easier than defining the process itself?
| Everyone loosely speaking sees the value in test driven
| development, but often I think with complex processes,
| writing the test is harder than writing the process.
| RicoElectrico wrote:
| I want to make a simple solution where data is parsed by a
| vision model and "engineer for the unhappy path" is my
| assumption from the get-go. Changing the prompt or swapping
| the model is cheap.
| herval wrote:
| vision models are also faulty, and some times all paths are
| unhappy paths, so there's really no viable solution. Most
| of the times, swapping the model completely randomizes the
| problem space (unless you measure every single corner case,
| it's impossible to tell if everything got better or if some
| things got worse...
| _bin_ wrote:
| The biggest issue I've seen is "context window poisoning",
| for lack of a better term. If it screws something up it's
| highly prone to repeating that mistake. It then makes a bad
| fix that propagates two more errors, the says, "Sure! Let me
| address that," repeating to poorly fix those rather than the
| underlying issue (say, restructuring code to mitigate.)
|
| It is almost impossible to produce a useful result, far as
| I've seen, unless one eliminates that mistake from the
| context window.
| instakill wrote:
| I really really wish that LLMs had an "eject" function - as
| in I could click on any message in a chat, and it would
| basically start a new clone chat with the current chat's
| thread history.
|
| There are so many times where I get to a point where the
| conversation is finally flowing in the way that I want and
| I would love to "fork" into several directions from that
| one specific part of the conversation.
|
| Instead I have to rely on a prompt that requests the LLM to
| compress the entire conversation into a non-prose format
| that attempts to be as semantically lossless as possible;
| this sadly never works as in ten did [sic].
| theblazehen wrote:
| You can use LibreChat which allows you to fork messages:
| https://www.librechat.ai/docs/features/fork
| tough wrote:
| Google UI supports branching and delete someone recently
| made a blog post about how great it is
| mvdtnz wrote:
| This is precisely what the poorly named Edit button does
| in Claude.
| bongodongobob wrote:
| I think this is one of the core issues people have when
| trying to program with them. If you have a long
| conversation with a bunch of edits, it will start to get
| unreliable. I frequently start new chats to get around this
| and it seems to work well for me.
| donmcronald wrote:
| This is what I find. If it makes a mistake, trying to get
| it to fix the mistake is futile and you can't "teach" it to
| avoid that mistake in the future.
| ModernMech wrote:
| > Perhaps the solutions(s) needs to be less focusing on
| output quality, and more on having a solid process for
| dealing with errors. Think undo, containers, git, CRDTs
|
| LLMs are supposed to save us from the toils of software
| engineering, but it looks like we're going to reinvent
| software engineering to make AI useful.
|
| Problem: Programming languages are too hard.
|
| Solution: AI!
|
| Problem: AI is not reliable, it's hard to specify problems
| precisely so that it understands what I mean unambiguously.
|
| Solution: Programming languages!
| Workaccount2 wrote:
| With pretty much every new technology, society has bent
| towards the tech too.
|
| When smartphones first popped up, browsing the web on them
| was a pain. Now pretty much the whole web has phone
| versions that make it easier*.
|
| *I recognize the folly of stating this on HN.
| LtWorf wrote:
| No it's still a pain.
|
| There's apps that open links in their embedded browser
| where ads aren't blocked. So I need to copy the link and
| open them in my real browser.
| serjester wrote:
| Even operator's original demo the first thing they showed was
| booking restaurant reservations and ordering groceries. I
| understand their need to demo something intuitive but it's
| still debatable whether these tasks are ones that most people
| want delegated to black-box agents.
| ToucanLoucan wrote:
| They don't. I have never once in my life wanted to talk to my
| smart speaker about what I wanted for dinner, not even
| because a smart speaker is/can be creepy, not because of
| social anxiety, no, it's just simpler and more
| straightforward to open Doordash on my damn phone, and look
| at a list of restaurants nearby to order from. Or browse a
| list of products on Amazon to buy. Or just call a restaurant
| to get a reservation. These tasks are trivial.
|
| And like, as a socially anxious millennial, no I don't
| particularly like phone calls. However I also recognize that
| setting my discomfort aside, a direct connection to a human
| being who can help reason out a problem I'm having is not
| something easily replaced with a chatbot or an AI assistant.
| It just isn't. Perfect example: called a place to make a
| reservation for myself, my wife and girlfriend (poly long
| story) and found the place didn't usually do reservations on
| the day in question, but the person did ask when we'd be
| there. As I was talking to a person, I could provide that
| information immediately, and say "if you don't take
| reservations don't worry, that's fine," but it was an off-
| busy hour so we got one anyway. How does an AI navigate that
| conversation more efficiently than me?
|
| As a techie person I basically spend the entire day
| interacting with various software to perform various tasks,
| work related and otherwise. I cannot overstate: NONE of these
| interactions, not a single one, is improved one iota by
| turning it into a conversation, verbal or text-based, with my
| or someone else's computer. By definition it makes basic
| tasks take longer, every time, without fail.
| bluGill wrote:
| I've more than once been on a roadtrip and realized that
| wanted something to help me find a meal where I'll be
| sometime in the next 2 hours. I have no idea what the
| options are and I can't find them. All too often I've taken
| some generic fast food when I really wanted something local
| but I couldn't get maps to tell me and such things are one
| street away where I wouldn't see it. (remember too if i'm
| driving I can't spend time to scroll through a list - but
| even when I'm navigator the interface I can find in maps
| isn't good)
| simonw wrote:
| I'm on a road trip across Utah and Colorado right now and
| I've been experimenting with both Gemini and OpenAI Deep
| Research for this kind of thing with surprisingly decent
| results. Here's one transcript from this morning: https:/
| /chatgpt.com/share/67e9f968-4e88-8006-b672-13381d5e95...
| 3p495w3op495 wrote:
| Any customer service or tech support rep can tell you that even
| humans can't always understand what other humans are attempting
| to say
| hansmayer wrote:
| It's so funny when people try to build robots imitating people.
| I mean part funny, part tragedy of the upcoming bust. The irony
| being, we would have been better off with an interoperable
| flight booking API standard which a deterministic _headless_
| agent could use to make perfect bookings every single time.
| There is a reason current user interfaces stem from a
| scientific discipline once called " _Human_ -Computer
| Interaction".
| jatins wrote:
| But that's the promise of AI, right? You can't put an API on
| everything for human + technological reasons.
| hansmayer wrote:
| It is a promise alright :)
| dartos wrote:
| You can't put an API on everything because it'd take a ton
| of time and money to pull that off.
|
| I can't think of any technological reasons why every
| digital system can't have an API (barring security
| concerns, as those would need to be case by case)
|
| So instead, we put 100s of billions of dollars into
| statistical models hoping they could do it for us.
|
| It's kind of backwards.
| Scene_Cast2 wrote:
| You change who's paying.
| dartos wrote:
| Sure, as a biz it makes sense, but as a society, it's
| obviously a big failure.
| datadrivenangel wrote:
| A web page is an Application/Human Interface. Outside of
| security concerns, companies can make more money if they
| control the Application/Human Interface, and reduce the
| risk of a middleman / broker extorting them for margins.
|
| If I run a flight aggregator that has a majority of
| flight bookings, I can start charging 'rents' by allowing
| featured/sponsored listings to be promoted above the
| 'best' result, leading to a prisoner's dilemma where
| airlines should pay up to their margins to keep market
| share.
|
| If an AI company becomes the default application human
| interface, they can do the same thing. Pay OpenAI tribute
| or be ended as a going concern.
| daxfohl wrote:
| Exactly. It should take around 10 parameters to book a
| flight. Not 30,000,000,000 and a dedicated nuclear power
| plant.
| TeMPOraL wrote:
| It's a business problem, not a tech problem. We don't have a
| solution you described because half of the air travel
| industry relies on things not being interoperable. AI is the
| solution at the limit, one set of companies selling users the
| ability to show a middle finger to a much wider set of
| companies - interoperability by literally having a digital
| human approximation pretending to be the user.
| the_snooze wrote:
| I've been a sentient human for at least the last 15 years
| of tech advancement. Assuming this stuff actually works,
| it's only a matter of time before these AI services claw
| back all that value for themselves and hold users and
| businesses hostage to one another, just like social media
| and e-commerce before.
| https://en.wikipedia.org/wiki/Enshittification
|
| Unless these tools can be run locally independent of a
| service provider, we're just trading one boss for another.
| ben_w wrote:
| > Unless these tools can be run locally independent of a
| service provider, we're just trading one boss for
| another.
|
| Many of them already can be. Many more existing models
| will become local options if/when RAM prices decline.
|
| But this won't necessarily prevent enshittification, as
| there's always a possibility of a new model being tasked
| with pushing adverts or propaganda. And perhaps existing
| models already have been -- certainly some people talk as
| if it's so.
| polishdude20 wrote:
| The difference is that social media isn't special because
| of its hardware or software even. People are stuck on fa
| ebook because everyone else is on it. It's network
| effects. LLMs currently have no network effects. Your
| friends and family aren't "on" chatgpt so why use that
| over something else?
|
| Once performance of a local setup is on par with online
| ones or good enough, that'll be game over for them.
| bluGill wrote:
| The airlines rely on things not interoperating for you.
| However their agents interoperate all the time via code
| sharing. They don't want normal people to do this but if
| something goes wrong with the airplane you should be on
| they would prefer you to get there than not.
| doug_durham wrote:
| Your use of the word "perfect" is doing a lot of heavy
| lifting. "Perfect" is a word embedded in a high dimensional
| space whose local maxima are different for every human on the
| planet.
| yujzgzc wrote:
| I'm old enough to remember having to talk to a (human) agent in
| order to book flights, and can confirm that in my experience,
| the modern flight booking website is an order of magnitude
| better UX than talking to someone about your travel plans.
| kccqzy wrote:
| That still exists. The last time I did onsite interviews,
| every single company that wanted to fly me to their office to
| interview me asked me to talk to a human agent to book
| flights. But of course the human agent is just a travel agent
| with no budgetary power; so I ended up calling the agent to
| inquire about a booking, then calling the recruiter to
| confirm that price is acceptable, and then calling the agent
| book to confirm the booking.
|
| It doesn't have to be this way. Even before the pandemic I
| remember some companies simply gave me access to an internal
| app to choose flights where the only flights shown are these
| of the right date, right airport, and right price.
| leoedin wrote:
| Yeah, I much prefer using a well designed self service system
| than trying to explain it over the phone.
|
| The only problem with most of the flights I book now is that
| they're with low cost airlines and packed with dark patterns
| designed to push upgrades.
|
| Would an AI salesman be any better though? At least the
| website can't actively try to pursuade me to upgrade.
| Spooky23 wrote:
| It's no different than the old Amazon button thing. I'm not
| going to automatically pay whatever price Amazon is going to
| charge to push-button replenish household goods. Especially in
| those days, where "The World's Biggest" fence would have pretty
| wild swings in price.
|
| If i were rich enough to have some bot fly me somewhere, I'd
| have a real-life minion do it for me.
| burnte wrote:
| > Yeah, the "book a flight" agent thing is a running joke now
|
| I literally sat in a meeting with one of our board members who
| used this exact example of how "AI can do everything now!" and
| it was REALLY hard not to laugh.
| wdb wrote:
| Can Google Flights find the best flight dates to a destination
| within a time frame? E.g. get flights to LA in a up to 15 day
| period with ensure attendance on 17 September. Fly with
| SkyAlliance airlines only. Flexible with any dates but needs to
| be there on 17 Sept and at minimum stay of eight days or more.
|
| Love if it could help with that but I haven't figured it out
| with Google Flights yet. My dream is to tell an AI agent the
| above and let it figure out the best deal.
| davesque wrote:
| Yep, and AI agents essentially throw up a boundary blocking the
| user from understanding the capabilities of the system they're
| using. They're like the touch screens in cars that no one asked
| for, but for software.
| photonthug wrote:
| > The problem is that if your UI involves human beings typing
| or talking to you in a human language, there is an unbounded
| set of ways things could go wrong. You can't test against every
| possible variant of what they might say.
|
| It's almost like we really might benefit from using the
| advances in AI for stuff like speech recognition to build
| _concrete interfaces with specific predefined vocabularies and
| a local-first UX_. But stuff like that undermines a cloud-based
| service and a constantly changing interface and the
| opportunities for general spying and manufacturing
| "engagement" while people struggle to use the stuff you've
| made. And of course, producing actual specifications means that
| you would have to own bugs. Besides eliminating employees, much
| interest in AI is all about completely eliminating
| responsibility. As a user of ML-based monitoring products and
| such for years.. "intelligence" usually implies no real
| specifications, and no specifications implies no bugs, and no
| bugs implies rent-seeking behaviour without the burden of any
| actual responsibilities.
|
| It's frustrating to see how often even technologists buy the
| story that "users don't want/need concrete specifications" or
| that "users aren't smart enough to deal with concrete
| interfaces". It's a trick.
| ramesh31 wrote:
| More capability, less reliability please. I want something that
| can achieve superhuman results 1 out of 10 times, not something
| that gives mediocre human results 9 out of 10 times.
|
| All of reality is probabilistic. Expecting that to map
| deterministically to solving open ended complex problems is
| absurd. It's vectors all the way down.
| soulofmischief wrote:
| Stability is the bedrock of the evolution of stable systems.
| LLMs will not democratize software until an average person can
| get consistently decent and useful results without needing to
| be a senior engineer capable of a thorough audit.
| ramesh31 wrote:
| >Stability is the bedrock of the evolution of stable systems.
|
| So we also thought with AI in general, and spent decades
| toiling on rules based systems. Until interpretability was
| thrown out the window and we just started letting deep
| learning algorithms run wild with endless compute, and looked
| at the actual results. This will be very similar.
| skydhash wrote:
| Rules based systems are quite useful, not for interacting
| with an untrained human, but for getting things done. Deep
| learning can be good at exploring the edges of a problem
| space, but when a solution is found, we can actually get to
| the doing part.
| klabb3 wrote:
| This can be explained easily - there are simply some
| domains that were hard to model, and those are the ones
| where AI is outperforming humans. Natural language is the
| canonical example of this. Just because we focus on those
| domains now due to the recent advancements, doesn't mean
| that AI will be better at every domain, especially the ones
| we understand exceptionally well. In fact, all evidence
| suggests that AI excels at some tasks and struggles with
| others. The null hypothesis should be that it continues to
| be the case, even as capability improves. Not all
| computation is the same.
| soulofmischief wrote:
| Stability and probability are orthogonal concepts. You can
| have stable probabilistic systems. Look no further than our
| own universe, where everything is ultimately probabilistic
| and not "rules-based".
| deprave wrote:
| What would be a superhuman result for booking a flight?
| mjmsmith wrote:
| 10% of the time the seat on either side of you is empty, 90%
| of the time you land in the wrong country.
| Jianghong94 wrote:
| Superhuman results 1/10 are, in fact, a very strong reliability
| guarantee (maybe not up to today's nth 9 decimal standard that
| we are accustomed to, but probably much higher than any agent
| in real-world workflow).
| klabb3 wrote:
| Reality is probabilistic yes but it's not black box. We can
| improve our systems by understanding and addressing the flaws
| in our engineering. Do you want probabilistic black-box
| banking? Flight controls? Insurance?
|
| "It works when it works" is fine when stakes are low and human
| is in the loop, like artwork for a blog post. And so in a way,
| I agree with you. AI doesn't belong in intermediate computer-
| to-computer interactions, unless the stakes are low. What
| scares me is that the AI optimists are desperately looking to
| apply LLMs to domains and tasks where the cost of mistakes are
| high.
| recursive wrote:
| > Expecting that to map deterministically to solving open ended
| complex problems is absurd.
|
| TCP creates an abstraction layer with more reliability than
| what it's built on. If you can detect failure, you can create a
| retry loop, assuming you can understand the rules of the
| environment you're operating in.
| segh wrote:
| Lots of people are building on the edge of current AI
| capabilities, where things don't quite work, because in 6 months
| when the AI labs release a more capable model, you will just be
| able to plug it in and have it work consistently.
| techpineapple wrote:
| In 6 months when FSD is completed, and we get robots in every
| home? I suspect we keep adding features, because reliability is
| hard. I do not know what heuristic you would be looking to
| conclude that this problem will eventually be solved by current
| AI paradigms.
| thornewolf wrote:
| GP comment is what has already happened "every 6 months"
| multiple times
| postexitus wrote:
| and where is that product that was developed on the edge of
| current AI capabilities and now with latest AI model plugged in
| it's suddenly working consistently? All I am seeing is models
| getting better and better in generating videos of spaghetti
| eating movie stars.
| segh wrote:
| For me, they have come from the AI labs themselves. I have
| been impressed with Claude Code and OpenAI's Deep Research.
| vslira wrote:
| while i'm bullish on AI capabilities, that is not a very
| optimistic observation for developers building on top of it
| cube00 wrote:
| > because in 6 months when the AI labs release a more capable
| model
|
| How many years do we have to keep hearing this line? ChatGPT is
| two years old and still can't be relied on.
| wiradikusuma wrote:
| Booking a flight is actually task I cannot outsource to a human
| assistant, let alone AI. Maybe it's a third-world problem or just
| me being cheap, but there are heuristics involved when booking
| flights for a family trip or even just for myself.
|
| Check the official website, compare pricing with aggregator,
| check other dates, check people's availability on cheap dates.
| Sometimes I only do the first step if the official price is
| reasonable (I travel 1-2x a month, so I have expectation "how
| much it should cost").
|
| Don't get me started if I also consider which credit card to use
| for the points rewards.
| kccqzy wrote:
| Completely agree! Especially considering that flights for most
| people are still a large expense, people, especially those in
| the credit card points game, like to go to great lengths to
| score the cheapest possible flights.
|
| For example, this person[0] could have simply booked a United
| flight from the United site for 15k points. Instead the person
| batch emailed Turkish Airlines booking offices, found the Thai
| office that was willing to make that booking but required bank
| transfers in Thai baht to pay taxes, made two more phone calls
| to Turkish Airlines to pay taxes with a credit card, and in the
| end only spent 7.5k points for the same trip on United.
|
| This may be an extreme example, but it shows the amount of
| familiarity with the points system, the customer service phone
| tree and the actual rules to get cheap flights.
|
| If AI can do all of that, it'd be useful. Otherwise I'll stick
| to manual booking.
|
| [0]: https://frequentmiler.com/yes-you-can-still-book-united-
| flig...
| Jianghong94 wrote:
| Now THAT's the workflow I'd like to see AI agent automate,
| streamline and democratize for everybody.
| maxbond wrote:
| If it were available to everybody, it would disappear. This
| is a market inefficiency that a "trader" with deep
| knowledge of the structure of this market was able to
| exploit. But if everyone started doing this, United/Turkish
| Airlines would see they were losing money and eliminate it.
| Similar to how airlines have tried to stop people
| exploiting "hidden cities."
| davedx wrote:
| > Similar to how airlines have tried to stop people
| exploiting "hidden cities."
|
| This sounds interesting?
| wbxp99 wrote:
| https://skiplagged.com/
|
| Just don't book a round trip, don't check a bag, don't do
| it too often. Also you're gambling that they don't cancel
| your flight and book you on a new one to the city you
| don't actually want to go to (that no longer connects via
| the hidden city). You can get half price tickets
| sometimes with this trick.
| kristjansson wrote:
| and watch it immediate evaporate or require even more
| esoteric knowledge of opaque systems?
|
| Persistent mispricings can only exist if the cost of
| exploitation removes the benefit or constrains the
| population.
| victorbjorklund wrote:
| I don't really need an AI agent to book flights for me (I just
| don't travel enough for it to be any burden) but aren't those
| arguments for an AI agent? If you just wanna book the next
| flight London to New York it isn't that hard. A few minutes of
| clicking.
|
| But if you wanna find the cheapest way to get to A, compare
| different retailers, check multiple peoples availability,
| calculate effects of credit cards etc. It takes time. Aren't
| those things that could be automated with an agent that can
| find the cheapest flights, propose dates for it, check
| availability etc with multiple people via a messing app,
| calculate which credit card to use, etc?
| bgirard wrote:
| In theory, yes. But in a real world evaluation would it pick
| better flights? I'd like to see evidence that it's able to
| find a better flight that maximizes this. Also the tricky
| part is how do you communicate how much I personally weight a
| shorter flight vs points on my preferred carrier vs having to
| leave for the airport at 5am vs 8am? I'm sure my answers
| would differ from wiradikusuma's answers.
| UncleMeat wrote:
| Yep this is my vibe.
|
| When I'm picking out a flight I'm looking at, among other
| things:
|
| * Is the itinerary aggravatingly early or late
|
| * Is the layover aggravatingly short or long
|
| * Is the layover in an airport that sucks
|
| * Is the flight on a carrier that sucks
|
| * What does it cost
|
| If you asked me to encode ahead of time the relative value
| of each of these dimensions I'd never be able to do it.
| Heck, the relative value to me isn't even constant over
| time. But show me five options and I can easily select
| between them. A clear case where search is more convenient
| than some agent doing it for me.
| bgirard wrote:
| I agree. At first I would be open to an LLM suggested
| option to appear in the search UI. I would have to pick
| it the majority of the time for quite awhile for me to
| trust it enough to blindly book through it.
|
| It's the same problem with Alexa. I don't trust it to
| blindly reorder me basic stuff when I have to shift
| through so many bad product listing on the Amazon
| marketplace.
| Jianghong94 wrote:
| Yep that's what I've been thinking. This shouldn't be that
| hard, at this point LLMs should already have all the 'rules'
| (e.g. credit card A buys flight X give you m point which can
| be converted into n miles) in their params or can easily
| query the web to get it out. Dev need to encode the whole
| thing into a decision mechanism and once executed ask LLM to
| chase down the specific path (e.g. bombard ticket office with
| emails).
| joseda-hg wrote:
| The Flight Price to Tolerable Layover time ratio is something
| too personal for me to convey to an assistant
| zippergz wrote:
| I have HAD a human assistant who booked flights for me. But it
| took them a long time to learn the nuances of my preferences
| enough to do it without a lot of back and forth. And even then,
| they still sometimes had to ask. Things like what time of day I
| prefer to fly based on what I had going on the day before or
| what I'll be doing after I land. What airlines I prefer based
| on which lounges I'd have access to, or what aircraft they fly.
| When I would opt for a connecting flight to get a better price
| vs. when I want nonstop regardless of cost. And on and on.
| Probably dozens of factors that might come into play in various
| combinations depending on where I'm going and why. And
| preferences that are hard to articulate, but make sense once
| understood.
|
| With a really excellent human assistant who deeply understood
| my brain (at least the travel related parts of it), it was kind
| of nice. But even then there were times when I thought it would
| be easier and better to just do it myself. Maybe it's a failure
| of imagination, but I find it very hard to see the path from
| today's technology to an AI agent that I would trust enough to
| hand it off, and that would save enough time and hassle that I
| wouldn't prefer to just do it myself.
| pton_xd wrote:
| > Booking a flight is actually task I cannot outsource to a
| human assistant, let alone AI.
|
| Because there is no "correct" flight. Your preference changes
| as you discover information about what's available at a given
| time and price.
|
| The helpful AI assistant would present you with options, you'd
| choose what you prefer, it would refine the options, and so on,
| until you make your final selection. There would be no
| communication lag as there would be with a human assistant.
| That sounds very doable to me.
| qoez wrote:
| You get more reliability from better capability though. More
| capability means being better at not misclassifying subtle tasks,
| which is what causes reliability issues.
| joshdavham wrote:
| My rule of thumb has thus far been: if I'm gonna allow AI to
| write any bit of code for me, then I must, at a bare minimum, be
| able to understand that code.
|
| There's no way I could do what some of these "vibe coders" are
| doing where they allow AI to write code for them that they don't
| even understand.
| kevmo314 wrote:
| That's only true as long as you want to modify said code. If it
| meets your bar for reliability then you won't need to
| understand it, much like how we don't really need to
| read/understand compiled assembly code so we largely trust the
| compiler.
|
| A lot of these vibe coders just have a much lower bar for
| reliability than you.
| fourside wrote:
| How do you know if it meets your bar for reliability if you
| don't understand the output? I don't know that the analogy to
| a compiler is apples to apples. A compiler isn't producing an
| answer based on statistically generating something that
| should look like the right answer.
| kevmo314 wrote:
| The premise for vibe coding is that it's generating the
| entire app or site. If the app does what you want then it's
| meeting the bar.
| joshdavham wrote:
| This is an interesting point and it's certainly true with
| respect to most peoples' attitudes towards dependencies.
|
| For example, while I feel the need to understand the code I
| wrote using pytorch, I don't generally feel the need to
| totally grok how pytorch works.
| AlexandrB wrote:
| I think there's a lot of code that gets written that's either
| disposable or effectively "write only" in that no one is
| expected to maintain it. I have friends who write a lot of this
| code for tasks like data analysis for retail and "vibe coding"
| isn't that crazy in such a domain.
|
| Basically, what's worse? "Vibes" code that no one understands
| or a cascade of 20 spreadsheets that no one understands? At
| least with the "vibes" code you can stick it in git and have
| some semblance of sane revision control and change tracking.
| Centigonal wrote:
| > I have friends who write a lot of this code for tasks like
| data analysis for retail and "vibe coding" isn't that crazy
| in such a domain.
|
| I think this is a great use case for AI, but the analyst
| still needs to understand what the code that is output does.
| There are a lot of ways to transform data that result in
| inaccurate or misleading results.
| LPisGood wrote:
| Vibe coders focus on writing tests, and verifying
| function/correctness. It's not like they don't read _any_
| of the code. They get the vibes, but ignore the details.
| cube00 wrote:
| > Vibe coders focus on writing tests
|
| From the boasting I've seen, Vibe coders are also using
| AI to slop out their tests as well.
| kibwen wrote:
| Worry not, we can solve this by using AI to generate
| tests to test the tests.
| LPisGood wrote:
| Testing the tests is pretty much the definition of being
| a vibe coder.
| LPisGood wrote:
| Yeah and tests are much easier to validate than
| functions.
| hooverd wrote:
| Huh. The whole promise of vibe coding is that you don't
| have to pay attention to the details.
| namaria wrote:
| "You're programming wrong wrong" /s
| LPisGood wrote:
| Yeah, of course. I don't think what I described could
| possibly be misconstrued as someone paying attention to
| details.
| pton_xd wrote:
| > I have friends who write a lot of this code for tasks like
| data analysis for retail and "vibe coding" isn't that crazy
| in such a domain.
|
| That sort of makes sense, but then again... if you run some
| analysis code and it spits out a few plots, how do you know
| what you're looking at is correct if you have no idea what
| the code is doing?
| kibwen wrote:
| _> how do you know what you 're looking at is correct if
| you have no idea what the code is doing?_
|
| Does it reaffirm the biases of the one who signs my
| paychecks? If so, then the code is correct.
| usui wrote:
| LOL thanks for the laughs. But yes seriously though, most
| kinds of data analysis jobs several rungs down the ladder
| where the result is not in a critical path amount to
| reaffirming what upper people believe. Don't rock the
| boat.
| AlexandrB wrote:
| Lol, that's definitely a factor. Actually plotting is the
| perfect example because python is really popular in the
| space and matplotlib sucks so much. While an analyst may
| not understand Python very well, they often understand
| the data itself through either previous projects or
| through other analysis tools. It's kind of like vibe
| coding a UI for a backend that's hand built.
| liveoneggs wrote:
| two wrongs don't make a right
| cube00 wrote:
| > I have friends who write a lot of this code for tasks like
| data analysis for retail and "vibe coding" isn't that crazy
| in such a domain
|
| Considering the hallucinations we've all seen I don't know
| how they can be comfortable using AI generated data analysis
| to drive the future direction of the business.
| inetknght wrote:
| > _what 's worse? "Vibes" code that no one understands or a
| cascade of 20 spreadsheets that no one understands? At least
| with the "vibes" code you can stick it in git and have some
| semblance of sane revision control and change tracking._
|
| You can for spreadsheets too.
| palmotea wrote:
| > think there's a lot of code that gets written that's either
| disposable or effectively "write only" in that no one is
| expected to maintain it. I have friends who write a lot of
| this code for tasks like data analysis for retail and "vibe
| coding" isn't that crazy in such a domain.
|
| > Basically, what's worse? "Vibes" code that no one
| understands or a cascade of 20 spreadsheets that no one
| understands?
|
| Correction: it's a "cascade of 20 spreadsheets" that _one_
| person understood /understands.
|
| Write only code still needs to work, and _someone_ at _some
| point_ needs to understand it well enough to know that it
| works.
| SkyPuncher wrote:
| Sure, but you're a professional software engineer, who I assume
| gets feedback and performance reviews based on the quality of
| your code.
|
| There's always been a group of beginners that throws stuff
| together without fully understanding what it does. In the past,
| this would be copy n' paste from Stackoverflow. Now, that
| process is simply more automated.
| __MatrixMan__ wrote:
| I think there are times where it's ok to treat a function like
| a black box--cases where anything that makes the test pass will
| do because the test is in fact an exhaustive evaluation of what
| that code needs to do.
|
| We just need to be better about making it clear which code is
| that way and which is not.
| mentalgear wrote:
| Capability demos (like Rabbit R1 vaporware) will go up as long as
| the market is hot and investors (like lemmings) foolishly running
| after those companies that are best @ hype.
| marban wrote:
| Giving up accuracy for a bit of convenience--if any at all--
| almost never pays off. Looking at you, Alexa.
| danielbln wrote:
| Image compression, eventual consistency, fuzzy search. There
| are many more examples I'm sure.
| skydhash wrote:
| > _Image compression, eventual consistency, fuzzy search.
| There are many more examples I 'm sure._
|
| Isn't all of these very deterministic? You can predict what's
| going to be discarded by the compression algorithm. Eventual
| consistency is only eventual because of the generation of
| events. Once that stops, you will have a consistent system
| and the whole thing can be replayed based on the history of
| events. Even with fuzzy search you can intuit how to get
| reliable results and ordering without even looking at the
| algorithms.
|
| An LLMs based agent is the least efficient method for most of
| the cases they're marketing if for. Sometimes all you need is
| a rule-based engine. Then you can add bounded fuzziness where
| it's actually helpful.
| bhu8 wrote:
| I have been thinking about the exact same problem for a while and
| was literally hours away from publishing a blogpost on the
| subject.
|
| +100 on the footnote:
|
| > agents or workflows?
|
| Workflows. Workflows, all the way.
|
| The agents can start using these workflows once they are actually
| ready to execute stuff with high precision. And, by then we would
| have figured out how to create effective, accurate and easily
| diagnozable workflows, so people will stop complaining about "I
| want to know what's going on inside the black box".
| breckenedge wrote:
| Agreed, I started crafting workflows last week. Still not
| impressed with how poorly the current crop of models is at
| following instructions.
|
| And are there any guidelines on how to manage workflows for a
| project or set of projects? I'm just keeping them in plain text
| and including them in conversations ad hoc.
| DebtDeflation wrote:
| I've been building workflows with "AI" capability inserted
| where appropriate since 2016. Mostly customer service chatbots.
|
| 99.9% of real world enterprise AI use cases today are for
| workflows not agents.
|
| However, "agents" are being pushed because the industry needs a
| next big thing to keep the investment funding flowing in.
|
| The problem is that even the best reasoning models available
| today don't have the actual reasoning and planning capability
| needed to build truly autonomous agents. They might in a year.
| Or they might not.
| narmiouh wrote:
| I feel like OP would have been better of not referencing the
| viral thread about a developer not using any version control and
| surprised when the AI made changes, I don't think anyone who
| doesn't understand version control should be using a tool like
| cursor, there are other SAAS apps that build and deploy apps
| using AI and for people with the skill demonstrated in the
| thread, that is what they should be using.
|
| It's like saying rm -rf / should have more safeguards built in.
| It feels unfair to call out the AI based tools for this.
| fabianhjr wrote:
| `rm -rf /` does have a safeguard:
|
| > For example, if a user with appropriate privileges mistakenly
| runs 'rm -rf / tmp/junk', that may remove all files on the
| entire system. Since there are so few legitimate uses for such
| a command, GNU rm normally declines to operate on any directory
| that resolves to /. If you really want to try to remove all the
| files on your system, you can use the --no-preserve-root
| option, but the default behavior, specified by the --preserve-
| root option, is safer for most purposes.
|
| https://www.gnu.org/software/coreutils/manual/html_node/Trea...
| layer8 wrote:
| That was added in 2006, so didn't exist for a good half of
| its life (even longer if you count pre-GNU). I remember _rm
| -rf /_ being considered just one instance of having to
| double-check what you do when using the _-rf_ option. It's
| one reason it became common to alias _rm_ to _rm -i_.
| danso wrote:
| I think it's a useful anecdote because it underscores how
| catastrophically unreliable* agents can be, especially in the
| hands of users who aren't experienced in the particular domain.
| In the domain of programming, it's much easier to quantify a
| "catastrophic" scenario vs. more open-ended "real world"
| situations like booking a flight.
|
| * "unreliable" may not be the right word. For all we know, the
| agent performed admirably given whatever the user's prompt may
| have been. Just goes to show that even in a relatively
| constricted domain of programming, where a lot (but far from
| _all_ ) outcomes are binary, the room for misinterpretation and
| error is still quite vast.
| namaria wrote:
| More than that, I think it's quite relevant, because it shows
| that the complexity in the tooling around writing code
| manually is not so inessential as it seems.
|
| Any system capable of automating a complex task will by need
| be more complex than the task at hand. This complexity
| doesn't evaporate when you through statistical fuzzers at it.
| outime wrote:
| Technically, they could be using version control, not have a
| copy on their local machine for some reason, and have an AI
| agent issue a `git push -f` wiping out all the previous work.
| jappwilson wrote:
| Can't wait for this being a plot point in a murder mystery,
| someone gamed the AI agent to create a planned "accident"
| daxfohl wrote:
| We can barely make deterministic distributed services reliable.
| And microservices now have a bad reputation for being expensive
| distributed spaghetti. I'm not holding my breath for distributed
| AI agents to be a thing.
| twotwotwo wrote:
| FWIW, work has pushed use of Cursor and I quickly came around to
| a related conclusion: given a reliability vs. anything tradeoff,
| you more or less always have to prefer reliability. For example,
| even ignoring subtle head-scratcher type bugs, a faster model's
| output on average needs more revision before it basically works,
| and on average you end up spending more energy on that than you
| save by reducing time to first response. Up-front work that
| decreases the chance of trouble--detailing how you want something
| done, explicitly pulling into context specific libraries--also
| tends to be worth it on net, even if the agent might have gotten
| there by searching (or you could get it there through follow-up
| requests).
|
| That's my experience working with a largeish mature codebase (all
| on non-prod code) where you can't get far if you can't use
| various internal libraries correctly. With standalone (or small
| greenfield) projects, where results can lean more on public info
| from pre-training and there's not as much project specific info
| to pull in, you might see different outcomes.
|
| Maybe the tech and surrounding practice will change over time,
| but in my short experience it's mostly been about trying to just
| get to 'acceptable' for this kind of task.
| asdev wrote:
| want reliability? build automation instead of using non
| deterministic models to complete tasks
| nottorp wrote:
| But but...
|
| People don't get promoted for reliability. They get promoted for
| new capabilities. Everyone thinks they're the next Google.
| prng2021 wrote:
| I think the best shot we have at solving this problem is an
| explosion of specialized agents. That will limit how off the
| rails each one can go at interpreting or performing some type of
| task. The end user still just needs to interact with one agent
| though, as long as it can delegate properly to subagents.
| SkyPuncher wrote:
| Unfortunately, the picked example kind of weighs down the point.
| Cursor has an _extremely_ vocal minority (beginner coders) that
| isn 't really representative of their heavy weight users
| (professional coders). These beginner users face significant
| issues that come from being new to programming, in general.
| Cursor gives them amazing capabilities, but it also lets them
| make the same dumb mistakes that most professional developers
| have done once or twice in their career.
|
| That being said, back in February I was trying out of bunch of AI
| personal assistant apps/tools. I found, without fail, every
| single one of them was advertising features their LLMs could
| theoretically accomplish, but in practice couldn't. Even worse
| was many of these "assistants" would proactively suggest they
| could accomplish something but when you sent them out to do it,
| they'd tell you they couldn't.
|
| * "Would you like me to call that restaurant?"...."Sorry, I don't
| have support for that yet"
|
| * "Would you like me to create a reminder?"....Created the
| reminder, but never executed it
|
| * "Do you want me to check their website?"...."Sorry, I don't
| support that yet"
|
| Of all of the promised features, the only thing I ended up using
| any of them for was a text message interface to an LLM. Now that
| Siri has native ChatGPT support, it's not necessary.
| _cs2017_ wrote:
| Does anyone have AI agent use cases that that you think might
| happen within this year and that feels very exciting to you?
|
| I personally struggle to find a new one (AI agent coding
| assistants already exist, and of course I'm excited about them,
| especially as they get better). I will not, any time soon, trust
| unsupervised AI to send emails on my behalf, make travel
| reservations, or perform other actions that are very costly to
| fix. AI as a shopping agent just isn't too exciting for me, since
| I do not believe I actually know what features in a speaker /
| laptop / car I want until I do my own research by reading what
| experts and users say.
| danso wrote:
| I think the replies [0] to the mentioned reddit thread sums up my
| (perhaps complacent?) feelings about the current state of
| automated AI programming:
|
| > _Does it terrify anyone else that there is an entire cohort of
| new engineers who are getting into programming because of AI, but
| missing these absolute basic bare necessities?_
|
| > > _Terrify? No, it 's reassuring that I might still have a
| place in the world._
|
| [0]
| https://www.reddit.com/r/cursor/comments/1inoryp/comment/mdo...
| bob1029 wrote:
| The reddit post feels like engagement bait to me.
|
| Why would you ask the community a question like "how to source
| control" when you've been working with (presumably) a
| programming genius LLM that could provide the most personally
| tailored path for baby's first git experience? Even if you
| don't know that "git" is a thing, you could ask questions as if
| you were a golden retriever and the model would still
| inevitably recommend git in the first turn of conversation.
|
| Is it really the case that a person who has the ability to use
| a compiler, IDE, LLM, web browser, reddit, etc., somehow
| simultaneously lacks the ability to frame basic-ass questions
| about the very mission they set out on? If stuff like this is
| _not_ manufactured, then we should all walk away feeling pretty
| fantastic about our future job prospects.
| danso wrote:
| The account is a throwaway but based on its short posting
| history and its replies, I don't have reason to believe it's
| a troll:
|
| https://www.reddit.com/r/cursor/comments/1inoryp/comment/mdr.
| ..
|
| > _I 'm not a dev or engineers at all (just a geek working in
| Finance)_
|
| This fits my experience of teaching very intelligent students
| how to code; if you're an experienced programmer, you simply
| cannot fathom the kinds of assumptions beginners will make
| due to gaps in yet-to-be foundational knowledge. I remember
| having to tell students to mindful when searching Stack
| Overflow for help, because of how something as simple as an
| error from Requests (e.g. while doing web scraping) could
| lead them down a rabbit hole of "solutions" such as
| completely uninstalling their Python for a different/older
| version of Python.
| layer8 wrote:
| They were using Cursor, not a general LLM, and were asking
| their fellow Cursor users how they deal with the risk of
| Cursor destroying the code base.
| namaria wrote:
| If you start from scratch trying to build an ideal system to
| program computers, you always converge on the time tested
| tooling that we have now. Code, compilers, interpreters,
| versioning, etc.
|
| People think "this is hard, I'll re-invent it in an easier
| way" and end up with a half-assed version of the tooling
| we've honed over the decades.
| mycall wrote:
| > People think "this is hard, I'll re-invent it in an
| easier way" and end up with a half-assed version of the
| tooling we've honed over the decades.
|
| This is a win in the long run because the occassional and
| successful thought people labor over sometimes is a better
| way.
| donfotto wrote:
| > choosing a small number of tasks to execute exceptionally well
|
| And that is the Unix philosophy
| vivzkestrel wrote:
| remember 2016 chatbots anymore. sounds like the same thing all
| over again except this time we got hallucinations and
| unpredictability
| hirako2000 wrote:
| The problem with Devin wasn't that it was a black box doing too
| much. It's that the outcome demo'd were fake and what was inside
| the box wasn't an "AI engineer."
|
| Transparency? If it worked even unreliably, nobody would care
| what it does. Problem is stochastic machines aren't engineers,
| don't reason, are not intelligence.
|
| I find articles attacking Ai but finding excuses in some mouse
| rather than pointing at the elephant, exhausting.
| ankit219 wrote:
| Agents in the current format are unlikely to go beyond a current
| levels of reliability. I believe agents are a good use case in a
| low trust environments (outside of coding where you could see the
| errors quickly with testing or deployment) like inter-company
| communications and tasks, where there are already systems in
| place for checks and things going wrong. Might be a hot space in
| some time. For intra company, high trust environment cannot just
| be a workflow automation given any error would need the knowledge
| worker to redo the whole thing to check if its correct. We can do
| it via other agents - less chances of it going wrong - but more
| chances it screws up in the same place as previous one.
| shireboy wrote:
| " It's easy to blame the user's missing grasp of basic version
| control, but that misses the deeper point."
|
| Uhh, no, that's pretty much the point. A developer without basic
| understanding of version control is like a pilot without a basic
| understanding of landing. A ton of problems with AI (or any other
| tool, including your own brain) get fixed by iterating on small
| commits and branching. Throw away the commit or branch if it
| really goes sideways. I can't fathom working on something for 4
| months without realizing a problem or having any way to roll
| back.
|
| That said, the one argument I could see is if Cursor (or copilot,
| etc) had built in to suggest "this project isn't in source
| control, we should probably do that before getting too far ahead
| of ourselves.", then help the user setup sc, repo, commit, etc.
| The topic _is_ tricky and I do remember not totally grasping git,
| branching, etc.
| highmastdon wrote:
| The nice thing is that adding this to the basic prompt that
| cursor uses will advance all those users and directly do away
| with this problem only to discover the next one. However, all
| these little things add up to a very powerful prompt where the
| LLM will make it only easier for anyone to build real stuff
| that on the surface looks very good
| andreash wrote:
| We are building this with https://lets.dev. We believe there will
| be great demand for less capable, but much more determinisic
| agents. I also recommend everyone to read "What is an agent?" by
| Harrison Chase. https://blog.langchain.dev/what-is-an-agent/
| tristor wrote:
| The thing I most want an AI agent to do is something I can't
| trust to any third-party, it'd need to be local, and it's
| something well within LLM capabilities today. I just want a
| "secretary in my pocket" to take notes during conversations and
| produce minutes, but do so in a way that's secure and privacy-
| respecting (e.g. I can use it at work or at home).
| anishpalakurT wrote:
| Check out BAML at boundaryml.com
| piokoch wrote:
| Funny note about Cursor. Commercial project, rather expensive,
| cannot figure out that it would be good to use, say, version
| control not to break somebody's work. That's why I prefer Aider
| (free), which is simply committing whatever it does, so any
| change could be reverted. Easily.
| jlaneve wrote:
| I appreciate the distinction between agents and workflows - this
| seems to be commonly overlooked and in my opinion helps ground
| people in reliability vs capability. Today (and in the near
| future) there's not going to be "one agent to rule them all", so
| these LLM workflows don't need to be incredibly capable. They
| just need to do what they're intended to do _reliably_ and
| nothing more.
|
| I've started taking a very data engineering-centric approach to
| the problem where you treat an LLM as an API call as you would
| any other tool in a pipeline, and it's crazy (or maybe not so
| crazy) what LLM workflows are capable of doing, all with
| increased reliability. So much so that I've tried to package my
| thoughts / opinions up into an AI SDK for Apache Airflow [1] (one
| of the more popular orchestration tools that data engineers use).
| This feels like the right approach and in our customer base /
| community, it also maps perfectly to the organizations that have
| been most successful. The number of times I've seen companies
| stand up an AI team without really understanding _what problem
| they want to solve_...
|
| [1] https://github.com/astronomer/airflow-ai-sdk
| LeifCarrotson wrote:
| Unfortunately, LLMs, natural language, and human cognition
| largely are what they are. Mix the three together and you don't
| get reliability as a result.
|
| It's not like there's a lever in Cursor HQ where one side is
| "Capability" and one side is "Reliability", and they can make
| things better just by tipping it back towards the latter.
|
| You can bias designs and efforts in that direction, and get your
| tool to output reversible steps or bake in sanity checks to
| blessed actions, but that doesn't change the nature of the
| problem.
| rambambram wrote:
| I heard you, so we decided to now tweak the dials a bit. The dial
| for 'capability' we can turn back a little, no problem, but the
| dial for 'reliability', uhm yeah... I'm sorry, but we couldn't
| find that dial. Sorry.
| extr wrote:
| The problem I find in many cases is that people are restrained by
| their imagination of what's possible, so they target existing
| workflows for AI. But existing workflows exist for a reason:
| someone already wanted to do that, and there have been countless
| man-hours put into the optimization of the UX/UI. And by
| definition they were possible before AI, so using AI for them is
| a bit of a solution in search of a problem.
|
| Flights are a good example but I often cite Uber as a good one
| too. Nobody wants to tell their assistant to book them an Uber -
| the UX/UI is so streamlined and easy, it's almost always easy
| enough to just do it yourself (or if you are too important for
| that, you probably have a private driver already). Basically
| anything you can do with an iPhone and the top 20 apps is in this
| category. You are literally competing against hundreds of
| engineers/product designers who had no other goal than to build
| the best possible experience for accomplishing X. Even if LLMs
| would have been helpful a priori - they aren't after every edge
| case has already been enumerated and planned for.
| lolinder wrote:
| > You are literally competing against hundreds of
| engineers/product designers who had no other goal than to build
| the best possible experience for accomplishing X.
|
| I think part of what's been happening here is that the hubris
| of the AI startups is really showing through.
|
| People working on these startups are by definition much more
| likely than average to have bought the AI hype. And what's the
| AI hype? That AI will replace humans at somewhere between "a
| lot" and "all" tasks.
|
| Given that we're filtering for people who believe that, it's
| unsurprising that they consciously or unconsciously devalue all
| the human effort that went into the designs of the apps they're
| looking to replace and think that an LLM could do better.
| arionhardison wrote:
| > I think part of what's been happening here is that the
| hubris of the AI startups is really showing through.
|
| I think it its somewhat reductive to assign this "hubris" to
| "AI startups". I would posit that this hubris is more akin to
| the superiority we feel as human beings.
|
| I have heard people say several times that they "treat AI
| like a Jr. employee", I think that within the context of a
| project AI should be treated based on the level if
| contribution. If AI is the expert, I am not going to approach
| it as if I am an SME that knows exactly what to ask. I am
| going to try and focus on the thing. know best, and ask
| questions around that to discover and learn the best
| approach. Obviously there is nuance here that is outside the
| scope of this discussion, but these two fundamentally
| different approaches have yield materially different outcomes
| in my experience.
| arionhardison wrote:
| > The problem I find in many cases is that people are
| restrained by their imagination of what's possible, so they
| target existing workflows for AI.
|
| I concur and would like to add that they are also restrained by
| the limitations of existing "systems" and our implicit and
| explicit expectations of said system. I am currently attempting
| to mitigate the harm done by this restriction by focusing on
| and starting with a first principal analysis of the problem
| being solved before starting the work, for example; lets take a
| well established and well documented system like the SSA.
|
| When attempting to develop, refactor, extend etc... such a
| system; what is the proper thought process. As I see it, there
| are two paths:
|
| Path 1: a) Breakdown the existing workflows
| b) Identify key performance indicators (KPIs) that align with
| your business goals c) Collect and analyze data
| related to those KPIs using BPM tools d) Find the
| most expensive worst performing workflows e)
| Automate them E2E w/ interface contracts on either side
|
| This approach locks you into to existing restrictions of the
| system, workflows, implementation etc...
|
| Path 2: a) Analyze system to understand goal in
| terms of 1st principals, e.g: What is the mission of the SSA?
| To move money based on conditional logic. b) What
| systems / data structures are closest to this function and does
| the legacy system reflect this at its core e.g.: SSA should
| just be a ledger IMO c) If Yes, go to "Path 1" and
| if No go to "D" d) Identify the core function of the
| system, the critical path (core workflow) and all required
| parties e) Make MVP which only does the bare min
|
| By following path 2 and starting off with an AI analysis of the
| actual problem and not the problem as it exist as a solution
| within the context of an existing system, it is my opinion that
| the previous restrictions have been avoided.
|
| Note: Obviously this is a gross oversimplification of the
| project management process and there are usually external
| factors that weigh in and decide which path is possible for a
| given initiative, my goal here was just to highlight a specific
| deviation from my normal process that has yielded benefits so
| far in my own personal experience.
| peterjliu wrote:
| We've (ex Google Deepmind researchers) been doing research in
| increasing the reliability of agents and realized it is pretty
| non-trivial, but there are a lot of techniques to improve it. The
| most important thing is doing rigorous evals that are
| representative of what your users do in your product. Often this
| is not the same as academic benchmarks. We made our own
| benchmarks to measure progress.
|
| Plug: We just posted a demo of our agent doing sophisticated
| reasoning over a huge dataset ((JFK assassination files -- 80,000
| PDF pages): https://x.com/peterjliu/status/1906711224261464320
|
| Even on small amounts of files, I think there's quite a palpable
| difference in reliability/accuracy vs the big AI players.
| ai-christianson wrote:
| > The most important thing is doing rigorous evals that are
| representative of what your users do in your product. Often
| this is not the same as academic benchmarks.
|
| OMFG thank you for saying this. As a core contributor to
| RA.Aid, optimizing it for SWE-bench seems like it would
| actively go against perf on real-world tasks. RA.Aid came about
| in the first place as a pragmatic programming tool (I created
| it while making another software startup, Fictie.) It works
| well because it was literally made and tested by making other
| software, and these days it mostly creates its own code.
|
| Do you have any tips or suggestions on how to do more
| formalized evals, but on tasks that resemble real world tasks?
| peterjliu wrote:
| I would start by making the examples yourself initially,
| assuming you have a good sense for what that real-world task
| is. If you can't articulate what a good task is and what a
| good output is, it is not ready for out-sourcing to crowd-
| workers.
|
| And before going to crowd-workers (maybe you can skip them
| entirely) try LLMs.
| ai-christianson wrote:
| > I would start by making the examples yourself initially
|
| What I'm doing right now is this: 1) I have
| X problem to solve using the coding agent. 2) I ask
| the agent to do X 3) I use my own brain: did the
| agent do it correctly?
|
| If the agent did not do it correctly, I then ask: _should_
| the agent have been able to solve this? If so, I try to
| improve the agent so it 's able to do that.
|
| The hardest part about automating this is #3 above --each
| evaluation is one-off and it would be hard to even
| formalize the evaluation.
|
| SWE bench, for example uses unit tests for this, and the
| agent is blind to the unit tests --so the agent has to make
| a red test (which it has never seen) go green.
| jedberg wrote:
| I've been working on this problem for a while. There are whole
| companies that do this. They all work by having a human review a
| sample of the results and score them (with various uses of magic
| to make that more efficient). And then suggest changes to make it
| more accurate in the future.
|
| The best companies can get up to 90% accuracy. Most are closer to
| 80%.
|
| But it's important to remember, we're expecting perfection here.
| But think about this: Have you ever asked someone to book a
| flight for you? How did it go?
|
| At least in my experience, there's usually a few back and forth
| emails, and then something is always not quite right or as good
| as if you did it yourself, but you're ok with that because it
| saved you time. The one thing that makes it better is if the same
| person does it for you a couple of times and learned your
| specific habits and what you care about.
|
| I think the biggest problem in AI accuracy is expecting the AI to
| be better than a human.
| morsecodist wrote:
| This is really cool. I agree with your point that a human would
| also struggle to book a flight for someone but what I take from
| that is conversation is not the best interface for picking
| flights. I am not really sure how you beat a list of available
| flights + filters. There are a lot of criteria: total fight
| time, price, number of stops, length of layover, airline, which
| airport if your destination is served by multiple airports. I
| couldn't really communicate to anyone how I weigh those and it
| shifts over time.
| lolinder wrote:
| > I think the biggest problem in AI accuracy is expecting the
| AI to be better than a human.
|
| If it's not better across at least one of {more accurate,
| faster, cheaper} then there is no business. You have to be
| offering one of the above.
|
| And that applies both to humans and to existing tech solutions:
| an LLM solution must beat both in some dimension. Current
| flight booking interfaces are actually better than a human at
| _all three_ : they're more accurate, they're free, and they're
| faster than trying to do the back and forth, which means the
| bar to clear for an agent is extremely high.
| bluGill wrote:
| > Current flight booking interfaces are actually better than
| a human at all three
|
| Only when you know exactly where to go. If you need to get to
| customers in 3 cities where order doesn't matter (ie the
| traveling salemen problem, though you are allowed to hit any
| city more than once) current solutions are not great. If you
| want to go on vacation but don't care much about where
| (almost every place with an airport would be an acceptable
| vacation, though some are better than others)
| rglover wrote:
| > Given the intensifying competition within AI, teams face a
| difficult balance: move fast and risk breaking things, or
| prioritize reliability and risk being left behind.
|
| Can we please retire this dichotomy? Part of why teams do this in
| the first place is because there's this language of "being left
| behind."
|
| We badly need to retreat to a world in which rigorous engineering
| is applauded and _expected_ --not treated as a nice to have or
| "old world thinking."
| getnormality wrote:
| "Less capability, more reliability, please" is what I want to say
| about everything that's happened in the past 20 years. Of
| everything that's happened since then, I'm happy to have a few
| new capabilities: smartphones, driving directions, cloud storage,
| real-time collaborative editing of documents. I don't need
| anything else. And now I just want my gadget batteries to last
| longer, and working parental controls on my kids' devices.
| janalsncm wrote:
| I think many people share the same sentiment. We don't need
| agents that can _kind of_ do many things. We need reliable
| programs that are really good at doing a single thing. I said as
| much about Manus when it came out.
|
| https://news.ycombinator.com/item?id=43350950
|
| There are mistakes in the Manus demo if you actually look at it.
| And with so many AI demos, they never want you to look too
| closely because the thing that was created is fairly mediocre. No
| one is asking for the tsunami of sludge except for VCs
| apparently.
| YetAnotherNick wrote:
| I think the author is doing apples to oranges comparison. If you
| have AI acting agnatically, capability is likely positively
| correlated with reliability. If you don't have AI agents, it is
| more reliable.
|
| AI agents are not there yet and even cursor has agent mode not
| selected by default. I have seen cursor agent quite a bit worse
| that the raw model with human selected context.
| bendyBus wrote:
| "If your task can be expressed as a workflow, build a workflow".
| 100% true but the thing all these 'agent pattern' or 'workflow'
| diagrams miss is that real tasks require back-and-forth with a
| user, not just a Rube Goldberg machine that gets triggered in
| response to a _single user message_. What you need is not 'tool
| use' but something like 'process use'. This is what we did at
| Rasa, giving you a declarative way to define multi-step
| processes. An LLM lets you have a fluent conversation, but the
| execution of the task is pre-defined and deterministic:
| https://rasa.com/docs/learn/concepts/calm/ The fact that every
| framework starts with a `while` loop around an LLM and then duct-
| tapes on some "guardrails" betrays a lack of imagination.
| wg0 wrote:
| Totally agree with author here. Also, reliability is pretty hard
| to achieve when the underlying models are all mountains of
| probability that no one yet understands how they do what they
| exactly do and how to precisely fix a problem without affecting
| other parts.
|
| Here's CNBC Business is pushing greed that these aren't AI
| wrappers but next best thing after fire, bread and axe[0]
|
| [0]. https://youtu.be/mmws6Oqtq9o
| freeamz wrote:
| same can be said about digital tech/infrastructure in general!
| wg0 wrote:
| I can't say that based on what I know about both.
| cadamsdotcom wrote:
| Models aren't great at deciding whether an action is irreversible
| - and thus whether to stop to ask for input/advice/approval.
| Hence agentic systems usually are given a policy to follow.
|
| Perhaps the question "is this irreversible?" should be delegated
| to a separate model invocation.
|
| There could be a future in which agentic systems are a tree of
| model and tool invocations, maybe with a shared scratchpad.
| genevra wrote:
| I agree up until the coding example. If someone doesn't know
| about version control I don't think that's any fault of the
| company trying to stretch the technology to its limits and let
| people experiment. Cursor is a really cool step in a direction,
| and it's weird to say we should clamp what it's doing because
| people might not be competent enough to fix its mistakes.
| kuil009 wrote:
| It's natural to expect reliability from AI agents -- but I don't
| think Cursor is a fair example. It's a developer tool deeply
| integrated with git, where every action can have serious
| consequences, as in any software development context.
|
| Rather than blaming the agent, we should recognize that this
| behavior is expected. It's not that AI is uniquely flawed -- it's
| that we're automating a class of human communication problems
| that already exist.
|
| This is less about broken tools and more about adjusting our
| expectations. Just like hunters had to learn how to manage
| gunpowder weapons after using bows, we're now figuring out how to
| responsibly wield this new power.
|
| After all, when something works exactly as intended, we already
| have a word for that: software.
| amogul wrote:
| Reliability, consistency and accuracy is the next frontier that
| we all have to tackle it sucks. Friend of mine is building
| Empromptu.ai to tackle exactly this. From what she told me built
| a model where that let's you define accuracy based on your use
| case and their models optimize your whole system towards it.
___________________________________________________________________
(page generated 2025-03-31 23:00 UTC)