[HN Gopher] Is there a half-life for the success rates of AI age...
       ___________________________________________________________________
        
       Is there a half-life for the success rates of AI agents?
        
       Author : EvgeniyZh
       Score  : 190 points
       Date   : 2025-06-18 10:53 UTC (12 hours ago)
        
 (HTM) web link (www.tobyord.com)
 (TXT) w3m dump (www.tobyord.com)
        
       | mikeocool wrote:
       | This very much aligns with my experience -- I had a case
       | yesterday where opus was trying to do something with a library,
       | and it encountered a build error. Rather than fix the error, it
       | decided to switch to another library. It then encountered another
       | error and decided to switch back to the first library.
       | 
       | I don't think I've encountered a case where I've just let the LLM
       | churn for more than a few minutes and gotten a good result. If it
       | doesn't solve an issue on the first or second pass, it seems to
       | rapidly start making things up, make totally unrelated changes
       | claiming they'll fix the issue, or trying the same thing over and
       | over.
        
         | the__alchemist wrote:
         | This is consistent with my experience as well.
        
         | fcatalan wrote:
         | I brought over the source of the Dear imgui library to a toy
         | project and Cline/Gemini2.5 hallucinated the interface and when
         | the compilation failed started editing the library to conform
         | with it. I was all like: Nono no no no stop.
        
           | BoiledCabbage wrote:
           | Oh man that's good - next step create a PR to push it up
           | stream! Everyone can benefit from its fixes.
        
         | qazxcvbnmlp wrote:
         | when this happens I do thew following
         | 
         | 1) switch to a more expensive llm and ask it to debug: add
         | debugging statements, reason about what's going on, try small
         | tasks, etc 2) find issue 3) ask it to summarize what was wrong
         | and what to do differently next time 4) copy and paste that
         | recommendation to a small text document 5) revert to the
         | original state and ask the llm to make the change with the
         | recommendation as context
        
           | rurp wrote:
           | This honestly sounds slower than just doing it myself, and
           | with more potential for bugs or non-standard code.
           | 
           | I've had the same experience as parent where LLMs are great
           | for simple tasks but still fall down surprisingly quickly on
           | anything complex and sometimes make simple problems complex.
           | Just a few days ago I asked Claude how to do something with a
           | library and rather than give me the simple answer it
           | suggested I rewrite a large chunk of that library instead, in
           | a way that I highly doubt was bug-free. Fortunately I figured
           | there would be a much simpler answer but mistakes like that
           | could easily slip through.
        
             | ziml77 wrote:
             | Yeah if it gets stuck and can't easily get itself unstuck,
             | that's when I step in to do the work for it. Otherwise it
             | will continue to make more and more of a mess as it
             | iterates on its own code.
        
           | nico wrote:
           | > 1) switch to a more expensive llm and ask it to debug
           | 
           | You might not even need to switch
           | 
           | A lot of times, just asking the model to debug an issue,
           | instead of fixing it, helps to get the model unstuck (and
           | also helps providing better context)
        
         | enraged_camel wrote:
         | I've actually thought about this extensively, and experimented
         | with various approaches. What I found is that the quality of
         | results I get, and whether the AI gets stuck in the type of
         | loop you describe, depends on two things: how detailed and
         | thorough I am with what I tell it to do, and how robust the
         | guard rails I put around it are.
         | 
         | To get the best results, I make sure to give detailed specs of
         | both the current situation (background context, what I've tried
         | so far, etc.) and also what criteria the solution needs to
         | satisfy. So long as I do that, there's a high chance that the
         | answer is at least satisfying if not a perfect solution. If I
         | don't, the AI takes a lot of liberties (such as switching to
         | completely different approaches, or rewriting entire modules,
         | etc.) to try to reach what _it_ thinks is the solution.
        
           | prmph wrote:
           | But don't they keep forgetting the instructions after enough
           | time have passed? How do you get around that? Do you add an
           | instruction that after every action it should go back and
           | read the instructions gain?
        
             | enraged_camel wrote:
             | They do start "drifting" after a while, at which point I
             | export the chat (using Cursor), then start a new chat and
             | add the exported file and say "here's the previous
             | conversation, let's continue where we left off". I find
             | that it deals with the transition pretty well.
             | 
             | It's not often that I have to do this. As I mentioned in my
             | post above, if I start the interaction with thorough
             | instructions/specs, then the conversation concludes before
             | the drift starts to happen.
        
         | skerit wrote:
         | > I don't think I've encountered a case where I've just let the
         | LLM churn for more than a few minutes and gotten a good result.
         | 
         | Is this with something like Aider or CLine?
         | 
         | I've been using Claude-Code (with a Max plan, so I don't have
         | to worry about it wasting tokens), and I've had it successfully
         | handle tasks that take over an hour. But getting there isn't
         | super easy, that's true. The instructions/CLAUDE.md file need
         | to be perfect.
        
           | nico wrote:
           | > I've had it successfully handle tasks that take over an
           | hour
           | 
           | What kind of tasks take over an hour?
        
           | aprilthird2021 wrote:
           | You have to give us more about your example of a task that
           | takes over an hour with very detailed instruction. That's
           | very intriguing
        
         | onlyrealcuzzo wrote:
         | > If it doesn't solve an issue on the first or second pass, it
         | seems to rapidly start making things up, make totally unrelated
         | changes claiming they'll fix the issue, or trying the same
         | thing over and over.
         | 
         | Sounds like a lot of employees I know.
         | 
         | Changing out the entire library is quite amusing, though.
         | 
         | Just imagine: I couldn't fix this build error, so I migrated
         | our entire database from Postgres to MongoDB...
        
           | butterknife wrote:
           | Thanks for the laughs.
        
           | mathattack wrote:
           | It may be doing the wrong thing like an employee, but at
           | least it's doing it automatically and faster. :)
        
           | didgeoridoo wrote:
           | Probably had "MongoDB is web scale" in the training set.
        
           | KronisLV wrote:
           | That is amusing but I remember HikariCP in an older Java
           | project having issues with DB connections against an Oracle
           | instance. Thing is, no settings or debugging could really
           | (easily) narrow it down, whereas switching to DBCP2 both
           | fixed whatever was the stability issue in that particular
           | pairing, as well as has nice abandoned connection tracking
           | too. Definitely the quick and dirty solution that still has
           | good results.
        
         | Workaccount2 wrote:
         | They poison their own context. Maybe you can call it context
         | rot, where as context grows and especially if it grows with
         | lots of distractions and dead ends, the output quality falls
         | off rapidly. Even with good context the rot will start to
         | become apparent around 100k tokens (with Gemini 2.5).
         | 
         | They really need to figure out a way to delete or "forget"
         | prior context, so the user or even the model can go back and
         | prune poisonous tokens.
         | 
         | Right now I work around it by regularly making summaries of
         | instances, and then spinning up a new instance with fresh
         | context and feed in the summary of the previous instance.
        
           | codeflo wrote:
           | I wonder to what extent this might be a case where the base
           | model (the pure token prediction model without RLHF) is
           | "taking over". This is a bit tongue-in-cheek, but if you see
           | a chat protocol where an assistant makes 15 random wrong
           | suggestions, the most likely continuation has to be yet
           | another wrong suggestion.
           | 
           | People have also been reporting that ChatGPT's new "memory"
           | feature is poisoning their context. But context is also
           | useful. I think AI companies will have to put a lot of
           | engineering effort into keeping those LLMs on the happy path
           | even with larger and larger contexts.
        
             | potatolicious wrote:
             | I think this is at least somewhat true anecdotally. We do
             | know that as context length increases, adherence to the
             | system prompt decreases. Whether that de-adherence is
             | reversion to the base model or not I'm not really qualified
             | to say, but it certainly _feels_ that way from observing
             | the outputs.
             | 
             | Pure speculation on my part but it feels like this may be a
             | major component of the recent stories of people being
             | driven mad by ChatGPT - they have extremely long
             | conversations with the chatbot where the outputs start
             | seeming more like the "spicy autocomplete" fever dream
             | creative writing of pre-RLHF models, which feeds and
             | reinforces the user's delusions.
             | 
             | Many journalists have complained that they can't seem to
             | replicate this kind of behavior in their own attempts, but
             | maybe they just need a sufficiently long context window?
        
           | kossae wrote:
           | This is my experience as well, and for now comes down to a
           | workflow optimization. As I feel the LLM getting off track, I
           | start a brand new session with useful previous context pasted
           | in from my previous session. This seems to help steer it back
           | to a decent solution, but agreed it would be nice if this was
           | more automated based off of user/automated feedback (broken
           | unit test, "this doesn't work", etc.)
        
             | kazinator wrote:
             | "Human Attention to the Right Subset of the Prior Context
             | is All You Need"
        
           | HeWhoLurksLate wrote:
           | I've found issues like this happen extremely quickly with
           | ChatGPT's image generation features - if I tell it to put a
           | particular logo in, the first iteration looks okay, while
           | anything after that starts to look more and more cursed /
           | mutant.
        
             | rvnx wrote:
             | I've noticed something, even if you ask to edit a specific
             | picture, it will still used the other pictures in the
             | context (and this is somewhat unwanted)
        
             | vunderba wrote:
             | gpt-image-1 is unfortunately particular vulnerable to this
             | problem. The more you want to change the initial image -
             | the better off you'd honestly be just starting an entirely
             | new conversation.
        
             | nojs wrote:
             | https://www.astralcodexten.com/p/the-claude-bliss-attractor
        
           | OtherShrezzing wrote:
           | >They really need to figure out a way to delete or "forget"
           | prior context, so the user or even the model can go back and
           | prune poisonous tokens.
           | 
           | This is possible in tools like LM Studio when running LLMs
           | locally. It's a choice by the implementer to grant this
           | ability to end users. You pass the entire context to the
           | model in each turn of the conversation, so there's no
           | technical reason stopping this feature existing, besides
           | maybe some cost benefits to the inference vendor from cache.
        
           | steveklabnik wrote:
           | > They really need to figure out a way to delete or "forget"
           | prior context, so the user or even the model can go back and
           | prune poisonous tokens.
           | 
           | In Claude Code you can use /clear to clear context, or
           | /compact <optional message> to compact it down, with the
           | message guiding what stays and what goes. It's helpful.
        
             | libraryofbabel wrote:
             | Also in Claude Code you can just press <esc> a bunch of
             | times and you can backtrack to an earlier point in the
             | history before the context was poisoned, and re-start from
             | there.
             | 
             | Claude has some amazing features like this that aren't very
             | well documented. Yesterday I just learned it writes
             | sessions to disk and you can resume them where you left off
             | with -continue or - resume if you accidentally close or
             | something.
        
               | drewnick wrote:
               | Thank you! This just saved me after closing laptop and
               | losing a chat in VS Code. Cool feature and always a place
               | where Clause Code UX was behind chat - being able to see
               | history. "/continue" saved me ~15 minutes of re-
               | establishing the planning for a new feature.
               | 
               | Also loving the shift + tab (twice) to enter plan mode.
               | Just adding here in case it helps anyone else.
        
               | Aeolun wrote:
               | Claude code should really document this stuff in some
               | kind of tutorial. There's too much about code that I need
               | to learn from random messages on the internet.
        
               | steveklabnik wrote:
               | > Claude has some amazing features like this that aren't
               | very well documented.
               | 
               | Yeah, it seems like they stealth ship a lot. Which is
               | cool, but can sometimes lead to a future that's unevenly
               | distributed, if you catch my drift.
        
           | autobodie wrote:
           | If so, that certainly fits with my experiences.
        
           | darepublic wrote:
           | There are just certain problems that they cannot solve.
           | Usually when there is no clear example in its pretraining or
           | discoverable on the net. I would say the reasoning
           | capabilities of these models are pretty shallow, at least it
           | seems that way to me
        
             | dingnuts wrote:
             | They can't reason at all. The language specification for
             | Tcl 9 is in the training data of the SOTA models but there
             | exist almost no examples, only documentation. Go ahead, try
             | to get a model to write Tcl 9 instead of 8.5 code and see
             | for yourself. They can't do it, at all. They write 8.5
             | exclusively, because they only copy. They don't reason.
             | "reasoning" in LLMs is pure marketing.
        
           | eplatzek wrote:
           | Honestly that feels a like a human.
           | 
           | After hitting my head against a wall with a problem I need to
           | stop.
           | 
           | I need to stop and clear my context. Go a walk. Talk with
           | friends. Switch to another task.
        
           | no_wizard wrote:
           | I feel like as an end user I'd like to be able to do more to
           | shape the LLM behavior. For example, I'd like to flag the
           | dead end paths so they're properly dropped out of context and
           | not explored again, unless I as a user clear the flag(s).
           | 
           | I know there is work being done on LLM "memory" for lack of a
           | better term but I have yet to see models get _more
           | responsive_ over time with this kind of feedback. I know I
           | can flag it but right now it doesn't help my "running"
           | context that would be unique to me.
           | 
           | I have a similar thought about LLM "membranes", which
           | combines the learning from multiple users to become more
           | useful, I am keeping a keen eye on that as I think that will
           | make them more useful on a organizational level
        
           | iLoveOncall wrote:
           | > They really need to figure out a way to delete or "forget"
           | prior context
           | 
           | This is already pretty much figured out:
           | https://www.promptingguide.ai/techniques/react
           | 
           | We use it at work and we never encounter this kind of issues.
        
           | dr_dshiv wrote:
           | Context rot! Love it. One bad apple spoils the barrel.
           | 
           | I try to keep hygiene with prompts; if I get anything bad in
           | the result, I try to edit my prompts to get it better rather
           | than correcting in conversation.
        
         | peacebeard wrote:
         | Very common to see in comments some people saying "it can't do
         | that" and others saying "here is how I make it work." Maybe
         | there is a knack to it, sure, but I'm inclined to say the
         | difference between the problems people are trying to use it on
         | may explain a lot of the difference as well. People are not
         | usually being too specific about what they were trying to do.
         | The same goes for a lot of programming discussion of course.
        
           | heyitsguay wrote:
           | I've noticed this a lot, too, in HN LLM discourse.
           | 
           | (Context: Working in applied AI R&D for 10 years, daily user
           | of Claude for boilerplate coding stuff and as an HTML coding
           | assistant)
           | 
           | Lots of "with some tweaks i got it to work" or "we're using
           | an agent at my company", rarely details about what's working
           | or why, or what these production-grade agents are doing.
        
           | alganet wrote:
           | > People are not usually being too specific about what they
           | were trying to do. The same goes for a lot of programming
           | discussion of course.
           | 
           | In programming, I already have a very good tool to follow
           | specific steps: _the programming language_. It is designed to
           | run algorithms. If I need to be specific, that's the tool to
           | use. It does exactly what I ask it to do. When it fails, it's
           | my fault.
           | 
           | Some humans require algorithmic-like instructions too. Like
           | cooking a recipe. However, those instructions can be very
           | vague and a lot of humans can still follow it.
           | 
           | LLMs stand on this weird place where we don't have a clue in
           | which occasions we can be vague or not. Sometimes you can be
           | vague, sometimes you can't. Sometimes high level steps are
           | enough, sometimes you need fine-grained instructions. It's
           | basically trial and error.
           | 
           | Can you really blame someone for not being specific enough in
           | a system that only provides you with a text box that offers
           | anthropomorphic conversation? I'd say no, you can't.
           | 
           | If you want to talk about how specific you need to prompt an
           | LLM, there must be a well-defined treshold. The other option
           | is "whatever you can expect from a human".
           | 
           | Most discussions seem to juggle between those two. LLMs are
           | praised when they accept vague instructions, but the user is
           | blamed when they fail. Very convenient.
        
             | peacebeard wrote:
             | I am not saying that people were not specific in their
             | instructions to the LLM, but rather that in the discussion
             | they are not sharing specific details of their success
             | stories or failures. We are left seeing lots of people
             | saying "it worked for me" and "it didn't work for me"
             | without enough information to assess what was different in
             | those cases. What I'm contending is that the essential
             | differences in the challenges they are facing may be a
             | primary factor, while these discussions tend to focus on
             | the capabilities of the LLM and the user.
        
               | alganet wrote:
               | > they are not sharing specific details of their success
               | stories or failures
               | 
               | Can you blame them for that?
               | 
               | For other products, do you think people contact customer
               | support with an abundance of information?
               | 
               | Now, consider what these LLM products promise to deliver.
               | Text box, answer. Is there any indication that different
               | challenges might yield difference in the quality of
               | outcome? Nope. Magic genie interface, it either works or
               | it doesn't.
        
           | Aeolun wrote:
           | I ask it to build it to 3d voxel engine in Rust, and it just
           | goes off and do it. Same for a vox file parser.
           | 
           | Sure, it takes some creative prompting, and a lot of turns to
           | get it to settle on the proper coordinate system for the
           | whole thing, but it goes ahead and does it.
           | 
           | This took me two days so far. Unfortunate, the scope of the
           | thing is now so large that the quality rapidly starts to
           | degrade.
        
         | accrual wrote:
         | I've had some similar experiences. While I find agents very
         | useful and able to complete many tasks on its own, it does hit
         | roadblocks sometimes and its chosen solution can be
         | unusual/silly.
         | 
         | For example, the other day I was converting models but was
         | running out of disk space. The agent decided to change the
         | quantization to save space when I'd prefer it ask "hey, I need
         | some more disk space". I just paused it, cleared some space,
         | then asked the agent to try the original command again.
        
         | vadansky wrote:
         | I had a particularly hard parsing problem so I setup a bunch of
         | tests and let the LLM churn for a while and did something else.
         | 
         | When I came back all the tests were passing!
         | 
         | But as I ran it live a lot of cases were still failing.
         | 
         | Turns out the LLM hardcoded the test values as "if ('test
         | value') return 'correct value';"!
        
           | bluefirebrand wrote:
           | This is the most accurate Junior Engineer behavior I've heard
           | LLMs doing yet
        
           | ffsm8 wrote:
           | Missed opportunity for the LLM, could've just switched to
           | Volkswagen CI
           | 
           | https://github.com/auchenberg/volkswagen
        
             | EGreg wrote:
             | This is gold lol
        
             | artursapek wrote:
             | lmfao
        
           | vunderba wrote:
           | I've definitely seen this happen before too. Test-driven
           | development isn't all that effective if the LLM's only stated
           | goal is to pass the tests without thinking about the problem
           | in a more holistic/contextual manner.
        
             | matsemann wrote:
             | Reminds me of trying to train a small neural net to play
             | Robocode ~10+ years ago. Tried to "punish" it for hitting
             | walls, so next morning I had evolved a tanks that just
             | stood still... Then punished it for standing still, ended
             | up with a tanks just vibrating, alternating moving back and
             | forth quickly, etc.
        
               | vunderba wrote:
               | That's great. There's a pretty funny example of somebody
               | training a neural net to play Tetris on the Nintendo
               | entertainment system, and it quickly learned that if it
               | was about to lose to just hit pause and leave the game in
               | that state indefinitely.
        
               | amlib wrote:
               | I guess it came to the same conclusion as the computer in
               | War Games, "The only way to win is not to play"
        
           | mikeocool wrote:
           | Yeah -- I had something like this happen as well -- the llm
           | wrote a half decent implementation and some good tests, but
           | then ran into issues getting the tests to pass.
           | 
           | It then deleted the entire implementation and made the
           | function raise a "not implemented" exception, updated the
           | tests to expect that, and told me this was a solid base for
           | the next developer to start working on.
        
           | insane_dreamer wrote:
           | While I haven't run into this egregious of an offense, I have
           | had LLMs either "fix" the unit test to pass with buggy code,
           | or, conversely, "fix" the code to so that the test passes but
           | now the code does something different than it should (because
           | the unit test was wrong to start with).
        
         | nico wrote:
         | > Rather than fix the error, it decided to switch to another
         | library
         | 
         | I've had a similar experience, where instead of trying to fix
         | the error, it added a try/catch around it with a log message,
         | just so execution could continue
        
         | mtalantikite wrote:
         | I had Claude Code deep inside a change it was trying to make,
         | struggling with a test that kept failing, and then decided to
         | delete the test to make the test suite pass. We've all been
         | there!
         | 
         | I generally treat all my sessions with it as a pairing session,
         | and like in any pairing session, sometimes we have to stop
         | going down whatever failing path we're on, step all the way
         | back to the beginning, and start again.
        
           | nojs wrote:
           | > decided to delete the test to make the test suite pass
           | 
           | At least that's easy to catch. It's often more insidious like
           | "if len(custom_objects) > 10:" or "if object_name == 'abc'"
           | buried deep in the function, for the sole purpose of making
           | one stubborn test pass.
        
           | akomtu wrote:
           | Claude Doctor will hopefully do better.
        
           | Aeolun wrote:
           | Hah, I got "All this async stuff looks really hard, let's
           | just replace it with some synchronous calls".
           | 
           | "Claude, this is a web server!"
           | 
           | "My apologies... etc."
        
         | dylan604 wrote:
         | > it encountered a build error.
         | 
         | does this mean that even AI gets stuck in dependency hell?
        
           | hhh wrote:
           | of course, i've even had them actively say they're giving up
           | after 30 turns
        
         | reactordev wrote:
         | This happens in real life too when a dev builds something using
         | too much copy pasta and encounters a build error. Stackoverflow
         | was founded on this.
        
         | EnPissant wrote:
         | Where you using Claude Code or something else? I've had very
         | good luck with Claude Code not doing what you described.
        
         | Wowfunhappy wrote:
         | > I don't think I've encountered a case where I've just let the
         | LLM churn for more than a few minutes and gotten a good result.
         | 
         | I absolutely have, for what it's worth. Particularly when the
         | LLM has some sort of test to validate against, such as a test
         | suite or simply fixing compilation errors until a project
         | builds successfully. It will just keep chugging away until it
         | gets it, often with good overall results in the end.
         | 
         | I'll add that until the AI succeeds, its errors can be
         | excessively dumb, to the point where it can be frustrating to
         | watch.
        
           | civilian wrote:
           | Yeah, and I have a similar experience watching junior devs
           | try to get things working-- their errors can be excessively
           | dumb :D
        
         | insane_dreamer wrote:
         | I have the same experience. When it starts going down the wrong
         | path, it won't switch paths. I have to intervene, put my
         | thinking cap on, and tell the agent to start over from scratch
         | and explore another path (which I usually have to define to get
         | them started in the right direction). In the end, I'm not sure
         | how much time I've actually saved over doing it myself.
        
         | smeeth wrote:
         | I suspect its something to do with the following:
         | 
         | When humans get stuck solving problems they often go out to
         | acquire new information so they can better address the barrier
         | they encountered. This is hard to replicate in a training
         | environment, I bet its hard to let an agent search google
         | without contaminating your training sample.
        
         | Aeolun wrote:
         | I feel like they might eventually arrive at the right solution,
         | but generally, interrupting it before it goes off on a wild
         | tangent saves you quite a bit of time.
        
       | ldjkfkdsjnv wrote:
       | another article on xyz problem with LLMs, which will probably be
       | solved by model advancements in 6/12 months.
        
         | bwfan123 wrote:
         | you left out the other llm-apology: But humans can fail too !
        
       | deadbabe wrote:
       | This is another reason why there's no point in carefully
       | constructing prompts and contexts trying to coax the right
       | solution out of an LLM. The end result becomes more brittle with
       | time.
       | 
       | If you can't zero shot your way to success the LLM simply doesn't
       | have enough training for your problem and you need a human touch
       | or slightly different trigger words. There have been times where
       | I've gotten a solution with such a minimal prompt it practically
       | feels like the LLM read my mind, that's the vibe.
        
         | byyoung3 wrote:
         | i think that this is a bit of an exageration, but i see what
         | you are saying. Anything more than 4-5 re-prompts is
         | diminishing.
        
         | furyofantares wrote:
         | This is half right, I think, and half very wrong. I always tell
         | people if they're arguing with the LLM they're doing it wrong
         | and for sure part of that is there's things they can't do and
         | arguing won't change that. But the other part is it's hard to
         | overstate their sensitivity to their context; when you're
         | arguing about something it can do, you should start over with a
         | better prompt (and, critically, no polluted context from its
         | original attempt.)
        
           | jappgar wrote:
           | Real arguments would be fine. The problem is that the LLM
           | always acquiesces to your argument and then convinces itself
           | that it will never do that again. It then proceeds to do it
           | again.
           | 
           | I really think the insufferable obsequiousness of every LLM
           | is one of the core flaws that make them terrible peer
           | programmers.
           | 
           | So, you're right in a way. There's no sense in arguing with
           | them, but only because they refuse to argue.
        
             | furyofantares wrote:
             | I've noticed I tend to word things in a way that implies
             | the opposite of what I want it to. The obsequiousness is
             | very obnoxious indeed.
             | 
             | But I think the bigger reason arguing doesn't work is they
             | are still fundamentally next-token-predictors. The wrong
             | answer was already something it thought was probable before
             | it polluted its context with it. You arguing can be seen as
             | an attempt to make the wrong answer less probable. But it
             | already strengthened that probability by having already
             | answered incorrectly.
        
       | PaulHoule wrote:
       | This was always my mental model. If you have a process with N
       | steps where your probability of getting a step right is p, your
       | chance of success is p, or 0 as N - [?].
       | 
       | It affects people too. Something I learned halfway through a
       | theoretical physics PhD in the 1990s was that a 50-page paper
       | with a complex calculation almost certainly had a serious mistake
       | in it that you'd find if you went over it line-by-line.
       | 
       | I thought I could counter that by building a set of unit tests
       | and integration tests around the calculation and on one level
       | that worked, but in the end my calculation never got published
       | outside my thesis because our formulation of the problem turned a
       | topological circle into a helix and we had no idea how to compute
       | the associated topological factor.
        
         | bwfan123 wrote:
         | > It affects people too. Something I learned halfway through a
         | theoretical physics PhD in the 1990s was that a 50-page paper
         | with a complex calculation almost certainly had a serious
         | mistake in it that you'd find if you went over it line-by-line.
         | 
         | Interesting, and I used to think that math and sciences were
         | invented by humans to model the world in a manner to avoid
         | errors due to chains of fuzzy thinking. Also, formal languages
         | allowed large buildings to be constructued on strong
         | foundations.
         | 
         | From your anecdote it appears that the calculations in the
         | paper were numerical ? but I suppose a similar argument applies
         | to symbolic calculations.
        
           | PaulHoule wrote:
           | These were symbolic calculations. Mine was a derivation of
           | the Gutzwiller Trace Formula
           | 
           | https://inspirehep.net/files/20b84db59eace6a7f90fc38516f530e.
           | ..
           | 
           | using integration over phase space instead of position or
           | momentum space. Most people think you need an orthogonal
           | basis set to do quantum mechanical calculation but it turns
           | that "resolution of unity is all you need", that is, if you
           | integrate |x><x| over all x you get 1. If you believe
           | resolution of unity applies in quantum gravity, then Hawking
           | was wrong about black hole information. In my case we were
           | hoping we could apply the trace formula and make similar
           | derivations to systems with unusual coordinates, such as spin
           | systems.
           | 
           | There are quite a few calculations in physics that involve
           | perturbation theory, for instance, people used to try to
           | calculate the motion of the moon by expanding out thousands
           | of terms that look like (112345/552) sin(32 th-75 ph) and
           | still not getting terribly good results. It turns out classic
           | perturbation theory is pathological around popular cases such
           | as the harmonic oscillator (frequency doesn't vary with
           | amplitude) and celestial mechanics (the frequency to go
           | around the sun, to get closer or further from sun, or to go
           | above or below the plane of the plane of the ecliptic are all
           | the same.) In quantum mechanic these are not pathological,
           | notably perturbation theory works great for an electron going
           | around an atom which is basically the same problem as the
           | Earth going around the Sun.
           | 
           | I have a lot of skepticism about things like
           | 
           | https://en.wikipedia.org/wiki/Anomalous_magnetic_dipole_mome.
           | ..
           | 
           | in high energy physics because frequently they're comparing a
           | difficult experiment to an expansion of thousands of Feynman
           | diagrams and between computational errors and the fact that
           | perturbation theory often doesn't converge very well I don't
           | get excited when they don't agree.
           | 
           | ----
           | 
           | Note that I used numerical calculations for "unit and
           | integration testing", so if I derived an identity I could
           | test that the identity was true for different inputs. As for
           | formal systems, they only go so far. See
           | 
           | https://en.wikipedia.org/wiki/Principia_Mathematica#Consiste.
           | ..
        
             | bwfan123 wrote:
             | thanks !
        
             | kennyadam wrote:
             | Exactly what I was going to say
        
         | Davidzheng wrote:
         | It kind of reminds me of this vanishing gradient problem in ML
         | early on, where really deep layers won't train b/c you get
         | these gradients dying midway, and the solution was to add these
         | bypass connections (resnets style). I wonder if you can have
         | similar solutions. Ofc I think what happens in general is like
         | control theory, like you should be able to detect going off-
         | course with some probability too and correct [longer horizon
         | you have probability of leaving the safe-zone so you still get
         | the exp decay but in larger field]. Not sure how to connect all
         | these ideas though.
        
         | kridsdale1 wrote:
         | Human health follows this principle too. N is the LifeSpan. The
         | steps taken are cell division. Eventually enough problems
         | accumulate that it fails systemically.
         | 
         | Sexual reproduction is context-clearing and starting over from
         | ROM.
        
           | ge96 wrote:
           | damn you telemorase
        
           | jappgar wrote:
           | This is precisely why there is not and never will be a
           | fountain of youth.
           | 
           | Sure, you could be cloned, but that wouldn't be you. The
           | process of accumulating memories is also the process of aging
           | with death being an inevitability.
           | 
           | Software is sort of like this too, hence rewrites.
        
       | prmph wrote:
       | The amusing things LLMs do when they have been at a problem for
       | some time and cannot fix it:
       | 
       | - Removing problematic tests altogether
       | 
       | - Making up libs
       | 
       | - Providing a stub and asking you to fill in the code
        
         | esafak wrote:
         | If humans can say "The proof is left as an exercise for the
         | reader", why can't LLMs :)
        
           | kridsdale1 wrote:
           | They're just looking out for us to preserve our mental
           | sharpness as we delegate too much to them.
        
           | bogdan wrote:
           | If I'm paying someone to solve a problem for me and they tell
           | me part of it is left for me to sort out then I'm gonna be
           | pretty unhappy with how I spent my money.
        
         | nico wrote:
         | - Adding try/catch block to "remove" the error and let
         | execution continue
        
         | Wowfunhappy wrote:
         | > Providing a stub and asking you to fill in the code
         | 
         | This is a perennial issue in chatbot-style apps, but I've never
         | had it happen in Claude Code.
        
       | __MatrixMan__ wrote:
       | I don't think this has anything to do with AI. There's a half
       | life for success rates.
        
         | kridsdale1 wrote:
         | It's applicable here since people are experimenting with Agent
         | Pipelines and so the existing literature on that kinda of
         | systems engineering and robustness is useful to people who may
         | not have needed to learn about it before.
        
       | einrealist wrote:
       | So as the space for possible decisions increases, it increases
       | the likelihood of models to end up with bad "decisions". And what
       | is the correlation between the increase in "survival rate" and
       | the increase in model parameters, compute power and memory
       | (context)?
        
         | kridsdale1 wrote:
         | Nonscientific: if roughly feels like as models get bigger they
         | absorb more "wisdom" and that lowers the error-generation
         | probability.
        
       | DrNosferatu wrote:
       | Speaking of the Kwa et al. paper, is there a site that updates
       | the results as new LLMs come out?
        
       | kosh2 wrote:
       | As long as LLMs have no true memory, this is expected. Think
       | about the movie Memento. That is the experience for an LLM.
       | 
       | What could any human do with a context window of 10 minutes and
       | no other memory? You could write yourself notes... but you might
       | not see them because soon you won't know they are there. So maybe
       | tattoo them on your body...
       | 
       | You could likely do a lot of things. Just follow a recipe and
       | cook. Drive to work. But could you drive to the hardware store
       | and get some stuff you need to build that ikea furniture? Might
       | be too much context.
       | 
       | I think solving memory is solving agi.
        
       | jongjong wrote:
       | I suspect it's because the average code online is flawed. The
       | flaws are trained into LLMs and this manifests as an error rate
       | which compounds.
       | 
       | I saw some results showing that LLMs struggle to complete tasks
       | which would take longer than a day. I wonder if the average
       | developer, individually, would be much better if they had to
       | write the software on their own.
       | 
       | The average dev today is very specialized and their code is
       | optimized for job security, not for correctness and not for
       | producing succinct code which maps directly to functionality.
        
       | adverbly wrote:
       | Interesting.
       | 
       | So if you project outwards a while, you hit around 10000 hours
       | about 6 years from now.
       | 
       | Is that a reasonable timeline for ASI?
       | 
       | It's got more of a rationale behind it than other methods
       | perhaps?
        
       ___________________________________________________________________
       (page generated 2025-06-18 23:00 UTC)