[HN Gopher] Is there a half-life for the success rates of AI age...
___________________________________________________________________
Is there a half-life for the success rates of AI agents?
Author : EvgeniyZh
Score : 190 points
Date : 2025-06-18 10:53 UTC (12 hours ago)
(HTM) web link (www.tobyord.com)
(TXT) w3m dump (www.tobyord.com)
| mikeocool wrote:
| This very much aligns with my experience -- I had a case
| yesterday where opus was trying to do something with a library,
| and it encountered a build error. Rather than fix the error, it
| decided to switch to another library. It then encountered another
| error and decided to switch back to the first library.
|
| I don't think I've encountered a case where I've just let the LLM
| churn for more than a few minutes and gotten a good result. If it
| doesn't solve an issue on the first or second pass, it seems to
| rapidly start making things up, make totally unrelated changes
| claiming they'll fix the issue, or trying the same thing over and
| over.
| the__alchemist wrote:
| This is consistent with my experience as well.
| fcatalan wrote:
| I brought over the source of the Dear imgui library to a toy
| project and Cline/Gemini2.5 hallucinated the interface and when
| the compilation failed started editing the library to conform
| with it. I was all like: Nono no no no stop.
| BoiledCabbage wrote:
| Oh man that's good - next step create a PR to push it up
| stream! Everyone can benefit from its fixes.
| qazxcvbnmlp wrote:
| when this happens I do thew following
|
| 1) switch to a more expensive llm and ask it to debug: add
| debugging statements, reason about what's going on, try small
| tasks, etc 2) find issue 3) ask it to summarize what was wrong
| and what to do differently next time 4) copy and paste that
| recommendation to a small text document 5) revert to the
| original state and ask the llm to make the change with the
| recommendation as context
| rurp wrote:
| This honestly sounds slower than just doing it myself, and
| with more potential for bugs or non-standard code.
|
| I've had the same experience as parent where LLMs are great
| for simple tasks but still fall down surprisingly quickly on
| anything complex and sometimes make simple problems complex.
| Just a few days ago I asked Claude how to do something with a
| library and rather than give me the simple answer it
| suggested I rewrite a large chunk of that library instead, in
| a way that I highly doubt was bug-free. Fortunately I figured
| there would be a much simpler answer but mistakes like that
| could easily slip through.
| ziml77 wrote:
| Yeah if it gets stuck and can't easily get itself unstuck,
| that's when I step in to do the work for it. Otherwise it
| will continue to make more and more of a mess as it
| iterates on its own code.
| nico wrote:
| > 1) switch to a more expensive llm and ask it to debug
|
| You might not even need to switch
|
| A lot of times, just asking the model to debug an issue,
| instead of fixing it, helps to get the model unstuck (and
| also helps providing better context)
| enraged_camel wrote:
| I've actually thought about this extensively, and experimented
| with various approaches. What I found is that the quality of
| results I get, and whether the AI gets stuck in the type of
| loop you describe, depends on two things: how detailed and
| thorough I am with what I tell it to do, and how robust the
| guard rails I put around it are.
|
| To get the best results, I make sure to give detailed specs of
| both the current situation (background context, what I've tried
| so far, etc.) and also what criteria the solution needs to
| satisfy. So long as I do that, there's a high chance that the
| answer is at least satisfying if not a perfect solution. If I
| don't, the AI takes a lot of liberties (such as switching to
| completely different approaches, or rewriting entire modules,
| etc.) to try to reach what _it_ thinks is the solution.
| prmph wrote:
| But don't they keep forgetting the instructions after enough
| time have passed? How do you get around that? Do you add an
| instruction that after every action it should go back and
| read the instructions gain?
| enraged_camel wrote:
| They do start "drifting" after a while, at which point I
| export the chat (using Cursor), then start a new chat and
| add the exported file and say "here's the previous
| conversation, let's continue where we left off". I find
| that it deals with the transition pretty well.
|
| It's not often that I have to do this. As I mentioned in my
| post above, if I start the interaction with thorough
| instructions/specs, then the conversation concludes before
| the drift starts to happen.
| skerit wrote:
| > I don't think I've encountered a case where I've just let the
| LLM churn for more than a few minutes and gotten a good result.
|
| Is this with something like Aider or CLine?
|
| I've been using Claude-Code (with a Max plan, so I don't have
| to worry about it wasting tokens), and I've had it successfully
| handle tasks that take over an hour. But getting there isn't
| super easy, that's true. The instructions/CLAUDE.md file need
| to be perfect.
| nico wrote:
| > I've had it successfully handle tasks that take over an
| hour
|
| What kind of tasks take over an hour?
| aprilthird2021 wrote:
| You have to give us more about your example of a task that
| takes over an hour with very detailed instruction. That's
| very intriguing
| onlyrealcuzzo wrote:
| > If it doesn't solve an issue on the first or second pass, it
| seems to rapidly start making things up, make totally unrelated
| changes claiming they'll fix the issue, or trying the same
| thing over and over.
|
| Sounds like a lot of employees I know.
|
| Changing out the entire library is quite amusing, though.
|
| Just imagine: I couldn't fix this build error, so I migrated
| our entire database from Postgres to MongoDB...
| butterknife wrote:
| Thanks for the laughs.
| mathattack wrote:
| It may be doing the wrong thing like an employee, but at
| least it's doing it automatically and faster. :)
| didgeoridoo wrote:
| Probably had "MongoDB is web scale" in the training set.
| KronisLV wrote:
| That is amusing but I remember HikariCP in an older Java
| project having issues with DB connections against an Oracle
| instance. Thing is, no settings or debugging could really
| (easily) narrow it down, whereas switching to DBCP2 both
| fixed whatever was the stability issue in that particular
| pairing, as well as has nice abandoned connection tracking
| too. Definitely the quick and dirty solution that still has
| good results.
| Workaccount2 wrote:
| They poison their own context. Maybe you can call it context
| rot, where as context grows and especially if it grows with
| lots of distractions and dead ends, the output quality falls
| off rapidly. Even with good context the rot will start to
| become apparent around 100k tokens (with Gemini 2.5).
|
| They really need to figure out a way to delete or "forget"
| prior context, so the user or even the model can go back and
| prune poisonous tokens.
|
| Right now I work around it by regularly making summaries of
| instances, and then spinning up a new instance with fresh
| context and feed in the summary of the previous instance.
| codeflo wrote:
| I wonder to what extent this might be a case where the base
| model (the pure token prediction model without RLHF) is
| "taking over". This is a bit tongue-in-cheek, but if you see
| a chat protocol where an assistant makes 15 random wrong
| suggestions, the most likely continuation has to be yet
| another wrong suggestion.
|
| People have also been reporting that ChatGPT's new "memory"
| feature is poisoning their context. But context is also
| useful. I think AI companies will have to put a lot of
| engineering effort into keeping those LLMs on the happy path
| even with larger and larger contexts.
| potatolicious wrote:
| I think this is at least somewhat true anecdotally. We do
| know that as context length increases, adherence to the
| system prompt decreases. Whether that de-adherence is
| reversion to the base model or not I'm not really qualified
| to say, but it certainly _feels_ that way from observing
| the outputs.
|
| Pure speculation on my part but it feels like this may be a
| major component of the recent stories of people being
| driven mad by ChatGPT - they have extremely long
| conversations with the chatbot where the outputs start
| seeming more like the "spicy autocomplete" fever dream
| creative writing of pre-RLHF models, which feeds and
| reinforces the user's delusions.
|
| Many journalists have complained that they can't seem to
| replicate this kind of behavior in their own attempts, but
| maybe they just need a sufficiently long context window?
| kossae wrote:
| This is my experience as well, and for now comes down to a
| workflow optimization. As I feel the LLM getting off track, I
| start a brand new session with useful previous context pasted
| in from my previous session. This seems to help steer it back
| to a decent solution, but agreed it would be nice if this was
| more automated based off of user/automated feedback (broken
| unit test, "this doesn't work", etc.)
| kazinator wrote:
| "Human Attention to the Right Subset of the Prior Context
| is All You Need"
| HeWhoLurksLate wrote:
| I've found issues like this happen extremely quickly with
| ChatGPT's image generation features - if I tell it to put a
| particular logo in, the first iteration looks okay, while
| anything after that starts to look more and more cursed /
| mutant.
| rvnx wrote:
| I've noticed something, even if you ask to edit a specific
| picture, it will still used the other pictures in the
| context (and this is somewhat unwanted)
| vunderba wrote:
| gpt-image-1 is unfortunately particular vulnerable to this
| problem. The more you want to change the initial image -
| the better off you'd honestly be just starting an entirely
| new conversation.
| nojs wrote:
| https://www.astralcodexten.com/p/the-claude-bliss-attractor
| OtherShrezzing wrote:
| >They really need to figure out a way to delete or "forget"
| prior context, so the user or even the model can go back and
| prune poisonous tokens.
|
| This is possible in tools like LM Studio when running LLMs
| locally. It's a choice by the implementer to grant this
| ability to end users. You pass the entire context to the
| model in each turn of the conversation, so there's no
| technical reason stopping this feature existing, besides
| maybe some cost benefits to the inference vendor from cache.
| steveklabnik wrote:
| > They really need to figure out a way to delete or "forget"
| prior context, so the user or even the model can go back and
| prune poisonous tokens.
|
| In Claude Code you can use /clear to clear context, or
| /compact <optional message> to compact it down, with the
| message guiding what stays and what goes. It's helpful.
| libraryofbabel wrote:
| Also in Claude Code you can just press <esc> a bunch of
| times and you can backtrack to an earlier point in the
| history before the context was poisoned, and re-start from
| there.
|
| Claude has some amazing features like this that aren't very
| well documented. Yesterday I just learned it writes
| sessions to disk and you can resume them where you left off
| with -continue or - resume if you accidentally close or
| something.
| drewnick wrote:
| Thank you! This just saved me after closing laptop and
| losing a chat in VS Code. Cool feature and always a place
| where Clause Code UX was behind chat - being able to see
| history. "/continue" saved me ~15 minutes of re-
| establishing the planning for a new feature.
|
| Also loving the shift + tab (twice) to enter plan mode.
| Just adding here in case it helps anyone else.
| Aeolun wrote:
| Claude code should really document this stuff in some
| kind of tutorial. There's too much about code that I need
| to learn from random messages on the internet.
| steveklabnik wrote:
| > Claude has some amazing features like this that aren't
| very well documented.
|
| Yeah, it seems like they stealth ship a lot. Which is
| cool, but can sometimes lead to a future that's unevenly
| distributed, if you catch my drift.
| autobodie wrote:
| If so, that certainly fits with my experiences.
| darepublic wrote:
| There are just certain problems that they cannot solve.
| Usually when there is no clear example in its pretraining or
| discoverable on the net. I would say the reasoning
| capabilities of these models are pretty shallow, at least it
| seems that way to me
| dingnuts wrote:
| They can't reason at all. The language specification for
| Tcl 9 is in the training data of the SOTA models but there
| exist almost no examples, only documentation. Go ahead, try
| to get a model to write Tcl 9 instead of 8.5 code and see
| for yourself. They can't do it, at all. They write 8.5
| exclusively, because they only copy. They don't reason.
| "reasoning" in LLMs is pure marketing.
| eplatzek wrote:
| Honestly that feels a like a human.
|
| After hitting my head against a wall with a problem I need to
| stop.
|
| I need to stop and clear my context. Go a walk. Talk with
| friends. Switch to another task.
| no_wizard wrote:
| I feel like as an end user I'd like to be able to do more to
| shape the LLM behavior. For example, I'd like to flag the
| dead end paths so they're properly dropped out of context and
| not explored again, unless I as a user clear the flag(s).
|
| I know there is work being done on LLM "memory" for lack of a
| better term but I have yet to see models get _more
| responsive_ over time with this kind of feedback. I know I
| can flag it but right now it doesn't help my "running"
| context that would be unique to me.
|
| I have a similar thought about LLM "membranes", which
| combines the learning from multiple users to become more
| useful, I am keeping a keen eye on that as I think that will
| make them more useful on a organizational level
| iLoveOncall wrote:
| > They really need to figure out a way to delete or "forget"
| prior context
|
| This is already pretty much figured out:
| https://www.promptingguide.ai/techniques/react
|
| We use it at work and we never encounter this kind of issues.
| dr_dshiv wrote:
| Context rot! Love it. One bad apple spoils the barrel.
|
| I try to keep hygiene with prompts; if I get anything bad in
| the result, I try to edit my prompts to get it better rather
| than correcting in conversation.
| peacebeard wrote:
| Very common to see in comments some people saying "it can't do
| that" and others saying "here is how I make it work." Maybe
| there is a knack to it, sure, but I'm inclined to say the
| difference between the problems people are trying to use it on
| may explain a lot of the difference as well. People are not
| usually being too specific about what they were trying to do.
| The same goes for a lot of programming discussion of course.
| heyitsguay wrote:
| I've noticed this a lot, too, in HN LLM discourse.
|
| (Context: Working in applied AI R&D for 10 years, daily user
| of Claude for boilerplate coding stuff and as an HTML coding
| assistant)
|
| Lots of "with some tweaks i got it to work" or "we're using
| an agent at my company", rarely details about what's working
| or why, or what these production-grade agents are doing.
| alganet wrote:
| > People are not usually being too specific about what they
| were trying to do. The same goes for a lot of programming
| discussion of course.
|
| In programming, I already have a very good tool to follow
| specific steps: _the programming language_. It is designed to
| run algorithms. If I need to be specific, that's the tool to
| use. It does exactly what I ask it to do. When it fails, it's
| my fault.
|
| Some humans require algorithmic-like instructions too. Like
| cooking a recipe. However, those instructions can be very
| vague and a lot of humans can still follow it.
|
| LLMs stand on this weird place where we don't have a clue in
| which occasions we can be vague or not. Sometimes you can be
| vague, sometimes you can't. Sometimes high level steps are
| enough, sometimes you need fine-grained instructions. It's
| basically trial and error.
|
| Can you really blame someone for not being specific enough in
| a system that only provides you with a text box that offers
| anthropomorphic conversation? I'd say no, you can't.
|
| If you want to talk about how specific you need to prompt an
| LLM, there must be a well-defined treshold. The other option
| is "whatever you can expect from a human".
|
| Most discussions seem to juggle between those two. LLMs are
| praised when they accept vague instructions, but the user is
| blamed when they fail. Very convenient.
| peacebeard wrote:
| I am not saying that people were not specific in their
| instructions to the LLM, but rather that in the discussion
| they are not sharing specific details of their success
| stories or failures. We are left seeing lots of people
| saying "it worked for me" and "it didn't work for me"
| without enough information to assess what was different in
| those cases. What I'm contending is that the essential
| differences in the challenges they are facing may be a
| primary factor, while these discussions tend to focus on
| the capabilities of the LLM and the user.
| alganet wrote:
| > they are not sharing specific details of their success
| stories or failures
|
| Can you blame them for that?
|
| For other products, do you think people contact customer
| support with an abundance of information?
|
| Now, consider what these LLM products promise to deliver.
| Text box, answer. Is there any indication that different
| challenges might yield difference in the quality of
| outcome? Nope. Magic genie interface, it either works or
| it doesn't.
| Aeolun wrote:
| I ask it to build it to 3d voxel engine in Rust, and it just
| goes off and do it. Same for a vox file parser.
|
| Sure, it takes some creative prompting, and a lot of turns to
| get it to settle on the proper coordinate system for the
| whole thing, but it goes ahead and does it.
|
| This took me two days so far. Unfortunate, the scope of the
| thing is now so large that the quality rapidly starts to
| degrade.
| accrual wrote:
| I've had some similar experiences. While I find agents very
| useful and able to complete many tasks on its own, it does hit
| roadblocks sometimes and its chosen solution can be
| unusual/silly.
|
| For example, the other day I was converting models but was
| running out of disk space. The agent decided to change the
| quantization to save space when I'd prefer it ask "hey, I need
| some more disk space". I just paused it, cleared some space,
| then asked the agent to try the original command again.
| vadansky wrote:
| I had a particularly hard parsing problem so I setup a bunch of
| tests and let the LLM churn for a while and did something else.
|
| When I came back all the tests were passing!
|
| But as I ran it live a lot of cases were still failing.
|
| Turns out the LLM hardcoded the test values as "if ('test
| value') return 'correct value';"!
| bluefirebrand wrote:
| This is the most accurate Junior Engineer behavior I've heard
| LLMs doing yet
| ffsm8 wrote:
| Missed opportunity for the LLM, could've just switched to
| Volkswagen CI
|
| https://github.com/auchenberg/volkswagen
| EGreg wrote:
| This is gold lol
| artursapek wrote:
| lmfao
| vunderba wrote:
| I've definitely seen this happen before too. Test-driven
| development isn't all that effective if the LLM's only stated
| goal is to pass the tests without thinking about the problem
| in a more holistic/contextual manner.
| matsemann wrote:
| Reminds me of trying to train a small neural net to play
| Robocode ~10+ years ago. Tried to "punish" it for hitting
| walls, so next morning I had evolved a tanks that just
| stood still... Then punished it for standing still, ended
| up with a tanks just vibrating, alternating moving back and
| forth quickly, etc.
| vunderba wrote:
| That's great. There's a pretty funny example of somebody
| training a neural net to play Tetris on the Nintendo
| entertainment system, and it quickly learned that if it
| was about to lose to just hit pause and leave the game in
| that state indefinitely.
| amlib wrote:
| I guess it came to the same conclusion as the computer in
| War Games, "The only way to win is not to play"
| mikeocool wrote:
| Yeah -- I had something like this happen as well -- the llm
| wrote a half decent implementation and some good tests, but
| then ran into issues getting the tests to pass.
|
| It then deleted the entire implementation and made the
| function raise a "not implemented" exception, updated the
| tests to expect that, and told me this was a solid base for
| the next developer to start working on.
| insane_dreamer wrote:
| While I haven't run into this egregious of an offense, I have
| had LLMs either "fix" the unit test to pass with buggy code,
| or, conversely, "fix" the code to so that the test passes but
| now the code does something different than it should (because
| the unit test was wrong to start with).
| nico wrote:
| > Rather than fix the error, it decided to switch to another
| library
|
| I've had a similar experience, where instead of trying to fix
| the error, it added a try/catch around it with a log message,
| just so execution could continue
| mtalantikite wrote:
| I had Claude Code deep inside a change it was trying to make,
| struggling with a test that kept failing, and then decided to
| delete the test to make the test suite pass. We've all been
| there!
|
| I generally treat all my sessions with it as a pairing session,
| and like in any pairing session, sometimes we have to stop
| going down whatever failing path we're on, step all the way
| back to the beginning, and start again.
| nojs wrote:
| > decided to delete the test to make the test suite pass
|
| At least that's easy to catch. It's often more insidious like
| "if len(custom_objects) > 10:" or "if object_name == 'abc'"
| buried deep in the function, for the sole purpose of making
| one stubborn test pass.
| akomtu wrote:
| Claude Doctor will hopefully do better.
| Aeolun wrote:
| Hah, I got "All this async stuff looks really hard, let's
| just replace it with some synchronous calls".
|
| "Claude, this is a web server!"
|
| "My apologies... etc."
| dylan604 wrote:
| > it encountered a build error.
|
| does this mean that even AI gets stuck in dependency hell?
| hhh wrote:
| of course, i've even had them actively say they're giving up
| after 30 turns
| reactordev wrote:
| This happens in real life too when a dev builds something using
| too much copy pasta and encounters a build error. Stackoverflow
| was founded on this.
| EnPissant wrote:
| Where you using Claude Code or something else? I've had very
| good luck with Claude Code not doing what you described.
| Wowfunhappy wrote:
| > I don't think I've encountered a case where I've just let the
| LLM churn for more than a few minutes and gotten a good result.
|
| I absolutely have, for what it's worth. Particularly when the
| LLM has some sort of test to validate against, such as a test
| suite or simply fixing compilation errors until a project
| builds successfully. It will just keep chugging away until it
| gets it, often with good overall results in the end.
|
| I'll add that until the AI succeeds, its errors can be
| excessively dumb, to the point where it can be frustrating to
| watch.
| civilian wrote:
| Yeah, and I have a similar experience watching junior devs
| try to get things working-- their errors can be excessively
| dumb :D
| insane_dreamer wrote:
| I have the same experience. When it starts going down the wrong
| path, it won't switch paths. I have to intervene, put my
| thinking cap on, and tell the agent to start over from scratch
| and explore another path (which I usually have to define to get
| them started in the right direction). In the end, I'm not sure
| how much time I've actually saved over doing it myself.
| smeeth wrote:
| I suspect its something to do with the following:
|
| When humans get stuck solving problems they often go out to
| acquire new information so they can better address the barrier
| they encountered. This is hard to replicate in a training
| environment, I bet its hard to let an agent search google
| without contaminating your training sample.
| Aeolun wrote:
| I feel like they might eventually arrive at the right solution,
| but generally, interrupting it before it goes off on a wild
| tangent saves you quite a bit of time.
| ldjkfkdsjnv wrote:
| another article on xyz problem with LLMs, which will probably be
| solved by model advancements in 6/12 months.
| bwfan123 wrote:
| you left out the other llm-apology: But humans can fail too !
| deadbabe wrote:
| This is another reason why there's no point in carefully
| constructing prompts and contexts trying to coax the right
| solution out of an LLM. The end result becomes more brittle with
| time.
|
| If you can't zero shot your way to success the LLM simply doesn't
| have enough training for your problem and you need a human touch
| or slightly different trigger words. There have been times where
| I've gotten a solution with such a minimal prompt it practically
| feels like the LLM read my mind, that's the vibe.
| byyoung3 wrote:
| i think that this is a bit of an exageration, but i see what
| you are saying. Anything more than 4-5 re-prompts is
| diminishing.
| furyofantares wrote:
| This is half right, I think, and half very wrong. I always tell
| people if they're arguing with the LLM they're doing it wrong
| and for sure part of that is there's things they can't do and
| arguing won't change that. But the other part is it's hard to
| overstate their sensitivity to their context; when you're
| arguing about something it can do, you should start over with a
| better prompt (and, critically, no polluted context from its
| original attempt.)
| jappgar wrote:
| Real arguments would be fine. The problem is that the LLM
| always acquiesces to your argument and then convinces itself
| that it will never do that again. It then proceeds to do it
| again.
|
| I really think the insufferable obsequiousness of every LLM
| is one of the core flaws that make them terrible peer
| programmers.
|
| So, you're right in a way. There's no sense in arguing with
| them, but only because they refuse to argue.
| furyofantares wrote:
| I've noticed I tend to word things in a way that implies
| the opposite of what I want it to. The obsequiousness is
| very obnoxious indeed.
|
| But I think the bigger reason arguing doesn't work is they
| are still fundamentally next-token-predictors. The wrong
| answer was already something it thought was probable before
| it polluted its context with it. You arguing can be seen as
| an attempt to make the wrong answer less probable. But it
| already strengthened that probability by having already
| answered incorrectly.
| PaulHoule wrote:
| This was always my mental model. If you have a process with N
| steps where your probability of getting a step right is p, your
| chance of success is p, or 0 as N - [?].
|
| It affects people too. Something I learned halfway through a
| theoretical physics PhD in the 1990s was that a 50-page paper
| with a complex calculation almost certainly had a serious mistake
| in it that you'd find if you went over it line-by-line.
|
| I thought I could counter that by building a set of unit tests
| and integration tests around the calculation and on one level
| that worked, but in the end my calculation never got published
| outside my thesis because our formulation of the problem turned a
| topological circle into a helix and we had no idea how to compute
| the associated topological factor.
| bwfan123 wrote:
| > It affects people too. Something I learned halfway through a
| theoretical physics PhD in the 1990s was that a 50-page paper
| with a complex calculation almost certainly had a serious
| mistake in it that you'd find if you went over it line-by-line.
|
| Interesting, and I used to think that math and sciences were
| invented by humans to model the world in a manner to avoid
| errors due to chains of fuzzy thinking. Also, formal languages
| allowed large buildings to be constructued on strong
| foundations.
|
| From your anecdote it appears that the calculations in the
| paper were numerical ? but I suppose a similar argument applies
| to symbolic calculations.
| PaulHoule wrote:
| These were symbolic calculations. Mine was a derivation of
| the Gutzwiller Trace Formula
|
| https://inspirehep.net/files/20b84db59eace6a7f90fc38516f530e.
| ..
|
| using integration over phase space instead of position or
| momentum space. Most people think you need an orthogonal
| basis set to do quantum mechanical calculation but it turns
| that "resolution of unity is all you need", that is, if you
| integrate |x><x| over all x you get 1. If you believe
| resolution of unity applies in quantum gravity, then Hawking
| was wrong about black hole information. In my case we were
| hoping we could apply the trace formula and make similar
| derivations to systems with unusual coordinates, such as spin
| systems.
|
| There are quite a few calculations in physics that involve
| perturbation theory, for instance, people used to try to
| calculate the motion of the moon by expanding out thousands
| of terms that look like (112345/552) sin(32 th-75 ph) and
| still not getting terribly good results. It turns out classic
| perturbation theory is pathological around popular cases such
| as the harmonic oscillator (frequency doesn't vary with
| amplitude) and celestial mechanics (the frequency to go
| around the sun, to get closer or further from sun, or to go
| above or below the plane of the plane of the ecliptic are all
| the same.) In quantum mechanic these are not pathological,
| notably perturbation theory works great for an electron going
| around an atom which is basically the same problem as the
| Earth going around the Sun.
|
| I have a lot of skepticism about things like
|
| https://en.wikipedia.org/wiki/Anomalous_magnetic_dipole_mome.
| ..
|
| in high energy physics because frequently they're comparing a
| difficult experiment to an expansion of thousands of Feynman
| diagrams and between computational errors and the fact that
| perturbation theory often doesn't converge very well I don't
| get excited when they don't agree.
|
| ----
|
| Note that I used numerical calculations for "unit and
| integration testing", so if I derived an identity I could
| test that the identity was true for different inputs. As for
| formal systems, they only go so far. See
|
| https://en.wikipedia.org/wiki/Principia_Mathematica#Consiste.
| ..
| bwfan123 wrote:
| thanks !
| kennyadam wrote:
| Exactly what I was going to say
| Davidzheng wrote:
| It kind of reminds me of this vanishing gradient problem in ML
| early on, where really deep layers won't train b/c you get
| these gradients dying midway, and the solution was to add these
| bypass connections (resnets style). I wonder if you can have
| similar solutions. Ofc I think what happens in general is like
| control theory, like you should be able to detect going off-
| course with some probability too and correct [longer horizon
| you have probability of leaving the safe-zone so you still get
| the exp decay but in larger field]. Not sure how to connect all
| these ideas though.
| kridsdale1 wrote:
| Human health follows this principle too. N is the LifeSpan. The
| steps taken are cell division. Eventually enough problems
| accumulate that it fails systemically.
|
| Sexual reproduction is context-clearing and starting over from
| ROM.
| ge96 wrote:
| damn you telemorase
| jappgar wrote:
| This is precisely why there is not and never will be a
| fountain of youth.
|
| Sure, you could be cloned, but that wouldn't be you. The
| process of accumulating memories is also the process of aging
| with death being an inevitability.
|
| Software is sort of like this too, hence rewrites.
| prmph wrote:
| The amusing things LLMs do when they have been at a problem for
| some time and cannot fix it:
|
| - Removing problematic tests altogether
|
| - Making up libs
|
| - Providing a stub and asking you to fill in the code
| esafak wrote:
| If humans can say "The proof is left as an exercise for the
| reader", why can't LLMs :)
| kridsdale1 wrote:
| They're just looking out for us to preserve our mental
| sharpness as we delegate too much to them.
| bogdan wrote:
| If I'm paying someone to solve a problem for me and they tell
| me part of it is left for me to sort out then I'm gonna be
| pretty unhappy with how I spent my money.
| nico wrote:
| - Adding try/catch block to "remove" the error and let
| execution continue
| Wowfunhappy wrote:
| > Providing a stub and asking you to fill in the code
|
| This is a perennial issue in chatbot-style apps, but I've never
| had it happen in Claude Code.
| __MatrixMan__ wrote:
| I don't think this has anything to do with AI. There's a half
| life for success rates.
| kridsdale1 wrote:
| It's applicable here since people are experimenting with Agent
| Pipelines and so the existing literature on that kinda of
| systems engineering and robustness is useful to people who may
| not have needed to learn about it before.
| einrealist wrote:
| So as the space for possible decisions increases, it increases
| the likelihood of models to end up with bad "decisions". And what
| is the correlation between the increase in "survival rate" and
| the increase in model parameters, compute power and memory
| (context)?
| kridsdale1 wrote:
| Nonscientific: if roughly feels like as models get bigger they
| absorb more "wisdom" and that lowers the error-generation
| probability.
| DrNosferatu wrote:
| Speaking of the Kwa et al. paper, is there a site that updates
| the results as new LLMs come out?
| kosh2 wrote:
| As long as LLMs have no true memory, this is expected. Think
| about the movie Memento. That is the experience for an LLM.
|
| What could any human do with a context window of 10 minutes and
| no other memory? You could write yourself notes... but you might
| not see them because soon you won't know they are there. So maybe
| tattoo them on your body...
|
| You could likely do a lot of things. Just follow a recipe and
| cook. Drive to work. But could you drive to the hardware store
| and get some stuff you need to build that ikea furniture? Might
| be too much context.
|
| I think solving memory is solving agi.
| jongjong wrote:
| I suspect it's because the average code online is flawed. The
| flaws are trained into LLMs and this manifests as an error rate
| which compounds.
|
| I saw some results showing that LLMs struggle to complete tasks
| which would take longer than a day. I wonder if the average
| developer, individually, would be much better if they had to
| write the software on their own.
|
| The average dev today is very specialized and their code is
| optimized for job security, not for correctness and not for
| producing succinct code which maps directly to functionality.
| adverbly wrote:
| Interesting.
|
| So if you project outwards a while, you hit around 10000 hours
| about 6 years from now.
|
| Is that a reasonable timeline for ASI?
|
| It's got more of a rationale behind it than other methods
| perhaps?
___________________________________________________________________
(page generated 2025-06-18 23:00 UTC)