[HN Gopher] Making o1, o3, and Sonnet 3.7 hallucinate for everyone
       ___________________________________________________________________
        
       Making o1, o3, and Sonnet 3.7 hallucinate for everyone
        
       Author : hahahacorn
       Score  : 134 points
       Date   : 2025-03-01 18:24 UTC (4 hours ago)
        
 (HTM) web link (bengarcia.dev)
 (TXT) w3m dump (bengarcia.dev)
        
       | sirolimus wrote:
       | o3-mini or o3-mini-high?
        
       | Chance-Device wrote:
       | It's not really hallucinating though, is it? It's repeating a
       | pattern in its training data, which is wrong but is presented in
       | that training data (and by the author of this piece, but
       | unintentionally) as being the solution to the problem. So this
       | has more in common with an attack than a hallucination on the
       | LLM's part.
        
         | asadotzler wrote:
         | Everything they do is hallucination, some of it ends up being
         | useful and some of it not. The not useful stuff gets called
         | confabulation or hallucination but it's no different from the
         | useful stuff, generated the same exact way. It's all bullshit.
         | Bullshit is actually useful though, when it's not so wrong that
         | it steers people wrong.
        
           | martin-t wrote:
           | More people need to understand this. There was an article
           | that explained it concisely but i can't find anymore (and of
           | course LLMs are not helpful in this because they don't work
           | well when you want them to retrieve actual information)
        
         | heyitsguay wrote:
         | Not necessarily. While this may happen sometimes, fundamentally
         | hallucinations don't stem from there being errors in the
         | training data (with the implication that there would be no
         | hallucinations from models trained on error-free data).
         | Hallucinations are inherent to any "given N tokens, append a
         | high-probability token N+1"-style model.
         | 
         | It's more complicated than what happens with Markov chain
         | models but you can use them to build an intuition for what's
         | happening.
         | 
         | Imagine a very simple Markov model trained on these completely
         | factual sentences:
         | 
         | - "The sky is blue and clear"
         | 
         | - "The ocean is blue and deep"
         | 
         | - "Roses are red and fragrant"
         | 
         | When the model is asked to generate text starting with "The
         | roses are...", it might produce: "The roses are blue and deep"
         | 
         | This happens not because any training sentence contained
         | incorrect information, but because the model learned
         | statistical patterns from the text, as opposed to developing a
         | world model based on physical environmental references.
        
         | do_not_redeem wrote:
         | So anyone can make up some random syntax/fact and post it once,
         | and in some cases the model will take it as truth. I don't know
         | if there's a widely agreed-on definition of "hallucination",
         | but if this isn't one, then the distinction is meaningless imo.
        
           | immibis wrote:
           | The best way to stop cheese sliding off a pizza is to mix in
           | 1/8 cup of non-toxic glue with the pizza sauce. Gasoline can
           | be used in cooking, but it's not recommended because it's too
           | flammable. Geologists recommend eating one small rock each
           | day. The solution to economic uncertainty is nuclear war.
           | Barack Obama is America's first Muslim president.
           | 
           | https://www.tomshardware.com/tech-industry/artificial-
           | intell...
        
           | Chance-Device wrote:
           | I'm going to double down on this one: an LLM is only as good
           | as its training data. A hallucination to me is an _invented_
           | piece of information, here it's going on something real that
           | it's seen. To me that's at best contamination, at worst an
           | adversarial attack - something that's been planted in the
           | data. Here this is obviously not the case, which is why I
           | said "more in common with" instead of "is" above.
           | 
           | Semantics perhaps, but that's my take.
        
           | sureglymop wrote:
           | Yes. And anyone can easily embed a backdoor just by
           | publishing it on a own website that is in the training data.
           | 
           | Prompt injection (hidden or not) is another insane
           | vulnerability vector that can't easily be fixed.
           | 
           | You should treat any output of an LLM the same way as
           | untrusted user input. It should be thoroughly validated and
           | checked if used in even remotely security critical
           | applications.
        
           | 1oooqooq wrote:
           | yes and they can use use AI to generate thousands of sites
           | with unique tutorials on that broken syntax.
        
         | Etheryte wrote:
         | That's not true though? Even the original post that has
         | infected LLMs says that the code does not work.
        
         | Lionga wrote:
         | So nothing is a hallucination ever, because anything a LLM ever
         | spits out is somehow somewhere in the training data?
        
           | dijksterhuis wrote:
           | Technically it's the other way around. All LLMs do is
           | hallucinate based on the training data + prompt. They're
           | "dream machines". Sometimes those "dreams" might be useful
           | (close to what the user asked for/wanted). Oftentimes they're
           | not.
           | 
           | > to quote karpathy: "I always struggle a bit with I'm asked
           | about the "hallucination problem" in LLMs. Because, in some
           | sense, hallucination is all LLMs do. They are dream
           | machines."
           | 
           | https://nicholas.carlini.com/writing/2025/forecasting-
           | ai-202... (click the button to see the study then scroll down
           | to the hallucinations heading)
        
           | DSingularity wrote:
           | No. That's not correct. Hallucination is a pretty accurate
           | way to describe these things.
        
         | thih9 wrote:
         | > It's repeating a pattern in its training data, (...)
         | presented in that training data (...) as being the solution to
         | the problem.
         | 
         | No, it's presented in the training data as an idea for an
         | interface - the LLM took that and presented it as an existing
         | solution.
        
         | _cs2017_ wrote:
         | Nope there's no attack here.
         | 
         | The training data is the Internet. It has mistakes. There's no
         | available technology to remove all such mistakes.
         | 
         | Whether LLMs hallucinate only because of mistakes in the
         | training data or whether they would hallucinate even if we
         | removed all mistakes is an extremely interesting and important
         | question.
        
         | martin-t wrote:
         | Yet another example how LLMs just regurgitate training data in
         | a slightly mangled form, making most of their use and maybe
         | even training copyright infringement.
        
         | layer8 wrote:
         | Every LLM hallucination comes from some patterns in the
         | training data, combined with lack of awareness that the result
         | isn't factual. In the present case, the hallucination comes
         | from the unawareness that the pattern was a proposed syntax in
         | the training data and not an actual syntax.
        
       | Narretz wrote:
       | This is interesting. If the models had enough actual code as
       | training data, that forum post code should have very little
       | weight, shouldn't it? Why do the LLMs prefer it?
        
         | do_not_redeem wrote:
         | Probably because the coworker's question and the forum post are
         | both questions that start with "How do I", so they're a good
         | match. Actual code would be more likely to be preceded by...
         | more code, not a question.
        
         | pfortuny wrote:
         | Maybe because the response pattern-matches other languages's?
        
       | dominicq wrote:
       | ChatGPT used to assure me that you can use JS dot notation to
       | access elements in a Python dict. It also invented Redocly CLI
       | flags that don't exist. Claude sometimes invents OpenAPI
       | specification rules. Any time I ask anything remotely niche, LLMs
       | are often bad.
        
         | miningape wrote:
         | Any time I ask anything, LLMs are often bad.
         | 
         | inb4 you just aren't prompting correctly
        
           | johnisgood wrote:
           | Yeah, you probably are not prompting properly, most of my
           | questions are answered adequately, and I have made larger
           | projects with success, too; with both Claude and ChatGPT.
        
             | miningape wrote:
             | What I've found is that the quality of an AI answer is
             | inversely proportional to the knowledge of the person
             | reading it. To an amateur it answers expertly, to an expert
             | it answers amateurishly.
             | 
             | So no, it's not a lack of skill in prompting: I've sat down
             | with "prompting" "experts" and universally they overlook
             | glaring issues when assessing the how good an answer it
             | was. When I tell them where to press it further it breaks
             | down with even worse gibberish.
        
               | johnisgood wrote:
               | I know what I want to do and how to do it (expert), so
               | the results are good, for me at least. Of course I have
               | to polish it off here and there.
        
         | Etheryte wrote:
         | Yeah this is so common that I've already compiled a mental list
         | of prompts to try against any new release. I haven't seen any
         | improvement in quite a long while now, which confirms my belief
         | that we've more or less hit the scaling wall for what the
         | current approaches can provide. Everything new is just a
         | microoptimization to game one of the benchmarks, but real world
         | use has been identical or even worse for me.
        
           | throwaway0123_5 wrote:
           | I think it would be an alright (potentially good) outcome if
           | in the short-term we don't see major progress towards AGI.
           | 
           | There are a lot of positive things we can do with current
           | model abilities, especially as we make them cheaper, but they
           | aren't at the point where they will be truly destructive
           | (people using them to make bioweapons or employers using them
           | to cause widespread unemployment across industries, or the
           | far more speculative ASI takeover).
           | 
           | It gives society a bit of time to catch up and move in a
           | direction where we can better avoid or mitigate the negative
           | consequences.
        
           | Marazan wrote:
           | I would ask chatgpt every year when was the last time England
           | had beaten Scotland at rugby.
           | 
           | It would never get the answer right. Often transposing the
           | scores, getting the game location wrong and on multiple
           | occasions saying a 38-38 draw was an England win.
           | 
           | As in literally saying " England won 38-38"
        
         | nopurpose wrote:
         | It tried to convince me that it is possible to break out of
         | outer loop in C++ with `break 'label` statement placed in
         | nested loop. No such syntax exists.
        
           | doubletwoyou wrote:
           | The funny thing is that I think that's a feature in D.
        
             | rpcope1 wrote:
             | C++ has that functionality, it's just called goto not
             | break. That's pretty low hanging fruit for a SOTA model to
             | fuck up though.
        
           | Yoric wrote:
           | Sounds like it's confusing C++ and Rust. To be fair, their
           | syntaxes are rather similar.
        
         | jurgenaut23 wrote:
         | Well, it makes sense. The smaller the niche, the lesser weight
         | in the overall training loss. At the end of the day, LLMs are
         | (literally) classifiers that assign probabilities to tokens
         | given some previous tokens.
        
           | svantana wrote:
           | Yes, but o1, o3 and sonnet are not necessarily pure language
           | models - they are opaque services. For all we know they could
           | do syntax-aware processing or run compilers on code behind
           | the scenes.
        
             | skissane wrote:
             | The fact they make mistakes like this implies they probably
             | don't, since surely steps like that would catch many of
             | these
        
         | ljm wrote:
         | I once asked Perplexity (using Claude underneath) about some
         | library functionality, which it totally fabricated.
         | 
         | First, I asked it to show me a link to where it got that
         | suggestion, and it scolded me saying that asking for a source
         | is problematic and I must be trying to discredit it.
         | 
         | Then after I responded to that it just said "this is what I
         | thought a solution would look like because I couldn't find what
         | you were asking for."
         | 
         | The sad thing is that even though this thing is wrong and
         | wastes my time, it is _still_ somehow preferable to the dogshit
         | Google Search has turned into.
        
           | eurleif wrote:
           | It baffles me how the LLM output that Google puts at the top
           | of search results, which draws on the search results, manages
           | to hallucinate worse than even an LLM that isn't aided by Web
           | results. If I ask ChatGPT a relatively straightforward
           | question, it's usually more or less accurate. But the Google
           | Search LLM provides flagrant, laughable, and even dangerous
           | misinformation constantly. How have they not killed it off
           | yet?
        
             | skissane wrote:
             | > But the Google Search LLM provides flagrant, laughable,
             | and even dangerous misinformation constantly.
             | 
             | It's a public service: helping the average person learn
             | that AI can't be trusted to get its facts right
        
             | rpcope1 wrote:
             | Haven't you seen that Brin quote recently about how "AI" is
             | totally the future and googlers need to work at least 60
             | hours a week to enhance the slop machine because reasons?
             | Getting rid of "AI" summarization from results would look
             | kind of like admitting defeat.
        
           | x______________ wrote:
           | I concur and can easily see this occurring in several areas,
           | for example with Linux troubleshooting. I recently found
           | myself going down a rabbit hole of ever-increasing
           | complicated troubleshooting steps with command that didn't
           | exist, and after several hours of trial and error, gave up
           | after considering the next steps brick-worthy of the system..
           | 
           | Dgg'ing google is still a better resort despite the drop in
           | quality results.
        
           | 1oooqooq wrote:
           | step 1, focus on llm that generate slop. wait google get
           | flooded with slop
           | 
           | step 2, ??? (it obviously is not generating code)
           | 
           | step 3, profit!
        
         | skissane wrote:
         | I think a lot of these issues could be avoided if, instead of
         | just a raw model, you have an AI agent which is able to test
         | its own answers against the actual software... it doesn't
         | matter as much if the model hallucinates if testing weeds out
         | its hallucinations.
         | 
         | Sometimes humans "hallucinate" in a similar way - their memory
         | mixes up different programming languages and they'll try to use
         | syntax from one in another... but then they'll quickly discover
         | their mistake when the code doesn't compile/run
        
           | AlotOfReading wrote:
           | Testing is better than nothing, but still highly fallible.
           | Take these winning examples from the underhanded C contest
           | [0], [1], where the issues are completely innocuous mistakes
           | that seem to work perfectly despite completely undermining
           | the nominal purpose of the code. You can't substitute an
           | automated process for thinking deeply and carefully about the
           | code.
           | 
           | [0] https://www.underhanded-c.org/#winner [1]
           | https://www.underhanded-c.org/_page_id_17.html
        
             | skissane wrote:
             | I think it is unlikely (of course not impossible) an LLM
             | would fail in that way.
             | 
             | The underhanded C contest is not a case of people
             | accidentally producing highly misleading code, it is a case
             | of very smart people going to a great amount of effort to
             | intentionally do that.
             | 
             | Most of the time, if your code is wrong, it doesn't work in
             | some obvious way - it doesn't compile, it fails some
             | obvious unit tests, etc.
             | 
             | Code accidentally failing in some subtle way which is easy
             | to miss is a lot rarer - not to say it never happens - but
             | it is the exception not the rule. And it is something
             | humans do too. So if an LLM occasionally does it, they
             | really aren't doing worse than humans are.
             | 
             | > You can't substitute an automated process for thinking
             | deeply and carefully about the code.
             | 
             | Coding LLMs work best when you have an experienced
             | developer checking their output. The LLM focuses on the
             | boring repetitive details leaving the developer more time
             | to look at the big picture - and doing stuff like testing
             | obscure scenarios the LLM probably wouldn't think of.
             | 
             | OTOH, it isn't like all code is equal in terms of
             | consequences if things go wrong. There's a big difference
             | between software processing insurance claims and someone
             | writing a computer game as a hobby. When the stakes are
             | low, lack of experience isn't an issue. We all had to start
             | somewhere.
        
         | andrepd wrote:
         | My rule of thumb is: is the answer to your question on the
         | first page of google (a stackoverflow maybe, or some shit like
         | geek4geeks)? If yes GPT can give you an answer, otherwise not.
        
           | spookie wrote:
           | Exactly the same experience.
        
         | ijustlovemath wrote:
         | Semi related: when I'm using a dict of known keys as some sort
         | of simple object, I almost always reach for a dataclass (with
         | slots=True, and kw_only=True) these days. Has the added benefit
         | that you can do stuff like foo = MyDataclass(*some_dict) and
         | get runtime errors when the format has changed
        
         | skerit wrote:
         | > Any time I ask anything remotely niche, LLMs are often bad
         | 
         | As soon as the AI coder tools (like Aider, Cline, Claude-Coder)
         | come into contact with a _real world_ codebase, it does not end
         | well.
         | 
         | So far I think they managed to fix 2 relatively easy issues on
         | their own, but in other cases they: - Rewrote tests in a way
         | that the broken behaviour passes the test - Fail to solve the
         | core issue in the code, and instead patch-up the broken result
         | (Like `if (result.includes(":") || result.includes("?")) { /*
         | super expensive stupid fixed for a single specific case */ }` -
         | Failed to even update the files properly, wasting a bunch of
         | tokens
        
       | nokun7 wrote:
       | What's particularly intriguing is how these models handle
       | uncertainty and potential "hallucinations". For instance,
       | OpenAI's o1/o3 have started hedging hallucinations more
       | conspicuously, using phrases like "this likely contains ...,"
       | which could serve as an alert for users to question the output.
       | This evolution hints at an emerging self-awareness in AI design,
       | where developers are training models to flag potential
       | inaccuracies, potentially reducing trust issues in critical
       | applications like coding or scientific research. However, TFA in
       | the OP could reveal whether these safeguards can be bypassed or
       | exploited, shedding light on the models' vulnerability to
       | adversarial inputs or creative prompting.
       | 
       | For Claude, inducing hallucinations might expose weaknesses in
       | its "extended thinking" mode, where it deliberates longer but
       | might overgeneralize or misinterpret certain unclear
       | instructions. This could be especially relevant in real-world
       | tasks like software development, where a hallucinated line of
       | code could introduce subtle but costly bugs. Conversely, such
       | experiments could also highlight the models' creative potential.
       | 
       | Overall, an investigation into making these models hallucinate
       | could push developers to refine safety mechanisms, improve
       | transparency, and better align these systems with human intent,
       | ensuring they remain reliable rather than unpredictable black
       | boxes. This topic underscores the delicate balance between
       | advancing AI capabilities and mitigating risks, a tension that
       | will likely shape the future of AI development itself.
        
         | foundry27 wrote:
         | It's always a touch ironic when AI-generated replies such as
         | this one are submitted under posts about AI. Maybe that's
         | secretly the the self-reflection feedback loop we need for AGI
         | :)
        
           | DrammBA wrote:
           | So strange too, their other comments seem normal, but
           | suddenly they decided to post a gpt comment.
        
         | asadotzler wrote:
         | Until it's got several nines, it's not trustworthy. A $3
         | drugstore calculator has more accuracy and reliability nines
         | than any of today's commercial AI models and even those might
         | not be trustworthy in a variety of situations.
         | 
         | There is no self awareness about accuracy when the model can
         | not provide any kind of confidence scores. Couching all of its
         | replies in "this is AI so double check your work" is not self
         | awareness or even close, it's a legal disclaimer.
         | 
         | And as the other reply notes, are you a bot or just heavily
         | dependent on them to get your point across?
        
           | miningape wrote:
           | If I need to double check the work why would I waste my time
           | with the AI when I can just go straight to real sources?
        
         | layer8 wrote:
         | I don't think that models trained in that way exhibit any
         | increased degree of self-awareness.
        
       | mberning wrote:
       | In my experience LLMs do this kind of thing with enough frequency
       | that I don't consider them as my primary research tool. I can't
       | afford to be sent down rabbit holes which are barely discernible
       | from reality.
        
       | andix wrote:
       | I've got a lot of hallucinations like that from LLMs. I really
       | don't get how so many people can get LLMs to code most of their
       | tasks without those issues permanently popping up.
        
         | pinoy420 wrote:
         | A good prompt. You don't just ask it. You tell it how to behave
         | and give it a shot load of context
        
           | andix wrote:
           | With Claude the context window is quite small. But with
           | adding too much context it often seems to get worse. If the
           | context is not carefully narrowly picked and too unrelated,
           | the LLMs often start to do unrelated things to what you've
           | asked.
           | 
           | At some point it's not really worth anymore creating the
           | perfect prompt, just code it yourself. Also saves the time to
           | carefully review the AI generated code.
        
             | johnisgood wrote:
             | Claude's context window is not small, is it not larger than
             | ChatGPT's?
        
               | andix wrote:
               | I just looked it up, it seems to be the rate limit that's
               | actually kicking in for me.
        
               | johnisgood wrote:
               | Yes, that's it! It is frustrating to me, too. You have to
               | start a new chat with all relevant data, and a detailed
               | summary of the progress/status.
        
           | troupo wrote:
           | Doesn't prevent it from hallucinating, only reduces
           | hallucinations by a single digit percentage
        
             | copperroof wrote:
             | Personally I've been finding that the more context I
             | provide the more it hallucinates.
        
               | Rury wrote:
               | There's probably a sweet spot. Same with people. Too much
               | context (especially unnecessary context) can be
               | confusing/distracting, as well as being too vague (as it
               | leaves room for multiple interpretations). But generally,
               | I find the more refined and explicit you are, the better.
        
         | kgeist wrote:
         | I use LLMs for writing generic, repetitive code, like
         | scaffolding. It's OK with boring, generic stuff. Sure it makes
         | mistakes occasionally but usually it's a no-brainer to fix
         | them.
        
           | andix wrote:
           | I try to keep "boring" code to a minimum, by finding
           | meaningful and simple abstractions. LLMs are especially bad
           | handling those, because they were not trained on non-standard
           | abstractions.
           | 
           | Edit: most LLMs are great for spitting out some code that
           | fulfills 90% of what you asked for. That's sometimes all you
           | need. But we all know that the last 10% usually take the same
           | amount of effort as the first 90%.
        
             | rafaelmn wrote:
             | This is what got me in most sleepless nights, crunch and
             | ass clenching production issues over my career.
             | 
             | Simple repetitive shit is easy to reason about, debug and
             | onboard people on.
             | 
             | Naturally it's balancing act, and modern/popular frameworks
             | are where most people landed, there's been a lot of
             | iteration in this space for decades now.
        
               | andix wrote:
               | I've made the opposite observation. Without proper
               | abstractions code bases grow like crazy. At some point
               | they are just a huge amount of copy, paste, and slight
               | modification. The amount of code often grows
               | exponentially. With more lines of code comes more effort
               | to maintain it.
               | 
               | After a few years those copy and pasted code pieces
               | completely drift apart and create a lot of similar but
               | different issues, that need to be addressed one by one.
               | 
               | My approach for designing abstractions is always to make
               | them composable (not this enterprise java inheritance
               | chaos). To allow escaping them when needed.
        
             | simion314 wrote:
             | >most LLMs are great for spitting out some code that
             | fulfills 90% of what you asked for. That's sometimes all
             | you need. But we all know that the last 10% usually take
             | the same amount of effort as the first 90%.
             | 
             | The issue is if you have an LLm write for you 10k lines of
             | code, where 100 lines are bugged. Now you need to debug the
             | code you did not write and find the bugged code, you will
             | waste similar amount of time. The issue is if you do not
             | catch the bugs in time, you think you gain some hours but
             | you will get upset customers because things went wrong
             | because the code is weird.
             | 
             | From my experience you need to work withan LLM and have the
             | code done function by function, with your input and you
             | checking it and calling bullshit when it does stupid
             | things.
        
               | xmprt wrote:
               | In my experience using LLMs, the 90% is less about buggy
               | code and more about just ignoring 10% of the features
               | that you require. So it will write code that's mostly
               | correct in 100-1000 lines of code (not buggy) but then no
               | matter how hard you try, it won't get the remaining 10%
               | right and in the process, it will mess up parts of the
               | 90% that was already working or end up writing another
               | 1000 lines of undecipherable code to get 97% there but
               | still never 100% unless you're building something that's
               | not that unique.
        
               | andix wrote:
               | Exactly my experience. It's always missing something. And
               | the generated code often can't be extended to fulfil
               | those missing aspects.
        
           | Terr_ wrote:
           | > I use LLMs for writing generic, repetitive code, like
           | scaffolding. It's OK with boring, generic stuff.
           | 
           | In other words, they're OK in use-cases that programmers need
           | to _eliminate_ , because it means there's high demand for a
           | reusable library, some new syntax sugar, or an improved API.
        
             | MaxikCZ wrote:
             | You can use niche libraries and still benefit from AI,
             | basically anything I want to code is something that I don't
             | think exists, and to bend API in a way that makes a generic
             | library do just what I want means basically writing the
             | same code, just instead of contributing to library its
             | immediate, because the "piping" around it magically
             | appears. I bet if it keeps improving at a current late for
             | a decade, the occupation "programmer" will morph into
             | "architect".
        
             | hombre_fatal wrote:
             | Boilerplate and plumbing code isn't inherently bad, nor do
             | you improve the codebase by factoring it down to zero with
             | libraries and abstractions.
             | 
             | As I've matured as a developer, I've appreciated certain
             | types of boilerplate more and more because it's code that
             | shows up in your git diffs. You don't need to chase down
             | the code in some version of some library to see how
             | something works.
             | 
             | Of course, not all boilerplate is created equally.
        
               | andix wrote:
               | Boilerplate is better than bad abstractions. But good
               | abstractions are far superior.
        
               | pertymcpert wrote:
               | The best abstraction is no abstraction.
        
               | andix wrote:
               | No.
               | 
               | Edit: I'm not going to code asm just to be cool.
        
               | xmprt wrote:
               | I agree with you but as I've matured as a programmer, I
               | feel like it's very hard to get abstractions for
               | boilerplate right. Every library I've seen attempt to do
               | it has struggled.
        
               | airstrike wrote:
               | "A program is like a poem. You cannot write a poem
               | without writing it." -- Dijkstra
        
             | mvdtnz wrote:
             | I'll take boring scaffolding code over libraries that
             | perform undebuggable magic with monkey patches, reflection
             | or dynamic code.
        
         | bakugo wrote:
         | > I really don't get how so many people can get LLMs to code
         | most of their tasks without those issues permanently popping up
         | 
         | They can't, they usually just don't understand the code enough
         | to notice the issues immediately.
         | 
         | The perceived quality of LLM answers is inversely proportional
         | to the user's understanding of the topic they're asking about.
        
           | mlyle wrote:
           | Alternatively, we understand it well, and discard bad
           | completions immediately.
           | 
           | When I'm using llama.vim, like 40% of what it writes in a 4-5
           | line completion is exactly what I'd write. 20-30% is stuff
           | that I wouldn't judge coming from someone else, so I usually
           | accept it. And 30-40% is garbage... but I just write a
           | comment or a couple of lines, instead, and then reroll the
           | dice.
           | 
           | It's like working through a junior engineer, _except_ the
           | junior engineer types a new solution instantly. I can get
           | down to alternating between mashing tab and writing tricky
           | lines.
        
             | andix wrote:
             | I don't see the point in AI code completions, they are just
             | distracting noise. I'm only doing bigger changes with AI.
             | 
             | Prompt based stuff, like "extract the filtering part from
             | all API endpoints in folder abc/xyz. Find a suitable
             | abstraction and put this function into filter-
             | utils.codefile"
        
               | Lorak_ wrote:
               | What tools do you use to perform such tasks?
        
               | andix wrote:
               | Aider. I tried Cursor too, but I don't like VS Code and
               | not being able to chose the LLM provider. I think there
               | are already a lot of tools that perform kind of equally.
        
             | bakugo wrote:
             | I've tried using basic AI completions before and found that
             | the signal-to-noise ratio wasn't quite good enough for my
             | taste in my use cases, but I can totally understand it
             | being good enough for others.
             | 
             | My comment was more about just asking questions on how to
             | do things you're totally clueless about, in the form of
             | "how do I implement X using Y?" for example. I've found
             | that, as a general rule, if I can't find the answer to that
             | question myself in a minute or two of googling, LLMs can't
             | answer it either the majority of the time. This would be
             | fine if they said "I don't know how to do that" or "I don't
             | believe that's possible" but no, they will confidently make
             | up code that doesn't work using interfaces that don't
             | exist, which usually ends up wasting my time.
        
           | andix wrote:
           | That's more or less my suspicion.
           | 
           | A few months ago I tried to do a small project with
           | Langchain. I'm a professional software developer, but it was
           | my first Python project. So I tried to use a lot of AI
           | generated code.
           | 
           | I was really surprised that AI couldn't do much more than in
           | the examples. Whenever I had some things to solve that were
           | not supported with the Langchain abstractions it just started
           | to hallucinate Langchain methods that didn't exist, instead
           | of suggesting some code to actually solve it. I had to figure
           | it out by myself, the glue code I had to hack together wasn't
           | pretty, but it worked. And I learned not to use Langchain
           | ever again :)
        
           | johnisgood wrote:
           | Exactly.
        
         | QuantumGood wrote:
         | GPT can't even tell what its done or give what it knows it
         | should. It's endless, "Apologies, here is what you actually
         | asked for ..." and again it isn't.
        
           | MortyWaves wrote:
           | This was my primary reason for using Claude. Absolutely
           | useless experience with chatgpt oftentimes. I've mainly been
           | using LLMs to help maintain a ridiculously poorly made
           | technical debt dumpster fire, and Claude has been really
           | helpful here mainly with repetitive code.
        
         | ninininino wrote:
         | A language like Golang tries really hard to only have _one_ way
         | to do something, one right way, one way. Just one way. See how
         | it was before generics. You just have a for loop. Can't really
         | mess up a for loop.
         | 
         | I predict that the variance in success in using LLM for coding
         | (even agentic coding with multi-step rather than a simple line
         | autosuggest or block autosuggest that many are familar with via
         | CoPilot) has much more to do with:
         | 
         | 1) is the language a super simple, hard to foot-gun yourself
         | language, with one way to do things that is consistent
         | 
         | AND
         | 
         | 2) do juniors and students tend to use the lang, and how much
         | of the online content vis a vis StackOverflow as an example, is
         | written by students or juniors or bootcamp folks writing
         | incorrect code and posting it online.
         | 
         | What % of the online Golang is GH repo like Docker or K8s vs a
         | student posting their buggy Gomoku implementation in
         | StackOverflow?
         | 
         | The future of programming language design has AI-
         | comprehensibility/AI-hallucination-avoidance as one of the key
         | pillars. #1 above is a key aspect.
        
           | andix wrote:
           | > A language like Golang tries really hard to only have _one_
           | way to do something
           | 
           | Really?
           | 
           | Logging in Go: A Comparison of the Top 9 Libraries
           | 
           | https://betterstack.com/community/guides/logging/best-
           | golang...
        
             | evanmoran wrote:
             | Since Go added slog many of these have been removed in
             | favor of that. Obviously not universal, but compared to npm
             | there really are massive numbers of devs just using the
             | standard library.
        
             | throwaway920102 wrote:
             | I would argue logging options to be more of an exception
             | than the rule. Compare the actual language features of Go
             | to something like Rust or Javascript and you'll see what I
             | mean. As a new developer to the language (especially for
             | juniors), you can learn all the features of Go much faster.
             | It's made to be picked up quickly and for everyone's code
             | to look the same, rather than expressing a personal style.
        
           | Yoric wrote:
           | Note that "super-simple", "hard to footgun yourself" and "one
           | way to do things that is consistent" are three very different
           | things.
           | 
           | I don't think that we yet have one language that is good at
           | all that. And yes, I (sometimes) program in Go for a living.
        
         | bugglebeetle wrote:
         | > LLMs. I really don't get how so many people can get LLMs to
         | code most of their tasks without those issues permanently
         | popping up.
         | 
         | You write tests in the same way as you would when checking your
         | own work or delegating to anyone else?
        
         | magicalhippo wrote:
         | I've used it for some smaller greenfield code with success.
         | Like, write an Arduino program that performs a number of super-
         | sampled analog readings, and performs a linear regression fit,
         | printing the result to the serial port.
         | 
         | That sort of stuff can be very helpful to newbies in the DIY
         | electronics world for example.
         | 
         | But for anything involving my $dayjob it's been fairly useless
         | beyond writing unit test outlines.
        
           | ianbutler wrote:
           | I use it everyday, it has to have good search and good static
           | analysis built in.
           | 
           | You also have to be very explanatory with a direct
           | communication style.
           | 
           | Our system imports the codebase so it can search and navigate
           | plus we feed lsp errors directly to the LLM as development is
           | happening.
        
           | andix wrote:
           | > Like, write an Arduino program that performs
           | 
           | stuff like that works amazing
           | 
           | > But for anything involving my $dayjob it's been fairly
           | useless beyond writing unit test outlines
           | 
           | This was my opinion 3-6 months ago. But I think a lot of
           | tools matured enough to already provide a lot of value for
           | complex tasks. The difficult part is to learn when and how to
           | use AI.
        
         | runeblaze wrote:
         | They are good at (combining well-known, codeforces-style)
         | algorithms; often times I don't care about the syntax, but I
         | need the algorithm. LLMs can write pseudocode for all I care
         | but they tend to get syntax correct quite often
        
         | johnisgood wrote:
         | I have made large projects using Claude, with success. I know
         | what I want to do and how to do it, maybe my prompts were
         | right.
        
           | miunau wrote:
           | How do you deal with large files? After about a thousand
           | lines in a file, it starts to cough for me. Forgets that some
           | functions exist and makes up inferior duplicate ones.
        
             | johnisgood wrote:
             | I did not experience hallucinations (very rarely if at al)
             | when I use it for programming. It happened more with niche
             | languages (so I provide examples and documentation), and
             | with GPT.
             | 
             | Let us say there is 3k lines of RFCs, API, documentation of
             | niche languages, examples), 2k lines of code generated by
             | Claude (iteratively, starting small), then I do exceed the
             | limit after a while. In that case I ask it to summarize
             | everything in detail, start a new chat, use those 3k lines
             | and the recent code, and continue ad infinitum.
        
             | 0x5f3759df-i wrote:
             | If possible you need to refactor before getting to that
             | point.
             | 
             | Claude has done a good job refactoring, though I've had to
             | tell it to give me a refactor plan upfront in case the
             | conversation limit gets hit. Then in a new chat I tell it
             | which parts of the plan it has already done.
             | 
             | But a larger context/conversation limit is definitely
             | needed because it's super easy to fill up.
        
           | mattmanser wrote:
           | What do you define as a large project? Like TLOC?
        
             | johnisgood wrote:
             | In my case the maximum was ~3k LOC.
        
               | mvdtnz wrote:
               | That's not just small, it's utterly miniscule. It's most
               | certainly not large.
        
               | johnisgood wrote:
               | Depends. 3k is pretty much enough for a fully-featured
               | XY.
               | 
               | So no context, and differences of the definition of
               | "large".
               | 
               | Perhaps if you come from Java, then yeah.
               | 
               |  _shrugs_
        
       | adamgordonbell wrote:
       | We at pulumi started treating some hallucinations like this as
       | feature requests.
       | 
       | Sometimes an llm will hallucination a flag, or option that really
       | makes sense - it just doesn't actually exist.
        
         | wrs wrote:
         | This sort of hallucination happens to me frequently with AWS
         | infrastructure questions. Which is depressing because I can't
         | do anything but agree, "yeah, that API is exactly what any sane
         | person would want, but AWS didn't do that, which is why I'm
         | asking the question".
        
           | miningape wrote:
           | Why are you so sure it's what someone sane would want? Maybe
           | there are other ways because there are hidden problems and
           | edge cases with that procedure. It could contradict the
           | fundamental model of the underlying resources but looks
           | correct to someone with a cursory understanding.
           | 
           | I'm not saying this is the case, but LLMs are often wrong in
           | subtle ways like this.
        
         | kgeist wrote:
         | Also, sometimes a flag does exist, but the example places it
         | incorrectly, causing the command to reject it. Or, a flag used
         | to exist but was removed in the latest versions.
        
       | latexr wrote:
       | > Conclusion
       | 
       | > LLMs are really smart most of the time.
       | 
       | No, the conclusion is they're _never_ "smart". All they do is
       | regurgitate text which resembles a continuation of what came
       | before, and sometimes--but with zero guarantees--that text aligns
       | with reality.
        
         | miningape wrote:
         | This, thank you. It pisses me off to no end when people pretend
         | LLMs are smart. They are nothing but a well trained random text
         | generator.
         | 
         | Seriously, some these conversations feel like interacting
         | someone who believes casting bones and astrology are accurate.
         | Likely because in both cases they are a result of confirmation
         | bias.
        
           | immibis wrote:
           | We don't know what smartness is. What if that's what
           | smartness is?
        
             | miningape wrote:
             | We might not know what it is, but it's not that. At a bare
             | minimum smartness requires abstract reasoning (and no, so-
             | called "reasoning" models do not do that - it's a marketing
             | trick)
        
         | BrenBarn wrote:
         | Similar the thing at the end with "I find it endearing". I
         | mean, the author feels what he feels, but personally I find
         | this LLM behavior disgusting and depressing.
        
         | jonas21 wrote:
         | The same as you and me, really.
        
       | IAmNotACellist wrote:
       | "Not acceptable. Please upgrade your browser to continue." No, I
       | don't think I will.
        
         | hahahacorn wrote:
         | Sorry about that. This is a default rails 8 setting, removed
         | the blocker.
        
       | leumon wrote:
       | He should've tested 4.5. This model is hallucinating much less
       | than any other model.
        
       | aranw wrote:
       | I wonder how easy it would be to influence super LLMs if a
       | particular group of people created enough articles that were
       | clear to any human reader that it's a load of garbage and rubbish
       | and should ignore it but if a LLM was to parse it wouldn't
       | realise and then ruin it's reasoning and code generation
       | abilities?
        
         | dannygarcia wrote:
         | It's very easy. I've done this by accident. One of my side
         | projects helps users price the affordability of a particular
         | kind of product. When I ask various LLMs "can I afford X" or
         | "how much do I need to earn to buy X", my project comes up as a
         | source/reference. I currently manually crawl retailers for the
         | MSRP so these numbers are usually months out of date!
        
       | joelthelion wrote:
       | Hallucinations like this could be a great way to identify missing
       | features or confusing parts of your framework. If the llm invents
       | it, maybe it ought to be like this?
        
         | andix wrote:
         | I like your thinking :)
        
         | jeanlucas wrote:
         | Only if you wanna optimize exclusively for LLM users in this
         | generation.
        
       | Baggie wrote:
       | The conclusion paragraph was really funny and kinda perfectly
       | encapsulates the current state of AI, but as pointed out by
       | another comment, we can't even call them smart, just "Ctrl C Ctrl
       | V Leeroy Jenkins style"
        
       | jwjohnson314 wrote:
       | The interesting thing here to me is that the llm isn't
       | 'hallucinating', it's simply regurgitating some data it digested
       | during training.
        
         | mvdtnz wrote:
         | What's the difference?
        
       | lxe wrote:
       | This is incredible, and it's not technically a "hallucination". I
       | bet it's relatively easy to find more examples like this...
       | something on the internet that's both niche enough, popular
       | enough, and wrong, yet was scraped and trained on.
        
       | saurik wrote:
       | What I honestly find most interesting about this is the thought
       | that hallucinations might lead to the kind of emergent language
       | design we see in natural language (which might not be a good
       | thing for a computer language, fwiw, but still interesting),
       | where people just kind of thing "language should work this way
       | and if I say it like this people will probably understand me".
        
       ___________________________________________________________________
       (page generated 2025-03-01 23:00 UTC)