[HN Gopher] Making o1, o3, and Sonnet 3.7 hallucinate for everyone
___________________________________________________________________
Making o1, o3, and Sonnet 3.7 hallucinate for everyone
Author : hahahacorn
Score : 134 points
Date : 2025-03-01 18:24 UTC (4 hours ago)
(HTM) web link (bengarcia.dev)
(TXT) w3m dump (bengarcia.dev)
| sirolimus wrote:
| o3-mini or o3-mini-high?
| Chance-Device wrote:
| It's not really hallucinating though, is it? It's repeating a
| pattern in its training data, which is wrong but is presented in
| that training data (and by the author of this piece, but
| unintentionally) as being the solution to the problem. So this
| has more in common with an attack than a hallucination on the
| LLM's part.
| asadotzler wrote:
| Everything they do is hallucination, some of it ends up being
| useful and some of it not. The not useful stuff gets called
| confabulation or hallucination but it's no different from the
| useful stuff, generated the same exact way. It's all bullshit.
| Bullshit is actually useful though, when it's not so wrong that
| it steers people wrong.
| martin-t wrote:
| More people need to understand this. There was an article
| that explained it concisely but i can't find anymore (and of
| course LLMs are not helpful in this because they don't work
| well when you want them to retrieve actual information)
| heyitsguay wrote:
| Not necessarily. While this may happen sometimes, fundamentally
| hallucinations don't stem from there being errors in the
| training data (with the implication that there would be no
| hallucinations from models trained on error-free data).
| Hallucinations are inherent to any "given N tokens, append a
| high-probability token N+1"-style model.
|
| It's more complicated than what happens with Markov chain
| models but you can use them to build an intuition for what's
| happening.
|
| Imagine a very simple Markov model trained on these completely
| factual sentences:
|
| - "The sky is blue and clear"
|
| - "The ocean is blue and deep"
|
| - "Roses are red and fragrant"
|
| When the model is asked to generate text starting with "The
| roses are...", it might produce: "The roses are blue and deep"
|
| This happens not because any training sentence contained
| incorrect information, but because the model learned
| statistical patterns from the text, as opposed to developing a
| world model based on physical environmental references.
| do_not_redeem wrote:
| So anyone can make up some random syntax/fact and post it once,
| and in some cases the model will take it as truth. I don't know
| if there's a widely agreed-on definition of "hallucination",
| but if this isn't one, then the distinction is meaningless imo.
| immibis wrote:
| The best way to stop cheese sliding off a pizza is to mix in
| 1/8 cup of non-toxic glue with the pizza sauce. Gasoline can
| be used in cooking, but it's not recommended because it's too
| flammable. Geologists recommend eating one small rock each
| day. The solution to economic uncertainty is nuclear war.
| Barack Obama is America's first Muslim president.
|
| https://www.tomshardware.com/tech-industry/artificial-
| intell...
| Chance-Device wrote:
| I'm going to double down on this one: an LLM is only as good
| as its training data. A hallucination to me is an _invented_
| piece of information, here it's going on something real that
| it's seen. To me that's at best contamination, at worst an
| adversarial attack - something that's been planted in the
| data. Here this is obviously not the case, which is why I
| said "more in common with" instead of "is" above.
|
| Semantics perhaps, but that's my take.
| sureglymop wrote:
| Yes. And anyone can easily embed a backdoor just by
| publishing it on a own website that is in the training data.
|
| Prompt injection (hidden or not) is another insane
| vulnerability vector that can't easily be fixed.
|
| You should treat any output of an LLM the same way as
| untrusted user input. It should be thoroughly validated and
| checked if used in even remotely security critical
| applications.
| 1oooqooq wrote:
| yes and they can use use AI to generate thousands of sites
| with unique tutorials on that broken syntax.
| Etheryte wrote:
| That's not true though? Even the original post that has
| infected LLMs says that the code does not work.
| Lionga wrote:
| So nothing is a hallucination ever, because anything a LLM ever
| spits out is somehow somewhere in the training data?
| dijksterhuis wrote:
| Technically it's the other way around. All LLMs do is
| hallucinate based on the training data + prompt. They're
| "dream machines". Sometimes those "dreams" might be useful
| (close to what the user asked for/wanted). Oftentimes they're
| not.
|
| > to quote karpathy: "I always struggle a bit with I'm asked
| about the "hallucination problem" in LLMs. Because, in some
| sense, hallucination is all LLMs do. They are dream
| machines."
|
| https://nicholas.carlini.com/writing/2025/forecasting-
| ai-202... (click the button to see the study then scroll down
| to the hallucinations heading)
| DSingularity wrote:
| No. That's not correct. Hallucination is a pretty accurate
| way to describe these things.
| thih9 wrote:
| > It's repeating a pattern in its training data, (...)
| presented in that training data (...) as being the solution to
| the problem.
|
| No, it's presented in the training data as an idea for an
| interface - the LLM took that and presented it as an existing
| solution.
| _cs2017_ wrote:
| Nope there's no attack here.
|
| The training data is the Internet. It has mistakes. There's no
| available technology to remove all such mistakes.
|
| Whether LLMs hallucinate only because of mistakes in the
| training data or whether they would hallucinate even if we
| removed all mistakes is an extremely interesting and important
| question.
| martin-t wrote:
| Yet another example how LLMs just regurgitate training data in
| a slightly mangled form, making most of their use and maybe
| even training copyright infringement.
| layer8 wrote:
| Every LLM hallucination comes from some patterns in the
| training data, combined with lack of awareness that the result
| isn't factual. In the present case, the hallucination comes
| from the unawareness that the pattern was a proposed syntax in
| the training data and not an actual syntax.
| Narretz wrote:
| This is interesting. If the models had enough actual code as
| training data, that forum post code should have very little
| weight, shouldn't it? Why do the LLMs prefer it?
| do_not_redeem wrote:
| Probably because the coworker's question and the forum post are
| both questions that start with "How do I", so they're a good
| match. Actual code would be more likely to be preceded by...
| more code, not a question.
| pfortuny wrote:
| Maybe because the response pattern-matches other languages's?
| dominicq wrote:
| ChatGPT used to assure me that you can use JS dot notation to
| access elements in a Python dict. It also invented Redocly CLI
| flags that don't exist. Claude sometimes invents OpenAPI
| specification rules. Any time I ask anything remotely niche, LLMs
| are often bad.
| miningape wrote:
| Any time I ask anything, LLMs are often bad.
|
| inb4 you just aren't prompting correctly
| johnisgood wrote:
| Yeah, you probably are not prompting properly, most of my
| questions are answered adequately, and I have made larger
| projects with success, too; with both Claude and ChatGPT.
| miningape wrote:
| What I've found is that the quality of an AI answer is
| inversely proportional to the knowledge of the person
| reading it. To an amateur it answers expertly, to an expert
| it answers amateurishly.
|
| So no, it's not a lack of skill in prompting: I've sat down
| with "prompting" "experts" and universally they overlook
| glaring issues when assessing the how good an answer it
| was. When I tell them where to press it further it breaks
| down with even worse gibberish.
| johnisgood wrote:
| I know what I want to do and how to do it (expert), so
| the results are good, for me at least. Of course I have
| to polish it off here and there.
| Etheryte wrote:
| Yeah this is so common that I've already compiled a mental list
| of prompts to try against any new release. I haven't seen any
| improvement in quite a long while now, which confirms my belief
| that we've more or less hit the scaling wall for what the
| current approaches can provide. Everything new is just a
| microoptimization to game one of the benchmarks, but real world
| use has been identical or even worse for me.
| throwaway0123_5 wrote:
| I think it would be an alright (potentially good) outcome if
| in the short-term we don't see major progress towards AGI.
|
| There are a lot of positive things we can do with current
| model abilities, especially as we make them cheaper, but they
| aren't at the point where they will be truly destructive
| (people using them to make bioweapons or employers using them
| to cause widespread unemployment across industries, or the
| far more speculative ASI takeover).
|
| It gives society a bit of time to catch up and move in a
| direction where we can better avoid or mitigate the negative
| consequences.
| Marazan wrote:
| I would ask chatgpt every year when was the last time England
| had beaten Scotland at rugby.
|
| It would never get the answer right. Often transposing the
| scores, getting the game location wrong and on multiple
| occasions saying a 38-38 draw was an England win.
|
| As in literally saying " England won 38-38"
| nopurpose wrote:
| It tried to convince me that it is possible to break out of
| outer loop in C++ with `break 'label` statement placed in
| nested loop. No such syntax exists.
| doubletwoyou wrote:
| The funny thing is that I think that's a feature in D.
| rpcope1 wrote:
| C++ has that functionality, it's just called goto not
| break. That's pretty low hanging fruit for a SOTA model to
| fuck up though.
| Yoric wrote:
| Sounds like it's confusing C++ and Rust. To be fair, their
| syntaxes are rather similar.
| jurgenaut23 wrote:
| Well, it makes sense. The smaller the niche, the lesser weight
| in the overall training loss. At the end of the day, LLMs are
| (literally) classifiers that assign probabilities to tokens
| given some previous tokens.
| svantana wrote:
| Yes, but o1, o3 and sonnet are not necessarily pure language
| models - they are opaque services. For all we know they could
| do syntax-aware processing or run compilers on code behind
| the scenes.
| skissane wrote:
| The fact they make mistakes like this implies they probably
| don't, since surely steps like that would catch many of
| these
| ljm wrote:
| I once asked Perplexity (using Claude underneath) about some
| library functionality, which it totally fabricated.
|
| First, I asked it to show me a link to where it got that
| suggestion, and it scolded me saying that asking for a source
| is problematic and I must be trying to discredit it.
|
| Then after I responded to that it just said "this is what I
| thought a solution would look like because I couldn't find what
| you were asking for."
|
| The sad thing is that even though this thing is wrong and
| wastes my time, it is _still_ somehow preferable to the dogshit
| Google Search has turned into.
| eurleif wrote:
| It baffles me how the LLM output that Google puts at the top
| of search results, which draws on the search results, manages
| to hallucinate worse than even an LLM that isn't aided by Web
| results. If I ask ChatGPT a relatively straightforward
| question, it's usually more or less accurate. But the Google
| Search LLM provides flagrant, laughable, and even dangerous
| misinformation constantly. How have they not killed it off
| yet?
| skissane wrote:
| > But the Google Search LLM provides flagrant, laughable,
| and even dangerous misinformation constantly.
|
| It's a public service: helping the average person learn
| that AI can't be trusted to get its facts right
| rpcope1 wrote:
| Haven't you seen that Brin quote recently about how "AI" is
| totally the future and googlers need to work at least 60
| hours a week to enhance the slop machine because reasons?
| Getting rid of "AI" summarization from results would look
| kind of like admitting defeat.
| x______________ wrote:
| I concur and can easily see this occurring in several areas,
| for example with Linux troubleshooting. I recently found
| myself going down a rabbit hole of ever-increasing
| complicated troubleshooting steps with command that didn't
| exist, and after several hours of trial and error, gave up
| after considering the next steps brick-worthy of the system..
|
| Dgg'ing google is still a better resort despite the drop in
| quality results.
| 1oooqooq wrote:
| step 1, focus on llm that generate slop. wait google get
| flooded with slop
|
| step 2, ??? (it obviously is not generating code)
|
| step 3, profit!
| skissane wrote:
| I think a lot of these issues could be avoided if, instead of
| just a raw model, you have an AI agent which is able to test
| its own answers against the actual software... it doesn't
| matter as much if the model hallucinates if testing weeds out
| its hallucinations.
|
| Sometimes humans "hallucinate" in a similar way - their memory
| mixes up different programming languages and they'll try to use
| syntax from one in another... but then they'll quickly discover
| their mistake when the code doesn't compile/run
| AlotOfReading wrote:
| Testing is better than nothing, but still highly fallible.
| Take these winning examples from the underhanded C contest
| [0], [1], where the issues are completely innocuous mistakes
| that seem to work perfectly despite completely undermining
| the nominal purpose of the code. You can't substitute an
| automated process for thinking deeply and carefully about the
| code.
|
| [0] https://www.underhanded-c.org/#winner [1]
| https://www.underhanded-c.org/_page_id_17.html
| skissane wrote:
| I think it is unlikely (of course not impossible) an LLM
| would fail in that way.
|
| The underhanded C contest is not a case of people
| accidentally producing highly misleading code, it is a case
| of very smart people going to a great amount of effort to
| intentionally do that.
|
| Most of the time, if your code is wrong, it doesn't work in
| some obvious way - it doesn't compile, it fails some
| obvious unit tests, etc.
|
| Code accidentally failing in some subtle way which is easy
| to miss is a lot rarer - not to say it never happens - but
| it is the exception not the rule. And it is something
| humans do too. So if an LLM occasionally does it, they
| really aren't doing worse than humans are.
|
| > You can't substitute an automated process for thinking
| deeply and carefully about the code.
|
| Coding LLMs work best when you have an experienced
| developer checking their output. The LLM focuses on the
| boring repetitive details leaving the developer more time
| to look at the big picture - and doing stuff like testing
| obscure scenarios the LLM probably wouldn't think of.
|
| OTOH, it isn't like all code is equal in terms of
| consequences if things go wrong. There's a big difference
| between software processing insurance claims and someone
| writing a computer game as a hobby. When the stakes are
| low, lack of experience isn't an issue. We all had to start
| somewhere.
| andrepd wrote:
| My rule of thumb is: is the answer to your question on the
| first page of google (a stackoverflow maybe, or some shit like
| geek4geeks)? If yes GPT can give you an answer, otherwise not.
| spookie wrote:
| Exactly the same experience.
| ijustlovemath wrote:
| Semi related: when I'm using a dict of known keys as some sort
| of simple object, I almost always reach for a dataclass (with
| slots=True, and kw_only=True) these days. Has the added benefit
| that you can do stuff like foo = MyDataclass(*some_dict) and
| get runtime errors when the format has changed
| skerit wrote:
| > Any time I ask anything remotely niche, LLMs are often bad
|
| As soon as the AI coder tools (like Aider, Cline, Claude-Coder)
| come into contact with a _real world_ codebase, it does not end
| well.
|
| So far I think they managed to fix 2 relatively easy issues on
| their own, but in other cases they: - Rewrote tests in a way
| that the broken behaviour passes the test - Fail to solve the
| core issue in the code, and instead patch-up the broken result
| (Like `if (result.includes(":") || result.includes("?")) { /*
| super expensive stupid fixed for a single specific case */ }` -
| Failed to even update the files properly, wasting a bunch of
| tokens
| nokun7 wrote:
| What's particularly intriguing is how these models handle
| uncertainty and potential "hallucinations". For instance,
| OpenAI's o1/o3 have started hedging hallucinations more
| conspicuously, using phrases like "this likely contains ...,"
| which could serve as an alert for users to question the output.
| This evolution hints at an emerging self-awareness in AI design,
| where developers are training models to flag potential
| inaccuracies, potentially reducing trust issues in critical
| applications like coding or scientific research. However, TFA in
| the OP could reveal whether these safeguards can be bypassed or
| exploited, shedding light on the models' vulnerability to
| adversarial inputs or creative prompting.
|
| For Claude, inducing hallucinations might expose weaknesses in
| its "extended thinking" mode, where it deliberates longer but
| might overgeneralize or misinterpret certain unclear
| instructions. This could be especially relevant in real-world
| tasks like software development, where a hallucinated line of
| code could introduce subtle but costly bugs. Conversely, such
| experiments could also highlight the models' creative potential.
|
| Overall, an investigation into making these models hallucinate
| could push developers to refine safety mechanisms, improve
| transparency, and better align these systems with human intent,
| ensuring they remain reliable rather than unpredictable black
| boxes. This topic underscores the delicate balance between
| advancing AI capabilities and mitigating risks, a tension that
| will likely shape the future of AI development itself.
| foundry27 wrote:
| It's always a touch ironic when AI-generated replies such as
| this one are submitted under posts about AI. Maybe that's
| secretly the the self-reflection feedback loop we need for AGI
| :)
| DrammBA wrote:
| So strange too, their other comments seem normal, but
| suddenly they decided to post a gpt comment.
| asadotzler wrote:
| Until it's got several nines, it's not trustworthy. A $3
| drugstore calculator has more accuracy and reliability nines
| than any of today's commercial AI models and even those might
| not be trustworthy in a variety of situations.
|
| There is no self awareness about accuracy when the model can
| not provide any kind of confidence scores. Couching all of its
| replies in "this is AI so double check your work" is not self
| awareness or even close, it's a legal disclaimer.
|
| And as the other reply notes, are you a bot or just heavily
| dependent on them to get your point across?
| miningape wrote:
| If I need to double check the work why would I waste my time
| with the AI when I can just go straight to real sources?
| layer8 wrote:
| I don't think that models trained in that way exhibit any
| increased degree of self-awareness.
| mberning wrote:
| In my experience LLMs do this kind of thing with enough frequency
| that I don't consider them as my primary research tool. I can't
| afford to be sent down rabbit holes which are barely discernible
| from reality.
| andix wrote:
| I've got a lot of hallucinations like that from LLMs. I really
| don't get how so many people can get LLMs to code most of their
| tasks without those issues permanently popping up.
| pinoy420 wrote:
| A good prompt. You don't just ask it. You tell it how to behave
| and give it a shot load of context
| andix wrote:
| With Claude the context window is quite small. But with
| adding too much context it often seems to get worse. If the
| context is not carefully narrowly picked and too unrelated,
| the LLMs often start to do unrelated things to what you've
| asked.
|
| At some point it's not really worth anymore creating the
| perfect prompt, just code it yourself. Also saves the time to
| carefully review the AI generated code.
| johnisgood wrote:
| Claude's context window is not small, is it not larger than
| ChatGPT's?
| andix wrote:
| I just looked it up, it seems to be the rate limit that's
| actually kicking in for me.
| johnisgood wrote:
| Yes, that's it! It is frustrating to me, too. You have to
| start a new chat with all relevant data, and a detailed
| summary of the progress/status.
| troupo wrote:
| Doesn't prevent it from hallucinating, only reduces
| hallucinations by a single digit percentage
| copperroof wrote:
| Personally I've been finding that the more context I
| provide the more it hallucinates.
| Rury wrote:
| There's probably a sweet spot. Same with people. Too much
| context (especially unnecessary context) can be
| confusing/distracting, as well as being too vague (as it
| leaves room for multiple interpretations). But generally,
| I find the more refined and explicit you are, the better.
| kgeist wrote:
| I use LLMs for writing generic, repetitive code, like
| scaffolding. It's OK with boring, generic stuff. Sure it makes
| mistakes occasionally but usually it's a no-brainer to fix
| them.
| andix wrote:
| I try to keep "boring" code to a minimum, by finding
| meaningful and simple abstractions. LLMs are especially bad
| handling those, because they were not trained on non-standard
| abstractions.
|
| Edit: most LLMs are great for spitting out some code that
| fulfills 90% of what you asked for. That's sometimes all you
| need. But we all know that the last 10% usually take the same
| amount of effort as the first 90%.
| rafaelmn wrote:
| This is what got me in most sleepless nights, crunch and
| ass clenching production issues over my career.
|
| Simple repetitive shit is easy to reason about, debug and
| onboard people on.
|
| Naturally it's balancing act, and modern/popular frameworks
| are where most people landed, there's been a lot of
| iteration in this space for decades now.
| andix wrote:
| I've made the opposite observation. Without proper
| abstractions code bases grow like crazy. At some point
| they are just a huge amount of copy, paste, and slight
| modification. The amount of code often grows
| exponentially. With more lines of code comes more effort
| to maintain it.
|
| After a few years those copy and pasted code pieces
| completely drift apart and create a lot of similar but
| different issues, that need to be addressed one by one.
|
| My approach for designing abstractions is always to make
| them composable (not this enterprise java inheritance
| chaos). To allow escaping them when needed.
| simion314 wrote:
| >most LLMs are great for spitting out some code that
| fulfills 90% of what you asked for. That's sometimes all
| you need. But we all know that the last 10% usually take
| the same amount of effort as the first 90%.
|
| The issue is if you have an LLm write for you 10k lines of
| code, where 100 lines are bugged. Now you need to debug the
| code you did not write and find the bugged code, you will
| waste similar amount of time. The issue is if you do not
| catch the bugs in time, you think you gain some hours but
| you will get upset customers because things went wrong
| because the code is weird.
|
| From my experience you need to work withan LLM and have the
| code done function by function, with your input and you
| checking it and calling bullshit when it does stupid
| things.
| xmprt wrote:
| In my experience using LLMs, the 90% is less about buggy
| code and more about just ignoring 10% of the features
| that you require. So it will write code that's mostly
| correct in 100-1000 lines of code (not buggy) but then no
| matter how hard you try, it won't get the remaining 10%
| right and in the process, it will mess up parts of the
| 90% that was already working or end up writing another
| 1000 lines of undecipherable code to get 97% there but
| still never 100% unless you're building something that's
| not that unique.
| andix wrote:
| Exactly my experience. It's always missing something. And
| the generated code often can't be extended to fulfil
| those missing aspects.
| Terr_ wrote:
| > I use LLMs for writing generic, repetitive code, like
| scaffolding. It's OK with boring, generic stuff.
|
| In other words, they're OK in use-cases that programmers need
| to _eliminate_ , because it means there's high demand for a
| reusable library, some new syntax sugar, or an improved API.
| MaxikCZ wrote:
| You can use niche libraries and still benefit from AI,
| basically anything I want to code is something that I don't
| think exists, and to bend API in a way that makes a generic
| library do just what I want means basically writing the
| same code, just instead of contributing to library its
| immediate, because the "piping" around it magically
| appears. I bet if it keeps improving at a current late for
| a decade, the occupation "programmer" will morph into
| "architect".
| hombre_fatal wrote:
| Boilerplate and plumbing code isn't inherently bad, nor do
| you improve the codebase by factoring it down to zero with
| libraries and abstractions.
|
| As I've matured as a developer, I've appreciated certain
| types of boilerplate more and more because it's code that
| shows up in your git diffs. You don't need to chase down
| the code in some version of some library to see how
| something works.
|
| Of course, not all boilerplate is created equally.
| andix wrote:
| Boilerplate is better than bad abstractions. But good
| abstractions are far superior.
| pertymcpert wrote:
| The best abstraction is no abstraction.
| andix wrote:
| No.
|
| Edit: I'm not going to code asm just to be cool.
| xmprt wrote:
| I agree with you but as I've matured as a programmer, I
| feel like it's very hard to get abstractions for
| boilerplate right. Every library I've seen attempt to do
| it has struggled.
| airstrike wrote:
| "A program is like a poem. You cannot write a poem
| without writing it." -- Dijkstra
| mvdtnz wrote:
| I'll take boring scaffolding code over libraries that
| perform undebuggable magic with monkey patches, reflection
| or dynamic code.
| bakugo wrote:
| > I really don't get how so many people can get LLMs to code
| most of their tasks without those issues permanently popping up
|
| They can't, they usually just don't understand the code enough
| to notice the issues immediately.
|
| The perceived quality of LLM answers is inversely proportional
| to the user's understanding of the topic they're asking about.
| mlyle wrote:
| Alternatively, we understand it well, and discard bad
| completions immediately.
|
| When I'm using llama.vim, like 40% of what it writes in a 4-5
| line completion is exactly what I'd write. 20-30% is stuff
| that I wouldn't judge coming from someone else, so I usually
| accept it. And 30-40% is garbage... but I just write a
| comment or a couple of lines, instead, and then reroll the
| dice.
|
| It's like working through a junior engineer, _except_ the
| junior engineer types a new solution instantly. I can get
| down to alternating between mashing tab and writing tricky
| lines.
| andix wrote:
| I don't see the point in AI code completions, they are just
| distracting noise. I'm only doing bigger changes with AI.
|
| Prompt based stuff, like "extract the filtering part from
| all API endpoints in folder abc/xyz. Find a suitable
| abstraction and put this function into filter-
| utils.codefile"
| Lorak_ wrote:
| What tools do you use to perform such tasks?
| andix wrote:
| Aider. I tried Cursor too, but I don't like VS Code and
| not being able to chose the LLM provider. I think there
| are already a lot of tools that perform kind of equally.
| bakugo wrote:
| I've tried using basic AI completions before and found that
| the signal-to-noise ratio wasn't quite good enough for my
| taste in my use cases, but I can totally understand it
| being good enough for others.
|
| My comment was more about just asking questions on how to
| do things you're totally clueless about, in the form of
| "how do I implement X using Y?" for example. I've found
| that, as a general rule, if I can't find the answer to that
| question myself in a minute or two of googling, LLMs can't
| answer it either the majority of the time. This would be
| fine if they said "I don't know how to do that" or "I don't
| believe that's possible" but no, they will confidently make
| up code that doesn't work using interfaces that don't
| exist, which usually ends up wasting my time.
| andix wrote:
| That's more or less my suspicion.
|
| A few months ago I tried to do a small project with
| Langchain. I'm a professional software developer, but it was
| my first Python project. So I tried to use a lot of AI
| generated code.
|
| I was really surprised that AI couldn't do much more than in
| the examples. Whenever I had some things to solve that were
| not supported with the Langchain abstractions it just started
| to hallucinate Langchain methods that didn't exist, instead
| of suggesting some code to actually solve it. I had to figure
| it out by myself, the glue code I had to hack together wasn't
| pretty, but it worked. And I learned not to use Langchain
| ever again :)
| johnisgood wrote:
| Exactly.
| QuantumGood wrote:
| GPT can't even tell what its done or give what it knows it
| should. It's endless, "Apologies, here is what you actually
| asked for ..." and again it isn't.
| MortyWaves wrote:
| This was my primary reason for using Claude. Absolutely
| useless experience with chatgpt oftentimes. I've mainly been
| using LLMs to help maintain a ridiculously poorly made
| technical debt dumpster fire, and Claude has been really
| helpful here mainly with repetitive code.
| ninininino wrote:
| A language like Golang tries really hard to only have _one_ way
| to do something, one right way, one way. Just one way. See how
| it was before generics. You just have a for loop. Can't really
| mess up a for loop.
|
| I predict that the variance in success in using LLM for coding
| (even agentic coding with multi-step rather than a simple line
| autosuggest or block autosuggest that many are familar with via
| CoPilot) has much more to do with:
|
| 1) is the language a super simple, hard to foot-gun yourself
| language, with one way to do things that is consistent
|
| AND
|
| 2) do juniors and students tend to use the lang, and how much
| of the online content vis a vis StackOverflow as an example, is
| written by students or juniors or bootcamp folks writing
| incorrect code and posting it online.
|
| What % of the online Golang is GH repo like Docker or K8s vs a
| student posting their buggy Gomoku implementation in
| StackOverflow?
|
| The future of programming language design has AI-
| comprehensibility/AI-hallucination-avoidance as one of the key
| pillars. #1 above is a key aspect.
| andix wrote:
| > A language like Golang tries really hard to only have _one_
| way to do something
|
| Really?
|
| Logging in Go: A Comparison of the Top 9 Libraries
|
| https://betterstack.com/community/guides/logging/best-
| golang...
| evanmoran wrote:
| Since Go added slog many of these have been removed in
| favor of that. Obviously not universal, but compared to npm
| there really are massive numbers of devs just using the
| standard library.
| throwaway920102 wrote:
| I would argue logging options to be more of an exception
| than the rule. Compare the actual language features of Go
| to something like Rust or Javascript and you'll see what I
| mean. As a new developer to the language (especially for
| juniors), you can learn all the features of Go much faster.
| It's made to be picked up quickly and for everyone's code
| to look the same, rather than expressing a personal style.
| Yoric wrote:
| Note that "super-simple", "hard to footgun yourself" and "one
| way to do things that is consistent" are three very different
| things.
|
| I don't think that we yet have one language that is good at
| all that. And yes, I (sometimes) program in Go for a living.
| bugglebeetle wrote:
| > LLMs. I really don't get how so many people can get LLMs to
| code most of their tasks without those issues permanently
| popping up.
|
| You write tests in the same way as you would when checking your
| own work or delegating to anyone else?
| magicalhippo wrote:
| I've used it for some smaller greenfield code with success.
| Like, write an Arduino program that performs a number of super-
| sampled analog readings, and performs a linear regression fit,
| printing the result to the serial port.
|
| That sort of stuff can be very helpful to newbies in the DIY
| electronics world for example.
|
| But for anything involving my $dayjob it's been fairly useless
| beyond writing unit test outlines.
| ianbutler wrote:
| I use it everyday, it has to have good search and good static
| analysis built in.
|
| You also have to be very explanatory with a direct
| communication style.
|
| Our system imports the codebase so it can search and navigate
| plus we feed lsp errors directly to the LLM as development is
| happening.
| andix wrote:
| > Like, write an Arduino program that performs
|
| stuff like that works amazing
|
| > But for anything involving my $dayjob it's been fairly
| useless beyond writing unit test outlines
|
| This was my opinion 3-6 months ago. But I think a lot of
| tools matured enough to already provide a lot of value for
| complex tasks. The difficult part is to learn when and how to
| use AI.
| runeblaze wrote:
| They are good at (combining well-known, codeforces-style)
| algorithms; often times I don't care about the syntax, but I
| need the algorithm. LLMs can write pseudocode for all I care
| but they tend to get syntax correct quite often
| johnisgood wrote:
| I have made large projects using Claude, with success. I know
| what I want to do and how to do it, maybe my prompts were
| right.
| miunau wrote:
| How do you deal with large files? After about a thousand
| lines in a file, it starts to cough for me. Forgets that some
| functions exist and makes up inferior duplicate ones.
| johnisgood wrote:
| I did not experience hallucinations (very rarely if at al)
| when I use it for programming. It happened more with niche
| languages (so I provide examples and documentation), and
| with GPT.
|
| Let us say there is 3k lines of RFCs, API, documentation of
| niche languages, examples), 2k lines of code generated by
| Claude (iteratively, starting small), then I do exceed the
| limit after a while. In that case I ask it to summarize
| everything in detail, start a new chat, use those 3k lines
| and the recent code, and continue ad infinitum.
| 0x5f3759df-i wrote:
| If possible you need to refactor before getting to that
| point.
|
| Claude has done a good job refactoring, though I've had to
| tell it to give me a refactor plan upfront in case the
| conversation limit gets hit. Then in a new chat I tell it
| which parts of the plan it has already done.
|
| But a larger context/conversation limit is definitely
| needed because it's super easy to fill up.
| mattmanser wrote:
| What do you define as a large project? Like TLOC?
| johnisgood wrote:
| In my case the maximum was ~3k LOC.
| mvdtnz wrote:
| That's not just small, it's utterly miniscule. It's most
| certainly not large.
| johnisgood wrote:
| Depends. 3k is pretty much enough for a fully-featured
| XY.
|
| So no context, and differences of the definition of
| "large".
|
| Perhaps if you come from Java, then yeah.
|
| _shrugs_
| adamgordonbell wrote:
| We at pulumi started treating some hallucinations like this as
| feature requests.
|
| Sometimes an llm will hallucination a flag, or option that really
| makes sense - it just doesn't actually exist.
| wrs wrote:
| This sort of hallucination happens to me frequently with AWS
| infrastructure questions. Which is depressing because I can't
| do anything but agree, "yeah, that API is exactly what any sane
| person would want, but AWS didn't do that, which is why I'm
| asking the question".
| miningape wrote:
| Why are you so sure it's what someone sane would want? Maybe
| there are other ways because there are hidden problems and
| edge cases with that procedure. It could contradict the
| fundamental model of the underlying resources but looks
| correct to someone with a cursory understanding.
|
| I'm not saying this is the case, but LLMs are often wrong in
| subtle ways like this.
| kgeist wrote:
| Also, sometimes a flag does exist, but the example places it
| incorrectly, causing the command to reject it. Or, a flag used
| to exist but was removed in the latest versions.
| latexr wrote:
| > Conclusion
|
| > LLMs are really smart most of the time.
|
| No, the conclusion is they're _never_ "smart". All they do is
| regurgitate text which resembles a continuation of what came
| before, and sometimes--but with zero guarantees--that text aligns
| with reality.
| miningape wrote:
| This, thank you. It pisses me off to no end when people pretend
| LLMs are smart. They are nothing but a well trained random text
| generator.
|
| Seriously, some these conversations feel like interacting
| someone who believes casting bones and astrology are accurate.
| Likely because in both cases they are a result of confirmation
| bias.
| immibis wrote:
| We don't know what smartness is. What if that's what
| smartness is?
| miningape wrote:
| We might not know what it is, but it's not that. At a bare
| minimum smartness requires abstract reasoning (and no, so-
| called "reasoning" models do not do that - it's a marketing
| trick)
| BrenBarn wrote:
| Similar the thing at the end with "I find it endearing". I
| mean, the author feels what he feels, but personally I find
| this LLM behavior disgusting and depressing.
| jonas21 wrote:
| The same as you and me, really.
| IAmNotACellist wrote:
| "Not acceptable. Please upgrade your browser to continue." No, I
| don't think I will.
| hahahacorn wrote:
| Sorry about that. This is a default rails 8 setting, removed
| the blocker.
| leumon wrote:
| He should've tested 4.5. This model is hallucinating much less
| than any other model.
| aranw wrote:
| I wonder how easy it would be to influence super LLMs if a
| particular group of people created enough articles that were
| clear to any human reader that it's a load of garbage and rubbish
| and should ignore it but if a LLM was to parse it wouldn't
| realise and then ruin it's reasoning and code generation
| abilities?
| dannygarcia wrote:
| It's very easy. I've done this by accident. One of my side
| projects helps users price the affordability of a particular
| kind of product. When I ask various LLMs "can I afford X" or
| "how much do I need to earn to buy X", my project comes up as a
| source/reference. I currently manually crawl retailers for the
| MSRP so these numbers are usually months out of date!
| joelthelion wrote:
| Hallucinations like this could be a great way to identify missing
| features or confusing parts of your framework. If the llm invents
| it, maybe it ought to be like this?
| andix wrote:
| I like your thinking :)
| jeanlucas wrote:
| Only if you wanna optimize exclusively for LLM users in this
| generation.
| Baggie wrote:
| The conclusion paragraph was really funny and kinda perfectly
| encapsulates the current state of AI, but as pointed out by
| another comment, we can't even call them smart, just "Ctrl C Ctrl
| V Leeroy Jenkins style"
| jwjohnson314 wrote:
| The interesting thing here to me is that the llm isn't
| 'hallucinating', it's simply regurgitating some data it digested
| during training.
| mvdtnz wrote:
| What's the difference?
| lxe wrote:
| This is incredible, and it's not technically a "hallucination". I
| bet it's relatively easy to find more examples like this...
| something on the internet that's both niche enough, popular
| enough, and wrong, yet was scraped and trained on.
| saurik wrote:
| What I honestly find most interesting about this is the thought
| that hallucinations might lead to the kind of emergent language
| design we see in natural language (which might not be a good
| thing for a computer language, fwiw, but still interesting),
| where people just kind of thing "language should work this way
| and if I say it like this people will probably understand me".
___________________________________________________________________
(page generated 2025-03-01 23:00 UTC)