[HN Gopher] A new Google model is nearly perfect on automated ha...
___________________________________________________________________
A new Google model is nearly perfect on automated handwriting
recognition
Author : scrlk
Score : 477 points
Date : 2025-11-11 13:52 UTC (4 days ago)
(HTM) web link (generativehistory.substack.com)
(TXT) w3m dump (generativehistory.substack.com)
| throwup238 wrote:
| I really hope they have because I've also been experimenting with
| LLMs to automate searching through old archival handwritten
| documents. I'm interested in the Conquistadors and their
| extensive accounts of their expeditions, but holy cow reading
| 16th century handwritten Spanish and translating it at the same
| time is a nightmare, requiring a ton of expertise and inside
| field knowledge. It doesn't help that they were often written in
| the field by semi-literate people who misused lots of words. Even
| the simplest accounts require quite a lot of detective work to
| decipher with subtle signals like that pound sign for the sugar
| loaf.
|
| _> Whatever it is, users have reported some truly wild things:
| it codes fully functioning Windows and Apple OS clones, 3D design
| software, Nintendo emulators, and productivity suites from single
| prompts._
|
| This I'm a lot more skeptical of. The linked twitter post just
| looks like something it would replicate via HTML/CSS/JS. Whats
| the kernel look like?
| WhyOhWhyQ wrote:
| "> Whatever it is, users have reported some truly wild things:
| it codes fully functioning Windows and Apple OS clones, 3D
| design software, Nintendo emulators, and productivity suites
| from single prompts."
|
| Wow I'm doing it way wrong. How do I get the good stuff?
| zer00eyz wrote:
| Your not.
|
| I want you to go into the kitchen and bake a cake. Please
| replace all the flour with baking soda. If it comes out
| looking limp and lifeless just decorate it up with extra
| layers of frosting.
|
| You can make something that looks like a cake but would not
| be good to eat.
|
| The cake, sometimes, is a lie. And in this case, so are
| likely most of these results... or they are the actual source
| code of some other project just regurgitated.
| hinkley wrote:
| We got the results back. You are a horrible person. I'm
| serious, that's what it says: "Horrible person."
|
| We weren't even testing for that.
| erulabs wrote:
| Well, what does a neck-bearded old engineer know about
| fashion? He probably - Oh, wait. It's a she. Still, what
| does she know? Oh wait, it says she has a medical degree.
| In fashion! From France!
| joshstrange wrote:
| If you want to listen to the line from Portal 2 it's on
| this page (second line in the section linked): https://th
| eportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...
| fragmede wrote:
| Just because "Die motherfucker die motherfucker die"
| appeared in a song once doesn't mean it's not also death
| threat when someone's pointing a gun at you and saying
| that.
| scubbo wrote:
| ...what?
| fragmede wrote:
| hinkley wrote:
|
| > We got the results back. You are a horrible person. I'm
| serious, that's what it says: "Horrible person."
|
| > We weren't even testing for that.
|
| joshstrange then wrote:
|
| > If you want to listen to the line from Portal 2 it's on
| this page (second line in the section linked): https://th
| eportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...
|
| as if the fact that the words that hinkley wrote are from
| a popular video game excuses the fact that hinkley just
| also called zer00eyz horrible.
| hinkley wrote:
| So if two sentences that make no sense to you sandwich
| one that does, you should totally accept the middle one
| at face value.
|
| K.
| fragmede wrote:
| Yes. You chose to repeat those words in that sequence in
| that place. You could have said anything else in the
| whole wide world, but you chose to use a quote from an
| ancient video game stating that someone was horrible.
| Sorry if I'm being autistic and taking things too
| literally again, working on having social skills was a
| different thread from today.
| joshstrange wrote:
| I think you might be confused or mistaken (or you are
| making a whole different joke).
|
| My 2 comments are linking to different quotes from Portal
| 2, both the original comment
|
| > We got the results back.....
|
| and
|
| > Well, what does a neck-bearded old engineer know about
| fashion?.....
|
| Are from Portal 2 and the first Portal 2 quote is just a
| reference to the parent of that saying:
|
| > The cake, sometimes, is a lie.
|
| (Another Portal reference if that wasn't clear), they
| weren't calling the parent horrible, they were just
| putting in quote they liked from the game that was
| referenced.
|
| That's one reason why I linked the quote, so people would
| understand it was a reference to the game, not the person
| actually saying the parent was horrible. The other reason
| I linked it is just because I like added metadata where
| possible.
| joshstrange wrote:
| Source: Portal 2, you can see the line and listen to it
| here (last one in section): https://theportalwiki.com/wik
| i/GLaDOS_voice_lines_(Portal_2)...
| hinkley wrote:
| I figured it was appropriate given the context.
|
| I'm still amazed that game started as someone's school
| project. Long live the Orange Box!
| chihuahua wrote:
| I'd really like Alexa+ to have the voice of GLaDOS.
| snickerbockers wrote:
| I'm skeptical that they're actually capable of making something
| novel. There are thousands of hobby operating systems and video
| game emulators on github for it to train off of so it's not
| particularly surprising that it can copy somebody else's
| homework.
| flatline wrote:
| I believe they can create a novel instance of a system from a
| sufficient number of relevant references - i.e. implement a
| set of already-known features without (much) code
| duplication. LLMs are certainly capable of this level of
| generalization due to their huge non-relevant reference set.
| Whether they can expand beyond that into something truly
| novel from a feature/functionality standpoint is a whole
| other, and less well-defined, question. I tend to agree that
| they are closed systems relative to their corpus. But then,
| aren't we? I feel like the aperture for true novelty to enter
| is vanishingly small, and cultures put a premium on it vis-a-
| vis the arts, technological innovation, etc. Almost every
| human endeavor is just copying and iterating on prior
| examples.
| imiric wrote:
| Here's a thought experiment: if modern machine learning
| systems existed in the early 20th century, would they have
| been able to produce an equivalent to the theory of
| relativity? How about advance our understanding of the
| universe? Teach us about flight dynamics and take us into
| space? Invent the Turing machine, Von Neumann architecture,
| transistors?
|
| If yes, why aren't we seeing glimpses of such genius today?
| If we've truly invented artificial intelligence, and on our
| way to super and general intelligence, why aren't we seeing
| breakthroughs in all fields of science? Why are state of
| the art applications of this technology based on pattern
| recognition and applied statistics?
|
| Can we explain this by saying that we're only a few years
| into it, and that it's too early to expect fundamental
| breakthroughs? And that by 2027, or 2030, or surely by
| 2040, all of these things will suddenly materialize?
|
| I have my doubts.
| tanseydavid wrote:
| How about "Protein Folding"?
| imiric wrote:
| A great use case for pattern recognition.
| famouswaffles wrote:
| >Here's a thought experiment: if modern machine learning
| systems existed in the early 20th century, would they
| have been able to produce an equivalent to the theory of
| relativity? How about advance our understanding of the
| universe? Teach us about flight dynamics and take us into
| space? Invent the Turing machine, Von Neumann
| architecture, transistors?
|
| Only a small percentage of humanity are/were capable of
| doing any of these. And they tend to be the best of the
| best in their respective fields.
|
| >If yes, why aren't we seeing glimpses of such genius
| today?
|
| Again, most humans can't actually do any of the things
| you just listed. Only our most intelligent can. LLMs are
| great, but they're not (yet?) as capable as our best and
| brightest (and in many ways, lag behind the average
| human) in most respects, so why would you expect such
| genius now ?
| beeflet wrote:
| Were they the best of the best? or were they just at the
| right place and time to be exposed to a novel idea?
|
| I am skeptical of this claim that you need a 140IQ to
| make scientific breakthroughs, because you don't need a
| 140IQ to understand special relativity. It is a matter of
| motivation and exposure to new information. The vast
| majority of the population doesn't benefit from working
| in some niche field of physics in the first place.
|
| Perhaps LLMs will never be at the right place and the
| right time because they are only trained on ideas that
| already exist.
| famouswaffles wrote:
| >Were they the best of the best? or were they just at the
| right place and time to be exposed to a novel idea?
|
| It's not an "or" but an "and". Being at the right place
| and time is a necessary precondition, but it's not
| sufficient. Newton stood on the shoulders of giants like
| Kepler and Galileo, and Einstein built upon the work of
| Maxwell and Lorentz. The key question is, why did they
| see the next step when so many of their brilliant
| contemporaries, who had the exact same information and
| were in similar positions, did not? That's what separates
| the exceptional from the rest.
|
| >I am skeptical of this claim that you need a 140IQ to
| make scientific breakthroughs, because you don't need a
| 140IQ to understand special relativity.
|
| There is a pretty massive gap between understanding a
| revolutionary idea and originating it. It's the
| difference between being the first person to summit
| Everest without a map, and a tourist who takes a
| helicopter to the top to enjoy the view. One requires
| genius and immense effort; the other requires following
| instructions. Today, we have a century of explanations,
| analogies, and refined mathematics that make relativity
| understandable. Einstein had none of that.
| Kim_Bruning wrote:
| It's entirely plausible that sometimes one genius sees
| the answer all alone -I'm sure it happens sometimes- but
| it's also definitely a common theme that many people/ a
| subset of society as a whole may start having similar
| ideas all around the same time. In many cases where a
| breakthrough is attributed to one person, if you look
| more closely you'll often see some sort of team effort or
| societal ground swell.
| imiric wrote:
| > LLMs are great, but they're not (yet?) as capable as
| our best and brightest (and in many ways, lag behind the
| average human) in most respects, so why would you expect
| such genius now ?
|
| I'm not expecting novel scientific theories _today_. What
| I am expecting are signs and hints of such genius.
| Something that points in the direction that all tech CEOs
| are claiming we 're headed in. So far I haven't seen any
| of this yet.
|
| And, I'm sorry, I don't buy the excuse that these tools
| are not "yet" as capable as the best and brightest
| humans. They contain the sum of human knowledge, far more
| than any individual human in history. Are they not
| _intelligent_ , capable of thinking and reasoning? Are we
| not at the verge of superintelligence[1]?
|
| > we have recently built systems that are smarter than
| people in many ways, and are able to significantly
| amplify the output of people using them.
|
| If all this is true, surely we should be seeing
| incredible results produced by this technology. If not by
| itself, then surely by "amplifying" the work of the best
| and brightest humans.
|
| And yet... All we have to show for it are some very good
| applications of pattern matching and statistics, a bunch
| of gamed and misleading benchmarks and leaderboards, a
| whole lot of tech demos, solutions in search of a
| problem, and the very real problem of flooding us with
| even more spam, scams, disinformation, and devaluing
| human work with low-effort garbage.
|
| [1]: https://blog.samaltman.com/the-gentle-singularity
| famouswaffles wrote:
| >I'm not expecting novel scientific theories today. What
| I am expecting are signs and hints of such genius.
|
| Like I said, what exactly would you be expecting to see
| with the capabilities that exist today ? It's not a
| gotcha, it's a genuine question.
|
| >And, I'm sorry, I don't buy the excuse that these tools
| are not "yet" as capable as the best and brightest
| humans.
|
| There's nothing to buy or not buy. They simply aren't.
| They are unable to do a lot of the things these people
| do. You can't slot an LLM in place of most knowledge
| workers and expect everything to be fine and dandy.
| There's no ambiguity on that.
|
| >They contain the sum of human knowledge, far more than
| any individual human in history.
|
| It's not really the total sum of human knowledge but
| let's set that aside. Yeah so ? Einstein, Newton, Von
| Newman. None of these guys were privy to some super
| secret knowledge their contemporaries weren't so it's
| obviously not simply a matter of more knowledge.
|
| >Are they not intelligent, capable of thinking and
| reasoning?
|
| Yeah they are. And so are humans. So were the peers of
| all those guys. So why are only a few able to see the
| next step ? It's not just about knowledge, and
| intelligence lives in degrees/is a gradient.
|
| >If all this is true, surely we should be seeing
| incredible results produced by this technology. If not by
| itself, then surely by "amplifying" the work of the best
| and brightest humans.
|
| Yeah and that exists. Terence Tao has shared a lot of his
| (and his peers) experiences on the matter.
|
| https://mathstodon.xyz/@tao/115306424727150237
|
| https://mathstodon.xyz/@tao/115420236285085121
|
| https://mathstodon.xyz/@tao/115416208975810074
|
| >And yet... All we have to show for it are some very good
| applications of pattern matching and statistics, a bunch
| of gamed and misleading benchmarks and leaderboards, a
| whole lot of tech demos, solutions in search of a
| problem, and the very real problem of flooding us with
| even more spam, scams, disinformation, and devaluing
| human work with low-effort garbage.
|
| Well it's a good thing that's not true then
| imiric wrote:
| > Like I said, what exactly would you be expecting to see
| with the capabilities that exist today ?
|
| And like I said, "signs and hints" of superhuman
| intelligence. I don't know what that looks like since I'm
| merely human, but I sure know that I haven't seen it yet.
|
| > There's nothing to buy or not buy. They simply aren't.
| They are unable to do a lot of the things these people
| do.
|
| This claim is directly opposed to claims by Sam Altman
| and his cohort, which I'll repeat:
|
| > we have recently built systems that are smarter than
| people in many ways, and are able to significantly
| amplify the output of people using them.
|
| So which is it? If they're "smarter than people in many
| ways", where is the product of that superhuman
| intelligence? If they're able to "significantly amplify
| the output of people using them", then all of humanity
| should be empowered to produce incredible results that
| were previously only achievable by a limited number of
| people. In hands of the best and brightest humans, it
| should empower them to produce results previously
| unreachable by humanity.
|
| Yet all positive applications of this technology show
| that it excels at finding and producing data patterns,
| and nothing more than that. Those experience reports by
| Terence Tao are prime examples of this. The system was
| fed a lot of contextual information, and after being
| coaxed by highly intelligent humans, was able to find and
| produce patterns that were difficult to see by humans.
| This is hardly a showcase of intelligence that you and
| others think it is. Including those highly intelligent
| humans, some of whom have a lot to gain from pushing this
| narrative.
|
| We have seen similar reports by programmers as well[1].
| Yet I'm continually amazed that these highly intelligent
| people are surprised that a pattern finding and producing
| system was able to successfully find and produce useful
| patterns, and then interpret that as a showcase of
| intelligence. So much so that I start to feel suspicious
| about the intentions and biases of those people.
|
| To be clear: I'm not saying that these systems can't be
| very useful in the right hands, and potentially
| revolutionize many industries. Ultimately many real-world
| problems can be modeled as statistical problems where a
| pattern recognition system can excel. What I am saying is
| that there's a very large gap from the utility of such
| tools, and the extraordinary claims that they have
| intelligence, let alone superhuman and general
| intelligence. So far I have seen no evidence of the
| latter, despite of the overwhelming marketing euphoria
| we're going through.
|
| > Well it's a good thing that's not true then
|
| In the world outside of the "AI" tech bubble, that is
| very much the reality.
|
| [1]: https://news.ycombinator.com/item?id=45784179
| lelanthran wrote:
| > Only a small percentage of humanity are/were capable of
| doing any of these. And they tend to be the best of the
| best in their respective fields.
|
| Sure, agreed, but the difference between a small
| percentage and zero percentage is infinite.
| gf000 wrote:
| > Only a small percentage of humanity are/were capable of
| doing any of these. And they tend to be the best of the
| best in their respective fields.
|
| A definite, absolute and unquestionable no, and a small,
| but real chance is absolutely different categories.
|
| You may wait for a bunch of rocks to sprout forever, but
| I would put my money on a bunch of random seeds, even if
| I don't know how they were kept.
| beeflet wrote:
| Almost all of the work in making a new operating system or
| a gameboy emulator or something is in characterizing the
| problem space and defining the solution. How do you know
| what such and such instruction does? What is the ideal way
| to handle this memory structure here? You know, knowledge
| you gain from spending time tracking down a specific bug or
| optimizing a subroutine.
|
| When I create something, it's an exploratory process. I
| don't just guess what I am going to do based on my previous
| step and hope it comes out good on the first try. Let's say
| I decide to make a car with 5 wheels. I would go through
| several chassis designs, different engine configurations
| until I eventually had something that works well. Maybe
| some are too weak, some too expensive, some are too
| complicated. Maybe some prototypes get to the physical
| testing stage while others don't. Finally, I publish this
| design for other people to work on.
|
| If you ask the LLM to work on a novel concept it hasn't
| been trained on, it will usually spit out some nonsense
| that either doesn't work or works poorly, or it will refuse
| to provide a specific enough solution. If it has been
| trained on previous work, it will spit out something that
| looks similar to the solved problem in its training set.
|
| These AI systems don't undergo the process of trial and
| error that suggests it is creating something novel. Its
| process of creation is not reactive with the environment.
| It is just cribbing off of extant solutions it's been
| trained on.
| vidarh wrote:
| I'm literally watching Claude Code "undergo the process
| of trial and error" in another window right now.
| jstummbillig wrote:
| I remain confused but still somewhat interested as to a
| definition of "novel", given how often this idea is wielded
| in the AI context. How is everyone so good at identifying
| "novel"?
|
| For example, I can't wrap my head around how a) a human could
| come up with a piece of writing that _inarguably_ reads
| "novel" writing, while b) an AI could be guaranteed to _not_
| be able to do the same, under the same standard.
| testaccount28 wrote:
| why would you admit on the internet that you fail the
| reverse turing test?
| fragmede wrote:
| Because not everyone here has a raging ego and no
| humility?
| CamperBob2 wrote:
| You have no idea if you're talking to an LLM or a human,
| yourself, so ... uh, wait, neither do I.
| greygoo222 wrote:
| Because I'm an LLM and you are too
| mikestorrent wrote:
| Didn't some fake AI country song just get on the top 100?
| How novel is novel? A lot of human artists aren't
| producing anything _novel_.
| magicalist wrote:
| > _Didn 't some fake AI country song just get on the top
| 100?_
|
| No
|
| Edit: to be less snarky, it topped the Billboard Country
| Digital Song Sales Chart, which is a measure of sales of
| the individual song, not streaming listens. It's
| estimated it takes a few thousand sales to top that
| particular chart and it's widely believed to be commonly
| manipulated by coordinated purchases.
| terminalshort wrote:
| It was a real AI country song, not a fake one, but yes.
| snickerbockers wrote:
| Generally novel either refers to something that is new, or
| a certain type of literature. If the AI is generating
| something functionally equivalent to a program in its
| training set (in this case, dozens or even hundreds of such
| programs) then it by definition cannot be novel.
| brulard wrote:
| This is quite a narrow view of how the generation works.
| AI can extrapolate from the training set and explore new
| directions. It's not just cutting pieces and gluing
| together.
| beeflet wrote:
| In practice, I find the ability for this new wave of AI
| to extrapolate very limited.
| fragmede wrote:
| Do you have any concrete examples you'd care to share?
| While this new wave of AI doesn't have unlimited powers
| of extrapolation, the post we're commenting on is
| asserting that this latest AI from Google was able to
| extrapolate solutions to two of AI's oldest problems,
| which would seem to contradict an assertion of "very
| limited".
| snickerbockers wrote:
| uhhh can it? I've certainly not seen any evidence of an
| AI generating something not based on its training set.
| It's certainly smart enough to shuffle code around and
| make superficial changes, and that's pretty impressive in
| its own way but not particularly useful unless your only
| goal is to just launder somebody else's code to get
| around a licensing problem (and even then it's
| questionable if that's a derived work or not).
|
| Honest question: if AI is actually capable of exploring
| new directions why does it have to train on what is
| effectively the sum total of all human knowledge?
| Shouldn't it be able to take in some basic concepts
| (language parsing, logic, etc) and bootstrap its way into
| new discoveries (not necessarily completely new but
| independently derived) from there? Nobody learns the way
| an LLM does.
|
| ChatGPT, to the extent that it is comparable to human
| cognition, is undoubtedly the most well-read person in
| all of history. When I want to learn something I look it
| up online or in the public library but I don't have to
| read the entire library to understand a concept.
| BirAdam wrote:
| You didn't have to read the whole library because your
| brain has been absorbing knowledge from multiple inputs
| your entire life. AI systems are trying to temporally
| compress a lifetime into the time of training. Then,
| given that these systems have effectively a single input
| method of streams of bits, they need immense amounts of
| it to be knowledgeable at all.
| BobbyTables2 wrote:
| You have to realize AI is trained the same way one would
| train an auto-completer.
|
| Theres no cognition. It's not taught language, grammar,
| etc. none of that!
|
| It's only seen a huge amount of text that allows it to
| recognize answers to questions. Unfortunately, it appears
| to work so people see it as the equivalent to sci-fi
| movie AI.
|
| It's really just a search engine.
| snickerbockers wrote:
| I agree and that's the case I'm trying to make. The
| machine-learning community expects us to believe that it
| is somehow comparable to human cognition, yet the way it
| learns is inherently inhuman. If an LLM was in any way
| similar to a human I would expect that, like a human, it
| might require a little bit of guidance as it learns but
| ultimately it would be capable of understanding concepts
| well enough that it doesn't need to have memorized every
| book in the library just to perform simple tasks.
|
| In fact, I would expect it to be able to reproduce past
| human discoveries it hasn't even been exposed to, and if
| the AI is actually capable of this then it should be
| possible for them to set up a controlled experiment
| wherein it is given a limited "education" and must
| discover something already known to the researchers but
| not the machine. That nobody has done this tells me that
| either they have low confidence in the AI despite their
| bravado, or that they already have tried it and the
| machine failed.
| ezst wrote:
| > The machine-learning community
|
| Is it? I only see a few individuals, VCs, and tech giants
| overblowing LLMs capabilities (and still puzzled as to
| how the latter dragged themselves into a race to the
| bottom through it). I don't believe the academic field
| really is that impressed with LLMs.
| throwaway173738 wrote:
| There's a third possible reason which is that they're
| taking it as a given that the machine is "intelligent" as
| a sales tactic, and they're not academic enough to want
| to test anything they believe.
| ninetyninenine wrote:
| no it's not I work on AI and what these things do are
| much much more then a search engine or an autocomplete.
| If an autocomplete passed the turing test you'd dismiss
| it because it's still an autocomplete.
|
| The characterization you are regurgitating here is from
| laymen who do not understand AI. You are not just mildly
| wrong but wildly uninformed.
| MangoToupe wrote:
| To be fair, it's not clear _human_ intelligence is much
| more than search or autocomplete. The only thing that 's
| clear here is that LLMs can't reproduce it.
| ninetyninenine wrote:
| Yes but colloquially this characterization you see used
| by laymen is deliberately used to deride AI and dismiss
| it. It is not honest about the on the ground progress AI
| has made and it's not intellectual honest about the
| capabilities and weaknesses of Ai.
| MangoToupe wrote:
| I disagree. The actual capabilities of LLMs remain
| unclear, and there's a great deal of reasons to be
| suspicious of anyone whose paycheck relies on pimping
| them.
| ninetyninenine wrote:
| The capabilities of LLMs are unclear but it is clear that
| they are not just search engines or autocompletes or
| stochastic parrots.
|
| You can disagree. But this is not an opinion. You are
| factually wrong if you disagree. And by that I mean you
| don't know what you're talking about and you are
| completely misinformed and lack knowledge.
|
| The long term outcome if I'm right is that AI abilities
| continue to grow and it basically destroys my career and
| yours completely. I stand not to benefit from this
| reality and I state it because it is reality. LLMs
| improve every month. It's already to the point of where
| if you're not vibe coding you're behind.
| versteegen wrote:
| Well, I also work on AI, and I completely agree with you.
| But I've reached the point of thinking it's hopeless to
| argue with people about this: It seems that as LLMs
| become ever better people aren't going to change their
| opinions, as I had expected. If you don't have good
| awareness of how human cognition actually works, then
| it's not evidently contradictory to think that even a
| superintelligent LLM trained on all human knowledge is
| _just_ pattern matching and that humans _are not_.
| Creativity, understanding, originality, intent, etc, can
| all be placed into a largely self-consistent framework of
| human specialness.
| fragmede wrote:
| Isn't that what's going on with synthetic data? The LLM
| is trained, then is used to generate data that gets put
| into the training set, and then gets further trained on
| that generated data?
| ninetyninenine wrote:
| >I've certainly not seen any evidence of an AI generating
| something not based on its training set.
|
| There is plenty of evidence for this. You have to be
| blind not to realize this. Just ask the AI to generate
| something not in it's training set.
| gf000 wrote:
| Like the seahorse emoji?
| kazinator wrote:
| Positively not. It is pure interpolation and not
| extrapolation. The training set is vast and supports an
| even vaster set of possible traversal paths; but they are
| all interpolative.
|
| Same with diffusion and everything else. It is not
| extrapolation that you can transfer the style of Van Gogh
| onto a photographl it is interpolation.
|
| Extrapolation might be something like inventing a style:
| how did Van Gogh do that?
|
| And, sure, the thing can invent a new style---as a mashup
| of existing styles. Give me a Picasso-like take on Van
| Gogh and apply it to this image ...
|
| Maybe the original thing there is the _idea_ of doing
| that; but that came from me! The execution of it is just
| interpolation.
| BoorishBears wrote:
| This is knock against you _at all_ , but in a naive
| attempt to spare someone else some time: remember that
| based on this definition it is impossible for an LLM to
| do novel things _and more importantly_ , you're not going
| to change how this person defines a concept as integral
| to one's being as novelty.
|
| I personally think this is a bit tautological of a
| definition, but if you hold it, then yes LLMs are not
| capable of anything novel.
| kazinator wrote:
| That is not strictly true, because being able to transfer
| the style of Van Gogh onto an arbitrary photographic
| scene _is_ novel in a sense, but it is interpolative.
|
| Mashups are not purely derivative: the choice of what to
| mash up carries novelty: two (or more) representations
| are mashed together which hitherto have not been.
|
| We cannot deny that something is new.
| regularfry wrote:
| Innovation itself is frequently defined as the novel
| combination of pre-existing components. It's mashups all
| the way down.
| BoorishBears wrote:
| I'm saying their comment is calling that not something
| new.
|
| I don't agree, but by their estimation adding things
| together is still just using existing things.
| Libidinalecon wrote:
| I think you should reverse the question, why would we
| expect LLMs to even have the ability to do novel things?
|
| It is like expecting a DJ remixing tracks to output
| original music. Confusing that the DJ is not actually
| playing the instruments on the recorded music so they
| can't do something new beyond the interpolation. I love
| DJ sets but it wouldn't be fair to the DJ to expect them
| to know how to play the sitar because they open the set
| with a sitar sample interpolated with a kick drum.
| 8note wrote:
| kid koala does jazz solos on a disk of 12 notes, jumping
| the track back and forth to get different notes.
|
| i think that, along with the sitar player are still
| interpolating. the notes are all there on the instrument.
| even without an instrument, its still interpolating. the
| space that music and aound can be in is all well known
| wave math. if you draw a fourier transform view, you
| could see one chart with all 0, and a second with all
| +infinite, and all music and sound is gonna sit somewhere
| between the two.
|
| i dont know that "just interpolation" is all that
| meaningful to whether something is novel or interesting.
| BoorishBears wrote:
| It just depends on how you define novel.
|
| Would you consider the instrumental at 33 seconds a new
| song? https://youtu.be/eJA0wY1e-zU?si=yRrDlUN2tqKpWDCv
| ozgrakkurt wrote:
| This is how people do things as well imo. LLM does the
| same thing on some level but it is just not good enough
| for majority of use cases
| throwaway173738 wrote:
| Calling it "exploring" is anthropomorphising. The machine
| has weights that yield meaningful programs given
| specification-like language. It's a useful phenomenon but
| it may be nothing like what we do.
| grosswait wrote:
| Or it may be remarkably similar to what we do
| taneq wrote:
| OK, but by that definition, how many human software
| developers ever develop something "novel"? Of course, the
| "functionally equivalent" term is doing a lot of heavy
| lifting here: How equivalent? How many differences are
| required to qualify as different? How many similarities
| are required to qualify as similar? Which one overrules
| the other? If I write an app that's identical to Excel in
| every single aspect except that instead of a Microsoft
| Flight Simulator easter egg, there's a different, unique,
| fully playable game that can't be summed up with any
| combination of genre lables, is that 'novel'?
| gf000 wrote:
| I think the importance is _the ability_. Not every human
| have produced (or even can) something novel in their
| life, but there are humans who have time after time.
|
| Meanwhile, depending on how you rate LLM's capabilities,
| no matter how many trials you give it, it may not be
| considered capable of that.
|
| That's a very important distinction.
| QuadmasterXLII wrote:
| A system of humans creates bona fide novel writing. We
| don't know which human is responsible for the novelty in
| homoerotic fanfiction of the Odyssey, but it wasn't a
| lizard. LLMs don't have this system-of-thinkers
| bootstrapping effect yet, or if they do it requires an
| absolutely enormous boost to get going
| kazinator wrote:
| Because we know that the human only read, say, fifty books
| since they were born, and watched a few thousand videos,
| and there is nothing in them which resembles what they
| wrote.
| terminalshort wrote:
| If a LLM had written Linux, people would be saying that it
| isn't novel because it's just based on previous OS's. There
| is no standard here, only bias.
| jofla_net wrote:
| Cept its not made Linux (in the absence of it).
|
| At any point prior to the final output it can garner huge
| starting point bias from ingested reference material.
| This can be up to and including whole solutions to the
| original prompt minus some derivations. This is
| effectively akin to cheating for humans as we cant bring
| notes to the exam. Since we do not have a complete
| picture of where every part of the output comes from we
| are at a loss to explain if it indeed invented it or not.
| The onus is and should be on the applicant to ensure that
| the output wasn't copied (show your work), not on the
| graders to prove that it wasn't copied. No less than what
| would be required if it was a human. Ultimately it boils
| down to what it means to 'know' something, whether a
| photographic memory is, in fact, knowing something, or
| rather derivations based on other messy forms of
| symbolism. It is nevertheless a huge argument as both
| sides have a mountain of bias in either directions.
| jstummbillig wrote:
| > Cept its not made Linux (in the absence of it).
|
| Neither did you (or I). Did you create anything that you
| are certain your peers would recognize as more "novel"
| than anything a LLM could produce?
| snickerbockers wrote:
| >Neither did you (or I).
|
| Not that specifically but I certainly have the capability
| to create my own OS without having to refer to the source
| code of existing operating systems. Literally "creating a
| linux" is a bit on the impossible side because it implies
| compatibility with an existing kernel despite the
| constraints prohibiting me from referring to the source
| of that existing kernel (maybe possible if i had some
| clean-room RE team that would read through the source and
| create a list of requirements without including any
| source).
|
| If we're all on the same page regarding the origins of
| human intelligence (ie, that it does _not_ begin with
| satan tricking adam and eve into eating the fruit of a
| tree they were specifically instructed not to touch) then
| it necessarily follows that any idea or concept was new
| at some point and had to be developed by somebody who
| didn 't already have an entire library of books
| explaining the solution at his disposal.
|
| For the Linux thought-experiment you could maybe argue
| that Linux isn't totally novel since its creator was
| intentionally mimicking behavior of an existing well-
| known operating system (also iirc he had access to the
| minix source) and maybe you could even argue that those
| predecessors stood on the shoulders of their own
| proverbial giants, but if we keep kicking the ball down
| the road eventually we reach a point where somebody had
| an idea which was not in any way inspired by somebody
| else's existing idea.
|
| The argument I want to make is not that humans never
| create derivative or unoriginal works (that obviously
| cannot be true) but that humans have the capability to
| create new things. I'm not convinced that LLMs have that
| same capability; maybe I'm wrong but I'm still waiting to
| see evidence of them discovering something new. As I said
| in another post, this could easily be demonstrated with a
| controlled experiment in which the model is bootstrapped
| with a basic yet intentionally-limited "education" and
| then tasked with discovering something already known to
| the experimenters which was not in its training set.
|
| >Did you create anything that you are certain your peers
| would recognize as more "novel" than anything a LLM could
| produce?
|
| Yes, I have definitely created things without first
| reading every book in the library and memorizing
| thousands of existing functionally-equivalent solutions
| to the same problem. So have you so long as I'm not
| actually debating an LLM right now.
| visarga wrote:
| > For example, I can't wrap my head around how a) a human
| could come up with a piece of writing that inarguably reads
| "novel" writing, while b) an AI could be guaranteed to not
| be able to do the same, under the same standard.
|
| The secret ingredient is the world outside, and past
| experiences from the world, which are unique for each
| human. We stumble onto novelty in the environment. But AI
| can do that too - move 37 AlphaGo is an example, much
| stumbling around leads to discoveries even for AI. The
| environment is the key.
| baq wrote:
| If the model can map an unseen problem to something in its
| latent space, solve it there, map back and deliver an
| ultimately correct solution, is it novel? Genuine question,
| 'novel' doesn't seem to have a universally accepted
| definition here
| gf000 wrote:
| Good question, though I would say that there may be
| different grades of novelty.
|
| One grade might be your example, while something like
| Godel's incompleteness theorems or Einstein's relativity
| could go into a different grade.
| n8cpdx wrote:
| The windows (~2000) kernel itself is on GitHub. Even
| exquisitely documented if AI can read .doc files.
|
| https://github.com/ranni0225/WRK
| sosuke wrote:
| Doing something novel is incredibly difficult through LLM
| work alone. Dreaming, hallucinating, might eventually make
| novel possible but it has to be backed up be rock solid base
| work. We aren't there yet.
|
| The working memory it holds is still extremely small compared
| to what we would need for regular open ended tasks.
|
| Yes there are outliers and I'm not being specific enough but
| I can't type that much right now.
| fragmede wrote:
| Of course they can come up with something novel. They're
| called hallucinations when they do, and that's something that
| can't be in their training data, because it's not
| true/doesn't exist. Of course, when they do come up totally
| novel hallucinations, suddenly being creative is a bad thing
| to be "fixed".
| nestorD wrote:
| Oh! That's a nice use-case and not too far from stuff I have
| been playing with! (happily I do not have to deal with
| handwriting, just bad scans of older newspapers and texts)
|
| I can vouch for the fact that LLMs are great at searching in
| the original language, summarizing key points to let you know
| whether a document might be of interest, then providing you
| with a translation where you need one.
|
| The fun part has been build tools to turn Claude code and Codex
| CLI into capable research assistant for that type of projects.
| throwup238 wrote:
| _> The fun part has been build tools to turn Claude code and
| Codex CLI into capable research assistant for that type of
| projects._
|
| What does that look like? How well does it work?
|
| I ended up writing a research TUI with my own higher level
| orchestration (basically have the thing keep working in a
| loop until a budget has been reached) and document
| extraction.
| nestorD wrote:
| I started with a UI that sounded like it was built along
| the same lines as yours, which had the advantage of letting
| me enforce a pipeline and exhaustivity of search (I don't
| want the 10 most promising documents, I want all of them).
|
| But I realized I was not using it much _because_ it was
| that big and inflexible (plus I keep wanting to stamp out
| all the bugs, which I do not have the time to do on a hobby
| project). So I ended up extracting it into MCPs (equipped
| to do full-text search and download OCR from the various
| databases I care about) and AGENTS.md files (defining
| pipelines, as well as patterns for both searching behavior
| and reporting of results). I also put together a sub-agent
| for translation (cutting away all tools besides reading and
| writing files, and giving it some document-specific
| contextual information).
|
| That lets me use Claude Code and Codex CLI (which,
| anecdotally, I have found to be the better of the two for
| that kind of work; it seems to deal better with longer
| inputs produced by searches) as the driver, telling them
| what I am researching and maybe how I would structure the
| search, then letting them run in the background before
| checking their report and steering the search based on
| that.
|
| It is not perfect (if a search surfaces 300 promising
| documents, it will _not_ check all of them, and it often
| misunderstands things due to lacking further context), but
| I now find myself reaching for it regularly, and I polish
| out problems one at a time. The next goal is to add more
| data sources and to maybe unify things further.
| throwup238 wrote:
| _> It is not perfect (if a search surfaces 300 promising
| documents, it will not check all of them, and it often
| misunderstands things due to lacking further context)_
|
| This has been the biggest problem for me too. I jokingly
| call it the LLM halting problem because it never knows
| the proper time to stop working on something, finishing
| way too fast without going through each item in the list.
| That's why I've been doing my own custom orchestration,
| drip feeding it results with a mix of summarization and
| content extraction to keep the context from different
| documents chained together.
|
| Especially working with unindexed content like colonial
| documents where I'm searching through thousands of pages
| spread (as JPEGs) over hundreds of documents for a single
| one that's relevant to my research, but there are latent
| mentions of a name that ties them all together (like a
| minor member of an expedition giving relevant testimony
| in an unrelated case). It turns into a messy web of named
| entity recognition and a bunch of more classical NLU
| tasks, except done with an LLM because I'm lazy.
| jvreeland wrote:
| I'd love to find more info on this but from what I can find it
| seems to be making webpages that look like those product, and
| seemingly can "run python" or "emulate a game" but writing
| something that, based on all of GitHub, can approximate an
| iPhone or emulator in javscript/css/HTML is very very very
| different than writing an OS.
| kace91 wrote:
| >I'm interested in the Conquistadors and their extensive
| accounts of their expeditions, but holy cow reading 16th
| century handwritten Spanish and translating it at the same time
| is a nightmare, requiring a ton of expertise and inside field
| knowledge
|
| Completely off topic, but out of curiosity, where are you
| reading these documents? As a Spaniard I'm kinda interested.
| throwup238 wrote:
| I use the Portal de Archivos Espanoles [1] for Spanish
| colonial documents. Each country has their own archive but
| the Spanish one has the most content (35 million digitized
| pages)
|
| The hard part is knowing where to look since most of the
| images haven't gone through HRT/OCR or indexing so you have
| to understand Spanish colonial administration and go through
| the collections to find stuff.
|
| [1] https://pares.cultura.gob.es/pares/en/inicio.html
| throwout4110 wrote:
| Want to collab on a database and some clustering and
| analysis? I'm a data scientist at FAIR with an interest in
| antiquarian docs and books
| rmonvfer wrote:
| Spaniard here. Let me know if I can somehow help navigate
| all of that. I'm very interested in history and
| everything related to the 1400-1500 period (although I'm
| not an expert by any definition) and I'd love to see what
| modern technology could do here, specially OCRs and VLMs.
| throwup238 wrote:
| Sadly I'm just an amateur armchair historian (at best) so
| I doubt I'd be of much help. I'm mostly only doing the
| translation for my own edification
| cco wrote:
| You may be surprised (or not?) at how many important
| scientific and historical works are done by armchair
| practitioners.
| vintermann wrote:
| You should maybe reach out to the author of this blog
| post, professor Mark Humphries. Or to the genealogy
| communities, we struggle with handwritten historical
| texts no public AI model can make a dent in, regularly.
| dr_dshiv wrote:
| Hit me up, if you can. I'm focused on neolatin texts from
| the renaissance. Less than 30% of known book editions
| have been scanned and less than 5% translated. And that's
| before even getting to the manuscripts.
|
| https://Ancientwisdomtrust.org
|
| Also working on kids handwriting recognition for
| https://smartpaperapp.com
| SJC_Hacker wrote:
| Do you have six fingers, per chance ?
| ChrisMarshallNY wrote:
| I don't know if the six-fingered man was a Spaniard, but
| Inigo Montoya was...
| Footprint0521 wrote:
| Bro split that up, use LLMs for transcription first, then take
| that and translate it
| smusamashah wrote:
| > Whats the kernel look like?
|
| Those clones are all HTML/CSS, same for game clones made by
| Gemini.
| Aperocky wrote:
| > This I'm a lot more skeptical of. The linked twitter post
| just looks like something it would replicate via HTML/CSS/JS.
| Whats the kernel look like?
|
| Thanks for this, I was almost convinced and about to re-think
| my entire perspective and experience with LLMs.
| viftodi wrote:
| You are right to be skeptical.
|
| There are plenty of so called windows(or other) web 'os'
| clones.
|
| There were a couple of these posted on HN actually this very
| year.
|
| Here is one example I google dthat was also on HN :
| https://news.ycombinator.com/item?id=44088777
|
| This is not an OS as in emulating a kernel in javascript or
| wasm, this is making a web app that looks like the desktop of
| an OS.
|
| I have seen plenty such projects, some mimick windows UI
| entirely, you xan find them via google.
|
| So this was definitely in the training data, and is not as
| impressive as the blog post or the twitter thread make it to
| be.
|
| The scary thing is the replies in the twitter thread have no
| critical thinking at all and are impressed beyond belief, they
| think it coded a whole kernel, os, made an interpeter for it,
| ported games etc.
|
| I think this is the reason why some people are so impressed by
| AI, when you can only judge an app visually or only how you
| intetcat with it and don't have the depth of knowledge to
| understand, for such people it works all the way.land AI seems
| magical beyond comprehension.
|
| But all this is only superficial IMHO.
| krackers wrote:
| Every time a model is about to be released, there are a bunch
| of these hype accounts that spin up. I don't know they get
| paid or they spring up organically to farm engagement. Last
| time there was such hype for a model was "strawberry" (o1)
| then gpt-5, and both turned out to be meaningful improvements
| but nowhere near the hype.
|
| I don't doubt though that new models will be very good at
| frontend webdev. In fact this is explicitly one of the recent
| lmarena tasks so all the labs have probably been optimizing
| for it.
| tyre wrote:
| My guess is that there are insiders who know about the
| models and can't keep their mouths shut. They like being on
| the inside and leaking.
| DrewADesign wrote:
| I'd also bet my car on there being a ton of AI
| product/policy/optics astroturfing/shilling going on,
| here and everywhere else. Social proof is a hell of a
| marketing tool and I see a lot of comments suspiciously
| bullish about mediocre things, or suspiciously aggressive
| towards people that aren't enthused. I don't have any
| direct proof so I could be wrong, but it seems more
| extreme than a iPhone/Android (though I suspect
| deliberate marketing forces there, too,) Ford/Chevy
| brand-based-identity kind of thing, and naive to think
| this tactic is limited to TikTok and Instagram videos.
| The crowd here is so targeted, I wouldn't be surprised if
| a single-digit percentage of the comments are laying down
| plausible comment history facade for marketing use. The
| economics might make it worthwhile for the professional
| manipulators of the world.
| risyachka wrote:
| Its always amusing when "an app like windows xp" considered
| hard or challenging somehow.
|
| Literally the most basic html/css, not sure why it is even
| included in benchmarks.
| ACCount37 wrote:
| Those things are LLMs, with text and language at the core
| of their capabilities. UIs are, notably, not text.
|
| An LLM being able to build up interfaces that look
| recognizably like an UI from a real OS? That sure suggests
| a degree of multimodal understanding.
| cowboy_henk wrote:
| UIs made in the HyperText Markup Language are, in fact,
| text.
| viftodi wrote:
| While it is obviously much easier than creating a real OS,
| some people have created desktop managers web apps, with
| resizeable and movable windows, apps such as terminals,
| nodepads, file explorer etc.
|
| This is still a challenging task and requires lots of work
| to get this far.
| jchw wrote:
| I'm surprised people didn't click through to the tweet.
|
| https://x.com/chetaslua/status/1977936585522847768
|
| > I asked it for windows web os as everyone asked me for it and
| the result is mind blowing , it even has python in terminal and
| we can play games and run code in it
|
| And of course
|
| > 3D design software, Nintendo emulators
|
| No clue what these refer to but to be honest it sounds like
| they've incrementally improved one-shotting capabilities
| mostly. I wouldn't be surprised if Gemini 2.5 Pro could get a
| Gameboy or NES emulator working to boot Tetris or Mario, while
| it is a decent chunk of code to get things going, there's an
| absolute boatload of code on the Internet, and the complexity
| is lower than you might imagine. (I have written a couple of
| toy Gameboy emulators from scratch myself.)
|
| Don't get me wrong, it is pretty cool that a machine can do
| this. A lot of work people do today just isn't that novel and
| if we can find a way to tame AI models to make them trustworthy
| enough for some tasks it's going to be an easy sell to just
| throw AI models at certain problems they excel at. I'm sure
| it's already happening though I think it still mostly isn't
| happening for code at least in part due to the inherent
| difficulty of making AI work effectively in existing large
| codebases.
|
| But I will say that people are a little crazy sometimes. Yes it
| is very fascinating that an LLM, which is essentially an
| extremely fancy token predictor, can _one-shot_ a web app that
| is mostly correct, apparently without any feedback, like being
| able to actually run the application or even see editor errors,
| at least as far as we know. This is genuinely really impressive
| and interesting, and not the aspect that I think anyone seeks
| to downplay. However, consider this: even as relatively simple
| as an NES is compared to even moderately newer machines, to
| make an NES emulator you have to know how an NES works and even
| have strategies for how to emulate it, which don 't necessarily
| follow from just reading specifications or even NES program
| disassembly. The existence of _many_ toy NES emulators and a
| very large amount of documentation for the NES hardware and
| inner workings on the Internet, as well as the 6502, means that
| LLMs have a _lot_ of training data to help them out.
|
| I think that these tasks which extremely well-covered in the
| training data gives people unrealistic expectations. You could
| _probably_ pick a simpler machine that an LLM would do
| significantly worse at, even though a human who knows how to
| write emulation software could definitely do it. Not sure what
| to pick, but let 's say SEGA's VMU units for the Dreamcast -
| very small, simple device, and I reckon there should be
| information about it online, but it's going to be somewhat
| limited. You might think, "But that's not fair. It's unlikely
| to be able to one-shot something like that without mistakes
| with so much less training data on the subject." _Exactly_. In
| the real world, that comes up. Not always, but often. If it
| didn 't, programming would be an incredibly boring job. (For
| some people, it _is_ , and these LLMs will probably be
| disrupting that...) That's not to say that AI models can
| _never_ do things like debug an emulator or even do reverse
| engineering on its own, but it 's increasingly clear that this
| won't emerge from strapping agents on top of transformers
| predicting tokens. But since there is a very large portion of
| work that is not very novel in the world, I can totally
| understand why everyone is trying to squeeze this model as far
| as it goes. Gemini and Claude are shockingly competent.
|
| I believe many of the reasons people scoff at AI are fairly
| valid even if they don't always come from a rational mindset,
| and I try to keep my usage of AI to be relatively tasteful. I
| don't like AI art, and I personally don't like AI code. I find
| the push to put AI in everything incredibly annoying, and I
| worry about the clearly circular AI market, overhyped
| expectations. I dislike the way AI training has ripped up the
| Internet, violated people's trust, and lead to a more closed
| Internet. I dislike that sites like Reddit are capitalizing on
| all of the user-generated content that users submitted which
| made them rich in the first place, just to crap on them in the
| process.
|
| But I think that LLMs are useful, and useful LLMs could
| definitely be created ethically, it's just that the current AI
| race has everyone freaking the fuck out. I continue to explore
| use cases. I find that LLMs have gotten increasingly good at
| analyzing disassembly, though it varies depending on how well-
| covered the machine is in its training data. I've also found
| that LLMs can one-shot useful utilities and do a decent job. I
| had an LLM one-shot a utility to dump the structure of a simple
| common file format so I could debug something... It probably
| only saved me about 15-30 minutes, but still, in that case I
| truly believe it did save me time, as I didn't spend any time
| tweaking the result; it did compile, and it did work correctly.
|
| It's going to be troublesome to truly measure how good AI is.
| If you knew nothing about writing emulators, being able to
| synthesize an NES emulator that can at least boot a game may
| seem unbelievable, and to be sure it is obviously a stunning
| accomplishment from a PoV of scaling up LLMs. But what we're
| seeing is probably more a reflection of very good knowledge
| rather than very good intelligence. If we didn't have much
| written online about the NES or emulators at all, then it would
| be truly world-bending to have an AI model figure out
| everything it needs to know to write one on-the-fly. Humans can
| actually do stuff like that, which we know because humans _had_
| to do stuff like that. Today, I reckon most people rarely get
| the chance to show off that they are capable of novel thought
| _because_ there are so many other humans that had to do novel
| thinking before them. _Being able_ to do novel thinking
| effectively when needed is currently still a big gap between
| humans and AI, among others.
| stOneskull wrote:
| i think google is going to repeat history with gemini.. as in
| chatgpt, grok, etc will be like altavista, lycos, etc
| ninetyninenine wrote:
| I'm skeptical because my entire identity is basically built
| around being a software engineer and thinking my IQ and
| intelligence is higher than other people. If this AI stuff is
| real then it basically destroys my entire identity so I choose
| the most convenient conclusion.
|
| Basically we all know that AI is just a stochastic parrot
| autocomplete. That's all it is. Anyone who doesn't agree with
| me is of lesser intelligence and I feel the need to inform them
| of things that are obvious: AI is not a human, it does not have
| emotions. It just a search engine. Those people who are using
| AI to code and do things that are indistinguishable from human
| reasoning are liars. I choose to focus on what AI gets wrong,
| like hallucinations, while ignoring the things it gets right.
| hju22_-3 wrote:
| > [...] my entire identity is basically built around [...]
| thinking my IQ and intelligence is higher than other people.
|
| Well, there's your first problem.
| vintermann wrote:
| I don't know, that's commendable self-insight, it's true of
| lots and lots of people but there are few who would admit
| it!
| ninetyninenine wrote:
| I am unique. Totally. It is not like HN is flooded with
| cognition or psychology or IQ articles every other hour.
| Not at all. And whenever one shows up, you do not
| immediately get a parade of people diagnosing themselves
| with whatever the headline says. Never happens. You post
| something about slow thinking and suddenly half the
| thread whispers "that is literally me." You post
| something about fast thinking and the other half says
| "finally someone understands my brain." You post
| something about overthinking and everyone shows up with
| "wow I feel so seen." You post something about attention
| and now the entire site has ADHD.
|
| But yes. I am the unique one.
| vintermann wrote:
| Ah, so you were just attempting sarcasm?
| tptacek wrote:
| HN is not in fact flooded with cognition, psychology, and
| IQ articles every other hour.
| ninetyninenine wrote:
| There was more prior to AI but yes I exaggerated it. I
| mean it's obvious right? The title of this page is hacker
| so it must be tech related articles every hour.
|
| But articles on IQ and cognition and psychology are
| extremely common in HN. Enough to be noticeably out of
| place.
| tptacek wrote:
| They are actually not really all that common at all. We
| get 1, maybe 2 in a busy month.
| twoodfin wrote:
| This kind of comment certainly shows that no organic
| stochastic parrots post to hn threads!
| otherdave wrote:
| Where can I find these Conquistador documents? Sounds like
| something I might like to read and explore.
| throwup238 wrote:
| See here: https://news.ycombinator.com/item?id=45933750
| dotancohen wrote:
| My language does not use Latin letters, but they are separate
| letters. Is there a way to train some handwriting recognition
| on my own handwriting in my own language, such that it will be
| effective and useful? I mostly need to recognize text in PDF
| documents, generated by writing on an e-ink tablet with an EMR
| stylus.
| netsharc wrote:
| Author says "It is the most amazing thing I have seen an LLM do,
| and it was unprompted, entirely accidental." and then jumps back
| to the "beginning of the story". Including talking about a trip
| to Canada.
|
| Skip to the section headed "The Ultimate Test" for the resolution
| of the clickbait of "the most amazing thing...". (According to
| him, it correctly interpreted a line in an 18th century merchant
| ledger using maths and logic)
| appreciatorBus wrote:
| The new model may or may not be great at handwriting but I
| found the author's constant repetition about how amazing it was
| irritating enough to stop reading and to wonder if the article
| itself was slop-written.
|
| "users have reported some truly wild things" "the results were
| shocking" "the most amazing thing I have seen an LLM do"
| "exciting and frightening all at once" "the most astounding
| result I have ever seen" "made the hair stand up on the back of
| my neck"
| bitwize wrote:
| You're never gonna believe #6!
| bgwalter wrote:
| No, just another academic with the ominous handle
| @generativehistory that is beguiled by "AI". It is strange that
| others can never reproduce such amazing feats.
| pksebben wrote:
| I don't know if I'd call it an 'amazing feat', but claude had
| me pause for a moment recently.
|
| Some time ago, I'd been working on a framework that involved a
| series of servers (not the only one I've talked to claude
| about) that had to pass messages around in a particular
| fashion. Mostly technical implementation details and occasional
| questions about architecture.
|
| Fast forward a ways, and on a lark I decided to ask in the
| abstract about the best way to structure such an interaction.
| Mark that this was not in the same chat or project and didn't
| have any identifying information about the original, save for
| the structure of the abstraction (in this case, a message bus
| server and some translation and processing services, all
| accessed via client.)
|
| so:
|
| - we were far enough removed that the whole conversation
| pertaining to the original was for sure not in the context
| window
|
| - we only referred to the abstraction (with like a
| A=>B=>C=>B=>A kind of notation and a very brief question)
|
| - most of the work on the original was in claude code
|
| and it knew. In the answer it gave, it mentioned the project by
| name. I can think of only two ways this could have happened:
|
| - they are doing some real fancy tricks to cram your entire
| corpus of chat history into the current context somehow
|
| - the model has access to some kind of fact database _where it
| was keeping an effective enough abstraction to make the
| connection_
|
| I find either one mindblowing for different reasons.
| zahlman wrote:
| Are you sure it isn't just a case of a write-up of the
| project appearing in the training data?
| pksebben wrote:
| was my own project, so i don't see how it could have been.
| Private repo, unfinished, i gave it the name.
| omega3 wrote:
| Perhaps you have the memory feature enabled:
| https://support.claude.com/en/articles/11817273-using-
| claude...
| pksebben wrote:
| I probably do, and this is what I think happened. Mind you,
| it's not magic, but to hold that information with enough
| fidelity to pattern-match the structure of the underlying
| function was something I would find remarkable. It's a leap
| from a lot of the patterns I'm used to.
| pavlov wrote:
| I've seen those A/B choices on Google AI Studio recently, and
| there wasn't a substantial difference between the outputs. It
| felt more like a different random seed for the same model.
|
| Of course it's very possible my use case wasn't terribly
| interesting so it wouldn't reveal model differences, or that it
| was a different A/B test.
| jeffbee wrote:
| For me they've been very similar, except in one case where I
| corrected it and on one side it doubled down on being
| objectively wrong, and on the other side it took my feedback
| and started over with a new line of thinking.
| thatoneengineer wrote:
| https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...
| lproven wrote:
| You beat me to it.
| efitz wrote:
| I haven't seen this new google model but now must try it out.
|
| I will say that other frontier models are starting to surprise me
| with their reasoning/understanding- I really have a hard time
| making (or believing) the argument that they are just predicting
| the next word.
|
| I've been using Claude Code heavily since April; Sonnet 4.5
| frequently surprises me.
|
| Two days ago I told the AI to read all the documentation from my
| 5 projects related to a tool I'm building, and create a wiki,
| focused on audience and task.
|
| I'm hand reviewing the 50 wiki pages it created, but overall it
| did a great job.
|
| I got frustrated about one issue: I have a github issue to create
| a way to integrate with issue trackers (like Jira), but it's
| TODO, and the AI featured on the home page that we had issue
| tracker integration. It created a page for it and everything; I
| figured it was hallucinating.
|
| I went to edit the page and replace it with placeholder text and
| was shocked that the LLM had (unprompted) figured out how to use
| existing features to integrate with issue trackers, and wrote
| sample code for GitHub, Jira and Slack (notifications). That
| truly surprised me.
| energy123 wrote:
| Predicting the next word requires understanding, they're not
| separate things. If you don't know what comes after the next
| word, then you don't know what the next word should be. So the
| task implicitly forces a more long-horizon understanding of the
| future sequence.
| IAmGraydon wrote:
| This is utterly wrong. Predicting the next word requires a
| large sample of data made into a statistical model. It has
| nothing to do with "understanding", which implies it knows
| why rather than what.
| orionsbelt wrote:
| Ilya Sustkever was on a podcast, saying to imagine a
| mystery novel where at the end it says "and the killer is:
| (name)". Saying it's just a statistical model generating
| the next most likely word, how can it do that in this case
| if it doesn't have some understanding of all the clues,
| etc. A specific name is not statistically likely to appear
| shwaj wrote:
| Can current LLMs actually do that, though? What Ilya
| posed was a thought experiment: if it could do that, then
| we would say that it has understanding. But AFAIK that is
| beyond current capabilities.
| krackers wrote:
| Someone should try it and create a new "mysterybench".
| Find all mystery novels written after LLM training
| cutoff, and see how many models unravel the mystery
| IAmGraydon wrote:
| It can't do that without the answer to who did it being
| in the training data. I think the reason people keep
| falling for this illusion is that they can't really
| imagine how vast the training dataset is. In all cases
| where it appears to answer a question like the one you
| posed, it's regurgitating the answer from its training
| data in a way that creates an illusion of using logic to
| answer it.
| CamperBob2 wrote:
| _It can 't do that without the answer to who did it being
| in the training data._
|
| Try it. Write a simple original mystery story, and then
| ask a good model to solve it.
|
| This isn't your father's Chinese Room. It couldn't solve
| original brainteasers and puzzles if it were.
| dyauspitr wrote:
| That's not true, at all.
| IAmGraydon wrote:
| Please...go on.
| squigz wrote:
| This implies understanding of preceding tokens, no? GP
| was saying they have understanding of future tokens.
| nicpottier wrote:
| I once was chatting with an author of books (very much an
| amateur) and he said he enjoyed writing because he liked
| discovering where the story goes. IE, he starts and
| builds characters and creates scenarios for them and at
| some point the story kind of takes over, there is only
| one way a character can act based on what was previously
| written, but it wasn't preordained. That's why he liked
| it, it was a discovery to him.
|
| I'm not saying this is the right way to write a book but
| it is a way some people write at least! And one LLMs seem
| capable of doing. (though isn't a book outline pretty
| much the same as a coding plan and well within their
| wheelhouse?)
| astrange wrote:
| If you're claiming a transformer model is a Markov chain,
| this is easily disprovable by, eg, asking the model why it
| isn't a Markov chain!
|
| But here is a really big one of those if you want it:
| https://arxiv.org/abs/2401.17377
| nl wrote:
| Modern LLMs are post trained for tasks other than next word
| prediction.
|
| They still output words through (except for multi-modal
| LLMs) so that does involve next word generation.
| Workaccount2 wrote:
| "Understanding" is just a trap to get wrapped up in. A word
| with no definition and no test to prove it.
|
| Whether or not the model are "understanding" is ultimately
| immaterial, as their ability to do things is all that
| matters.
| pinnochio wrote:
| If they can't do things that require understanding, it's
| material, bub.
|
| And just because you have no understanding of what
| "understanding" means, doesn't mean nobody does.
| red75prime wrote:
| > doesn't mean nobody does
|
| If it's not a functional understating that allows to
| replicate functionality of understanding, is it the real
| understanding?
| dyauspitr wrote:
| The line between understanding and "large sample of data
| made into a statistical model" is kind of fuzzy.
| HarHarVeryFunny wrote:
| > Predicting the next word requires understanding
|
| If we were talking about humans trying to predict next word,
| that would be true.
|
| There is no reason to suppose than an LLM is doing anything
| other than deep pattern prediction pursuant to, and no better
| than needed for, next word prediction.
| CamperBob2 wrote:
| How'd _you_ do at the International Math Olympiad this
| year?
| HarHarVeryFunny wrote:
| I hear the LLM was able to parrot fragments of the stuff
| it was trained to memorize, and did very well
| CamperBob2 wrote:
| Yeah, that must be it.
| cxvrfr wrote:
| Well being able to extrapolate solutions to "novel"
| mathematical exercises based on a very large sample of
| similar tasks in your dataset seems like a reasonable
| explanation.
|
| Question is how well it would do if it was trained
| without those samples?
| CamperBob2 wrote:
| Gee, I don't know. How would you do at a math competition
| if you weren't trained with math books? Sample problems
| and solutions are not sufficient unless you can genuinely
| apply human-level inductive and deductive reasoning to
| them. If you don't understand that and agree with it, I
| don't see a way forward here.
|
| A more interesting question is, how would you do at a
| math competition if you were taught to read, then left
| alone in your room with a bunch of math books? You
| wouldn't get very far at a competition like IMO,
| calculator or no calculator, unless you happen to be some
| kind of prodigy at the level of von Neumann or Ramanujan.
| HarHarVeryFunny wrote:
| > A more interesting question is, how would you do at a
| math competition if you were taught to read, then left
| alone in your room with a bunch of math books?
|
| But that isn't how an LLM learnt to solve math olympiad
| problems. This isn't a base model just trained on a bunch
| of math books.
|
| The way they get LLMs to be good at specialized things
| like math olympiad problems is to custom train them for
| this using reinforcement learning - they give the LLM
| lots of examples of similar math problems being solved,
| showing all the individual solution steps, and train on
| these, rewarding the model when (due to having selected
| an appropriate sequence of solution steps) it is able
| itself to correctly solve the problem.
|
| So, it's not a matter of the LLM reading a bunch of math
| books and then being expert at math reasoning and problem
| solving, but more along the lines "of monkey see, monkey
| do". The LLM was explicitly shown how to step by step
| solve these problems, then trained extensively until it
| got it and was able to do it itself. It's probably a
| reflection of the self-contained and logical nature of
| math that this works - that the LLM can be trained on one
| group of problems and the generalizations it has learnt
| works on unseen problems.
|
| The dream is to be able to teach LLMs to reason more
| generally, but the reasons this works for math don't
| generally apply, so it's not clear that this math success
| can be used to predict future LLM advances in general
| reasoning.
| CamperBob2 wrote:
| _The dream is to be able to teach LLMs to reason more
| generally, but the reasons this works for math don 't
| generally apply_
|
| Why is that? Any suggestions for further reading that
| justifies this point?
|
| Ultimately, reinforcement learning _is_ still just a
| matter of shoveling in more text. Would RL work on
| humans? Why or why not? How similar is it to what kids
| are exposed to in school?
| HarHarVeryFunny wrote:
| An important difference between reinforcement learning
| (RL) and pre-training is the error feedback that is
| given. For pre-training the error feedback is just next
| token prediction error. For RL you need to have a goal in
| mind (e.g. successfully solving math problems) and the
| training feedback that is given is the RL "reward" - a
| measure of how well the model output achieved the goal.
|
| With RL used for LLMs, it's the whole LLM response that
| is being judged and rewarded (not just the next word), so
| you might give it a math problem and ask it to solve it,
| then when it was finished you take the generated answer
| and check if it is correct or not, and this reward
| feedback is what allows the RL algorithm to learn to do
| better.
|
| There are at least two problems with trying to use RL as
| a way to improve LLM reasoning in the general case.
|
| 1) Unlike math (and also programming) it is not easy to
| automatically check the solution to most general
| reasoning problems. With a math problem asking for a
| numerical answer, you can just check against the known
| answer, or for a programming task you can just check if
| the program compiles and the output is correct. In
| contrast, how do you check the answer to more general
| problems such "Should NATO expand to include Ukraine?" ?!
| If you can't define a reward then you can't use RL.
| People have tried using "LLM as judge" to provide rewards
| in cases like this (give the LLM response to another LLM,
| and ask it if it thinks the goal was met), but apparently
| this does not work very well.
|
| 2) Even if you could provide rewards for more general
| reasoning problems, and therefore were able to use RL to
| train the LLM to generate good solutions for those
| training examples, this is not very useful unless the
| reasoning it has learnt generalizes to other problems it
| was not trained on. In narrow logical domains like math
| and programming this evidentially works very well, but it
| is far from clear how learning to reason about NATO will
| help with reasoning about cooking or cutting your cat's
| nails, and the general solution to reasoning can't be
| "we'll just train it on every possible question anyone
| might ever ask"!
|
| I don't have any particular reading suggestions, but
| these are widely accepted limiting factors to using RL
| for LLM reasoning.
|
| I don't think RL for humans would work too well, and it's
| not generally the way we learn, or kids are mostly taught
| in school. We mostly learn or are taught individual
| skills and when they can be used, then practice and learn
| how to combine and apply them. The closest to using RL in
| school would be if the only feedback an English teacher
| gave you on your writing assignments was a letter grade,
| without any commentary, and you had to figure out what
| you needed to improve!
| cxvrfr wrote:
| How would you do multiplying 10000 pairs of 100 digit
| numbers in a limited amount of time? We don't
| anthropomorphize calculators though...
| CamperBob2 wrote:
| One problem for your argument is that transformer
| networks are not, and weren't meant to be, calculators.
| Their raw numerical calculating abilities are shaky when
| you don't let them use external tools, but they are also
| entirely emergent. It turns out that language doesn't
| just describe logic, it encodes it. Nobody expected that.
|
| To see another problem with your argument, find someone
| with weak reasoning abilities who is willing to be a test
| subject. Give them a calculator -- hell, give them a copy
| of Mathematica -- and send them to IMO, and see how that
| works out for them.
| famouswaffles wrote:
| There is plenty reason. This article is just one example of
| many. People bring it up because LLMs routinely do things
| we call reasoning when we see them manifest in other
| humans. Brushing it off as 'deep pattern prediction' is
| genuinely meaningless. Nobody who uses that phrase in that
| way can actually explain what they are talking about in a
| way that can be falsified. It's just vibes. It's an
| unfalsifiable conversation-stopper, not a real explanation.
| You can replace "pattern matching" with "magic" and the
| argument is identical because the phrase isn't actually
| doing anything.
|
| A - A force is required to lift a ball
|
| B - I see Human-N lifting a ball
|
| C - Obviously, Human-N cannot produce forces
|
| D - Forces are not required to lift a ball
|
| Well sir, why are you so sure Human-N cannot produce
| forces? How is she lifting the ball ? Well Of course
| Human-N is just using statistics magic.
| energy123 wrote:
| Anything can be euphemized. Human intelligence is atoms
| moving around the brain. General relativity is writing on
| a piece of paper.
| famouswaffles wrote:
| If you want to say human and LLM intelligence are both
| 'deep pattern prediction' then sure, but mostly and
| certainly in the case I was replying to, people often
| just use it as a means to make an imaginary unfalsifiable
| distinction between what LLMs do and what the super
| special humans do.
| HarHarVeryFunny wrote:
| You seem to be ignoring two things...
|
| First, the obvious one, is that LLMs are trained to auto-
| regressively predict human training samples (i.e.
| essentially to copy them, without overfitting), so OF
| COURSE they are going to sound like the training set -
| intelligent, reasoning, understanding, etc, etc. The
| mistake is to anthropomorphize the model because it
| sounds human, and associate these attributes of
| understanding etc to the model itself rather than just
| reflecting the mental abilities of the humans who wrote
| the training data.
|
| The second point is perhaps a bit more subtle, and is
| about the nature of understanding and the differences
| between what an LLM is predicting and what the human
| cortex - also a prediction machine - is predicting...
|
| When humans predict, what we're predicting is something
| external to ourself - the real world. We observe, over
| time we see regularities, and from this predict we'll
| continue to see those regularities. Our predictions
| include our own actions as an input - how will the
| external world react to our actions, and therefore we
| learn how to act.
|
| Understanding something means being able to predict how
| it will behave, both left alone, and in interaction with
| other objects/agents, including ourselves. Being able to
| predict what something will do if you poke it is
| essentially what it means to understand it.
|
| What an LLM is predicting is not the external world and
| how it reacts to the LLMs actions, since it is auto-
| regressively trained - it is only predicting a
| continuation of it's own output (actions) based on it's
| own immediately preceding output (actions)! The LLM
| therefore itself understands nothing since it has no
| grounding for what it is "talking about", and how the
| external world behaves in reaction to it's own actions.
|
| The LLMs appearance of "understanding" comes solely from
| the fact that it is mimicking the training data, which
| was generated by humans who do have agency in the world
| and understanding of it, but the LLM has no visibility
| into the generative process of the human mind - only to
| the artifacts (words) it produces, so the LLM is doomed
| to operate in a world of words where all it might be
| considered to "understand" is it's own auto-regressive
| generative process.
| famouswaffles wrote:
| You're restating two claims that sound intuitive but
| don't actually hold up when examined:
|
| 1. "LLMs just mimic the training set, so sounding like
| they understand doesn't imply understanding."
|
| This is the magic argument reskinned. Transformers aren't
| copying strings, they're constructing latent
| representations that capture relationships, abstractions,
| and causal structure because doing so reduces loss. We
| know this not by philosophy, but because mechanistic
| interpretability has repeatedly uncovered internal
| circuits representing world states, physics, game
| dynamics, logic operators, and agent modeling. "It's just
| next-token prediction" does not prevent any of that from
| occurring. When an LLM performs multi-step reasoning,
| corrects its own mistakes, or solves novel problems not
| seen in training, calling the behavior "mimicry" explains
| nothing. It's essentially saying "the model can do it,
| but not for the reasons we'd accept," without specifying
| what evidence would ever convince you otherwise.
| Imaginary distinction.
|
| 2. "Humans predict the world, but LLMs only predict text,
| so humans understand but LLMs don't."
|
| This is a distinction without the force you think it has.
| Humans also learn from sensory streams over which they
| have no privileged insight into the generative process.
| Humans do not know the "real world"; they learn patterns
| in their sensory data. The fact that the data stream for
| LLMs consists of text rather than photons doesn't negate
| the emergence of internal models. An internal model of
| how text-described worlds behave is still a model of the
| world.
|
| If your standard for "understanding" is "being able to
| successfully predict consequences within some domain,"
| then LLMs meet that standard, just in the domains they
| were trained on, and today's state of the art is trained
| on more than just text.
|
| You conclude that "therefore the LLM understands
| nothing." But that's an all-or-nothing claim that doesn't
| follow from your premises. A lack of sensorimotor
| grounding limits what kinds of understanding the system
| can acquire; it does not eliminate all possible forms of
| understanding.
|
| Wouldn't the birds that have the ability to navigate from
| the earth's magnetic field soon say humans have no
| understanding of electromagnetism ? They get trained on
| sensorimotor data humans will never be able to train on.
| If you think humans have access to the "real world" then
| think again. They have a tiny, extremely filtered slice
| of it.
|
| Saying "it understands nothing because autoregression" is
| just another unfalsifiable claim dressed as an
| explanation.
| HarHarVeryFunny wrote:
| > This is the magic argument reskinned. Transformers
| aren't copying strings, they're constructing latent
| representations that capture relationships, abstractions,
| and causal structure because doing so reduces loss.
|
| Sure (to the second part), but the latent representations
| aren't the same as a humans. The human's world that they
| have experience with, and therefore representations of,
| is the real word. The LLM's world that they have
| experience with, and therefore representations of, is the
| world of words.
|
| Of course an LLM isn't literally copying - it has learnt
| a sequence of layer-wise next-token
| predictions/generations (copying of partial embeddings to
| next token via induction heads etc), with each layer
| having learnt what patterns in the layer below it needs
| to attend to, to minimize prediction error at that layer.
| You can characterize these patterns (latent
| representations) in various ways, but at the end of the
| day they are derived from the world of words it is
| trained on, and are only going to be as good/abstract as
| next token error minimization allows. These
| patterns/latent representations (the "world model" of the
| LLM if you like) are going to be language-based (incl
| language-based generalizations), not the same as the
| unseen world model of the humans who generated that
| language, whose world model describes something
| completely different - predictions of sensory inputs and
| causal responses.
|
| So, yes, there is plenty of depth and nuance to the
| internal representations of an LLM, but no logical reason
| to think that the "world model" of an LLM is similar to
| the "world model" of a human since they live in different
| worlds, and any "understanding" the LLM itself can be
| considered as having is going to be based on it's own
| world model.
|
| > Saying "it understands nothing because autoregression"
| is just another unfalsifiable claim dressed as an
| explanation.
|
| I disagree. It comes down to how do you define
| understanding. A human understands (correctly predicts)
| how the real world behaves, and the effect it's own
| actions will have on the real world. This is what the
| human is predicting.
|
| What an LLM is predicting is effectively "what will I say
| next" after "the cat sat on the". The human might see a
| cat and based on circumstances and experience of cats
| predict that the cat will sit on the mat. This is because
| the human understands cats. The LLM may predict the next
| word as "mat", but this does not reflect any
| understanding of cats - it is just a statistical word
| prediction based on the word sequences it was trained on,
| notwithstanding that this prediction is based on the LLMs
| world-of-words-model.
| famouswaffles wrote:
| >So, yes, there is plenty of depth and nuance to the
| internal representations of an LLM, but no logical reason
| to think that the "world model" of an LLM is similar to
| the "world model" of a human since they live in different
| worlds, and any "understanding" the LLM itself can be
| considered as having is going to be based on it's own
| world model.
|
| So LLMs and Humans are different and have different
| sensory inputs. So what ? This is all animals. You think
| dolphins and orcas are not intelligent and don't
| understand things ?
|
| >What an LLM is predicting is effectively "what will I
| say next" after "the cat sat on the". The human might see
| a cat and based on circumstances and experience of cats
| predict that the cat will sit on the mat.
|
| Genuinely don't understand how you can actually believe
| this. A human who predicts mat does so because of the
| popular phrase. That's it. There is no reason to predict
| it over the numerous things cats regularly sit on, often
| much more so the mats (if you even have one). It's not
| because of any super special understanding of cats. You
| are doing the same thing the LLM is doing here.
| HarHarVeryFunny wrote:
| > You think dolphins and orcas are not intelligent and
| don't understand things ?
|
| Not sure where you got that non-secitur from ...
|
| I would expect most animal intelligence (incl. humans) to
| be very similar, since their brains are very similar.
|
| Orcas are animals.
|
| LLMs are not animals.
| famouswaffles wrote:
| Orca and human brains are similar, in the sense we have a
| common ancestor if you look back far enough, but they are
| still very different and focus on entirely different
| slices of reality and input than humans will ever do.
| It's not something you can brush off if you really
| believe in input supremacy so much.
|
| From the orca's perspective, many of the things we say we
| understand are similarly '2nd hand hearsay'.
| HarHarVeryFunny wrote:
| Regarding cats on mats ...
|
| If you ask a human to complete the phrase "the cat sat on
| the", they will probably answer "mat". This is
| memorization, not understanding. The LLM can do this too.
|
| If you just input "the cat sat on the" to an LLM, it will
| also likely just answer "mat" since this is what LLMs do
| - they are next-word input continuers.
|
| If you said "the sat sat on the" to a human, they would
| probably respond "huh?" or "who the hell knows!", since
| the human understands that cats are fickle creatures and
| that partial sentences are not the conversational norm.
|
| If you ask an LLM to explain it's understanding of cats,
| it will happily reply, but the output will not be it's
| own understanding of cats - it will be parroting some
| human opinion(s) it got from the training set. It has no
| first hand understanding, only 2nd hand heresay.
| famouswaffles wrote:
| >If you said "the sat sat on the" to a human, they would
| probably respond "huh?" or "who the hell knows!", since
| the human understands that cats are fickle creatures and
| that partial sentences are not the conversational norm.
|
| I'm not sure what you're getting at here ? You think LLMs
| don't similarly answer 'What are you trying to say?'.
| Sometimes I wonder if the people who propose these gotcha
| questions ever bother to actually test them on said LLMs.
|
| >If you ask an LLM to explain it's understanding of cats,
| it will happily reply, but the output will not be it's
| own understanding of cats - it will be parroting some
| human opinion(s) it got from the training set. It has no
| first hand understanding, only 2nd hand heresay.
|
| Again, you're not making the distinction you think you
| are. Understanding from '2nd hand heresay' is still
| understanding. The vast majority of what humans learn in
| school is such.
| HarHarVeryFunny wrote:
| > Sometimes I wonder if the people who propose these
| gotcha questions ever bother to actually test them on
| said LLMs
|
| Since you asked, yes, Claude responds "mat", then asks if
| I want it to "continue the story".
|
| Of course if you know anything about LLMs you should
| realize that they are just input continuers, and any
| conversational skills comes from post training. To an LLM
| a question is just an input whose human-preferred (as
| well as statistically most likely) continuation is a
| corresponding answer.
|
| I'm not sure why you regard this as a "gotcha" question.
| If you're expressing opinions on LLMs, then table stakes
| should be to have a basic understanding of LLMs - what
| they are internally, how they work, and how they are
| trained, etc. If you find a description of LLMs as input-
| continuers in the least bit contentious then I'm sorry to
| say you completely fail to understand them - this is
| literally what they are trained to do. The only thing
| they are trained to do.
| astrange wrote:
| Predicting the next word is the interface, not the
| implementation.
|
| (It's a pretty constraining interface though - the model
| outputs an entire distribution and then we instantly lose it by
| only choosing one token from it.)
| charcircuit wrote:
| It's trying to maximize a reward function. It's not just
| predicting the next word.
| schiffern wrote:
| >I really have a hard time making (or believing) the argument
| that they are just predicting the next word.
|
| It's true, but by the same token our brain is "just"
| thresholding spike rates.
| conception wrote:
| I will note that 2.5 pro preview... march? Was maybe the best
| model I've used yet. The actual release model was... less. I
| suspect Google found the preview too expensive and optimized it
| down but it was interesting to see there was some hidden
| horsepower there. Google has always been poised to be the AI
| leader/winner - excited to see if this is fluff or the real deal
| or another preview that gets nerfed.
| muixoozie wrote:
| Dunno if you're right, but I'd like to point out that I've been
| reading comments like these about every model since GPT 3. It's
| just starting to seem more likely to me to be a cognitive bias
| than not.
| conception wrote:
| I haven't noticed the effect of things getting worse after a
| release but definitely 2.5's abilities got worse. Or perhaps
| they optimized for something else? But I haven't noticed the
| usual "things got worse after release!" Except for when
| sonnet had a bug for a month and gpt5's autorouter broke.
| muixoozie wrote:
| Yea I don't know. I didn't mean to sound accusatory. I
| might very well be wrong.
| KaoruAoiShiho wrote:
| Sometimes it is just bias but the 2.5 pro had benchmarks
| showing the degradation (plus they changed the name every
| time so it was obviously a different ckpt or model).
| colordrops wrote:
| Why would you assume cognitive bias? Any evidence? These
| things are indeed very expensive to run, and are often run at
| a loss. Wouldn't quantization or other tuning be just as
| reasonable of an answer as cognitive bias? It's not like we
| are talking about reptilian aliens running the whitehouse.
| muixoozie wrote:
| I'm just pointing out a personal observation. Completely
| anecdotal. FWIW, I don't strongly believe this. I have at
| least noticed a selection bias (maybe) in myself too as
| recently as yesterday after GPT 5.1 was released. I asked
| codex to do a simple change (less than 50LOC) and it made a
| unrelated change, an early return statement, breaking a
| very simple state machine that goes from waiting ->
| evaluate -> done. However, I have to remind myself how
| often LLMs make dumb mistakes despite often seeming
| impressive.
| oasisbob wrote:
| That sounds more like availability bias, not selection
| bias.
| oasisbob wrote:
| I noticed the degradation when Gemini stopped being a good
| research tool, and made me want to strangle it on a daily
| basis.
|
| It's incredibly frustrating to have a model start to
| hallucinate sources and be incapable of revisiting its
| behavior.
|
| Couldn't even understand that it was making up non-sensical RFC
| references.
| Legend2440 wrote:
| What an unnecessarily wordy article. It could have been a fifth
| of the length. The actual point is buried under pages and pages
| of fluff and hyperbole.
| johnwheeler wrote:
| Yes, and I agree and it seems like the author has a naive
| experience with LLMs because what he's talking about is kind of
| the bread and butter as far as I'm concerned
| Al-Khwarizmi wrote:
| Indeed. To me, it has long been clear that LLMs do things
| that, at the very least, are indistinguishable from
| reasoning. The already classic examples where you make them
| do world modeling (I put an ice cube into a cup, put the cup
| in a black box, take it into the kitchen, etc... where is the
| ice cube now?) invalidate the stochastic parrot argument.
|
| But many people in the humanities have read the stochastic
| parrot argument, it fits their idea of how they prefer things
| to be, so they take it as true without questioning much.
| Legend2440 wrote:
| My favorite example: 'can <x> cut through <y>?'
|
| You can put just about anything in there for x and y, and
| it will almost always get it right. Can a pair of scissors
| cut through a boeing 747? Can a carrot cut through loose
| snow? A chainsaw cut through a palm leaf? Nailclippers
| through a rubber tire?
|
| Because of combinatorics, the space of ways objects can
| interact is too big to memorize, so it can only answer if
| it has learned something real about materials and their
| properties.
| asimilator wrote:
| Summarize it with an LLM.
| joshdifabio wrote:
| Yes. I left in frustration and came to the comments for a
| summary.
| turnsout wrote:
| So, a Substack article then
| ThrowawayTestr wrote:
| I'd expect nothing less from a historian
| mmaunder wrote:
| The author is far more fascinated with themselves than with AI.
| falcor84 wrote:
| I would just suggest that if you want your comment to be more
| helpful than the article that you're critiquing, you might want
| to actually quote the part which you believe is "The actual
| point".
|
| Otherwise you are likely to have people agreeing with you,
| while they actually had a very different point that they took
| away.
| _giorgio_ wrote:
| I missed the point, please point me to it
| observationist wrote:
| This might just be a handcrafted prompt framework for handwriting
| recognition tied in with reasoning - do a rough pass, make
| assumptions and predictions, check assumptions and predictions,
| if they pass, use the degree of confidence in their passage to
| inform what the other characters might be, and gradually flesh
| out an interpretation of what was intended to be communicated.
|
| If they could get this to occur naturally - with no supporting
| prompts, and only one-shot or one-shot reasoning, then it could
| extend to complex composition generally, which would be cool.
| terminalshort wrote:
| I don't see how this performance could be anything like that.
| There is no way that Google included specialized system prompts
| with anything to do with converting shillings to pounds in
| their model.
| lproven wrote:
| Betteridge's law _surely_ applies.
| kittikitti wrote:
| I much prefer this tone about improvements in AI over the
| doomerism I constantly read. I was waiting for a twist where the
| author changed their minds and suddenly went "this is the devil's
| technology" or "THEY T00K OUR JOBS" but it never happened. Thank
| you for sharing, it felt like breathing for the first time in a
| long time.
| greekrich92 wrote:
| Pretty hyperbolic reaction to what seems like a fairly modest
| improvement
| outside2344 wrote:
| We are probably just a few weeks away from Google completely
| wiping OpenAI out.
| xx_ns wrote:
| Am I missing something here? Colonial merchant ledgers and 18th-
| century accounting practices have been extensively digitized and
| discussed in academic literature. The model has almost certainly
| seen examples where these calculations are broken down or
| explained. It could be interpolating from similar training
| examples rather than "reasoning."
| ceroxylon wrote:
| The author claims that they tried to avoid that: "[. . .] we
| had to choose them carefully and experiment to ensure that
| these documents were not already in the LLM training data (full
| disclosure: we can't know for sure, but we took every
| reasonable precaution)."
| blharr wrote:
| Even if that specific document wasn't in the training data,
| there could be many similar documents from others at the
| time.
| jumploops wrote:
| This is exciting news, as I have some elegantly scribed family
| diaries from the 1800s that I can barely read (:
|
| With that said, the writing here is a bit hyperbolic, as the
| advances seem like standard improvements, rather than a huge leap
| or final solution.
| red75prime wrote:
| Statistics in the article has a low number of samples to make
| definitive conclusion, but expert-level WER looks like a huge
| leap.
| phkahler wrote:
| It's a diffusion model, not autocomplete.
| ghm2199 wrote:
| I just used AI studio for recognizing text from a relative's 60
| day log of food ingested 3 times a day. I think I am using
| models/gemini-flash-latest and it was shockingly good at
| recognizing text, far better than ChatGPT 5.1 or Claude's Sonnet
| (IIRC its 4.5) model.
|
| https://pasteboard.co/euHUz2ERKfHP.png
|
| Its response I have captured here
| https://pasteboard.co/sbC7G9nuD9T9.png is shockingly good. I
| could only spot 2 mistakes. And those that seems to have been the
| ones even I could not read or was very difficult for me to make
| out what the text was.
| ghm2199 wrote:
| I basically fed it all 60 images 5 at a time and made a table
| out of them to correlate sugar levels <-> food and colocate it
| with the person's exercise routines. This is insane.
| neom wrote:
| I've been complaining on hn for some time now that my only real
| test of an LLM is that it can help my poor wife with her
| research, she spends all day every day in small town archives
| pouring over 18th century American historical documents. I
| thought maybe that day had come, I showed her the article and she
| said "good for him I'm still not transcribing important
| historical documents with a chat bot and nor should he" - ha. If
| you wanna play around with some difficult stuff here are some
| images from her work I've posted before:
| https://s.h4x.club/bLuNed45
| Workaccount2 wrote:
| People have had spotty access to this model for brief periods
| (gemini 3 pro) for a few weeks now, but its strongly expected
| to be released next week, and definitely by year end.
| neom wrote:
| Oh I didn't realize this wasn't 2.5 pro (I skimmed, sorry) -
| i also haven't had time to run some of her docs on 5.1 yet, I
| should.
| HDThoreaun wrote:
| It doesnt have to be perfect to be useful. If it does a decent
| job then your wife reviews and edits, that will be much faster
| than doing the whole thing by hand. The only question is if she
| can stay committed to perfection. I dont see the downside of
| trying it unless she's worried about getting lazy.
| neom wrote:
| I raised this point with her, she said there are times it
| would be ambiguous for both her and the model, and she thinks
| it would be dangerous for her to be influenced by it. I'm not
| a professional historical researcher so I'm not sure if her
| concern is valid or not.
| HDThoreaun wrote:
| I think there's a lot of meta thought that deserves to be
| done about where these new tools fit. It is easy to off
| handedly reject change, especially as a subject matter
| expert who can feel they worked so hard to do this and now
| theyre being replaced so the work was for nothing. I really
| dont want to say your wife is wrong, she almost assuredly
| is not. But it is important to have a curious mindset when
| confronted with ideas you may be biased against. Then she
| can rest easy knowing she is doing her best to perfect her
| craft, right? Otherwise she might wake up one day feeling
| like symbolic NLP researchers trying LLMs for the first
| time. Certainly a lot to consider.
| neom wrote:
| I really appreciate your thoughtful reply. I try my best
| to be encouraging and educating without being preachy or
| condescending with my wife on this subject. I read hn, I
| see the posts of folks in, frankly what reads like
| anguish, about having a tool replace their expertise. I
| feel really, sad? about it. It's interesting to be
| confronted with it here (a place I love!) and at home (a
| place I love!) in quite different context. I've also
| never been particularly good at becoming good at
| something, I can't do very much, and genai is really
| exciting for me, I'm both drawn to and have love for
| experts so... This whole thing generally has been keeping
| me up at night a bit, because I feel anguish for the
| anguish.
| fooker wrote:
| As a scientist, I don't think this is valid or useful. It's
| very much a first year PhD line of thought that academia
| stamps out of you.
|
| This is the 'RE' in research, you specifically want to know
| and understand what others think of something by reading
| others' papers. The scientific training slowly, laboriously
| prepares you to reason about something without being too
| influenced by it.
| Huppie wrote:
| While it's of course a good thing to be critical the author did
| provide some more context on the why and how of doing it with
| LLM's on the hard fork podcast today [0]: mostly as a way to
| see how these models _can_ help them with these tasks.
|
| I would recommend listening to their explanation, maybe it'll
| give more insight.
|
| Disclosure: After listening the podcast and looking up and
| reading the article I emailed @dang to suggest it goes into the
| HN second chance pool. I'm glad more people enjoyed it.
|
| [0]: https://www.nytimes.com/2025/11/14/podcasts/hardfork-data-
| ce...
| potsandpans wrote:
| > ...of an LLM is that it can help my poor wife with her
| research, she spends all day every day in small town archives
| pouring over 18th century American historical documents.
|
| > I'm still not transcribing important historical documents
| with a chat bot and nor should he
|
| Doesn't sound like she's interested in technology, or wants
| help.
| mmaunder wrote:
| Substack: When you have nothing to say and all day to say it.
| mattmaroon wrote:
| "This AI did something amazing but first I'm going to put in 72
| paragraphs of details only I care about."
|
| I was thinking as I skimmed this it needs a "jump to recipe"
| button.
| _giorgio_ wrote:
| It was an embarrassing read. I should ask an llm to read it
| since he probably wrote it the same way.
| gcanyon wrote:
| > So that is essentially the ceiling in terms of accuracy.
|
| I think this is mistaken. I remember... ten years ago? When
| speech-to-text models came out that dealt with background noise
| that made the audio sound very much like straight pink noise to
| my ear, but the model was able to transcribe the speech hidden
| within at a reasonable accuracy rate.
|
| So with handwritten text, the only prediction that makes sense to
| me is that we will (potentially) reach a state where the machine
| is at least probably more accurate than humans, although we
| wouldn't be able to confirm it ourselves.
|
| But if multiple independent models, say, Gemini 5 and Claude 7,
| both agree on the result, and a human can only shrug and say,
| "might be," then we're at a point where the machines are probably
| superior at the task.
| regularfry wrote:
| That depends on how good we get at interpretability. If the
| models can not only do the job but also are structured to
| permit an explanation of how they did it, we get the
| confirmation. Or not, if it turns out that the explanation is
| faulty.
| roywiggins wrote:
| My task today for LLMs was "can you tell if this MRI brain scan
| is facing the normal way", and the answer was: no, absolutely
| not. Opus 4.1 succeeds more than chance, but still not nearly
| often enough to be useful. They all cheerfully hallucinate the
| wrong answer, confidently explaining the anatomy they are looking
| for, but wrong. Maybe Gemini 3 will pull it off.
|
| Now, Claude _did_ vibe code a fairly accurate solution to this
| using more traditional techniques. This is very impressive on its
| own but I 'd hoped to be able to just shovel the problem into the
| VLM and be done with it. It's kind of crazy that we have "AIs"
| that can't tell even roughly what the orientation of a brain scan
| is- something a five year old could probably learn to do- but can
| vibe code something using traditional computer vision techniques
| to do it.
|
| I suppose it's not _too_ surprising, a visually impaired
| programmer might find it impossible to do reliably themselves but
| would code up a solution, but still: it 's weird!
| hopelite wrote:
| What is the "normal" way? Is that defined in a technical
| specification? Did you provide the definition/description of
| what you mean by "normal"?
|
| I would not have expected a language model to perform well on
| what sounds like a computer vision problem? Even if it was
| agentic, as you also imply how a five year old could learn how
| to do it, so too an AI system would need to be trained or at
| the very least be provided with a description of what is
| looking at.
|
| Imagine you took an MRI brain scan back in time and showed it
| to a medical Doctor in even the 1950s or maybe 1900. Do you
| think they would know what the normal orientation is, let alone
| what they are looking at?
|
| I am a bit confused and also interested in how people are
| interacting with AI in general, it really seems to have a
| tendency to highlight significant holes in all kinds of human
| epistemological, organizational, and logical structures.
|
| I would suggest maybe you think of it as a kind of child, and
| with that, you would need to provide as much context and exact
| detail about the requested task or information as possible.
| This is what context engineering (are we still calling it
| that?) concerns itself with.
| roywiggins wrote:
| The models absolutely do know what the standard orientation
| is for a scan. They respond extensively about what they're
| looking for and what the correct orientation would be, more
| or less accurately. They are aware.
|
| They then give the wrong answer, hallucinating anatomical
| details in the wrong place, etc. I didn't bother with
| extensive prompting because it doesn't evince any confusion
| on the criteria, it just seems to not understand spatial
| orientations very well, and it seemed unlikely to help.
|
| The thing is that it's very, very simple: an axial slice of a
| brain is basically egg-shaped. You can work out whether it's
| pointing vertically (ie, nose pointing to towards the top of
| the image) or horizontally by looking at it. LLMs will insist
| it's pointing vertically when it isn't. it's an easy task for
| someone with eyes.
|
| Essentially all images an LLM will have seen of brains will
| be in this orientation, which is either a help or a
| hindrance, and I think in this case a hindrance- it's not
| that it's seen lots of brains and doesn't know which are
| correct, it's that it has only ever seen them in the standard
| orientation and it can't see the trees for the forest, so to
| speak.
| chrischen wrote:
| But these models are more like generalists no? Couldn't they
| simply be hooked up to more specialized models and just defer
| to them the way coding agents now use tools to assist?
| roywiggins wrote:
| There would be no point in going via an LLM then, if I had a
| specialist model ready I'd just invoke it on the images
| directly. I don't particularly need or want a chatbot for
| this.
| moritonal wrote:
| That's fairly unfair comparison. Did you include in the prompt
| a basic set of instructions about which way is "correct" and
| what to look for?
| roywiggins wrote:
| I didn't give a detailed explanation to the model, but I
| should have been more clear: they all seemed to know what to
| look for, they wrote explanations of what they were looking
| for, which were generally correct enough. They still got the
| answer wrong, hallucinating the locations of the anatomical
| features they insisted they were looking at.
|
| It's something that you can solve by just treating the brain
| as roughly egg-shaped and working out which way the pointy
| end is, or looking for the very obvious bilateral symmetry.
| You don't really have to know what any of the anatomy
| actually is.
| IanCal wrote:
| Most models don't have good spatial information from the
| images. Gemini models do preprocessing and so are typically
| better for that. It depends a lot on how things get segmented
| though.
| lern_too_spel wrote:
| This might be showing bugs in the training data. It is common
| to augment image data sets with mirroring, which is cheap and
| fast.
| fragmede wrote:
| and then, in a different industry, one that has physical
| factories, there's this obsession about getting really good at
| making the machine that makes the machine (product) being the
| route to success. So it's funny that LLMs being able to write
| programs to do the thing you want is seen as a failure here.
| MagicMoonlight wrote:
| It seems like a leap to assume it has done all sorts of complex
| calculations implicitly.
|
| I looked at the image and immediately noticed that it is written
| as "14 5" in the original text. It doesn't require calculation to
| guess that it might be 14 pounds 5 ounces rather than 145.
| Especially since presumably, that notation was used elsewhere in
| the document.
| elphinstone wrote:
| I read the whole article, but have never tried the model. Looking
| at the input document, I believe the model saw enough of a space
| between the 14 and 5 to simply treat it that way. I saw the space
| too. Impressive, but it's a leap to say it saw 145 then used
| higher order reasoning to correct 145 to 14 and 5.
| Coeur wrote:
| I also read the whole article, and this behaviour that the
| author is most excited about only happened once. For a process
| that inherently has some randomness about it, I feel it's too
| early to bit this excited.
| afro88 wrote:
| Yep. A lot of things looked magical in the GPT-4 days.
| Eventually you realised it did it by chance and more often
| than not gets it wrong
| AaronNewcomer wrote:
| The thinking models (especially OpenAI's o3) still seem to do by
| far the best at this task as they look across the document to see
| how the writer wrote certain letters where the word is more clear
| when it runs into confusing words.
|
| I built a whole product around this:
| https://DocumentTranscribe.com
|
| But I imagine this will keep getting better and that excites me
| since this was largely built for my own research!
| _giorgio_ wrote:
| I find Gemini 2.5 pro, not flash, way better than the chatGPT
| models. I didn't remember testing o3 though. Maybe it's o3 pro
| and it's one of the old costly and thinking models?
| akudha wrote:
| Your demo is very well done, love it!
| barremian wrote:
| > it codes fully functioning Windows and Apple OS clones, 3D
| design software, Nintendo emulators, and productivity suites from
| single prompts
|
| > As is so often the case with AI, that is exciting and
| frightening all at once
|
| > we need to extrapolate from this small example to think more
| broadly: if this holds the models are about to make similar leaps
| in any field where visual precision and skilled reasoning must
| work together required
|
| > this will be a big deal when it's released
|
| > What appears to be happening here is a form of emergent,
| implicit reasoning, the spontaneous combination of perception,
| memory, and logic inside a statistical model
|
| > model's ability to make a correct, contextually grounded
| inference that requires several layers of symbolic reasoning
| suggests that something new may be happening inside these systems
| --an emergent form of abstract reasoning that arises not from
| explicit programming but from scale and complexity itself
|
| Just another post with extreme hyperbolic wording to blow up
| another model release. How many times have we seen such non-
| realistic build up in the past couple of years.
| cheevly wrote:
| Reading HN comments just makes me realize how vastly LLMs exceed
| human intelligence.
| dang wrote:
| " _Please don 't sneer, including at the rest of the
| community._" It's reliably a marker of bad comments and worse
| threads.
|
| If you know more than others do, that's great, but in that case
| please share some of what you know so the rest of us can learn.
| Putting down others only makes this place worse for everyone.
|
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
|
| https://news.ycombinator.com/newsguidelines.html
| _giorgio_ wrote:
| Gemini 2.5 PRO is already incredibly good in handwritten
| recognition. It makes maybe one small mistake every 3 pages.
|
| It has completely changed the way I work, and it allows me to
| write math and text and then convert it with the Gemini app (or
| with a scanned PDF in the browser). You should really try it.
| sriku wrote:
| Rgd the "14 lb 5 oz" point in the article, the simpler
| explanation than the hypothesis there that it back calculated the
| weight is that there seems to be a space between 14 and 5 - i.e.
| It reads more like "14 5" than "145"?
| sriku wrote:
| Impressive performance, yes but is the article giving more
| credit than due?
| koliber wrote:
| It hasn't met my doctor.
| Grimblewald wrote:
| I dunno man, looks like goodharts law in action to me. That isnt
| to say the models wont be good at what is stated, but it does
| mean it might not signal a general improvement in competence but
| rather a targeted gain with more general deficits rising up in
| untested/ignored areas, some which may or may not be
| catastrophic. I guess we will see but for now Imma keep my hype
| in the box.
| lelanthran wrote:
| > In tabulating the "errors" I saw the most astounding result I
| have ever seen from an LLM, one that made the hair stand up on
| the back of my neck. Reading through the text, I saw that Gemini
| had transcribed a line as "To 1 loff Sugar 14 lb 5 oz @ 1/4 0 19
| 1". If you look at the actual document, you'll see that what is
| actually written on that line is the following: "To 1 loff Sugar
| 145 @ 1/4 0 19 1". For those unaware, in the 18th century sugar
| was sold in a hardened, conical form and Mr. Slitt was a
| storekeeper buying sugar in bulk to sell. At first glance, this
| appears to be a hallucinatory error: the model was told to
| transcribe the text exactly as written but it inserted 14 lb 5 oz
| which is not in the document.
|
| I read the whole reasoning of the blog author after that, but I
| still gotta know - how can we tell that this was not a
| hallucination and/or error? There's a 1/3 chance of an error
| being correct (either 1 lb 45, 14 lb 5 or 145 lb), so why is the
| author so sure that this was deliberate?
|
| I feel a good way to test this would be to create an almost
| identical ledger entry, but in a way so that the correct answer
| after reasoning (the way the author thinks the model reasoned)
| has completely different digits.
|
| This way there'd be more confidence that the model itself
| reasoned and did not make an error.
| yomismoaqui wrote:
| I implemented a receipt scanner to Google Sheet using Gemini
| Flash.
|
| The fact that it is "intelligent" it's fine for some things.
|
| For example I created structured output schema that had a field
| "currency" with the 3 letter format (USD, EUR...). So I scanned
| a receipt from some shop in Jakarta and it filled that field
| with IDR (Indonesian Rupiah). It inferred that data because of
| the city name on the receipt.
|
| Would it be better for my use case that it would have returned
| no data for the currency field? Don't think so.
|
| Note: if needed maybe I could have changed the prompt to not
| infer the currency when not explicitly listed on the receipt.
| Someone wrote:
| > Would it be better for my use case that it would have
| returned no data for the currency field? Don't think so.
|
| If there's a decent chance it infers the wrong currency,
| potentially one where the value of each unit is a few units
| of scale larger or smaller than that of IDR, it might be
| better to not infer it.
| otabdeveloper4 wrote:
| > Would it be better for my use case that it would have
| returned no data for the currency field?
|
| Almost certainly yes.
| DangitBobby wrote:
| Except in setups where you always check its work, and the
| effort from the 5% of the time you have to correct the
| currency is vastly outweighed due to effort saved from the
| other 95% of the time. Pretty common situation.
| HarHarVeryFunny wrote:
| Yes, and as the article itself notes, the page image has more
| than just "145" - there's a "u"-like symbol over the 1, which
| the model is either failing to notice, or perhaps is something
| it recognizes from training as indicating pounds.
|
| The article's assumption of how the model ended up
| "transcribing" "1 loaf of sugar u/145" as "1 loaf of sugar 14lb
| 5oz" seems very speculative. It seems more reasonable to assume
| that a massive frontier model knows something about loaves of
| sugar and their weight range, and in fact Google search's "AI
| overview" of "how heavy is a loaf of sugar" says the common
| size is approximately 14lb.
| wrs wrote:
| There's also a clear extra space between the 4 and 5, so
| figuring out to group it as "not 1 45, nor 145 but 14 5"
| doesn't seem worthy of astonishment.
| drawfloat wrote:
| If I ask a model to transcribe something exactly and it outputs
| an interpretation, that is an error and not a success.
| elzbardico wrote:
| I think the author has become a bit too enthusiastic. "Emerging
| capabilities" become code for, unexpectedly good results that are
| statistical serendipity but that I prefer to infer as some hidden
| capability in a model I can't resist anthropomorphizing.
| dr_dshiv wrote:
| Is anyone aware of any benchmark evaluation for handwriting
| recognition? I have not been able to find one, myself -- which is
| somewhat surprising.
| neves wrote:
| Of it can read ancient handwriting, it will be a revolution for
| historians work.
|
| My wife is a historian and she is trained to recognize old
| handwriting. When we go to museums she"translates" the texts for
| the family
| inshard wrote:
| Could it be guessing via orders of magnitude? Like 145 lb * 1/4
| is high confidence not the answer and 1 lb 45 oz is non standard
| notation as 1lb = 16oz - so it's most likely 14 lb 5 oz
| th0ma5 wrote:
| I always have to ask with OCR in a professional context is which
| digits of the numbers is it allowed to get wrong?
___________________________________________________________________
(page generated 2025-11-15 23:00 UTC)