[HN Gopher] A new Google model is nearly perfect on automated ha...
       ___________________________________________________________________
        
       A new Google model is nearly perfect on automated handwriting
       recognition
        
       Author : scrlk
       Score  : 477 points
       Date   : 2025-11-11 13:52 UTC (4 days ago)
        
 (HTM) web link (generativehistory.substack.com)
 (TXT) w3m dump (generativehistory.substack.com)
        
       | throwup238 wrote:
       | I really hope they have because I've also been experimenting with
       | LLMs to automate searching through old archival handwritten
       | documents. I'm interested in the Conquistadors and their
       | extensive accounts of their expeditions, but holy cow reading
       | 16th century handwritten Spanish and translating it at the same
       | time is a nightmare, requiring a ton of expertise and inside
       | field knowledge. It doesn't help that they were often written in
       | the field by semi-literate people who misused lots of words. Even
       | the simplest accounts require quite a lot of detective work to
       | decipher with subtle signals like that pound sign for the sugar
       | loaf.
       | 
       |  _> Whatever it is, users have reported some truly wild things:
       | it codes fully functioning Windows and Apple OS clones, 3D design
       | software, Nintendo emulators, and productivity suites from single
       | prompts._
       | 
       | This I'm a lot more skeptical of. The linked twitter post just
       | looks like something it would replicate via HTML/CSS/JS. Whats
       | the kernel look like?
        
         | WhyOhWhyQ wrote:
         | "> Whatever it is, users have reported some truly wild things:
         | it codes fully functioning Windows and Apple OS clones, 3D
         | design software, Nintendo emulators, and productivity suites
         | from single prompts."
         | 
         | Wow I'm doing it way wrong. How do I get the good stuff?
        
           | zer00eyz wrote:
           | Your not.
           | 
           | I want you to go into the kitchen and bake a cake. Please
           | replace all the flour with baking soda. If it comes out
           | looking limp and lifeless just decorate it up with extra
           | layers of frosting.
           | 
           | You can make something that looks like a cake but would not
           | be good to eat.
           | 
           | The cake, sometimes, is a lie. And in this case, so are
           | likely most of these results... or they are the actual source
           | code of some other project just regurgitated.
        
             | hinkley wrote:
             | We got the results back. You are a horrible person. I'm
             | serious, that's what it says: "Horrible person."
             | 
             | We weren't even testing for that.
        
               | erulabs wrote:
               | Well, what does a neck-bearded old engineer know about
               | fashion? He probably - Oh, wait. It's a she. Still, what
               | does she know? Oh wait, it says she has a medical degree.
               | In fashion! From France!
        
               | joshstrange wrote:
               | If you want to listen to the line from Portal 2 it's on
               | this page (second line in the section linked): https://th
               | eportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...
        
               | fragmede wrote:
               | Just because "Die motherfucker die motherfucker die"
               | appeared in a song once doesn't mean it's not also death
               | threat when someone's pointing a gun at you and saying
               | that.
        
               | scubbo wrote:
               | ...what?
        
               | fragmede wrote:
               | hinkley wrote:
               | 
               | > We got the results back. You are a horrible person. I'm
               | serious, that's what it says: "Horrible person."
               | 
               | > We weren't even testing for that.
               | 
               | joshstrange then wrote:
               | 
               | > If you want to listen to the line from Portal 2 it's on
               | this page (second line in the section linked): https://th
               | eportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...
               | 
               | as if the fact that the words that hinkley wrote are from
               | a popular video game excuses the fact that hinkley just
               | also called zer00eyz horrible.
        
               | hinkley wrote:
               | So if two sentences that make no sense to you sandwich
               | one that does, you should totally accept the middle one
               | at face value.
               | 
               | K.
        
               | fragmede wrote:
               | Yes. You chose to repeat those words in that sequence in
               | that place. You could have said anything else in the
               | whole wide world, but you chose to use a quote from an
               | ancient video game stating that someone was horrible.
               | Sorry if I'm being autistic and taking things too
               | literally again, working on having social skills was a
               | different thread from today.
        
               | joshstrange wrote:
               | I think you might be confused or mistaken (or you are
               | making a whole different joke).
               | 
               | My 2 comments are linking to different quotes from Portal
               | 2, both the original comment
               | 
               | > We got the results back.....
               | 
               | and
               | 
               | > Well, what does a neck-bearded old engineer know about
               | fashion?.....
               | 
               | Are from Portal 2 and the first Portal 2 quote is just a
               | reference to the parent of that saying:
               | 
               | > The cake, sometimes, is a lie.
               | 
               | (Another Portal reference if that wasn't clear), they
               | weren't calling the parent horrible, they were just
               | putting in quote they liked from the game that was
               | referenced.
               | 
               | That's one reason why I linked the quote, so people would
               | understand it was a reference to the game, not the person
               | actually saying the parent was horrible. The other reason
               | I linked it is just because I like added metadata where
               | possible.
        
               | joshstrange wrote:
               | Source: Portal 2, you can see the line and listen to it
               | here (last one in section): https://theportalwiki.com/wik
               | i/GLaDOS_voice_lines_(Portal_2)...
        
               | hinkley wrote:
               | I figured it was appropriate given the context.
               | 
               | I'm still amazed that game started as someone's school
               | project. Long live the Orange Box!
        
               | chihuahua wrote:
               | I'd really like Alexa+ to have the voice of GLaDOS.
        
         | snickerbockers wrote:
         | I'm skeptical that they're actually capable of making something
         | novel. There are thousands of hobby operating systems and video
         | game emulators on github for it to train off of so it's not
         | particularly surprising that it can copy somebody else's
         | homework.
        
           | flatline wrote:
           | I believe they can create a novel instance of a system from a
           | sufficient number of relevant references - i.e. implement a
           | set of already-known features without (much) code
           | duplication. LLMs are certainly capable of this level of
           | generalization due to their huge non-relevant reference set.
           | Whether they can expand beyond that into something truly
           | novel from a feature/functionality standpoint is a whole
           | other, and less well-defined, question. I tend to agree that
           | they are closed systems relative to their corpus. But then,
           | aren't we? I feel like the aperture for true novelty to enter
           | is vanishingly small, and cultures put a premium on it vis-a-
           | vis the arts, technological innovation, etc. Almost every
           | human endeavor is just copying and iterating on prior
           | examples.
        
             | imiric wrote:
             | Here's a thought experiment: if modern machine learning
             | systems existed in the early 20th century, would they have
             | been able to produce an equivalent to the theory of
             | relativity? How about advance our understanding of the
             | universe? Teach us about flight dynamics and take us into
             | space? Invent the Turing machine, Von Neumann architecture,
             | transistors?
             | 
             | If yes, why aren't we seeing glimpses of such genius today?
             | If we've truly invented artificial intelligence, and on our
             | way to super and general intelligence, why aren't we seeing
             | breakthroughs in all fields of science? Why are state of
             | the art applications of this technology based on pattern
             | recognition and applied statistics?
             | 
             | Can we explain this by saying that we're only a few years
             | into it, and that it's too early to expect fundamental
             | breakthroughs? And that by 2027, or 2030, or surely by
             | 2040, all of these things will suddenly materialize?
             | 
             | I have my doubts.
        
               | tanseydavid wrote:
               | How about "Protein Folding"?
        
               | imiric wrote:
               | A great use case for pattern recognition.
        
               | famouswaffles wrote:
               | >Here's a thought experiment: if modern machine learning
               | systems existed in the early 20th century, would they
               | have been able to produce an equivalent to the theory of
               | relativity? How about advance our understanding of the
               | universe? Teach us about flight dynamics and take us into
               | space? Invent the Turing machine, Von Neumann
               | architecture, transistors?
               | 
               | Only a small percentage of humanity are/were capable of
               | doing any of these. And they tend to be the best of the
               | best in their respective fields.
               | 
               | >If yes, why aren't we seeing glimpses of such genius
               | today?
               | 
               | Again, most humans can't actually do any of the things
               | you just listed. Only our most intelligent can. LLMs are
               | great, but they're not (yet?) as capable as our best and
               | brightest (and in many ways, lag behind the average
               | human) in most respects, so why would you expect such
               | genius now ?
        
               | beeflet wrote:
               | Were they the best of the best? or were they just at the
               | right place and time to be exposed to a novel idea?
               | 
               | I am skeptical of this claim that you need a 140IQ to
               | make scientific breakthroughs, because you don't need a
               | 140IQ to understand special relativity. It is a matter of
               | motivation and exposure to new information. The vast
               | majority of the population doesn't benefit from working
               | in some niche field of physics in the first place.
               | 
               | Perhaps LLMs will never be at the right place and the
               | right time because they are only trained on ideas that
               | already exist.
        
               | famouswaffles wrote:
               | >Were they the best of the best? or were they just at the
               | right place and time to be exposed to a novel idea?
               | 
               | It's not an "or" but an "and". Being at the right place
               | and time is a necessary precondition, but it's not
               | sufficient. Newton stood on the shoulders of giants like
               | Kepler and Galileo, and Einstein built upon the work of
               | Maxwell and Lorentz. The key question is, why did they
               | see the next step when so many of their brilliant
               | contemporaries, who had the exact same information and
               | were in similar positions, did not? That's what separates
               | the exceptional from the rest.
               | 
               | >I am skeptical of this claim that you need a 140IQ to
               | make scientific breakthroughs, because you don't need a
               | 140IQ to understand special relativity.
               | 
               | There is a pretty massive gap between understanding a
               | revolutionary idea and originating it. It's the
               | difference between being the first person to summit
               | Everest without a map, and a tourist who takes a
               | helicopter to the top to enjoy the view. One requires
               | genius and immense effort; the other requires following
               | instructions. Today, we have a century of explanations,
               | analogies, and refined mathematics that make relativity
               | understandable. Einstein had none of that.
        
               | Kim_Bruning wrote:
               | It's entirely plausible that sometimes one genius sees
               | the answer all alone -I'm sure it happens sometimes- but
               | it's also definitely a common theme that many people/ a
               | subset of society as a whole may start having similar
               | ideas all around the same time. In many cases where a
               | breakthrough is attributed to one person, if you look
               | more closely you'll often see some sort of team effort or
               | societal ground swell.
        
               | imiric wrote:
               | > LLMs are great, but they're not (yet?) as capable as
               | our best and brightest (and in many ways, lag behind the
               | average human) in most respects, so why would you expect
               | such genius now ?
               | 
               | I'm not expecting novel scientific theories _today_. What
               | I am expecting are signs and hints of such genius.
               | Something that points in the direction that all tech CEOs
               | are claiming we 're headed in. So far I haven't seen any
               | of this yet.
               | 
               | And, I'm sorry, I don't buy the excuse that these tools
               | are not "yet" as capable as the best and brightest
               | humans. They contain the sum of human knowledge, far more
               | than any individual human in history. Are they not
               | _intelligent_ , capable of thinking and reasoning? Are we
               | not at the verge of superintelligence[1]?
               | 
               | > we have recently built systems that are smarter than
               | people in many ways, and are able to significantly
               | amplify the output of people using them.
               | 
               | If all this is true, surely we should be seeing
               | incredible results produced by this technology. If not by
               | itself, then surely by "amplifying" the work of the best
               | and brightest humans.
               | 
               | And yet... All we have to show for it are some very good
               | applications of pattern matching and statistics, a bunch
               | of gamed and misleading benchmarks and leaderboards, a
               | whole lot of tech demos, solutions in search of a
               | problem, and the very real problem of flooding us with
               | even more spam, scams, disinformation, and devaluing
               | human work with low-effort garbage.
               | 
               | [1]: https://blog.samaltman.com/the-gentle-singularity
        
               | famouswaffles wrote:
               | >I'm not expecting novel scientific theories today. What
               | I am expecting are signs and hints of such genius.
               | 
               | Like I said, what exactly would you be expecting to see
               | with the capabilities that exist today ? It's not a
               | gotcha, it's a genuine question.
               | 
               | >And, I'm sorry, I don't buy the excuse that these tools
               | are not "yet" as capable as the best and brightest
               | humans.
               | 
               | There's nothing to buy or not buy. They simply aren't.
               | They are unable to do a lot of the things these people
               | do. You can't slot an LLM in place of most knowledge
               | workers and expect everything to be fine and dandy.
               | There's no ambiguity on that.
               | 
               | >They contain the sum of human knowledge, far more than
               | any individual human in history.
               | 
               | It's not really the total sum of human knowledge but
               | let's set that aside. Yeah so ? Einstein, Newton, Von
               | Newman. None of these guys were privy to some super
               | secret knowledge their contemporaries weren't so it's
               | obviously not simply a matter of more knowledge.
               | 
               | >Are they not intelligent, capable of thinking and
               | reasoning?
               | 
               | Yeah they are. And so are humans. So were the peers of
               | all those guys. So why are only a few able to see the
               | next step ? It's not just about knowledge, and
               | intelligence lives in degrees/is a gradient.
               | 
               | >If all this is true, surely we should be seeing
               | incredible results produced by this technology. If not by
               | itself, then surely by "amplifying" the work of the best
               | and brightest humans.
               | 
               | Yeah and that exists. Terence Tao has shared a lot of his
               | (and his peers) experiences on the matter.
               | 
               | https://mathstodon.xyz/@tao/115306424727150237
               | 
               | https://mathstodon.xyz/@tao/115420236285085121
               | 
               | https://mathstodon.xyz/@tao/115416208975810074
               | 
               | >And yet... All we have to show for it are some very good
               | applications of pattern matching and statistics, a bunch
               | of gamed and misleading benchmarks and leaderboards, a
               | whole lot of tech demos, solutions in search of a
               | problem, and the very real problem of flooding us with
               | even more spam, scams, disinformation, and devaluing
               | human work with low-effort garbage.
               | 
               | Well it's a good thing that's not true then
        
               | imiric wrote:
               | > Like I said, what exactly would you be expecting to see
               | with the capabilities that exist today ?
               | 
               | And like I said, "signs and hints" of superhuman
               | intelligence. I don't know what that looks like since I'm
               | merely human, but I sure know that I haven't seen it yet.
               | 
               | > There's nothing to buy or not buy. They simply aren't.
               | They are unable to do a lot of the things these people
               | do.
               | 
               | This claim is directly opposed to claims by Sam Altman
               | and his cohort, which I'll repeat:
               | 
               | > we have recently built systems that are smarter than
               | people in many ways, and are able to significantly
               | amplify the output of people using them.
               | 
               | So which is it? If they're "smarter than people in many
               | ways", where is the product of that superhuman
               | intelligence? If they're able to "significantly amplify
               | the output of people using them", then all of humanity
               | should be empowered to produce incredible results that
               | were previously only achievable by a limited number of
               | people. In hands of the best and brightest humans, it
               | should empower them to produce results previously
               | unreachable by humanity.
               | 
               | Yet all positive applications of this technology show
               | that it excels at finding and producing data patterns,
               | and nothing more than that. Those experience reports by
               | Terence Tao are prime examples of this. The system was
               | fed a lot of contextual information, and after being
               | coaxed by highly intelligent humans, was able to find and
               | produce patterns that were difficult to see by humans.
               | This is hardly a showcase of intelligence that you and
               | others think it is. Including those highly intelligent
               | humans, some of whom have a lot to gain from pushing this
               | narrative.
               | 
               | We have seen similar reports by programmers as well[1].
               | Yet I'm continually amazed that these highly intelligent
               | people are surprised that a pattern finding and producing
               | system was able to successfully find and produce useful
               | patterns, and then interpret that as a showcase of
               | intelligence. So much so that I start to feel suspicious
               | about the intentions and biases of those people.
               | 
               | To be clear: I'm not saying that these systems can't be
               | very useful in the right hands, and potentially
               | revolutionize many industries. Ultimately many real-world
               | problems can be modeled as statistical problems where a
               | pattern recognition system can excel. What I am saying is
               | that there's a very large gap from the utility of such
               | tools, and the extraordinary claims that they have
               | intelligence, let alone superhuman and general
               | intelligence. So far I have seen no evidence of the
               | latter, despite of the overwhelming marketing euphoria
               | we're going through.
               | 
               | > Well it's a good thing that's not true then
               | 
               | In the world outside of the "AI" tech bubble, that is
               | very much the reality.
               | 
               | [1]: https://news.ycombinator.com/item?id=45784179
        
               | lelanthran wrote:
               | > Only a small percentage of humanity are/were capable of
               | doing any of these. And they tend to be the best of the
               | best in their respective fields.
               | 
               | Sure, agreed, but the difference between a small
               | percentage and zero percentage is infinite.
        
               | gf000 wrote:
               | > Only a small percentage of humanity are/were capable of
               | doing any of these. And they tend to be the best of the
               | best in their respective fields.
               | 
               | A definite, absolute and unquestionable no, and a small,
               | but real chance is absolutely different categories.
               | 
               | You may wait for a bunch of rocks to sprout forever, but
               | I would put my money on a bunch of random seeds, even if
               | I don't know how they were kept.
        
             | beeflet wrote:
             | Almost all of the work in making a new operating system or
             | a gameboy emulator or something is in characterizing the
             | problem space and defining the solution. How do you know
             | what such and such instruction does? What is the ideal way
             | to handle this memory structure here? You know, knowledge
             | you gain from spending time tracking down a specific bug or
             | optimizing a subroutine.
             | 
             | When I create something, it's an exploratory process. I
             | don't just guess what I am going to do based on my previous
             | step and hope it comes out good on the first try. Let's say
             | I decide to make a car with 5 wheels. I would go through
             | several chassis designs, different engine configurations
             | until I eventually had something that works well. Maybe
             | some are too weak, some too expensive, some are too
             | complicated. Maybe some prototypes get to the physical
             | testing stage while others don't. Finally, I publish this
             | design for other people to work on.
             | 
             | If you ask the LLM to work on a novel concept it hasn't
             | been trained on, it will usually spit out some nonsense
             | that either doesn't work or works poorly, or it will refuse
             | to provide a specific enough solution. If it has been
             | trained on previous work, it will spit out something that
             | looks similar to the solved problem in its training set.
             | 
             | These AI systems don't undergo the process of trial and
             | error that suggests it is creating something novel. Its
             | process of creation is not reactive with the environment.
             | It is just cribbing off of extant solutions it's been
             | trained on.
        
               | vidarh wrote:
               | I'm literally watching Claude Code "undergo the process
               | of trial and error" in another window right now.
        
           | jstummbillig wrote:
           | I remain confused but still somewhat interested as to a
           | definition of "novel", given how often this idea is wielded
           | in the AI context. How is everyone so good at identifying
           | "novel"?
           | 
           | For example, I can't wrap my head around how a) a human could
           | come up with a piece of writing that _inarguably_ reads
           | "novel" writing, while b) an AI could be guaranteed to _not_
           | be able to do the same, under the same standard.
        
             | testaccount28 wrote:
             | why would you admit on the internet that you fail the
             | reverse turing test?
        
               | fragmede wrote:
               | Because not everyone here has a raging ego and no
               | humility?
        
               | CamperBob2 wrote:
               | You have no idea if you're talking to an LLM or a human,
               | yourself, so ... uh, wait, neither do I.
        
               | greygoo222 wrote:
               | Because I'm an LLM and you are too
        
               | mikestorrent wrote:
               | Didn't some fake AI country song just get on the top 100?
               | How novel is novel? A lot of human artists aren't
               | producing anything _novel_.
        
               | magicalist wrote:
               | > _Didn 't some fake AI country song just get on the top
               | 100?_
               | 
               | No
               | 
               | Edit: to be less snarky, it topped the Billboard Country
               | Digital Song Sales Chart, which is a measure of sales of
               | the individual song, not streaming listens. It's
               | estimated it takes a few thousand sales to top that
               | particular chart and it's widely believed to be commonly
               | manipulated by coordinated purchases.
        
               | terminalshort wrote:
               | It was a real AI country song, not a fake one, but yes.
        
             | snickerbockers wrote:
             | Generally novel either refers to something that is new, or
             | a certain type of literature. If the AI is generating
             | something functionally equivalent to a program in its
             | training set (in this case, dozens or even hundreds of such
             | programs) then it by definition cannot be novel.
        
               | brulard wrote:
               | This is quite a narrow view of how the generation works.
               | AI can extrapolate from the training set and explore new
               | directions. It's not just cutting pieces and gluing
               | together.
        
               | beeflet wrote:
               | In practice, I find the ability for this new wave of AI
               | to extrapolate very limited.
        
               | fragmede wrote:
               | Do you have any concrete examples you'd care to share?
               | While this new wave of AI doesn't have unlimited powers
               | of extrapolation, the post we're commenting on is
               | asserting that this latest AI from Google was able to
               | extrapolate solutions to two of AI's oldest problems,
               | which would seem to contradict an assertion of "very
               | limited".
        
               | snickerbockers wrote:
               | uhhh can it? I've certainly not seen any evidence of an
               | AI generating something not based on its training set.
               | It's certainly smart enough to shuffle code around and
               | make superficial changes, and that's pretty impressive in
               | its own way but not particularly useful unless your only
               | goal is to just launder somebody else's code to get
               | around a licensing problem (and even then it's
               | questionable if that's a derived work or not).
               | 
               | Honest question: if AI is actually capable of exploring
               | new directions why does it have to train on what is
               | effectively the sum total of all human knowledge?
               | Shouldn't it be able to take in some basic concepts
               | (language parsing, logic, etc) and bootstrap its way into
               | new discoveries (not necessarily completely new but
               | independently derived) from there? Nobody learns the way
               | an LLM does.
               | 
               | ChatGPT, to the extent that it is comparable to human
               | cognition, is undoubtedly the most well-read person in
               | all of history. When I want to learn something I look it
               | up online or in the public library but I don't have to
               | read the entire library to understand a concept.
        
               | BirAdam wrote:
               | You didn't have to read the whole library because your
               | brain has been absorbing knowledge from multiple inputs
               | your entire life. AI systems are trying to temporally
               | compress a lifetime into the time of training. Then,
               | given that these systems have effectively a single input
               | method of streams of bits, they need immense amounts of
               | it to be knowledgeable at all.
        
               | BobbyTables2 wrote:
               | You have to realize AI is trained the same way one would
               | train an auto-completer.
               | 
               | Theres no cognition. It's not taught language, grammar,
               | etc. none of that!
               | 
               | It's only seen a huge amount of text that allows it to
               | recognize answers to questions. Unfortunately, it appears
               | to work so people see it as the equivalent to sci-fi
               | movie AI.
               | 
               | It's really just a search engine.
        
               | snickerbockers wrote:
               | I agree and that's the case I'm trying to make. The
               | machine-learning community expects us to believe that it
               | is somehow comparable to human cognition, yet the way it
               | learns is inherently inhuman. If an LLM was in any way
               | similar to a human I would expect that, like a human, it
               | might require a little bit of guidance as it learns but
               | ultimately it would be capable of understanding concepts
               | well enough that it doesn't need to have memorized every
               | book in the library just to perform simple tasks.
               | 
               | In fact, I would expect it to be able to reproduce past
               | human discoveries it hasn't even been exposed to, and if
               | the AI is actually capable of this then it should be
               | possible for them to set up a controlled experiment
               | wherein it is given a limited "education" and must
               | discover something already known to the researchers but
               | not the machine. That nobody has done this tells me that
               | either they have low confidence in the AI despite their
               | bravado, or that they already have tried it and the
               | machine failed.
        
               | ezst wrote:
               | > The machine-learning community
               | 
               | Is it? I only see a few individuals, VCs, and tech giants
               | overblowing LLMs capabilities (and still puzzled as to
               | how the latter dragged themselves into a race to the
               | bottom through it). I don't believe the academic field
               | really is that impressed with LLMs.
        
               | throwaway173738 wrote:
               | There's a third possible reason which is that they're
               | taking it as a given that the machine is "intelligent" as
               | a sales tactic, and they're not academic enough to want
               | to test anything they believe.
        
               | ninetyninenine wrote:
               | no it's not I work on AI and what these things do are
               | much much more then a search engine or an autocomplete.
               | If an autocomplete passed the turing test you'd dismiss
               | it because it's still an autocomplete.
               | 
               | The characterization you are regurgitating here is from
               | laymen who do not understand AI. You are not just mildly
               | wrong but wildly uninformed.
        
               | MangoToupe wrote:
               | To be fair, it's not clear _human_ intelligence is much
               | more than search or autocomplete. The only thing that 's
               | clear here is that LLMs can't reproduce it.
        
               | ninetyninenine wrote:
               | Yes but colloquially this characterization you see used
               | by laymen is deliberately used to deride AI and dismiss
               | it. It is not honest about the on the ground progress AI
               | has made and it's not intellectual honest about the
               | capabilities and weaknesses of Ai.
        
               | MangoToupe wrote:
               | I disagree. The actual capabilities of LLMs remain
               | unclear, and there's a great deal of reasons to be
               | suspicious of anyone whose paycheck relies on pimping
               | them.
        
               | ninetyninenine wrote:
               | The capabilities of LLMs are unclear but it is clear that
               | they are not just search engines or autocompletes or
               | stochastic parrots.
               | 
               | You can disagree. But this is not an opinion. You are
               | factually wrong if you disagree. And by that I mean you
               | don't know what you're talking about and you are
               | completely misinformed and lack knowledge.
               | 
               | The long term outcome if I'm right is that AI abilities
               | continue to grow and it basically destroys my career and
               | yours completely. I stand not to benefit from this
               | reality and I state it because it is reality. LLMs
               | improve every month. It's already to the point of where
               | if you're not vibe coding you're behind.
        
               | versteegen wrote:
               | Well, I also work on AI, and I completely agree with you.
               | But I've reached the point of thinking it's hopeless to
               | argue with people about this: It seems that as LLMs
               | become ever better people aren't going to change their
               | opinions, as I had expected. If you don't have good
               | awareness of how human cognition actually works, then
               | it's not evidently contradictory to think that even a
               | superintelligent LLM trained on all human knowledge is
               | _just_ pattern matching and that humans _are not_.
               | Creativity, understanding, originality, intent, etc, can
               | all be placed into a largely self-consistent framework of
               | human specialness.
        
               | fragmede wrote:
               | Isn't that what's going on with synthetic data? The LLM
               | is trained, then is used to generate data that gets put
               | into the training set, and then gets further trained on
               | that generated data?
        
               | ninetyninenine wrote:
               | >I've certainly not seen any evidence of an AI generating
               | something not based on its training set.
               | 
               | There is plenty of evidence for this. You have to be
               | blind not to realize this. Just ask the AI to generate
               | something not in it's training set.
        
               | gf000 wrote:
               | Like the seahorse emoji?
        
               | kazinator wrote:
               | Positively not. It is pure interpolation and not
               | extrapolation. The training set is vast and supports an
               | even vaster set of possible traversal paths; but they are
               | all interpolative.
               | 
               | Same with diffusion and everything else. It is not
               | extrapolation that you can transfer the style of Van Gogh
               | onto a photographl it is interpolation.
               | 
               | Extrapolation might be something like inventing a style:
               | how did Van Gogh do that?
               | 
               | And, sure, the thing can invent a new style---as a mashup
               | of existing styles. Give me a Picasso-like take on Van
               | Gogh and apply it to this image ...
               | 
               | Maybe the original thing there is the _idea_ of doing
               | that; but that came from me! The execution of it is just
               | interpolation.
        
               | BoorishBears wrote:
               | This is knock against you _at all_ , but in a naive
               | attempt to spare someone else some time: remember that
               | based on this definition it is impossible for an LLM to
               | do novel things _and more importantly_ , you're not going
               | to change how this person defines a concept as integral
               | to one's being as novelty.
               | 
               | I personally think this is a bit tautological of a
               | definition, but if you hold it, then yes LLMs are not
               | capable of anything novel.
        
               | kazinator wrote:
               | That is not strictly true, because being able to transfer
               | the style of Van Gogh onto an arbitrary photographic
               | scene _is_ novel in a sense, but it is interpolative.
               | 
               | Mashups are not purely derivative: the choice of what to
               | mash up carries novelty: two (or more) representations
               | are mashed together which hitherto have not been.
               | 
               | We cannot deny that something is new.
        
               | regularfry wrote:
               | Innovation itself is frequently defined as the novel
               | combination of pre-existing components. It's mashups all
               | the way down.
        
               | BoorishBears wrote:
               | I'm saying their comment is calling that not something
               | new.
               | 
               | I don't agree, but by their estimation adding things
               | together is still just using existing things.
        
               | Libidinalecon wrote:
               | I think you should reverse the question, why would we
               | expect LLMs to even have the ability to do novel things?
               | 
               | It is like expecting a DJ remixing tracks to output
               | original music. Confusing that the DJ is not actually
               | playing the instruments on the recorded music so they
               | can't do something new beyond the interpolation. I love
               | DJ sets but it wouldn't be fair to the DJ to expect them
               | to know how to play the sitar because they open the set
               | with a sitar sample interpolated with a kick drum.
        
               | 8note wrote:
               | kid koala does jazz solos on a disk of 12 notes, jumping
               | the track back and forth to get different notes.
               | 
               | i think that, along with the sitar player are still
               | interpolating. the notes are all there on the instrument.
               | even without an instrument, its still interpolating. the
               | space that music and aound can be in is all well known
               | wave math. if you draw a fourier transform view, you
               | could see one chart with all 0, and a second with all
               | +infinite, and all music and sound is gonna sit somewhere
               | between the two.
               | 
               | i dont know that "just interpolation" is all that
               | meaningful to whether something is novel or interesting.
        
               | BoorishBears wrote:
               | It just depends on how you define novel.
               | 
               | Would you consider the instrumental at 33 seconds a new
               | song? https://youtu.be/eJA0wY1e-zU?si=yRrDlUN2tqKpWDCv
        
               | ozgrakkurt wrote:
               | This is how people do things as well imo. LLM does the
               | same thing on some level but it is just not good enough
               | for majority of use cases
        
               | throwaway173738 wrote:
               | Calling it "exploring" is anthropomorphising. The machine
               | has weights that yield meaningful programs given
               | specification-like language. It's a useful phenomenon but
               | it may be nothing like what we do.
        
               | grosswait wrote:
               | Or it may be remarkably similar to what we do
        
               | taneq wrote:
               | OK, but by that definition, how many human software
               | developers ever develop something "novel"? Of course, the
               | "functionally equivalent" term is doing a lot of heavy
               | lifting here: How equivalent? How many differences are
               | required to qualify as different? How many similarities
               | are required to qualify as similar? Which one overrules
               | the other? If I write an app that's identical to Excel in
               | every single aspect except that instead of a Microsoft
               | Flight Simulator easter egg, there's a different, unique,
               | fully playable game that can't be summed up with any
               | combination of genre lables, is that 'novel'?
        
               | gf000 wrote:
               | I think the importance is _the ability_. Not every human
               | have produced (or even can) something novel in their
               | life, but there are humans who have time after time.
               | 
               | Meanwhile, depending on how you rate LLM's capabilities,
               | no matter how many trials you give it, it may not be
               | considered capable of that.
               | 
               | That's a very important distinction.
        
             | QuadmasterXLII wrote:
             | A system of humans creates bona fide novel writing. We
             | don't know which human is responsible for the novelty in
             | homoerotic fanfiction of the Odyssey, but it wasn't a
             | lizard. LLMs don't have this system-of-thinkers
             | bootstrapping effect yet, or if they do it requires an
             | absolutely enormous boost to get going
        
             | kazinator wrote:
             | Because we know that the human only read, say, fifty books
             | since they were born, and watched a few thousand videos,
             | and there is nothing in them which resembles what they
             | wrote.
        
             | terminalshort wrote:
             | If a LLM had written Linux, people would be saying that it
             | isn't novel because it's just based on previous OS's. There
             | is no standard here, only bias.
        
               | jofla_net wrote:
               | Cept its not made Linux (in the absence of it).
               | 
               | At any point prior to the final output it can garner huge
               | starting point bias from ingested reference material.
               | This can be up to and including whole solutions to the
               | original prompt minus some derivations. This is
               | effectively akin to cheating for humans as we cant bring
               | notes to the exam. Since we do not have a complete
               | picture of where every part of the output comes from we
               | are at a loss to explain if it indeed invented it or not.
               | The onus is and should be on the applicant to ensure that
               | the output wasn't copied (show your work), not on the
               | graders to prove that it wasn't copied. No less than what
               | would be required if it was a human. Ultimately it boils
               | down to what it means to 'know' something, whether a
               | photographic memory is, in fact, knowing something, or
               | rather derivations based on other messy forms of
               | symbolism. It is nevertheless a huge argument as both
               | sides have a mountain of bias in either directions.
        
               | jstummbillig wrote:
               | > Cept its not made Linux (in the absence of it).
               | 
               | Neither did you (or I). Did you create anything that you
               | are certain your peers would recognize as more "novel"
               | than anything a LLM could produce?
        
               | snickerbockers wrote:
               | >Neither did you (or I).
               | 
               | Not that specifically but I certainly have the capability
               | to create my own OS without having to refer to the source
               | code of existing operating systems. Literally "creating a
               | linux" is a bit on the impossible side because it implies
               | compatibility with an existing kernel despite the
               | constraints prohibiting me from referring to the source
               | of that existing kernel (maybe possible if i had some
               | clean-room RE team that would read through the source and
               | create a list of requirements without including any
               | source).
               | 
               | If we're all on the same page regarding the origins of
               | human intelligence (ie, that it does _not_ begin with
               | satan tricking adam and eve into eating the fruit of a
               | tree they were specifically instructed not to touch) then
               | it necessarily follows that any idea or concept was new
               | at some point and had to be developed by somebody who
               | didn 't already have an entire library of books
               | explaining the solution at his disposal.
               | 
               | For the Linux thought-experiment you could maybe argue
               | that Linux isn't totally novel since its creator was
               | intentionally mimicking behavior of an existing well-
               | known operating system (also iirc he had access to the
               | minix source) and maybe you could even argue that those
               | predecessors stood on the shoulders of their own
               | proverbial giants, but if we keep kicking the ball down
               | the road eventually we reach a point where somebody had
               | an idea which was not in any way inspired by somebody
               | else's existing idea.
               | 
               | The argument I want to make is not that humans never
               | create derivative or unoriginal works (that obviously
               | cannot be true) but that humans have the capability to
               | create new things. I'm not convinced that LLMs have that
               | same capability; maybe I'm wrong but I'm still waiting to
               | see evidence of them discovering something new. As I said
               | in another post, this could easily be demonstrated with a
               | controlled experiment in which the model is bootstrapped
               | with a basic yet intentionally-limited "education" and
               | then tasked with discovering something already known to
               | the experimenters which was not in its training set.
               | 
               | >Did you create anything that you are certain your peers
               | would recognize as more "novel" than anything a LLM could
               | produce?
               | 
               | Yes, I have definitely created things without first
               | reading every book in the library and memorizing
               | thousands of existing functionally-equivalent solutions
               | to the same problem. So have you so long as I'm not
               | actually debating an LLM right now.
        
             | visarga wrote:
             | > For example, I can't wrap my head around how a) a human
             | could come up with a piece of writing that inarguably reads
             | "novel" writing, while b) an AI could be guaranteed to not
             | be able to do the same, under the same standard.
             | 
             | The secret ingredient is the world outside, and past
             | experiences from the world, which are unique for each
             | human. We stumble onto novelty in the environment. But AI
             | can do that too - move 37 AlphaGo is an example, much
             | stumbling around leads to discoveries even for AI. The
             | environment is the key.
        
             | baq wrote:
             | If the model can map an unseen problem to something in its
             | latent space, solve it there, map back and deliver an
             | ultimately correct solution, is it novel? Genuine question,
             | 'novel' doesn't seem to have a universally accepted
             | definition here
        
               | gf000 wrote:
               | Good question, though I would say that there may be
               | different grades of novelty.
               | 
               | One grade might be your example, while something like
               | Godel's incompleteness theorems or Einstein's relativity
               | could go into a different grade.
        
           | n8cpdx wrote:
           | The windows (~2000) kernel itself is on GitHub. Even
           | exquisitely documented if AI can read .doc files.
           | 
           | https://github.com/ranni0225/WRK
        
           | sosuke wrote:
           | Doing something novel is incredibly difficult through LLM
           | work alone. Dreaming, hallucinating, might eventually make
           | novel possible but it has to be backed up be rock solid base
           | work. We aren't there yet.
           | 
           | The working memory it holds is still extremely small compared
           | to what we would need for regular open ended tasks.
           | 
           | Yes there are outliers and I'm not being specific enough but
           | I can't type that much right now.
        
           | fragmede wrote:
           | Of course they can come up with something novel. They're
           | called hallucinations when they do, and that's something that
           | can't be in their training data, because it's not
           | true/doesn't exist. Of course, when they do come up totally
           | novel hallucinations, suddenly being creative is a bad thing
           | to be "fixed".
        
         | nestorD wrote:
         | Oh! That's a nice use-case and not too far from stuff I have
         | been playing with! (happily I do not have to deal with
         | handwriting, just bad scans of older newspapers and texts)
         | 
         | I can vouch for the fact that LLMs are great at searching in
         | the original language, summarizing key points to let you know
         | whether a document might be of interest, then providing you
         | with a translation where you need one.
         | 
         | The fun part has been build tools to turn Claude code and Codex
         | CLI into capable research assistant for that type of projects.
        
           | throwup238 wrote:
           | _> The fun part has been build tools to turn Claude code and
           | Codex CLI into capable research assistant for that type of
           | projects._
           | 
           | What does that look like? How well does it work?
           | 
           | I ended up writing a research TUI with my own higher level
           | orchestration (basically have the thing keep working in a
           | loop until a budget has been reached) and document
           | extraction.
        
             | nestorD wrote:
             | I started with a UI that sounded like it was built along
             | the same lines as yours, which had the advantage of letting
             | me enforce a pipeline and exhaustivity of search (I don't
             | want the 10 most promising documents, I want all of them).
             | 
             | But I realized I was not using it much _because_ it was
             | that big and inflexible (plus I keep wanting to stamp out
             | all the bugs, which I do not have the time to do on a hobby
             | project). So I ended up extracting it into MCPs (equipped
             | to do full-text search and download OCR from the various
             | databases I care about) and AGENTS.md files (defining
             | pipelines, as well as patterns for both searching behavior
             | and reporting of results). I also put together a sub-agent
             | for translation (cutting away all tools besides reading and
             | writing files, and giving it some document-specific
             | contextual information).
             | 
             | That lets me use Claude Code and Codex CLI (which,
             | anecdotally, I have found to be the better of the two for
             | that kind of work; it seems to deal better with longer
             | inputs produced by searches) as the driver, telling them
             | what I am researching and maybe how I would structure the
             | search, then letting them run in the background before
             | checking their report and steering the search based on
             | that.
             | 
             | It is not perfect (if a search surfaces 300 promising
             | documents, it will _not_ check all of them, and it often
             | misunderstands things due to lacking further context), but
             | I now find myself reaching for it regularly, and I polish
             | out problems one at a time. The next goal is to add more
             | data sources and to maybe unify things further.
        
               | throwup238 wrote:
               | _> It is not perfect (if a search surfaces 300 promising
               | documents, it will not check all of them, and it often
               | misunderstands things due to lacking further context)_
               | 
               | This has been the biggest problem for me too. I jokingly
               | call it the LLM halting problem because it never knows
               | the proper time to stop working on something, finishing
               | way too fast without going through each item in the list.
               | That's why I've been doing my own custom orchestration,
               | drip feeding it results with a mix of summarization and
               | content extraction to keep the context from different
               | documents chained together.
               | 
               | Especially working with unindexed content like colonial
               | documents where I'm searching through thousands of pages
               | spread (as JPEGs) over hundreds of documents for a single
               | one that's relevant to my research, but there are latent
               | mentions of a name that ties them all together (like a
               | minor member of an expedition giving relevant testimony
               | in an unrelated case). It turns into a messy web of named
               | entity recognition and a bunch of more classical NLU
               | tasks, except done with an LLM because I'm lazy.
        
         | jvreeland wrote:
         | I'd love to find more info on this but from what I can find it
         | seems to be making webpages that look like those product, and
         | seemingly can "run python" or "emulate a game" but writing
         | something that, based on all of GitHub, can approximate an
         | iPhone or emulator in javscript/css/HTML is very very very
         | different than writing an OS.
        
         | kace91 wrote:
         | >I'm interested in the Conquistadors and their extensive
         | accounts of their expeditions, but holy cow reading 16th
         | century handwritten Spanish and translating it at the same time
         | is a nightmare, requiring a ton of expertise and inside field
         | knowledge
         | 
         | Completely off topic, but out of curiosity, where are you
         | reading these documents? As a Spaniard I'm kinda interested.
        
           | throwup238 wrote:
           | I use the Portal de Archivos Espanoles [1] for Spanish
           | colonial documents. Each country has their own archive but
           | the Spanish one has the most content (35 million digitized
           | pages)
           | 
           | The hard part is knowing where to look since most of the
           | images haven't gone through HRT/OCR or indexing so you have
           | to understand Spanish colonial administration and go through
           | the collections to find stuff.
           | 
           | [1] https://pares.cultura.gob.es/pares/en/inicio.html
        
             | throwout4110 wrote:
             | Want to collab on a database and some clustering and
             | analysis? I'm a data scientist at FAIR with an interest in
             | antiquarian docs and books
        
               | rmonvfer wrote:
               | Spaniard here. Let me know if I can somehow help navigate
               | all of that. I'm very interested in history and
               | everything related to the 1400-1500 period (although I'm
               | not an expert by any definition) and I'd love to see what
               | modern technology could do here, specially OCRs and VLMs.
        
               | throwup238 wrote:
               | Sadly I'm just an amateur armchair historian (at best) so
               | I doubt I'd be of much help. I'm mostly only doing the
               | translation for my own edification
        
               | cco wrote:
               | You may be surprised (or not?) at how many important
               | scientific and historical works are done by armchair
               | practitioners.
        
               | vintermann wrote:
               | You should maybe reach out to the author of this blog
               | post, professor Mark Humphries. Or to the genealogy
               | communities, we struggle with handwritten historical
               | texts no public AI model can make a dent in, regularly.
        
               | dr_dshiv wrote:
               | Hit me up, if you can. I'm focused on neolatin texts from
               | the renaissance. Less than 30% of known book editions
               | have been scanned and less than 5% translated. And that's
               | before even getting to the manuscripts.
               | 
               | https://Ancientwisdomtrust.org
               | 
               | Also working on kids handwriting recognition for
               | https://smartpaperapp.com
        
           | SJC_Hacker wrote:
           | Do you have six fingers, per chance ?
        
             | ChrisMarshallNY wrote:
             | I don't know if the six-fingered man was a Spaniard, but
             | Inigo Montoya was...
        
         | Footprint0521 wrote:
         | Bro split that up, use LLMs for transcription first, then take
         | that and translate it
        
         | smusamashah wrote:
         | > Whats the kernel look like?
         | 
         | Those clones are all HTML/CSS, same for game clones made by
         | Gemini.
        
         | Aperocky wrote:
         | > This I'm a lot more skeptical of. The linked twitter post
         | just looks like something it would replicate via HTML/CSS/JS.
         | Whats the kernel look like?
         | 
         | Thanks for this, I was almost convinced and about to re-think
         | my entire perspective and experience with LLMs.
        
         | viftodi wrote:
         | You are right to be skeptical.
         | 
         | There are plenty of so called windows(or other) web 'os'
         | clones.
         | 
         | There were a couple of these posted on HN actually this very
         | year.
         | 
         | Here is one example I google dthat was also on HN :
         | https://news.ycombinator.com/item?id=44088777
         | 
         | This is not an OS as in emulating a kernel in javascript or
         | wasm, this is making a web app that looks like the desktop of
         | an OS.
         | 
         | I have seen plenty such projects, some mimick windows UI
         | entirely, you xan find them via google.
         | 
         | So this was definitely in the training data, and is not as
         | impressive as the blog post or the twitter thread make it to
         | be.
         | 
         | The scary thing is the replies in the twitter thread have no
         | critical thinking at all and are impressed beyond belief, they
         | think it coded a whole kernel, os, made an interpeter for it,
         | ported games etc.
         | 
         | I think this is the reason why some people are so impressed by
         | AI, when you can only judge an app visually or only how you
         | intetcat with it and don't have the depth of knowledge to
         | understand, for such people it works all the way.land AI seems
         | magical beyond comprehension.
         | 
         | But all this is only superficial IMHO.
        
           | krackers wrote:
           | Every time a model is about to be released, there are a bunch
           | of these hype accounts that spin up. I don't know they get
           | paid or they spring up organically to farm engagement. Last
           | time there was such hype for a model was "strawberry" (o1)
           | then gpt-5, and both turned out to be meaningful improvements
           | but nowhere near the hype.
           | 
           | I don't doubt though that new models will be very good at
           | frontend webdev. In fact this is explicitly one of the recent
           | lmarena tasks so all the labs have probably been optimizing
           | for it.
        
             | tyre wrote:
             | My guess is that there are insiders who know about the
             | models and can't keep their mouths shut. They like being on
             | the inside and leaking.
        
               | DrewADesign wrote:
               | I'd also bet my car on there being a ton of AI
               | product/policy/optics astroturfing/shilling going on,
               | here and everywhere else. Social proof is a hell of a
               | marketing tool and I see a lot of comments suspiciously
               | bullish about mediocre things, or suspiciously aggressive
               | towards people that aren't enthused. I don't have any
               | direct proof so I could be wrong, but it seems more
               | extreme than a iPhone/Android (though I suspect
               | deliberate marketing forces there, too,) Ford/Chevy
               | brand-based-identity kind of thing, and naive to think
               | this tactic is limited to TikTok and Instagram videos.
               | The crowd here is so targeted, I wouldn't be surprised if
               | a single-digit percentage of the comments are laying down
               | plausible comment history facade for marketing use. The
               | economics might make it worthwhile for the professional
               | manipulators of the world.
        
           | risyachka wrote:
           | Its always amusing when "an app like windows xp" considered
           | hard or challenging somehow.
           | 
           | Literally the most basic html/css, not sure why it is even
           | included in benchmarks.
        
             | ACCount37 wrote:
             | Those things are LLMs, with text and language at the core
             | of their capabilities. UIs are, notably, not text.
             | 
             | An LLM being able to build up interfaces that look
             | recognizably like an UI from a real OS? That sure suggests
             | a degree of multimodal understanding.
        
               | cowboy_henk wrote:
               | UIs made in the HyperText Markup Language are, in fact,
               | text.
        
             | viftodi wrote:
             | While it is obviously much easier than creating a real OS,
             | some people have created desktop managers web apps, with
             | resizeable and movable windows, apps such as terminals,
             | nodepads, file explorer etc.
             | 
             | This is still a challenging task and requires lots of work
             | to get this far.
        
         | jchw wrote:
         | I'm surprised people didn't click through to the tweet.
         | 
         | https://x.com/chetaslua/status/1977936585522847768
         | 
         | > I asked it for windows web os as everyone asked me for it and
         | the result is mind blowing , it even has python in terminal and
         | we can play games and run code in it
         | 
         | And of course
         | 
         | > 3D design software, Nintendo emulators
         | 
         | No clue what these refer to but to be honest it sounds like
         | they've incrementally improved one-shotting capabilities
         | mostly. I wouldn't be surprised if Gemini 2.5 Pro could get a
         | Gameboy or NES emulator working to boot Tetris or Mario, while
         | it is a decent chunk of code to get things going, there's an
         | absolute boatload of code on the Internet, and the complexity
         | is lower than you might imagine. (I have written a couple of
         | toy Gameboy emulators from scratch myself.)
         | 
         | Don't get me wrong, it is pretty cool that a machine can do
         | this. A lot of work people do today just isn't that novel and
         | if we can find a way to tame AI models to make them trustworthy
         | enough for some tasks it's going to be an easy sell to just
         | throw AI models at certain problems they excel at. I'm sure
         | it's already happening though I think it still mostly isn't
         | happening for code at least in part due to the inherent
         | difficulty of making AI work effectively in existing large
         | codebases.
         | 
         | But I will say that people are a little crazy sometimes. Yes it
         | is very fascinating that an LLM, which is essentially an
         | extremely fancy token predictor, can _one-shot_ a web app that
         | is mostly correct, apparently without any feedback, like being
         | able to actually run the application or even see editor errors,
         | at least as far as we know. This is genuinely really impressive
         | and interesting, and not the aspect that I think anyone seeks
         | to downplay. However, consider this: even as relatively simple
         | as an NES is compared to even moderately newer machines, to
         | make an NES emulator you have to know how an NES works and even
         | have strategies for how to emulate it, which don 't necessarily
         | follow from just reading specifications or even NES program
         | disassembly. The existence of _many_ toy NES emulators and a
         | very large amount of documentation for the NES hardware and
         | inner workings on the Internet, as well as the 6502, means that
         | LLMs have a _lot_ of training data to help them out.
         | 
         | I think that these tasks which extremely well-covered in the
         | training data gives people unrealistic expectations. You could
         | _probably_ pick a simpler machine that an LLM would do
         | significantly worse at, even though a human who knows how to
         | write emulation software could definitely do it. Not sure what
         | to pick, but let 's say SEGA's VMU units for the Dreamcast -
         | very small, simple device, and I reckon there should be
         | information about it online, but it's going to be somewhat
         | limited. You might think, "But that's not fair. It's unlikely
         | to be able to one-shot something like that without mistakes
         | with so much less training data on the subject." _Exactly_. In
         | the real world, that comes up. Not always, but often. If it
         | didn 't, programming would be an incredibly boring job. (For
         | some people, it _is_ , and these LLMs will probably be
         | disrupting that...) That's not to say that AI models can
         | _never_ do things like debug an emulator or even do reverse
         | engineering on its own, but it 's increasingly clear that this
         | won't emerge from strapping agents on top of transformers
         | predicting tokens. But since there is a very large portion of
         | work that is not very novel in the world, I can totally
         | understand why everyone is trying to squeeze this model as far
         | as it goes. Gemini and Claude are shockingly competent.
         | 
         | I believe many of the reasons people scoff at AI are fairly
         | valid even if they don't always come from a rational mindset,
         | and I try to keep my usage of AI to be relatively tasteful. I
         | don't like AI art, and I personally don't like AI code. I find
         | the push to put AI in everything incredibly annoying, and I
         | worry about the clearly circular AI market, overhyped
         | expectations. I dislike the way AI training has ripped up the
         | Internet, violated people's trust, and lead to a more closed
         | Internet. I dislike that sites like Reddit are capitalizing on
         | all of the user-generated content that users submitted which
         | made them rich in the first place, just to crap on them in the
         | process.
         | 
         | But I think that LLMs are useful, and useful LLMs could
         | definitely be created ethically, it's just that the current AI
         | race has everyone freaking the fuck out. I continue to explore
         | use cases. I find that LLMs have gotten increasingly good at
         | analyzing disassembly, though it varies depending on how well-
         | covered the machine is in its training data. I've also found
         | that LLMs can one-shot useful utilities and do a decent job. I
         | had an LLM one-shot a utility to dump the structure of a simple
         | common file format so I could debug something... It probably
         | only saved me about 15-30 minutes, but still, in that case I
         | truly believe it did save me time, as I didn't spend any time
         | tweaking the result; it did compile, and it did work correctly.
         | 
         | It's going to be troublesome to truly measure how good AI is.
         | If you knew nothing about writing emulators, being able to
         | synthesize an NES emulator that can at least boot a game may
         | seem unbelievable, and to be sure it is obviously a stunning
         | accomplishment from a PoV of scaling up LLMs. But what we're
         | seeing is probably more a reflection of very good knowledge
         | rather than very good intelligence. If we didn't have much
         | written online about the NES or emulators at all, then it would
         | be truly world-bending to have an AI model figure out
         | everything it needs to know to write one on-the-fly. Humans can
         | actually do stuff like that, which we know because humans _had_
         | to do stuff like that. Today, I reckon most people rarely get
         | the chance to show off that they are capable of novel thought
         | _because_ there are so many other humans that had to do novel
         | thinking before them. _Being able_ to do novel thinking
         | effectively when needed is currently still a big gap between
         | humans and AI, among others.
        
           | stOneskull wrote:
           | i think google is going to repeat history with gemini.. as in
           | chatgpt, grok, etc will be like altavista, lycos, etc
        
         | ninetyninenine wrote:
         | I'm skeptical because my entire identity is basically built
         | around being a software engineer and thinking my IQ and
         | intelligence is higher than other people. If this AI stuff is
         | real then it basically destroys my entire identity so I choose
         | the most convenient conclusion.
         | 
         | Basically we all know that AI is just a stochastic parrot
         | autocomplete. That's all it is. Anyone who doesn't agree with
         | me is of lesser intelligence and I feel the need to inform them
         | of things that are obvious: AI is not a human, it does not have
         | emotions. It just a search engine. Those people who are using
         | AI to code and do things that are indistinguishable from human
         | reasoning are liars. I choose to focus on what AI gets wrong,
         | like hallucinations, while ignoring the things it gets right.
        
           | hju22_-3 wrote:
           | > [...] my entire identity is basically built around [...]
           | thinking my IQ and intelligence is higher than other people.
           | 
           | Well, there's your first problem.
        
             | vintermann wrote:
             | I don't know, that's commendable self-insight, it's true of
             | lots and lots of people but there are few who would admit
             | it!
        
               | ninetyninenine wrote:
               | I am unique. Totally. It is not like HN is flooded with
               | cognition or psychology or IQ articles every other hour.
               | Not at all. And whenever one shows up, you do not
               | immediately get a parade of people diagnosing themselves
               | with whatever the headline says. Never happens. You post
               | something about slow thinking and suddenly half the
               | thread whispers "that is literally me." You post
               | something about fast thinking and the other half says
               | "finally someone understands my brain." You post
               | something about overthinking and everyone shows up with
               | "wow I feel so seen." You post something about attention
               | and now the entire site has ADHD.
               | 
               | But yes. I am the unique one.
        
               | vintermann wrote:
               | Ah, so you were just attempting sarcasm?
        
               | tptacek wrote:
               | HN is not in fact flooded with cognition, psychology, and
               | IQ articles every other hour.
        
               | ninetyninenine wrote:
               | There was more prior to AI but yes I exaggerated it. I
               | mean it's obvious right? The title of this page is hacker
               | so it must be tech related articles every hour.
               | 
               | But articles on IQ and cognition and psychology are
               | extremely common in HN. Enough to be noticeably out of
               | place.
        
               | tptacek wrote:
               | They are actually not really all that common at all. We
               | get 1, maybe 2 in a busy month.
        
           | twoodfin wrote:
           | This kind of comment certainly shows that no organic
           | stochastic parrots post to hn threads!
        
         | otherdave wrote:
         | Where can I find these Conquistador documents? Sounds like
         | something I might like to read and explore.
        
           | throwup238 wrote:
           | See here: https://news.ycombinator.com/item?id=45933750
        
         | dotancohen wrote:
         | My language does not use Latin letters, but they are separate
         | letters. Is there a way to train some handwriting recognition
         | on my own handwriting in my own language, such that it will be
         | effective and useful? I mostly need to recognize text in PDF
         | documents, generated by writing on an e-ink tablet with an EMR
         | stylus.
        
       | netsharc wrote:
       | Author says "It is the most amazing thing I have seen an LLM do,
       | and it was unprompted, entirely accidental." and then jumps back
       | to the "beginning of the story". Including talking about a trip
       | to Canada.
       | 
       | Skip to the section headed "The Ultimate Test" for the resolution
       | of the clickbait of "the most amazing thing...". (According to
       | him, it correctly interpreted a line in an 18th century merchant
       | ledger using maths and logic)
        
         | appreciatorBus wrote:
         | The new model may or may not be great at handwriting but I
         | found the author's constant repetition about how amazing it was
         | irritating enough to stop reading and to wonder if the article
         | itself was slop-written.
         | 
         | "users have reported some truly wild things" "the results were
         | shocking" "the most amazing thing I have seen an LLM do"
         | "exciting and frightening all at once" "the most astounding
         | result I have ever seen" "made the hair stand up on the back of
         | my neck"
        
           | bitwize wrote:
           | You're never gonna believe #6!
        
       | bgwalter wrote:
       | No, just another academic with the ominous handle
       | @generativehistory that is beguiled by "AI". It is strange that
       | others can never reproduce such amazing feats.
        
         | pksebben wrote:
         | I don't know if I'd call it an 'amazing feat', but claude had
         | me pause for a moment recently.
         | 
         | Some time ago, I'd been working on a framework that involved a
         | series of servers (not the only one I've talked to claude
         | about) that had to pass messages around in a particular
         | fashion. Mostly technical implementation details and occasional
         | questions about architecture.
         | 
         | Fast forward a ways, and on a lark I decided to ask in the
         | abstract about the best way to structure such an interaction.
         | Mark that this was not in the same chat or project and didn't
         | have any identifying information about the original, save for
         | the structure of the abstraction (in this case, a message bus
         | server and some translation and processing services, all
         | accessed via client.)
         | 
         | so:
         | 
         | - we were far enough removed that the whole conversation
         | pertaining to the original was for sure not in the context
         | window
         | 
         | - we only referred to the abstraction (with like a
         | A=>B=>C=>B=>A kind of notation and a very brief question)
         | 
         | - most of the work on the original was in claude code
         | 
         | and it knew. In the answer it gave, it mentioned the project by
         | name. I can think of only two ways this could have happened:
         | 
         | - they are doing some real fancy tricks to cram your entire
         | corpus of chat history into the current context somehow
         | 
         | - the model has access to some kind of fact database _where it
         | was keeping an effective enough abstraction to make the
         | connection_
         | 
         | I find either one mindblowing for different reasons.
        
           | zahlman wrote:
           | Are you sure it isn't just a case of a write-up of the
           | project appearing in the training data?
        
             | pksebben wrote:
             | was my own project, so i don't see how it could have been.
             | Private repo, unfinished, i gave it the name.
        
           | omega3 wrote:
           | Perhaps you have the memory feature enabled:
           | https://support.claude.com/en/articles/11817273-using-
           | claude...
        
             | pksebben wrote:
             | I probably do, and this is what I think happened. Mind you,
             | it's not magic, but to hold that information with enough
             | fidelity to pattern-match the structure of the underlying
             | function was something I would find remarkable. It's a leap
             | from a lot of the patterns I'm used to.
        
       | pavlov wrote:
       | I've seen those A/B choices on Google AI Studio recently, and
       | there wasn't a substantial difference between the outputs. It
       | felt more like a different random seed for the same model.
       | 
       | Of course it's very possible my use case wasn't terribly
       | interesting so it wouldn't reveal model differences, or that it
       | was a different A/B test.
        
         | jeffbee wrote:
         | For me they've been very similar, except in one case where I
         | corrected it and on one side it doubled down on being
         | objectively wrong, and on the other side it took my feedback
         | and started over with a new line of thinking.
        
       | thatoneengineer wrote:
       | https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...
        
         | lproven wrote:
         | You beat me to it.
        
       | efitz wrote:
       | I haven't seen this new google model but now must try it out.
       | 
       | I will say that other frontier models are starting to surprise me
       | with their reasoning/understanding- I really have a hard time
       | making (or believing) the argument that they are just predicting
       | the next word.
       | 
       | I've been using Claude Code heavily since April; Sonnet 4.5
       | frequently surprises me.
       | 
       | Two days ago I told the AI to read all the documentation from my
       | 5 projects related to a tool I'm building, and create a wiki,
       | focused on audience and task.
       | 
       | I'm hand reviewing the 50 wiki pages it created, but overall it
       | did a great job.
       | 
       | I got frustrated about one issue: I have a github issue to create
       | a way to integrate with issue trackers (like Jira), but it's
       | TODO, and the AI featured on the home page that we had issue
       | tracker integration. It created a page for it and everything; I
       | figured it was hallucinating.
       | 
       | I went to edit the page and replace it with placeholder text and
       | was shocked that the LLM had (unprompted) figured out how to use
       | existing features to integrate with issue trackers, and wrote
       | sample code for GitHub, Jira and Slack (notifications). That
       | truly surprised me.
        
         | energy123 wrote:
         | Predicting the next word requires understanding, they're not
         | separate things. If you don't know what comes after the next
         | word, then you don't know what the next word should be. So the
         | task implicitly forces a more long-horizon understanding of the
         | future sequence.
        
           | IAmGraydon wrote:
           | This is utterly wrong. Predicting the next word requires a
           | large sample of data made into a statistical model. It has
           | nothing to do with "understanding", which implies it knows
           | why rather than what.
        
             | orionsbelt wrote:
             | Ilya Sustkever was on a podcast, saying to imagine a
             | mystery novel where at the end it says "and the killer is:
             | (name)". Saying it's just a statistical model generating
             | the next most likely word, how can it do that in this case
             | if it doesn't have some understanding of all the clues,
             | etc. A specific name is not statistically likely to appear
        
               | shwaj wrote:
               | Can current LLMs actually do that, though? What Ilya
               | posed was a thought experiment: if it could do that, then
               | we would say that it has understanding. But AFAIK that is
               | beyond current capabilities.
        
               | krackers wrote:
               | Someone should try it and create a new "mysterybench".
               | Find all mystery novels written after LLM training
               | cutoff, and see how many models unravel the mystery
        
               | IAmGraydon wrote:
               | It can't do that without the answer to who did it being
               | in the training data. I think the reason people keep
               | falling for this illusion is that they can't really
               | imagine how vast the training dataset is. In all cases
               | where it appears to answer a question like the one you
               | posed, it's regurgitating the answer from its training
               | data in a way that creates an illusion of using logic to
               | answer it.
        
               | CamperBob2 wrote:
               | _It can 't do that without the answer to who did it being
               | in the training data._
               | 
               | Try it. Write a simple original mystery story, and then
               | ask a good model to solve it.
               | 
               | This isn't your father's Chinese Room. It couldn't solve
               | original brainteasers and puzzles if it were.
        
               | dyauspitr wrote:
               | That's not true, at all.
        
               | IAmGraydon wrote:
               | Please...go on.
        
               | squigz wrote:
               | This implies understanding of preceding tokens, no? GP
               | was saying they have understanding of future tokens.
        
               | nicpottier wrote:
               | I once was chatting with an author of books (very much an
               | amateur) and he said he enjoyed writing because he liked
               | discovering where the story goes. IE, he starts and
               | builds characters and creates scenarios for them and at
               | some point the story kind of takes over, there is only
               | one way a character can act based on what was previously
               | written, but it wasn't preordained. That's why he liked
               | it, it was a discovery to him.
               | 
               | I'm not saying this is the right way to write a book but
               | it is a way some people write at least! And one LLMs seem
               | capable of doing. (though isn't a book outline pretty
               | much the same as a coding plan and well within their
               | wheelhouse?)
        
             | astrange wrote:
             | If you're claiming a transformer model is a Markov chain,
             | this is easily disprovable by, eg, asking the model why it
             | isn't a Markov chain!
             | 
             | But here is a really big one of those if you want it:
             | https://arxiv.org/abs/2401.17377
        
             | nl wrote:
             | Modern LLMs are post trained for tasks other than next word
             | prediction.
             | 
             | They still output words through (except for multi-modal
             | LLMs) so that does involve next word generation.
        
             | Workaccount2 wrote:
             | "Understanding" is just a trap to get wrapped up in. A word
             | with no definition and no test to prove it.
             | 
             | Whether or not the model are "understanding" is ultimately
             | immaterial, as their ability to do things is all that
             | matters.
        
               | pinnochio wrote:
               | If they can't do things that require understanding, it's
               | material, bub.
               | 
               | And just because you have no understanding of what
               | "understanding" means, doesn't mean nobody does.
        
               | red75prime wrote:
               | > doesn't mean nobody does
               | 
               | If it's not a functional understating that allows to
               | replicate functionality of understanding, is it the real
               | understanding?
        
             | dyauspitr wrote:
             | The line between understanding and "large sample of data
             | made into a statistical model" is kind of fuzzy.
        
           | HarHarVeryFunny wrote:
           | > Predicting the next word requires understanding
           | 
           | If we were talking about humans trying to predict next word,
           | that would be true.
           | 
           | There is no reason to suppose than an LLM is doing anything
           | other than deep pattern prediction pursuant to, and no better
           | than needed for, next word prediction.
        
             | CamperBob2 wrote:
             | How'd _you_ do at the International Math Olympiad this
             | year?
        
               | HarHarVeryFunny wrote:
               | I hear the LLM was able to parrot fragments of the stuff
               | it was trained to memorize, and did very well
        
               | CamperBob2 wrote:
               | Yeah, that must be it.
        
               | cxvrfr wrote:
               | Well being able to extrapolate solutions to "novel"
               | mathematical exercises based on a very large sample of
               | similar tasks in your dataset seems like a reasonable
               | explanation.
               | 
               | Question is how well it would do if it was trained
               | without those samples?
        
               | CamperBob2 wrote:
               | Gee, I don't know. How would you do at a math competition
               | if you weren't trained with math books? Sample problems
               | and solutions are not sufficient unless you can genuinely
               | apply human-level inductive and deductive reasoning to
               | them. If you don't understand that and agree with it, I
               | don't see a way forward here.
               | 
               | A more interesting question is, how would you do at a
               | math competition if you were taught to read, then left
               | alone in your room with a bunch of math books? You
               | wouldn't get very far at a competition like IMO,
               | calculator or no calculator, unless you happen to be some
               | kind of prodigy at the level of von Neumann or Ramanujan.
        
               | HarHarVeryFunny wrote:
               | > A more interesting question is, how would you do at a
               | math competition if you were taught to read, then left
               | alone in your room with a bunch of math books?
               | 
               | But that isn't how an LLM learnt to solve math olympiad
               | problems. This isn't a base model just trained on a bunch
               | of math books.
               | 
               | The way they get LLMs to be good at specialized things
               | like math olympiad problems is to custom train them for
               | this using reinforcement learning - they give the LLM
               | lots of examples of similar math problems being solved,
               | showing all the individual solution steps, and train on
               | these, rewarding the model when (due to having selected
               | an appropriate sequence of solution steps) it is able
               | itself to correctly solve the problem.
               | 
               | So, it's not a matter of the LLM reading a bunch of math
               | books and then being expert at math reasoning and problem
               | solving, but more along the lines "of monkey see, monkey
               | do". The LLM was explicitly shown how to step by step
               | solve these problems, then trained extensively until it
               | got it and was able to do it itself. It's probably a
               | reflection of the self-contained and logical nature of
               | math that this works - that the LLM can be trained on one
               | group of problems and the generalizations it has learnt
               | works on unseen problems.
               | 
               | The dream is to be able to teach LLMs to reason more
               | generally, but the reasons this works for math don't
               | generally apply, so it's not clear that this math success
               | can be used to predict future LLM advances in general
               | reasoning.
        
               | CamperBob2 wrote:
               | _The dream is to be able to teach LLMs to reason more
               | generally, but the reasons this works for math don 't
               | generally apply_
               | 
               | Why is that? Any suggestions for further reading that
               | justifies this point?
               | 
               | Ultimately, reinforcement learning _is_ still just a
               | matter of shoveling in more text. Would RL work on
               | humans? Why or why not? How similar is it to what kids
               | are exposed to in school?
        
               | HarHarVeryFunny wrote:
               | An important difference between reinforcement learning
               | (RL) and pre-training is the error feedback that is
               | given. For pre-training the error feedback is just next
               | token prediction error. For RL you need to have a goal in
               | mind (e.g. successfully solving math problems) and the
               | training feedback that is given is the RL "reward" - a
               | measure of how well the model output achieved the goal.
               | 
               | With RL used for LLMs, it's the whole LLM response that
               | is being judged and rewarded (not just the next word), so
               | you might give it a math problem and ask it to solve it,
               | then when it was finished you take the generated answer
               | and check if it is correct or not, and this reward
               | feedback is what allows the RL algorithm to learn to do
               | better.
               | 
               | There are at least two problems with trying to use RL as
               | a way to improve LLM reasoning in the general case.
               | 
               | 1) Unlike math (and also programming) it is not easy to
               | automatically check the solution to most general
               | reasoning problems. With a math problem asking for a
               | numerical answer, you can just check against the known
               | answer, or for a programming task you can just check if
               | the program compiles and the output is correct. In
               | contrast, how do you check the answer to more general
               | problems such "Should NATO expand to include Ukraine?" ?!
               | If you can't define a reward then you can't use RL.
               | People have tried using "LLM as judge" to provide rewards
               | in cases like this (give the LLM response to another LLM,
               | and ask it if it thinks the goal was met), but apparently
               | this does not work very well.
               | 
               | 2) Even if you could provide rewards for more general
               | reasoning problems, and therefore were able to use RL to
               | train the LLM to generate good solutions for those
               | training examples, this is not very useful unless the
               | reasoning it has learnt generalizes to other problems it
               | was not trained on. In narrow logical domains like math
               | and programming this evidentially works very well, but it
               | is far from clear how learning to reason about NATO will
               | help with reasoning about cooking or cutting your cat's
               | nails, and the general solution to reasoning can't be
               | "we'll just train it on every possible question anyone
               | might ever ask"!
               | 
               | I don't have any particular reading suggestions, but
               | these are widely accepted limiting factors to using RL
               | for LLM reasoning.
               | 
               | I don't think RL for humans would work too well, and it's
               | not generally the way we learn, or kids are mostly taught
               | in school. We mostly learn or are taught individual
               | skills and when they can be used, then practice and learn
               | how to combine and apply them. The closest to using RL in
               | school would be if the only feedback an English teacher
               | gave you on your writing assignments was a letter grade,
               | without any commentary, and you had to figure out what
               | you needed to improve!
        
               | cxvrfr wrote:
               | How would you do multiplying 10000 pairs of 100 digit
               | numbers in a limited amount of time? We don't
               | anthropomorphize calculators though...
        
               | CamperBob2 wrote:
               | One problem for your argument is that transformer
               | networks are not, and weren't meant to be, calculators.
               | Their raw numerical calculating abilities are shaky when
               | you don't let them use external tools, but they are also
               | entirely emergent. It turns out that language doesn't
               | just describe logic, it encodes it. Nobody expected that.
               | 
               | To see another problem with your argument, find someone
               | with weak reasoning abilities who is willing to be a test
               | subject. Give them a calculator -- hell, give them a copy
               | of Mathematica -- and send them to IMO, and see how that
               | works out for them.
        
             | famouswaffles wrote:
             | There is plenty reason. This article is just one example of
             | many. People bring it up because LLMs routinely do things
             | we call reasoning when we see them manifest in other
             | humans. Brushing it off as 'deep pattern prediction' is
             | genuinely meaningless. Nobody who uses that phrase in that
             | way can actually explain what they are talking about in a
             | way that can be falsified. It's just vibes. It's an
             | unfalsifiable conversation-stopper, not a real explanation.
             | You can replace "pattern matching" with "magic" and the
             | argument is identical because the phrase isn't actually
             | doing anything.
             | 
             | A - A force is required to lift a ball
             | 
             | B - I see Human-N lifting a ball
             | 
             | C - Obviously, Human-N cannot produce forces
             | 
             | D - Forces are not required to lift a ball
             | 
             | Well sir, why are you so sure Human-N cannot produce
             | forces? How is she lifting the ball ? Well Of course
             | Human-N is just using statistics magic.
        
               | energy123 wrote:
               | Anything can be euphemized. Human intelligence is atoms
               | moving around the brain. General relativity is writing on
               | a piece of paper.
        
               | famouswaffles wrote:
               | If you want to say human and LLM intelligence are both
               | 'deep pattern prediction' then sure, but mostly and
               | certainly in the case I was replying to, people often
               | just use it as a means to make an imaginary unfalsifiable
               | distinction between what LLMs do and what the super
               | special humans do.
        
               | HarHarVeryFunny wrote:
               | You seem to be ignoring two things...
               | 
               | First, the obvious one, is that LLMs are trained to auto-
               | regressively predict human training samples (i.e.
               | essentially to copy them, without overfitting), so OF
               | COURSE they are going to sound like the training set -
               | intelligent, reasoning, understanding, etc, etc. The
               | mistake is to anthropomorphize the model because it
               | sounds human, and associate these attributes of
               | understanding etc to the model itself rather than just
               | reflecting the mental abilities of the humans who wrote
               | the training data.
               | 
               | The second point is perhaps a bit more subtle, and is
               | about the nature of understanding and the differences
               | between what an LLM is predicting and what the human
               | cortex - also a prediction machine - is predicting...
               | 
               | When humans predict, what we're predicting is something
               | external to ourself - the real world. We observe, over
               | time we see regularities, and from this predict we'll
               | continue to see those regularities. Our predictions
               | include our own actions as an input - how will the
               | external world react to our actions, and therefore we
               | learn how to act.
               | 
               | Understanding something means being able to predict how
               | it will behave, both left alone, and in interaction with
               | other objects/agents, including ourselves. Being able to
               | predict what something will do if you poke it is
               | essentially what it means to understand it.
               | 
               | What an LLM is predicting is not the external world and
               | how it reacts to the LLMs actions, since it is auto-
               | regressively trained - it is only predicting a
               | continuation of it's own output (actions) based on it's
               | own immediately preceding output (actions)! The LLM
               | therefore itself understands nothing since it has no
               | grounding for what it is "talking about", and how the
               | external world behaves in reaction to it's own actions.
               | 
               | The LLMs appearance of "understanding" comes solely from
               | the fact that it is mimicking the training data, which
               | was generated by humans who do have agency in the world
               | and understanding of it, but the LLM has no visibility
               | into the generative process of the human mind - only to
               | the artifacts (words) it produces, so the LLM is doomed
               | to operate in a world of words where all it might be
               | considered to "understand" is it's own auto-regressive
               | generative process.
        
               | famouswaffles wrote:
               | You're restating two claims that sound intuitive but
               | don't actually hold up when examined:
               | 
               | 1. "LLMs just mimic the training set, so sounding like
               | they understand doesn't imply understanding."
               | 
               | This is the magic argument reskinned. Transformers aren't
               | copying strings, they're constructing latent
               | representations that capture relationships, abstractions,
               | and causal structure because doing so reduces loss. We
               | know this not by philosophy, but because mechanistic
               | interpretability has repeatedly uncovered internal
               | circuits representing world states, physics, game
               | dynamics, logic operators, and agent modeling. "It's just
               | next-token prediction" does not prevent any of that from
               | occurring. When an LLM performs multi-step reasoning,
               | corrects its own mistakes, or solves novel problems not
               | seen in training, calling the behavior "mimicry" explains
               | nothing. It's essentially saying "the model can do it,
               | but not for the reasons we'd accept," without specifying
               | what evidence would ever convince you otherwise.
               | Imaginary distinction.
               | 
               | 2. "Humans predict the world, but LLMs only predict text,
               | so humans understand but LLMs don't."
               | 
               | This is a distinction without the force you think it has.
               | Humans also learn from sensory streams over which they
               | have no privileged insight into the generative process.
               | Humans do not know the "real world"; they learn patterns
               | in their sensory data. The fact that the data stream for
               | LLMs consists of text rather than photons doesn't negate
               | the emergence of internal models. An internal model of
               | how text-described worlds behave is still a model of the
               | world.
               | 
               | If your standard for "understanding" is "being able to
               | successfully predict consequences within some domain,"
               | then LLMs meet that standard, just in the domains they
               | were trained on, and today's state of the art is trained
               | on more than just text.
               | 
               | You conclude that "therefore the LLM understands
               | nothing." But that's an all-or-nothing claim that doesn't
               | follow from your premises. A lack of sensorimotor
               | grounding limits what kinds of understanding the system
               | can acquire; it does not eliminate all possible forms of
               | understanding.
               | 
               | Wouldn't the birds that have the ability to navigate from
               | the earth's magnetic field soon say humans have no
               | understanding of electromagnetism ? They get trained on
               | sensorimotor data humans will never be able to train on.
               | If you think humans have access to the "real world" then
               | think again. They have a tiny, extremely filtered slice
               | of it.
               | 
               | Saying "it understands nothing because autoregression" is
               | just another unfalsifiable claim dressed as an
               | explanation.
        
               | HarHarVeryFunny wrote:
               | > This is the magic argument reskinned. Transformers
               | aren't copying strings, they're constructing latent
               | representations that capture relationships, abstractions,
               | and causal structure because doing so reduces loss.
               | 
               | Sure (to the second part), but the latent representations
               | aren't the same as a humans. The human's world that they
               | have experience with, and therefore representations of,
               | is the real word. The LLM's world that they have
               | experience with, and therefore representations of, is the
               | world of words.
               | 
               | Of course an LLM isn't literally copying - it has learnt
               | a sequence of layer-wise next-token
               | predictions/generations (copying of partial embeddings to
               | next token via induction heads etc), with each layer
               | having learnt what patterns in the layer below it needs
               | to attend to, to minimize prediction error at that layer.
               | You can characterize these patterns (latent
               | representations) in various ways, but at the end of the
               | day they are derived from the world of words it is
               | trained on, and are only going to be as good/abstract as
               | next token error minimization allows. These
               | patterns/latent representations (the "world model" of the
               | LLM if you like) are going to be language-based (incl
               | language-based generalizations), not the same as the
               | unseen world model of the humans who generated that
               | language, whose world model describes something
               | completely different - predictions of sensory inputs and
               | causal responses.
               | 
               | So, yes, there is plenty of depth and nuance to the
               | internal representations of an LLM, but no logical reason
               | to think that the "world model" of an LLM is similar to
               | the "world model" of a human since they live in different
               | worlds, and any "understanding" the LLM itself can be
               | considered as having is going to be based on it's own
               | world model.
               | 
               | > Saying "it understands nothing because autoregression"
               | is just another unfalsifiable claim dressed as an
               | explanation.
               | 
               | I disagree. It comes down to how do you define
               | understanding. A human understands (correctly predicts)
               | how the real world behaves, and the effect it's own
               | actions will have on the real world. This is what the
               | human is predicting.
               | 
               | What an LLM is predicting is effectively "what will I say
               | next" after "the cat sat on the". The human might see a
               | cat and based on circumstances and experience of cats
               | predict that the cat will sit on the mat. This is because
               | the human understands cats. The LLM may predict the next
               | word as "mat", but this does not reflect any
               | understanding of cats - it is just a statistical word
               | prediction based on the word sequences it was trained on,
               | notwithstanding that this prediction is based on the LLMs
               | world-of-words-model.
        
               | famouswaffles wrote:
               | >So, yes, there is plenty of depth and nuance to the
               | internal representations of an LLM, but no logical reason
               | to think that the "world model" of an LLM is similar to
               | the "world model" of a human since they live in different
               | worlds, and any "understanding" the LLM itself can be
               | considered as having is going to be based on it's own
               | world model.
               | 
               | So LLMs and Humans are different and have different
               | sensory inputs. So what ? This is all animals. You think
               | dolphins and orcas are not intelligent and don't
               | understand things ?
               | 
               | >What an LLM is predicting is effectively "what will I
               | say next" after "the cat sat on the". The human might see
               | a cat and based on circumstances and experience of cats
               | predict that the cat will sit on the mat.
               | 
               | Genuinely don't understand how you can actually believe
               | this. A human who predicts mat does so because of the
               | popular phrase. That's it. There is no reason to predict
               | it over the numerous things cats regularly sit on, often
               | much more so the mats (if you even have one). It's not
               | because of any super special understanding of cats. You
               | are doing the same thing the LLM is doing here.
        
               | HarHarVeryFunny wrote:
               | > You think dolphins and orcas are not intelligent and
               | don't understand things ?
               | 
               | Not sure where you got that non-secitur from ...
               | 
               | I would expect most animal intelligence (incl. humans) to
               | be very similar, since their brains are very similar.
               | 
               | Orcas are animals.
               | 
               | LLMs are not animals.
        
               | famouswaffles wrote:
               | Orca and human brains are similar, in the sense we have a
               | common ancestor if you look back far enough, but they are
               | still very different and focus on entirely different
               | slices of reality and input than humans will ever do.
               | It's not something you can brush off if you really
               | believe in input supremacy so much.
               | 
               | From the orca's perspective, many of the things we say we
               | understand are similarly '2nd hand hearsay'.
        
               | HarHarVeryFunny wrote:
               | Regarding cats on mats ...
               | 
               | If you ask a human to complete the phrase "the cat sat on
               | the", they will probably answer "mat". This is
               | memorization, not understanding. The LLM can do this too.
               | 
               | If you just input "the cat sat on the" to an LLM, it will
               | also likely just answer "mat" since this is what LLMs do
               | - they are next-word input continuers.
               | 
               | If you said "the sat sat on the" to a human, they would
               | probably respond "huh?" or "who the hell knows!", since
               | the human understands that cats are fickle creatures and
               | that partial sentences are not the conversational norm.
               | 
               | If you ask an LLM to explain it's understanding of cats,
               | it will happily reply, but the output will not be it's
               | own understanding of cats - it will be parroting some
               | human opinion(s) it got from the training set. It has no
               | first hand understanding, only 2nd hand heresay.
        
               | famouswaffles wrote:
               | >If you said "the sat sat on the" to a human, they would
               | probably respond "huh?" or "who the hell knows!", since
               | the human understands that cats are fickle creatures and
               | that partial sentences are not the conversational norm.
               | 
               | I'm not sure what you're getting at here ? You think LLMs
               | don't similarly answer 'What are you trying to say?'.
               | Sometimes I wonder if the people who propose these gotcha
               | questions ever bother to actually test them on said LLMs.
               | 
               | >If you ask an LLM to explain it's understanding of cats,
               | it will happily reply, but the output will not be it's
               | own understanding of cats - it will be parroting some
               | human opinion(s) it got from the training set. It has no
               | first hand understanding, only 2nd hand heresay.
               | 
               | Again, you're not making the distinction you think you
               | are. Understanding from '2nd hand heresay' is still
               | understanding. The vast majority of what humans learn in
               | school is such.
        
               | HarHarVeryFunny wrote:
               | > Sometimes I wonder if the people who propose these
               | gotcha questions ever bother to actually test them on
               | said LLMs
               | 
               | Since you asked, yes, Claude responds "mat", then asks if
               | I want it to "continue the story".
               | 
               | Of course if you know anything about LLMs you should
               | realize that they are just input continuers, and any
               | conversational skills comes from post training. To an LLM
               | a question is just an input whose human-preferred (as
               | well as statistically most likely) continuation is a
               | corresponding answer.
               | 
               | I'm not sure why you regard this as a "gotcha" question.
               | If you're expressing opinions on LLMs, then table stakes
               | should be to have a basic understanding of LLMs - what
               | they are internally, how they work, and how they are
               | trained, etc. If you find a description of LLMs as input-
               | continuers in the least bit contentious then I'm sorry to
               | say you completely fail to understand them - this is
               | literally what they are trained to do. The only thing
               | they are trained to do.
        
         | astrange wrote:
         | Predicting the next word is the interface, not the
         | implementation.
         | 
         | (It's a pretty constraining interface though - the model
         | outputs an entire distribution and then we instantly lose it by
         | only choosing one token from it.)
        
         | charcircuit wrote:
         | It's trying to maximize a reward function. It's not just
         | predicting the next word.
        
         | schiffern wrote:
         | >I really have a hard time making (or believing) the argument
         | that they are just predicting the next word.
         | 
         | It's true, but by the same token our brain is "just"
         | thresholding spike rates.
        
       | conception wrote:
       | I will note that 2.5 pro preview... march? Was maybe the best
       | model I've used yet. The actual release model was... less. I
       | suspect Google found the preview too expensive and optimized it
       | down but it was interesting to see there was some hidden
       | horsepower there. Google has always been poised to be the AI
       | leader/winner - excited to see if this is fluff or the real deal
       | or another preview that gets nerfed.
        
         | muixoozie wrote:
         | Dunno if you're right, but I'd like to point out that I've been
         | reading comments like these about every model since GPT 3. It's
         | just starting to seem more likely to me to be a cognitive bias
         | than not.
        
           | conception wrote:
           | I haven't noticed the effect of things getting worse after a
           | release but definitely 2.5's abilities got worse. Or perhaps
           | they optimized for something else? But I haven't noticed the
           | usual "things got worse after release!" Except for when
           | sonnet had a bug for a month and gpt5's autorouter broke.
        
             | muixoozie wrote:
             | Yea I don't know. I didn't mean to sound accusatory. I
             | might very well be wrong.
        
           | KaoruAoiShiho wrote:
           | Sometimes it is just bias but the 2.5 pro had benchmarks
           | showing the degradation (plus they changed the name every
           | time so it was obviously a different ckpt or model).
        
           | colordrops wrote:
           | Why would you assume cognitive bias? Any evidence? These
           | things are indeed very expensive to run, and are often run at
           | a loss. Wouldn't quantization or other tuning be just as
           | reasonable of an answer as cognitive bias? It's not like we
           | are talking about reptilian aliens running the whitehouse.
        
             | muixoozie wrote:
             | I'm just pointing out a personal observation. Completely
             | anecdotal. FWIW, I don't strongly believe this. I have at
             | least noticed a selection bias (maybe) in myself too as
             | recently as yesterday after GPT 5.1 was released. I asked
             | codex to do a simple change (less than 50LOC) and it made a
             | unrelated change, an early return statement, breaking a
             | very simple state machine that goes from waiting ->
             | evaluate -> done. However, I have to remind myself how
             | often LLMs make dumb mistakes despite often seeming
             | impressive.
        
               | oasisbob wrote:
               | That sounds more like availability bias, not selection
               | bias.
        
         | oasisbob wrote:
         | I noticed the degradation when Gemini stopped being a good
         | research tool, and made me want to strangle it on a daily
         | basis.
         | 
         | It's incredibly frustrating to have a model start to
         | hallucinate sources and be incapable of revisiting its
         | behavior.
         | 
         | Couldn't even understand that it was making up non-sensical RFC
         | references.
        
       | Legend2440 wrote:
       | What an unnecessarily wordy article. It could have been a fifth
       | of the length. The actual point is buried under pages and pages
       | of fluff and hyperbole.
        
         | johnwheeler wrote:
         | Yes, and I agree and it seems like the author has a naive
         | experience with LLMs because what he's talking about is kind of
         | the bread and butter as far as I'm concerned
        
           | Al-Khwarizmi wrote:
           | Indeed. To me, it has long been clear that LLMs do things
           | that, at the very least, are indistinguishable from
           | reasoning. The already classic examples where you make them
           | do world modeling (I put an ice cube into a cup, put the cup
           | in a black box, take it into the kitchen, etc... where is the
           | ice cube now?) invalidate the stochastic parrot argument.
           | 
           | But many people in the humanities have read the stochastic
           | parrot argument, it fits their idea of how they prefer things
           | to be, so they take it as true without questioning much.
        
             | Legend2440 wrote:
             | My favorite example: 'can <x> cut through <y>?'
             | 
             | You can put just about anything in there for x and y, and
             | it will almost always get it right. Can a pair of scissors
             | cut through a boeing 747? Can a carrot cut through loose
             | snow? A chainsaw cut through a palm leaf? Nailclippers
             | through a rubber tire?
             | 
             | Because of combinatorics, the space of ways objects can
             | interact is too big to memorize, so it can only answer if
             | it has learned something real about materials and their
             | properties.
        
         | asimilator wrote:
         | Summarize it with an LLM.
        
         | joshdifabio wrote:
         | Yes. I left in frustration and came to the comments for a
         | summary.
        
         | turnsout wrote:
         | So, a Substack article then
        
         | ThrowawayTestr wrote:
         | I'd expect nothing less from a historian
        
         | mmaunder wrote:
         | The author is far more fascinated with themselves than with AI.
        
         | falcor84 wrote:
         | I would just suggest that if you want your comment to be more
         | helpful than the article that you're critiquing, you might want
         | to actually quote the part which you believe is "The actual
         | point".
         | 
         | Otherwise you are likely to have people agreeing with you,
         | while they actually had a very different point that they took
         | away.
        
         | _giorgio_ wrote:
         | I missed the point, please point me to it
        
       | observationist wrote:
       | This might just be a handcrafted prompt framework for handwriting
       | recognition tied in with reasoning - do a rough pass, make
       | assumptions and predictions, check assumptions and predictions,
       | if they pass, use the degree of confidence in their passage to
       | inform what the other characters might be, and gradually flesh
       | out an interpretation of what was intended to be communicated.
       | 
       | If they could get this to occur naturally - with no supporting
       | prompts, and only one-shot or one-shot reasoning, then it could
       | extend to complex composition generally, which would be cool.
        
         | terminalshort wrote:
         | I don't see how this performance could be anything like that.
         | There is no way that Google included specialized system prompts
         | with anything to do with converting shillings to pounds in
         | their model.
        
       | lproven wrote:
       | Betteridge's law _surely_ applies.
        
       | kittikitti wrote:
       | I much prefer this tone about improvements in AI over the
       | doomerism I constantly read. I was waiting for a twist where the
       | author changed their minds and suddenly went "this is the devil's
       | technology" or "THEY T00K OUR JOBS" but it never happened. Thank
       | you for sharing, it felt like breathing for the first time in a
       | long time.
        
       | greekrich92 wrote:
       | Pretty hyperbolic reaction to what seems like a fairly modest
       | improvement
        
       | outside2344 wrote:
       | We are probably just a few weeks away from Google completely
       | wiping OpenAI out.
        
       | xx_ns wrote:
       | Am I missing something here? Colonial merchant ledgers and 18th-
       | century accounting practices have been extensively digitized and
       | discussed in academic literature. The model has almost certainly
       | seen examples where these calculations are broken down or
       | explained. It could be interpolating from similar training
       | examples rather than "reasoning."
        
         | ceroxylon wrote:
         | The author claims that they tried to avoid that: "[. . .] we
         | had to choose them carefully and experiment to ensure that
         | these documents were not already in the LLM training data (full
         | disclosure: we can't know for sure, but we took every
         | reasonable precaution)."
        
           | blharr wrote:
           | Even if that specific document wasn't in the training data,
           | there could be many similar documents from others at the
           | time.
        
       | jumploops wrote:
       | This is exciting news, as I have some elegantly scribed family
       | diaries from the 1800s that I can barely read (:
       | 
       | With that said, the writing here is a bit hyperbolic, as the
       | advances seem like standard improvements, rather than a huge leap
       | or final solution.
        
         | red75prime wrote:
         | Statistics in the article has a low number of samples to make
         | definitive conclusion, but expert-level WER looks like a huge
         | leap.
        
       | phkahler wrote:
       | It's a diffusion model, not autocomplete.
        
       | ghm2199 wrote:
       | I just used AI studio for recognizing text from a relative's 60
       | day log of food ingested 3 times a day. I think I am using
       | models/gemini-flash-latest and it was shockingly good at
       | recognizing text, far better than ChatGPT 5.1 or Claude's Sonnet
       | (IIRC its 4.5) model.
       | 
       | https://pasteboard.co/euHUz2ERKfHP.png
       | 
       | Its response I have captured here
       | https://pasteboard.co/sbC7G9nuD9T9.png is shockingly good. I
       | could only spot 2 mistakes. And those that seems to have been the
       | ones even I could not read or was very difficult for me to make
       | out what the text was.
        
         | ghm2199 wrote:
         | I basically fed it all 60 images 5 at a time and made a table
         | out of them to correlate sugar levels <-> food and colocate it
         | with the person's exercise routines. This is insane.
        
       | neom wrote:
       | I've been complaining on hn for some time now that my only real
       | test of an LLM is that it can help my poor wife with her
       | research, she spends all day every day in small town archives
       | pouring over 18th century American historical documents. I
       | thought maybe that day had come, I showed her the article and she
       | said "good for him I'm still not transcribing important
       | historical documents with a chat bot and nor should he" - ha. If
       | you wanna play around with some difficult stuff here are some
       | images from her work I've posted before:
       | https://s.h4x.club/bLuNed45
        
         | Workaccount2 wrote:
         | People have had spotty access to this model for brief periods
         | (gemini 3 pro) for a few weeks now, but its strongly expected
         | to be released next week, and definitely by year end.
        
           | neom wrote:
           | Oh I didn't realize this wasn't 2.5 pro (I skimmed, sorry) -
           | i also haven't had time to run some of her docs on 5.1 yet, I
           | should.
        
         | HDThoreaun wrote:
         | It doesnt have to be perfect to be useful. If it does a decent
         | job then your wife reviews and edits, that will be much faster
         | than doing the whole thing by hand. The only question is if she
         | can stay committed to perfection. I dont see the downside of
         | trying it unless she's worried about getting lazy.
        
           | neom wrote:
           | I raised this point with her, she said there are times it
           | would be ambiguous for both her and the model, and she thinks
           | it would be dangerous for her to be influenced by it. I'm not
           | a professional historical researcher so I'm not sure if her
           | concern is valid or not.
        
             | HDThoreaun wrote:
             | I think there's a lot of meta thought that deserves to be
             | done about where these new tools fit. It is easy to off
             | handedly reject change, especially as a subject matter
             | expert who can feel they worked so hard to do this and now
             | theyre being replaced so the work was for nothing. I really
             | dont want to say your wife is wrong, she almost assuredly
             | is not. But it is important to have a curious mindset when
             | confronted with ideas you may be biased against. Then she
             | can rest easy knowing she is doing her best to perfect her
             | craft, right? Otherwise she might wake up one day feeling
             | like symbolic NLP researchers trying LLMs for the first
             | time. Certainly a lot to consider.
        
               | neom wrote:
               | I really appreciate your thoughtful reply. I try my best
               | to be encouraging and educating without being preachy or
               | condescending with my wife on this subject. I read hn, I
               | see the posts of folks in, frankly what reads like
               | anguish, about having a tool replace their expertise. I
               | feel really, sad? about it. It's interesting to be
               | confronted with it here (a place I love!) and at home (a
               | place I love!) in quite different context. I've also
               | never been particularly good at becoming good at
               | something, I can't do very much, and genai is really
               | exciting for me, I'm both drawn to and have love for
               | experts so... This whole thing generally has been keeping
               | me up at night a bit, because I feel anguish for the
               | anguish.
        
             | fooker wrote:
             | As a scientist, I don't think this is valid or useful. It's
             | very much a first year PhD line of thought that academia
             | stamps out of you.
             | 
             | This is the 'RE' in research, you specifically want to know
             | and understand what others think of something by reading
             | others' papers. The scientific training slowly, laboriously
             | prepares you to reason about something without being too
             | influenced by it.
        
         | Huppie wrote:
         | While it's of course a good thing to be critical the author did
         | provide some more context on the why and how of doing it with
         | LLM's on the hard fork podcast today [0]: mostly as a way to
         | see how these models _can_ help them with these tasks.
         | 
         | I would recommend listening to their explanation, maybe it'll
         | give more insight.
         | 
         | Disclosure: After listening the podcast and looking up and
         | reading the article I emailed @dang to suggest it goes into the
         | HN second chance pool. I'm glad more people enjoyed it.
         | 
         | [0]: https://www.nytimes.com/2025/11/14/podcasts/hardfork-data-
         | ce...
        
         | potsandpans wrote:
         | > ...of an LLM is that it can help my poor wife with her
         | research, she spends all day every day in small town archives
         | pouring over 18th century American historical documents.
         | 
         | > I'm still not transcribing important historical documents
         | with a chat bot and nor should he
         | 
         | Doesn't sound like she's interested in technology, or wants
         | help.
        
       | mmaunder wrote:
       | Substack: When you have nothing to say and all day to say it.
        
         | mattmaroon wrote:
         | "This AI did something amazing but first I'm going to put in 72
         | paragraphs of details only I care about."
         | 
         | I was thinking as I skimmed this it needs a "jump to recipe"
         | button.
        
         | _giorgio_ wrote:
         | It was an embarrassing read. I should ask an llm to read it
         | since he probably wrote it the same way.
        
       | gcanyon wrote:
       | > So that is essentially the ceiling in terms of accuracy.
       | 
       | I think this is mistaken. I remember... ten years ago? When
       | speech-to-text models came out that dealt with background noise
       | that made the audio sound very much like straight pink noise to
       | my ear, but the model was able to transcribe the speech hidden
       | within at a reasonable accuracy rate.
       | 
       | So with handwritten text, the only prediction that makes sense to
       | me is that we will (potentially) reach a state where the machine
       | is at least probably more accurate than humans, although we
       | wouldn't be able to confirm it ourselves.
       | 
       | But if multiple independent models, say, Gemini 5 and Claude 7,
       | both agree on the result, and a human can only shrug and say,
       | "might be," then we're at a point where the machines are probably
       | superior at the task.
        
         | regularfry wrote:
         | That depends on how good we get at interpretability. If the
         | models can not only do the job but also are structured to
         | permit an explanation of how they did it, we get the
         | confirmation. Or not, if it turns out that the explanation is
         | faulty.
        
       | roywiggins wrote:
       | My task today for LLMs was "can you tell if this MRI brain scan
       | is facing the normal way", and the answer was: no, absolutely
       | not. Opus 4.1 succeeds more than chance, but still not nearly
       | often enough to be useful. They all cheerfully hallucinate the
       | wrong answer, confidently explaining the anatomy they are looking
       | for, but wrong. Maybe Gemini 3 will pull it off.
       | 
       | Now, Claude _did_ vibe code a fairly accurate solution to this
       | using more traditional techniques. This is very impressive on its
       | own but I 'd hoped to be able to just shovel the problem into the
       | VLM and be done with it. It's kind of crazy that we have "AIs"
       | that can't tell even roughly what the orientation of a brain scan
       | is- something a five year old could probably learn to do- but can
       | vibe code something using traditional computer vision techniques
       | to do it.
       | 
       | I suppose it's not _too_ surprising, a visually impaired
       | programmer might find it impossible to do reliably themselves but
       | would code up a solution, but still: it 's weird!
        
         | hopelite wrote:
         | What is the "normal" way? Is that defined in a technical
         | specification? Did you provide the definition/description of
         | what you mean by "normal"?
         | 
         | I would not have expected a language model to perform well on
         | what sounds like a computer vision problem? Even if it was
         | agentic, as you also imply how a five year old could learn how
         | to do it, so too an AI system would need to be trained or at
         | the very least be provided with a description of what is
         | looking at.
         | 
         | Imagine you took an MRI brain scan back in time and showed it
         | to a medical Doctor in even the 1950s or maybe 1900. Do you
         | think they would know what the normal orientation is, let alone
         | what they are looking at?
         | 
         | I am a bit confused and also interested in how people are
         | interacting with AI in general, it really seems to have a
         | tendency to highlight significant holes in all kinds of human
         | epistemological, organizational, and logical structures.
         | 
         | I would suggest maybe you think of it as a kind of child, and
         | with that, you would need to provide as much context and exact
         | detail about the requested task or information as possible.
         | This is what context engineering (are we still calling it
         | that?) concerns itself with.
        
           | roywiggins wrote:
           | The models absolutely do know what the standard orientation
           | is for a scan. They respond extensively about what they're
           | looking for and what the correct orientation would be, more
           | or less accurately. They are aware.
           | 
           | They then give the wrong answer, hallucinating anatomical
           | details in the wrong place, etc. I didn't bother with
           | extensive prompting because it doesn't evince any confusion
           | on the criteria, it just seems to not understand spatial
           | orientations very well, and it seemed unlikely to help.
           | 
           | The thing is that it's very, very simple: an axial slice of a
           | brain is basically egg-shaped. You can work out whether it's
           | pointing vertically (ie, nose pointing to towards the top of
           | the image) or horizontally by looking at it. LLMs will insist
           | it's pointing vertically when it isn't. it's an easy task for
           | someone with eyes.
           | 
           | Essentially all images an LLM will have seen of brains will
           | be in this orientation, which is either a help or a
           | hindrance, and I think in this case a hindrance- it's not
           | that it's seen lots of brains and doesn't know which are
           | correct, it's that it has only ever seen them in the standard
           | orientation and it can't see the trees for the forest, so to
           | speak.
        
         | chrischen wrote:
         | But these models are more like generalists no? Couldn't they
         | simply be hooked up to more specialized models and just defer
         | to them the way coding agents now use tools to assist?
        
           | roywiggins wrote:
           | There would be no point in going via an LLM then, if I had a
           | specialist model ready I'd just invoke it on the images
           | directly. I don't particularly need or want a chatbot for
           | this.
        
         | moritonal wrote:
         | That's fairly unfair comparison. Did you include in the prompt
         | a basic set of instructions about which way is "correct" and
         | what to look for?
        
           | roywiggins wrote:
           | I didn't give a detailed explanation to the model, but I
           | should have been more clear: they all seemed to know what to
           | look for, they wrote explanations of what they were looking
           | for, which were generally correct enough. They still got the
           | answer wrong, hallucinating the locations of the anatomical
           | features they insisted they were looking at.
           | 
           | It's something that you can solve by just treating the brain
           | as roughly egg-shaped and working out which way the pointy
           | end is, or looking for the very obvious bilateral symmetry.
           | You don't really have to know what any of the anatomy
           | actually is.
        
         | IanCal wrote:
         | Most models don't have good spatial information from the
         | images. Gemini models do preprocessing and so are typically
         | better for that. It depends a lot on how things get segmented
         | though.
        
         | lern_too_spel wrote:
         | This might be showing bugs in the training data. It is common
         | to augment image data sets with mirroring, which is cheap and
         | fast.
        
         | fragmede wrote:
         | and then, in a different industry, one that has physical
         | factories, there's this obsession about getting really good at
         | making the machine that makes the machine (product) being the
         | route to success. So it's funny that LLMs being able to write
         | programs to do the thing you want is seen as a failure here.
        
       | MagicMoonlight wrote:
       | It seems like a leap to assume it has done all sorts of complex
       | calculations implicitly.
       | 
       | I looked at the image and immediately noticed that it is written
       | as "14 5" in the original text. It doesn't require calculation to
       | guess that it might be 14 pounds 5 ounces rather than 145.
       | Especially since presumably, that notation was used elsewhere in
       | the document.
        
       | elphinstone wrote:
       | I read the whole article, but have never tried the model. Looking
       | at the input document, I believe the model saw enough of a space
       | between the 14 and 5 to simply treat it that way. I saw the space
       | too. Impressive, but it's a leap to say it saw 145 then used
       | higher order reasoning to correct 145 to 14 and 5.
        
         | Coeur wrote:
         | I also read the whole article, and this behaviour that the
         | author is most excited about only happened once. For a process
         | that inherently has some randomness about it, I feel it's too
         | early to bit this excited.
        
           | afro88 wrote:
           | Yep. A lot of things looked magical in the GPT-4 days.
           | Eventually you realised it did it by chance and more often
           | than not gets it wrong
        
       | AaronNewcomer wrote:
       | The thinking models (especially OpenAI's o3) still seem to do by
       | far the best at this task as they look across the document to see
       | how the writer wrote certain letters where the word is more clear
       | when it runs into confusing words.
       | 
       | I built a whole product around this:
       | https://DocumentTranscribe.com
       | 
       | But I imagine this will keep getting better and that excites me
       | since this was largely built for my own research!
        
         | _giorgio_ wrote:
         | I find Gemini 2.5 pro, not flash, way better than the chatGPT
         | models. I didn't remember testing o3 though. Maybe it's o3 pro
         | and it's one of the old costly and thinking models?
        
         | akudha wrote:
         | Your demo is very well done, love it!
        
       | barremian wrote:
       | > it codes fully functioning Windows and Apple OS clones, 3D
       | design software, Nintendo emulators, and productivity suites from
       | single prompts
       | 
       | > As is so often the case with AI, that is exciting and
       | frightening all at once
       | 
       | > we need to extrapolate from this small example to think more
       | broadly: if this holds the models are about to make similar leaps
       | in any field where visual precision and skilled reasoning must
       | work together required
       | 
       | > this will be a big deal when it's released
       | 
       | > What appears to be happening here is a form of emergent,
       | implicit reasoning, the spontaneous combination of perception,
       | memory, and logic inside a statistical model
       | 
       | > model's ability to make a correct, contextually grounded
       | inference that requires several layers of symbolic reasoning
       | suggests that something new may be happening inside these systems
       | --an emergent form of abstract reasoning that arises not from
       | explicit programming but from scale and complexity itself
       | 
       | Just another post with extreme hyperbolic wording to blow up
       | another model release. How many times have we seen such non-
       | realistic build up in the past couple of years.
        
       | cheevly wrote:
       | Reading HN comments just makes me realize how vastly LLMs exceed
       | human intelligence.
        
         | dang wrote:
         | " _Please don 't sneer, including at the rest of the
         | community._" It's reliably a marker of bad comments and worse
         | threads.
         | 
         | If you know more than others do, that's great, but in that case
         | please share some of what you know so the rest of us can learn.
         | Putting down others only makes this place worse for everyone.
         | 
         | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
       | _giorgio_ wrote:
       | Gemini 2.5 PRO is already incredibly good in handwritten
       | recognition. It makes maybe one small mistake every 3 pages.
       | 
       | It has completely changed the way I work, and it allows me to
       | write math and text and then convert it with the Gemini app (or
       | with a scanned PDF in the browser). You should really try it.
        
       | sriku wrote:
       | Rgd the "14 lb 5 oz" point in the article, the simpler
       | explanation than the hypothesis there that it back calculated the
       | weight is that there seems to be a space between 14 and 5 - i.e.
       | It reads more like "14 5" than "145"?
        
         | sriku wrote:
         | Impressive performance, yes but is the article giving more
         | credit than due?
        
       | koliber wrote:
       | It hasn't met my doctor.
        
       | Grimblewald wrote:
       | I dunno man, looks like goodharts law in action to me. That isnt
       | to say the models wont be good at what is stated, but it does
       | mean it might not signal a general improvement in competence but
       | rather a targeted gain with more general deficits rising up in
       | untested/ignored areas, some which may or may not be
       | catastrophic. I guess we will see but for now Imma keep my hype
       | in the box.
        
       | lelanthran wrote:
       | > In tabulating the "errors" I saw the most astounding result I
       | have ever seen from an LLM, one that made the hair stand up on
       | the back of my neck. Reading through the text, I saw that Gemini
       | had transcribed a line as "To 1 loff Sugar 14 lb 5 oz @ 1/4 0 19
       | 1". If you look at the actual document, you'll see that what is
       | actually written on that line is the following: "To 1 loff Sugar
       | 145 @ 1/4 0 19 1". For those unaware, in the 18th century sugar
       | was sold in a hardened, conical form and Mr. Slitt was a
       | storekeeper buying sugar in bulk to sell. At first glance, this
       | appears to be a hallucinatory error: the model was told to
       | transcribe the text exactly as written but it inserted 14 lb 5 oz
       | which is not in the document.
       | 
       | I read the whole reasoning of the blog author after that, but I
       | still gotta know - how can we tell that this was not a
       | hallucination and/or error? There's a 1/3 chance of an error
       | being correct (either 1 lb 45, 14 lb 5 or 145 lb), so why is the
       | author so sure that this was deliberate?
       | 
       | I feel a good way to test this would be to create an almost
       | identical ledger entry, but in a way so that the correct answer
       | after reasoning (the way the author thinks the model reasoned)
       | has completely different digits.
       | 
       | This way there'd be more confidence that the model itself
       | reasoned and did not make an error.
        
         | yomismoaqui wrote:
         | I implemented a receipt scanner to Google Sheet using Gemini
         | Flash.
         | 
         | The fact that it is "intelligent" it's fine for some things.
         | 
         | For example I created structured output schema that had a field
         | "currency" with the 3 letter format (USD, EUR...). So I scanned
         | a receipt from some shop in Jakarta and it filled that field
         | with IDR (Indonesian Rupiah). It inferred that data because of
         | the city name on the receipt.
         | 
         | Would it be better for my use case that it would have returned
         | no data for the currency field? Don't think so.
         | 
         | Note: if needed maybe I could have changed the prompt to not
         | infer the currency when not explicitly listed on the receipt.
        
           | Someone wrote:
           | > Would it be better for my use case that it would have
           | returned no data for the currency field? Don't think so.
           | 
           | If there's a decent chance it infers the wrong currency,
           | potentially one where the value of each unit is a few units
           | of scale larger or smaller than that of IDR, it might be
           | better to not infer it.
        
           | otabdeveloper4 wrote:
           | > Would it be better for my use case that it would have
           | returned no data for the currency field?
           | 
           | Almost certainly yes.
        
             | DangitBobby wrote:
             | Except in setups where you always check its work, and the
             | effort from the 5% of the time you have to correct the
             | currency is vastly outweighed due to effort saved from the
             | other 95% of the time. Pretty common situation.
        
         | HarHarVeryFunny wrote:
         | Yes, and as the article itself notes, the page image has more
         | than just "145" - there's a "u"-like symbol over the 1, which
         | the model is either failing to notice, or perhaps is something
         | it recognizes from training as indicating pounds.
         | 
         | The article's assumption of how the model ended up
         | "transcribing" "1 loaf of sugar u/145" as "1 loaf of sugar 14lb
         | 5oz" seems very speculative. It seems more reasonable to assume
         | that a massive frontier model knows something about loaves of
         | sugar and their weight range, and in fact Google search's "AI
         | overview" of "how heavy is a loaf of sugar" says the common
         | size is approximately 14lb.
        
           | wrs wrote:
           | There's also a clear extra space between the 4 and 5, so
           | figuring out to group it as "not 1 45, nor 145 but 14 5"
           | doesn't seem worthy of astonishment.
        
         | drawfloat wrote:
         | If I ask a model to transcribe something exactly and it outputs
         | an interpretation, that is an error and not a success.
        
       | elzbardico wrote:
       | I think the author has become a bit too enthusiastic. "Emerging
       | capabilities" become code for, unexpectedly good results that are
       | statistical serendipity but that I prefer to infer as some hidden
       | capability in a model I can't resist anthropomorphizing.
        
       | dr_dshiv wrote:
       | Is anyone aware of any benchmark evaluation for handwriting
       | recognition? I have not been able to find one, myself -- which is
       | somewhat surprising.
        
       | neves wrote:
       | Of it can read ancient handwriting, it will be a revolution for
       | historians work.
       | 
       | My wife is a historian and she is trained to recognize old
       | handwriting. When we go to museums she"translates" the texts for
       | the family
        
       | inshard wrote:
       | Could it be guessing via orders of magnitude? Like 145 lb * 1/4
       | is high confidence not the answer and 1 lb 45 oz is non standard
       | notation as 1lb = 16oz - so it's most likely 14 lb 5 oz
        
       | th0ma5 wrote:
       | I always have to ask with OCR in a professional context is which
       | digits of the numbers is it allowed to get wrong?
        
       ___________________________________________________________________
       (page generated 2025-11-15 23:00 UTC)