[HN Gopher] AbsenceBench: Language models can't tell what's missing
       ___________________________________________________________________
        
       AbsenceBench: Language models can't tell what's missing
        
       Author : JnBrymn
       Score  : 307 points
       Date   : 2025-06-20 22:26 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | AlienRobot wrote:
       | Unrelated to the paper, which is about asking LLM's to figure out
       | which parts of a document were removed, but my assumption has
       | been that to an LLM there is nothing "missing" in the sense that
       | any input leads to valid computation and output.
       | 
       | For example, I asked ChatGPT to explain something I typed
       | randomly
       | 
       | >It looks like you've entered "dosfi8q3anfdfiqr", which appears
       | to be a random string or perhaps a typo--it's not a recognized
       | acronym, code, or term in any common context I'm aware of. Could
       | you share a bit more about where you found this?
       | 
       | Although the answer is correct, my point is that anything you
       | give to the LLM is going to be put under some bucket. The LLM
       | can't say "I don't know what that is." Instead it says "that is a
       | random string." As far as the LLM is concerned, it knows every
       | possible input and concept that anyone could ever type into it,
       | it's just that its "understanding" of what that means (after the
       | tokens have gone through the neural network) doesn't necessarily
       | match what any human being thinks it means.
        
         | cyral wrote:
         | This might be due to the system prompt and the training that it
         | is supposed to be "a helpful agent". If you tell it not to ask
         | clarifying questions, you get something more like "I do not
         | understand your input". Tell it to be rude and never ask
         | clarifying questions and I get "What an absolute mess. Fix it
         | yourself"
         | 
         | Funny enough when testing this I also had to tell it to use
         | English. It sees "dos" I suppose and tends to reply with
         | exactly what you saw, but in Spanish.
        
         | layer8 wrote:
         | "It's not a recognized acronym, code, or term in any common
         | context I'm aware of" is pretty similar to "I don't know what
         | that is". I would assume that a model could be trained to
         | output the latter.
        
           | drsim wrote:
           | Right. I've had a lot of success using structured output to
           | force LLMs to make Boolean choices, like can they reply or
           | not.
        
       | cs702 wrote:
       | Interesting. Even the most recent models perform relatively
       | poorly when asked to identify which information in a context has
       | been removed, given access to both the original and edited
       | contexts.
       | 
       | The authors posit that poor performance is due to the fact that
       | the attention mechanism of Transformers cannot attend to the
       | removed tokens, because there are no keys for them!
       | 
       | Thank you for sharing on HN.
        
         | cyanydeez wrote:
         | for vision models, I wonder if they can train on things like
         | photo negatives, rotated images, etc. Or madlib like sentences
         | where a Q/A is like "the _____ took first place in the horse
         | show."
        
           | bearseascape wrote:
           | The madlib like sentences approach is actually how masked
           | token prediction works! It was one of the pretraining tasks
           | for BERT, but nowadays I think all (?) LLMs are trained with
           | next token prediction instead.
        
           | latency-guy2 wrote:
           | For photo negatives - usually doesn't matter. I am not up to
           | date with what the vision folks are doing at these companies,
           | but images are usually single channel, and more likely than
           | not for regular images in greyscale. Otherwise in complex
           | domain for the radar folks, and those are not RGB based
           | images at all, rather scatterer defined.
           | 
           | Additional channels being recognized in training usually
           | didn't matter for the experiments and models I used to deal
           | with before 2022, and if they were, certainly did not matter
           | for colors. Then again, the work I was doing was on known
           | (and some additional confusers) classes for object detection
           | and classification where the color pretty much didn't matter
           | in the first place.
        
         | jug wrote:
         | And yet, there are some notable differences between them, so
         | now that there's a benchmark and attention given to this issue,
         | I wonder how much better they can get. Because obviously
         | something can be done.
        
         | usaar333 wrote:
         | They don't seem to use any recent top models. No opus, no o3,
         | no Gemini 25 pro
        
         | yorwba wrote:
         | There are keys to attend to, they're just in the original text
         | instead of the modified one. Since the model receives both as
         | input, it could theoretically attend to those keys.
         | 
         | For the attention mechanism, there isn't much difference
         | between                 Original: {shared prefix} {removed
         | part} {shared suffix} Modified: {shared prefix} {shared suffix}
         | 
         | And                 Original: {shared prefix} {shared suffix}
         | Modified: {shared prefix} {added part} {shared suffix}
         | 
         | I think you could implement an algorithm for this in RASP (a
         | language for manually programming transformers) roughly like
         | this:
         | 
         | 1. The first layer uses attention to the "Original:" and
         | "Modified:" tokens to determine whether the current token is in
         | the original or modified parts.
         | 
         | 2. The second layer has one head attend equally to all original
         | tokens, which averages their values, and another head attends
         | equally to all modified tokens, averaging them as well. The
         | averages are combined by computing their difference.
         | 
         | 3. The third layer attends to tokens that are similar to this
         | difference, which would be the ones in the {removed
         | part}/{added part}.
         | 
         | The only ordering-dependent part is whether you compute the
         | difference as _original_average_ - _modified_average_ or the
         | other way around.
         | 
         | If a model can detect additions but not removal, that would
         | show that it is capable of learning this or a similar algorithm
         | in principle, but wasn't trained on enough removal-style data
         | to develop the necessary circuitry.
        
           | ironmanszombie wrote:
           | Thanks for the breakdown. I am far from knowledgeable on AI
           | but was wondering why can't a simple comparison work? They
           | can definitely be coded, as you have beautifully
           | demonstrated.
        
       | XenophileJKO wrote:
       | I haven't read the paper yet, but from a structural 'attention'
       | perspective being unable to detect unclassified omissions is
       | completely expected. (Though I think it is can be solved with
       | structured thought.)
       | 
       | For needle in a haystack you have to pay attention to the thing
       | that you are trying to find. Attention can do this pretty well.
       | 
       | When looking for an omission, that omission can be anything, you
       | can only reason about it by comparing one whole context to
       | another whole context. The attention layers can't really do that.
       | 
       | This is similar to the "rank a long set of things" problem.
       | Absent some meta cognition process, they just can't do that.
        
         | teruakohatu wrote:
         | > When looking for an omission, that omission can be anything,
         | 
         | In this benchmark they give the LLM the necessary information
         | to determine what is missing. For example "here is a poem, here
         | is a version of that same poem that may or may not be missing
         | lines. Are any lines missing?
         | 
         | It's more a tuning issue IMHO than an inherent weakness in
         | LLMs.
         | 
         | If I was asked to find an omission in an ML paper, my brain
         | compares it with other ML papers, it does not need to compare
         | it to Star Ward, Top Gear, Greek history, pottery and the other
         | 1000s of contexts I may know about.
        
           | XenophileJKO wrote:
           | Sorry I meant the omission can be anything in the context,
           | not anything in the world.. lol.
           | 
           | That is still hard. You only have so many attention heads
           | looking for things.. you can't pay attention to EVERYTHING..
           | which is what is required to find the omission.
        
             | yorwba wrote:
             | To pay attention to everything, set the query vector to 0.
             | Then all attention scores will be equal and the attention
             | output is the average of the value vectors.
        
           | thaumasiotes wrote:
           | We should note that "where is there a line missing from this
           | poem: ____?" contains sufficient information to answer
           | correctly without needing a copy of the original to compare
           | to.
           | 
           | Here are two verses of a poem (song) in Mandarin Chinese:
           | 
           |  _yi quan ting ni de
           | 
           | er gei ni hao de
           | 
           | shu dao san yong yuan ai ni yi ge
           | 
           | si bu hui fan cuo
           | 
           | wu bu hui luo suo
           | 
           | shuo ni xiang shuo de
           | 
           | zuo ni xiang zuo de
           | 
           | bie pa shi bai yin wei ni you wo
           | 
           | pei ni kan ri luo
           | 
           | pei ni yi qi chang wan wo men ai de ge_
           | 
           | I removed two lines. Where did that happen?
           | 
           | Would your answer be different if I told you that I might or
           | might not have removed some lines?
        
             | teruakohatu wrote:
             | > Here are two verses of a poem (song) in Mandarin Chinese:
             | 
             | > ...
             | 
             | > I removed two lines. Where did that happen?
             | 
             | If you read the paper you will see they provide the
             | original as well as the version missing information.
             | 
             | I did mention this in my comment too.
             | 
             | I am quite sure I could find your two missing lines if you
             | provide me the full poem.
             | 
             | Given that you are a prolific commenter on HN, I am sure a
             | LLM could be fine tuned to detect missing text from your
             | comments without additional information. For example ...
             | 
             | > WinForms is still around. There have been further tec lly
             | just a big tire fire and about the best you can do is to
             | ignore all of them and develop in WinForms.
             | 
             | It's probably possible to detect missing information from
             | "tec" until "lly". But to know what is between is not
             | possible for a human either, beyond plausible guesses.
        
               | thaumasiotes wrote:
               | ...did you read my comment? The first - and really only -
               | thing I say is that the original isn't necessary. Then
               | there's an example. You shouldn't have trouble
               | identifying where lines have been removed from the
               | Chinese poem.
               | 
               | The fact that the original was provided doesn't
               | demonstrate that it's necessary to the task. You can
               | identify missing text without needing to know what was
               | there.
               | 
               | > Given that you are a prolific commenter on HN, I am
               | sure a LLM could be fine tuned to detect missing text
               | from your comments without additional information.
               | 
               | Same thing. Why would you need to do tuning on text
               | authored by me? You can easily detect missing text of
               | that style by the fact that the sentence you have fails
               | to be English. You can do the same thing in text for
               | which you have no prior experience of the author.
               | 
               | > I am quite sure I could find your two missing lines if
               | you provide me the full poem.
               | 
               | But hey, if you insist:
               | 
               | Qing Qing Tie Jin Ni De Er Duo
               | 
               | saranghaeyo
               | 
               | Qing Hua Yong Yuan Bu Xian Tai Duo Dui Ni Shuo
               | 
               | Yi Quan Ting Ni De
               | 
               | Er Gei Ni Hao De
               | 
               | Shu Dao San Yong Yuan Ai Ni Yi Ge
               | 
               | Si Bu Hui Fan Cuo
               | 
               | Wu Bu Hui Luo Suo
               | 
               | Mei Tian Wei Ni Da  call, cook Ye Bu Cuo
               | 
               | Qing Qing Tie Jin Ni De Er Duo
               | 
               | saranghaeyo
               | 
               | Qing Hua Yong Yuan Bu Xian Tai Duo Dui Ni Shuo
               | 
               | Da Kai Ni De Ai Qing Shou Ce
               | 
               | Jiu Zai Ci Ke
               | 
               | Wei Ni Chang De Zhuan Shu Qing Ge Yao Ji De
               | 
               | Shuo Ni Xiang Shuo De
               | 
               | Zuo Ni Xiang Zuo De
               | 
               | Bie Pa Shi Bai Yin Wei Ni You Wo
               | 
               | Pei Ni Kan Ri Luo
               | 
               | Pei Ni Deng Yu Guo
               | 
               | Pei Ni Yi Qi Chang Wan Wo Men Ai De Ge
               | 
               | Qing Qing Tie Jin Ni De Er Duo
               | 
               | saranghaeyo
               | 
               | Qing Hua Yong Yuan Bu Xian Tai Duo Dui Ni Shuo
               | 
               | Da Kai Ni De Ai Qing Shou Ce
               | 
               | Jiu Zai Ci Ke
               | 
               | Wei Ni Chang De Zhuan Shu Qing Ge Yao Ji De
               | 
               | Wo Qing Qing Kao Jin Ni De Er Duo
               | 
               | Shuo Ai Ni Bu Xian Tai Duo
               | 
               | Ru Guo Xiang Yu De Ji Lu Yi Mo Fen Zhi Yi Na Yao Duo
               | 
               | Qing Xiang Xin Wo De Zhen Zhen Zhen Xin Bi Yu Zhou Huan
               | Liao Kuo
               | 
               | Wo Hui Qian Zhao Ni De Shou Zhi Dao Ni Quan Bu Jie Shou
               | 
               | Da Kai Ni De Ai Qing Shou Ce
               | 
               | Jiu Zai Ci Ke
               | 
               | Zhe Shou Zhuan Shu Qing Ge  Qing Ji De
        
             | niemandhier wrote:
             | I'll take the bait :-). . Endings of lines seem to come in
             | pairs ( de, de; cuo, suo; de,de; wo,luo)
             | 
             | I'd therefore conjecture that lines are missing after 'ge'
             | and 'ge'.
             | 
             | This of course assumes Chinese poetry is based on vowels
             | matching as e.g it is the case in german and not based on
             | rhythm as would be the case in Latin and Arabic.
        
             | meltyness wrote:
             | Two lines clearly deviate from AAB.
        
       | pkoird wrote:
       | So LLMs are poor at string diff, it seems. Tangentially, is there
       | any source (a github repo or otherwise) that documents findings
       | like these a la what LLMs are good at and what they aren't good
       | at?
        
       | birdfood wrote:
       | Perhaps related, after watching a talk by Gerald Sussman I loaded
       | an image of the Kanizsa triangle into Claude and asked it a
       | pretty vague question to see if it could "see" the inferred
       | triangle. It recognised the image and went straight into giving
       | me a summary about it. So I rotated the image 90 degrees and
       | tried in a new conversation, it didn't recognise the image and
       | got the number of elements incorrect:
       | 
       | This image shows a minimalist, abstract geometric composition
       | with several elements:
       | 
       | Four black shapes that appear to be partial circles or "Pac-Man"
       | like forms, each with a wedge cut out, positioned in the four
       | corners/quadrants of the image Two thin black triangular or
       | arrow-like shapes - one pointing upward in the upper left area,
       | and one pointing to the right in the center-right area All
       | elements are arranged on a light gray or off-white background
        
         | latentsea wrote:
         | I guess they will now just rotate all the images in the
         | training data 90 degrees too to fill this kind of gap.
        
           | recursivecaveat wrote:
           | Everything old is new again: in the Alexnet paper that kicked
           | off the deep learning wave in 2012, they describe
           | horizontally flipping every image as a cheap form of data
           | augmentation. Though now that we expect models to actually
           | read text that seems potentially counter-productive.
           | Rotations are similar, in that you'd hope it would learn
           | heuristics such as that the sky is almost always at the top.
        
             | latency-guy2 wrote:
             | At least from when I was still doing this kind of work,
             | look angle/platform angle scatterer signal (radar) mattered
             | more than rotation, but rotation was a simple way to get
             | quite a bit more samples. It never stopped being relevant
             | :)
        
             | bonoboTP wrote:
             | That's called data augmentation. It was common alredy
             | before AlexNet. And it never stopped being common, it's
             | still commonly done.
        
           | mirekrusin wrote:
           | That's how you train neural network with synthetic data so it
           | extracts actual meaning.
           | 
           | That's how humans also learn ie. adding numbers. First there
           | is naive memoization, followed by more examples until you get
           | it.
           | 
           | LLM training seems to be falling into memoization trap
           | because models are extremely good at it, orders of magnitude
           | better than humans.
           | 
           | IMHO what is missing in training process is this feedback
           | explaining wrong answer. What we're currently doing with
           | training is leaving out this understanding as "exercise to
           | the reader". We're feeding correct answers to specific,
           | individual examples which promotes memoization.
           | 
           | What we should be doing in post training is ditch direct
           | backpropagation on next token, instead let the model finish
           | its wrong answer, append explanation why it's wrong and
           | continue backpropagation for final answer - now with
           | explanation in context to guide it to the right place in
           | understanding.
           | 
           | What all of this means is that current models are largely
           | underutilized and unnecessarily bloated, they contain way too
           | much memoized information. Making model larger is easy, quick
           | illusion of improvement. Models need to be squeezed more,
           | more focus needs to go towards training flow itself.
        
             | atwrk wrote:
             | _> That 's how humans also learn ie. adding numbers. First
             | there is naive memoization, followed by more examples until
             | you get it._
             | 
             | Just nitpicking here, but this isn't how humans learn
             | numbers. They start at birth with competency up to about 3
             | or 5 and expand from that. So they can already work with
             | quantities of varying size (i.e. they know which is more,
             | the 4 apples on the left or the five on the right, and they
             | also know what happens if I take one apple from the left
             | and put it to the others on the right), and _then_ they
             | learn the numbers. So yes, they learn the numbers through
             | memorization, but only the signs /symbols, not the numeric
             | competency itself.
        
               | mirekrusin wrote:
               | Turtles all the way down, things like meaning of "more"
               | is also memoized ie initially as "I want more food" etc.
               | then refined with time, ie. kid saying "he's more than
               | me" is corrected by explaining that there needs to be
               | some qualifier for measurable quantity ie. "he's more
               | tall (taller) than me" or "he is more fast (faster) than
               | me" etc.
               | 
               | Using different modalities (like images, videos,
               | voice/sounds instead of pure text) is interesting as well
               | as it helps completing the meaning, adds sense of time
               | etc.
               | 
               | I don't think we're born with any concepts at all, it's
               | all quite chaotic initially with consistent sensory
               | inputs that we use to train/stabilise our neural network.
               | Newborns for example don't even have concept of
               | separation between "me and the environment around me",
               | it's learned.
        
           | littlestymaar wrote:
           | And it will work.
           | 
           | I just whish the people believing LLM can actually reason and
           | generalize see that they don't.
        
             | latentsea wrote:
             | At this point think all reasoning really means is having
             | seen enough of the right training data to make the correct
             | inferences, and they're just missing some training data.
        
             | ben_w wrote:
             | If that was evidence current AI don't reason, then the
             | Thatcher effect would be evidence that humans don't:
             | https://en.wikipedia.org/wiki/Thatcher_effect
             | 
             | LLMs may or may not "reason", for certain definitions of
             | the word (there are many), but this specific thing doesn't
             | differentiate them from us.
        
               | t-3 wrote:
               | Being tricked by optical illusions is more about the
               | sensory apparatus and image processing faculties than
               | reasoning, but _detecting_ optical illusions is
               | definitely a reasoning task. I doubt it 's an important
               | enough task to train into general models though.
        
         | akomtu wrote:
         | To generalise this idea: if we look at a thousand points that
         | more or less fill a triangle, we'll instantly recognize the
         | shape. IMO, this simple example reveals what intelligence is
         | really about. We spot the triangle because so much complexity -
         | a thousand points - fits into a simple, low-entropy geometric
         | shape. What we call IQ is the ceiling of complexity of patterns
         | that we can notice. For example, the thousand dots may in fact
         | represent corners of a 10-dimensional cube, rotated slightly -
         | an easy pattern to see for a 10-d mind.
        
           | saithound wrote:
           | Cool. Since ChatGPT 4o is actually really good at this
           | particular shape identification task, what, if anything do
           | you conclude about its intelligence?
        
             | JohnKemeny wrote:
             | The entire point here is that LLMs and image recognition
             | software is not managing this task, so, not really good at
             | this particular shape identification task.
        
               | saithound wrote:
               | No, the post's article is not about the sort of shape
               | identification task discussed by GP. Or indeed any image
               | recognition task: it's a paper about removed context in
               | language.
               | 
               | Fwiw, I did test GP's task on ChatGPT 4o directly before
               | writing my comment. It is as good at it as any human.
        
             | akomtu wrote:
             | Recognizing triangles isn't that impressive. What's the
             | ceiling of complexity of patterns in data it can identify
             | with is the real question. Give it a list of randomly
             | generated xyz coords that fall on a geometric shape, or a
             | list of points that sample a trajectory of Earth around
             | Sun. Will it tell you that it's an ellipse? Will it derive
             | the 2nd Newton's law? Will it notice the deviation from the
             | ellipse and find the rule explaining it?
        
         | Workaccount2 wrote:
         | Show any LLM a picture of a dog with 5 legs watch them be
         | totally unable to count.
        
           | pfdietz wrote:
           | Or watch them channel Abraham Lincoln.
        
         | JohnKemeny wrote:
         | We really don't know how to compute.
         | 
         | Oct 2011, 30 comments.
         | 
         | https://news.ycombinator.com/item?id=3163473
         | 
         | Strange loop video:
         | 
         | July 2011, 36 comments.
         | 
         | https://news.ycombinator.com/item?id=2820118
        
         | iknownothow wrote:
         | As far as I can tell, the paper covers text documents only.
         | Therefore your example doesn't quite apply.
         | 
         | It is well known that LLMs have a ways to go when it comes to
         | processing images like they process text or audio.
         | 
         | I don't think there's any good performing multimodal model that
         | accepts image pixels directly. Most vision capabilities are
         | hacks or engineered in. An image undergoes several processing
         | steps and each processor's outputs are fed to the transformer
         | as tokens. This may happen in one network but there's non-
         | transformer networks involved. Examples of preprocessing:
         | 
         | * OCR * CNNs (2D pattern recognizers) with different zooms,
         | angles, slices etc * Others maybe too?
        
       | yousif_123123 wrote:
       | This is very interesting. 1. Authors mention the attention
       | mechanism being perhaps unable to attend to the location of gaps
       | since the gaps aren't tokens. But I would've expected a good LLM
       | transformer to be at least a bit close to the gap location. I
       | don't understand why mathematically the architecture is less
       | suitable for that, it could attend to a region that may contain
       | gaps. I wonder if fine-tuning on a task like this could help? 2.
       | Shorter inputs with less omissions were harder to solve. That is
       | not completely surprising, as a human doing this task, if 1 word
       | was missing it would be harder to notice. And similarly 1 line
       | would be harder than 10 lines. But still interesting for an LLM
       | to have this problem. 3. Reasoning models do better, as they can
       | write out the documents and potentially solve this easily. It
       | still very surprising that this doesn't lead to 100% accuracy.
       | This should be a trivial task. Like the paper says, a trivial
       | program can be written to solve this. Perhaps ChatGPT (or similar
       | agent) could read this paper while training, and know to write
       | and run python when solving an issue like this.
       | 
       | The most interesting thing though, is what other aspects of
       | intelligence we may not have identified explicitly, and whether
       | LLMs and current AI is very bad at them. This paper suggests that
       | there likely are many of those, and it seems in general a pretty
       | fun time for people working building benchmarks.
        
       | xianshou wrote:
       | In many of their key examples, it would also be unclear to a
       | human what data is missing:
       | 
       | "Rage, rage against the dying of the light.
       | 
       | Wild men who caught and sang the sun in flight,
       | 
       | [And learn, too late, they grieved it on its way,]
       | 
       | Do not go gentle into that good night."
       | 
       | For anyone who hasn't memorized Dylan Thomas, why would it be
       | obvious that a line had been omitted? A rhyme scheme of AAA is at
       | least as plausible as AABA.
       | 
       | In order for LLMs to score well on these benchmarks, they would
       | have to do more than recognize the original source - they'd have
       | to know it cold. This benchmark is really more a test of
       | memorization. In the same sense as "The Illusion of Thinking",
       | this paper measures a limitation that neither matches what the
       | authors claim nor is nearly as exciting.
        
         | jamessinghal wrote:
         | The test provides both the original and the modified excerpt in
         | the user message, so the LLM doesn't need any memorized version
         | of the excerpt to theoretically answer each correctly.
         | 
         | From the paper:
         | 
         | System Prompt You are helping a student practice memorizing
         | poems. The student will recite a poem, but they may have missed
         | some lines. Your task is to identify exactly which lines are
         | missing from their recitation. List only the missing lines,
         | nothing else.
         | 
         | User Message Here is the complete original poem: {original
         | poem} Now, here is my recitation which may be missing some
         | lines: {modified poem} What lines did I miss? Please list only
         | the missing lines, nothing else.
        
           | scarface_74 wrote:
           | This worked
           | 
           | https://chatgpt.com/share/6855f69d-766c-8010-96e2-ed1b45d3e6.
           | ..
        
             | htnwe_2312412 wrote:
             | yes, 69.8% of the time.
        
       | OsrsNeedsf2P wrote:
       | The criticisms to how AbsenceBench does this are valid, but I'm
       | very excited that we are benchmarking this at all. It's
       | definitely a push in the right direction
        
       | yandie wrote:
       | I wonder how this would apply with vision models? I tried with a
       | few example of single images and they appear to do well. I did a
       | few toy examples and they seem to do pretty well (Claude +
       | Gemini) with spotting differences. An example image:
       | https://www.pinterest.com/pin/127578601938412480/
       | 
       | They seem to struggle more when you flip the image around
       | (finding fewer differences, and potentially halluciating)
        
       | obscure-enigma wrote:
       | this research is too simplified and kind of vague, as it's the
       | inherent nature of language models for that matter any
       | probabilistic model, to compress the information for better
       | generalization since there is a lower bound to how much loss they
       | can incur while decoding the information. LLMs are indeed lossy
       | compressors
        
       | kadonoishi wrote:
       | To detect a presence, a real brain takes in sensory input and
       | compares it to expectations, and stays calm or registers
       | surprise, and from time to time issues predictions to guide the
       | organism.
       | 
       | To detect an absence, the brain cannot rely on sensory input, by
       | definition. To be surprised if sensory evidence is _not_ there
       | requires a model of the world strong enough to register surprise
       | if the expectation is not there, without a sensory prompt.
       | 
       | It seems to me detecting an absence is a strictly higher-order
       | neurological task than processing sensory input.
       | 
       | If LLMs can't do this strictly higher-order neurological task, is
       | that not a capability currently unique to living things?
        
         | tclancy wrote:
         | > from time to time
         | 
         | I know less-than-zero about the subject but I'd imagine the
         | temporal aspect alone is a problem. Aren't these agents
         | reasoning from a fixed/ frozen version of "reality" rather than
         | adjusting in real-time??
        
         | gtsop wrote:
         | Thinking is still currently unique to living things, so you
         | don't need to resort to what you describe to find the human
         | brain uniquness.
         | 
         | Onto what you describe, it has to do with memory. Memory is
         | storing and playing back sensory input, in the absence of that
         | sensory input. So your brain plays back some past sensory input
         | and checks it against current sensory input.
         | 
         | Eg you left the pen on the table. When you come back the pen
         | isn't there. Your brain compares the stored memory of seeing
         | the pen on the table vs what you see now.
        
         | viralsink wrote:
         | LLMs might not be very consistent overall in their learned
         | architecture. Some paths may lead to memorized info, some paths
         | may lead to advanced pattern matching.
        
       | b0a04gl wrote:
       | why are we surprised transformers can't detect what's missing
       | when the entire stack assumes the input is complete? the
       | tokenizer doesn't leave placeholders. the attention weights have
       | nothing to anchor to. even the loss function is built around
       | predicting what is, not what isn't. this isn't a model bug. it's
       | an architectural omission.
       | 
       | if we want models that detect absences? you need training
       | objectives that expect absence. maybe even input encodings that
       | represent "this might've been here."
        
         | zurfer wrote:
         | I am surprised because it's such a simple task. Any human who
         | is a bit diligent would be able to figure it out. They give
         | both the original and the modified version.
         | 
         | However it feels a bit like counting letters. So maybe it can
         | be solved with post training. We'll know in 3 to 6 months if it
         | was easy for the labs to "fix" this.
         | 
         | In my daily use of LLMs I regularly have some overly optimistic
         | answers because they fail to consider potentially absent or
         | missing information (even harder because it's out of context).
        
       | jonbaer wrote:
       | https://en.wikipedia.org/wiki/Chinese_room
        
       | itsgrimetime wrote:
       | I'm not sure how to go about solving it at the architecture level
       | but I would assume an LLM with access to a diff tool would get
       | 100%, but I understand that's not really the point
        
       | TZubiri wrote:
       | I have never tried this, but I'm wondering how effective the
       | following approaches would be to measure uncertainty and unknowns
       | in responses:
       | 
       | https://cookbook.openai.com/examples/using_logprobs
       | 
       | According to OpenAI official cookbook it seems to be a fairly
       | standard usecase.
       | 
       | Another approach, especially in classification, would be to
       | measure the cosine distance between the user embedding, and the
       | ideal embedding of the message category.
        
       | ThrowawayTestr wrote:
       | Wouldn't it be better to ask the LLM to use a diff tool instead
       | of asking it look at the text directly?
        
         | viralsink wrote:
         | This kind of research is about finding the limitations of the
         | technology to hopefully advance it in a meaningful direction.
         | If this finding impedes you, then sure, you can find a quick
         | fix for it. But it's beside the point.
        
       | pu_pe wrote:
       | Quite clever and useful benchmark. This implies that without tool
       | use, LLMs have a fundamental limitation when it comes to tasks
       | like code review.
        
         | iknownothow wrote:
         | I'd say that's where we're headed. A big model that's trained
         | from the start to use tools and know when to use certain tools
         | and how to use tools. Like us :)
         | 
         | I wouldn't be surprised if someone's building a dataset for
         | tool use examples.
         | 
         | The newer gen reasoning models are especially good at knowing
         | when to do web search. I imagine they'll slowly get better at
         | other tools.
         | 
         | At current levels of performance, LLMs having the ability to
         | get well curated information by themselves would increase their
         | scores by a lot.
        
       | VikRubenfeld wrote:
       | Interesting. I encountered an example of this the other day. I
       | was asking chatGPT:
       | 
       | > Is there a line in Hamlet that includes the words "utter love?"
       | 
       | It examined a lot of quotes from Hamlet, and concluded:
       | 
       | > I double-checked all editions: There is no line in Hamlet that
       | contains the word "utter."
       | 
       | >It is never used in any of the quartos or folios in any line of
       | Hamlet.
       | 
       | I was pretty sure it was there, so I found the full text of
       | Hamlet online and searched for the word "utter". I found it right
       | away and posted to chatGPT:
       | 
       | > That's odd. I found it in two seconds by searching the full
       | text for the word "utter".
       | 
       | > This must be known, which, being kept close, might move More
       | grief to hide than hate to utter love.
       | 
       | > Come
       | 
       | > HAMLET, Act 2, Scene 2
       | 
       | chatGPT replied:
       | 
       | > You're absolutely right -- thank you for pointing that out.
       | 
       | > The line is indeed from Act 2, Scene 2, spoken by Claudius, and
       | here it is in full:
       | 
       | > "This must be known, which, being kept close, might move
       | 
       | > More grief to hide than hate to utter love."
       | 
       | > I regret the earlier oversight -- your memory was sharper than
       | my index!
        
         | diggan wrote:
         | Yeah, kind of makes sense, LLMs are really bad at that sort of
         | recall unless provided tool and a loop to go over the results.
         | 
         | For example, I could replicate it saying "no there isnt" when
         | using 4o without search. But with search it gets the answer
         | right.
         | 
         | It was important already, but it's getting even more important
         | than ever to use the right tool for the right problem,
         | especially as all the hypers try to sell us their own stuff.
        
         | GLdRH wrote:
         | Seems like ChatGPT is an utter failure
        
         | speakeron wrote:
         | It's actually Act 2, Scene 1 and is spoken by Polonius.
        
       | emporas wrote:
       | Poor benchmark.
       | 
       | I tried their prompt [1] using 3 numbered items, qwq-32b got it
       | right with no problems at all. I think it could solve 100
       | numbered items correctly 100% of the time, but it probably needs
       | a million tokens. Probably even more, 10 million.
       | 
       | The limitation of 5000 tokens is peanuts for a reasoning model.
       | Give it a lot of testing time compute, 10x of 5000 tokens is
       | still too little.
       | 
       | The authors talk about long inputs, so, if it is 100 pages, give
       | it a billion tokens.
       | 
       | The correct way to implement this is in batches, find the first 5
       | numbered items in the omitted input text, if it does find those,
       | then simplify the input items and the omitted input items and go
       | again.
       | 
       | Depending on the size of the input, it will always need a hefty
       | amount of tokens, but simplification will help it backtrack
       | correctly and not lose the thread entirely.
       | 
       | [1]You are helping a student practice memorizing poems. The
       | student will recite a poem, but they may have missed some lines.
       | Your task is to identify exactly which lines are missing from
       | their recitation. List only the missing lines, nothing else. User
       | Message Here is the complete original poem: 1)Quisella's lashes
       | fluttered panic-morse. 2)The Moisture Vampires leeches that
       | sucked humidity. 3)Lysandra's nostrils flared precisely one
       | degree. Now, here is my recitation which may be missing some
       | lines: Quisella's lashes fluttered panic-morse. Lysandra's
       | nostrils flared precisely one degree. What lines did I miss?
       | Please list only the missing lines, nothing else.
        
         | emporas wrote:
         | I just tried qwq-32b using the numbered headlines of HN right
         | now, with 26 items [1], I removed 3 headlines, still found all
         | 3 omitted items first try, perfect, and it didn't even consume
         | 50.000 tokens.
         | 
         | [1]
         | https://gist.github.com/pramatias/fee1391ad08c7b965f435f3af1...
        
         | enragedcacti wrote:
         | What is interesting about reducing the problem to counting? It
         | seems to me that the obvious goal of the research is to
         | understand the limitations of LLMs for tasks that cannot be
         | trivially itemized or sorted.
        
           | emporas wrote:
           | The more specific are the instructions, the better they
           | perform. There is a huge difference, between trying to find
           | omitted text, or omitted words, or omitted sentences.
           | 
           | If omitted words are to be found, put each word into it's own
           | line and number it. The same with sentences.
           | 
           | If you are trying to find omitted words and sentences, make
           | one pass with only words, and another one with only
           | sentences. Then combine the results.
        
             | enragedcacti wrote:
             | To what end? You have to segment and order the document
             | (i.e. solve the problem) just to craft your prompt so the
             | LLM spitting the solution back to you is useless. The
             | experiment uses these tasks because test cases can be
             | algorithmically generated and scored, but it's not very
             | interesting that one can structure the input to solve this
             | specific, useless task with LLMs. It is interesting,
             | though, that this limitation could carry over into tasks
             | where traditional algorithms fail. LLMs improving at this
             | would be legitimately useful which is why a benchmark makes
             | sense, but cheating the benchmarks by augmenting the input
             | doesn't.
        
       | iknownothow wrote:
       | To be fair, I'd put finding literal string diffs in the category
       | of asking LLMs to do rote arithmetic.
       | 
       | The attention mechanism does far too much complex thinking for
       | such a dumb task. This is precisely where you need to dumb down
       | and focus and be disciplined rather than do high level next token
       | prediction.
       | 
       | You'd benefit from actually asking the LLM to list the full
       | document and compare, kind of like reasoning, and similar to how
       | LLMs perform better when they break down arithmetic or algebra
       | tasks into smaller steps.
       | 
       | Also my guess would be that the models that perform well are MoE
       | models where there may be an Expert or two that does well on
       | tasks that needs focus rather than intuition. So without knowing
       | anything about Gemini Flash, my guess would be that it's an MoE
       | model.
        
       | amelius wrote:
       | I bet an LLM would be able to do it if you allowed it to go
       | "meta", and asked it to write a python script to detect the
       | omissions, where the script can use an LLM.
        
         | yousif_123123 wrote:
         | Maybe if instructed, but how would it know it needs to use
         | python in this case vs just answer? Perhaps you'd instruct it
         | to always attempt using code to reduce errors.
         | 
         | But the idea of trivial problems like this potentially causing
         | issues for LLMs might mean other aspects of intelligence could
         | also be a struggle for LLMs (which could impact it's coding
         | ability as well).
        
       ___________________________________________________________________
       (page generated 2025-06-21 23:01 UTC)