[HN Gopher] AbsenceBench: Language models can't tell what's missing
___________________________________________________________________
AbsenceBench: Language models can't tell what's missing
Author : JnBrymn
Score : 307 points
Date : 2025-06-20 22:26 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| AlienRobot wrote:
| Unrelated to the paper, which is about asking LLM's to figure out
| which parts of a document were removed, but my assumption has
| been that to an LLM there is nothing "missing" in the sense that
| any input leads to valid computation and output.
|
| For example, I asked ChatGPT to explain something I typed
| randomly
|
| >It looks like you've entered "dosfi8q3anfdfiqr", which appears
| to be a random string or perhaps a typo--it's not a recognized
| acronym, code, or term in any common context I'm aware of. Could
| you share a bit more about where you found this?
|
| Although the answer is correct, my point is that anything you
| give to the LLM is going to be put under some bucket. The LLM
| can't say "I don't know what that is." Instead it says "that is a
| random string." As far as the LLM is concerned, it knows every
| possible input and concept that anyone could ever type into it,
| it's just that its "understanding" of what that means (after the
| tokens have gone through the neural network) doesn't necessarily
| match what any human being thinks it means.
| cyral wrote:
| This might be due to the system prompt and the training that it
| is supposed to be "a helpful agent". If you tell it not to ask
| clarifying questions, you get something more like "I do not
| understand your input". Tell it to be rude and never ask
| clarifying questions and I get "What an absolute mess. Fix it
| yourself"
|
| Funny enough when testing this I also had to tell it to use
| English. It sees "dos" I suppose and tends to reply with
| exactly what you saw, but in Spanish.
| layer8 wrote:
| "It's not a recognized acronym, code, or term in any common
| context I'm aware of" is pretty similar to "I don't know what
| that is". I would assume that a model could be trained to
| output the latter.
| drsim wrote:
| Right. I've had a lot of success using structured output to
| force LLMs to make Boolean choices, like can they reply or
| not.
| cs702 wrote:
| Interesting. Even the most recent models perform relatively
| poorly when asked to identify which information in a context has
| been removed, given access to both the original and edited
| contexts.
|
| The authors posit that poor performance is due to the fact that
| the attention mechanism of Transformers cannot attend to the
| removed tokens, because there are no keys for them!
|
| Thank you for sharing on HN.
| cyanydeez wrote:
| for vision models, I wonder if they can train on things like
| photo negatives, rotated images, etc. Or madlib like sentences
| where a Q/A is like "the _____ took first place in the horse
| show."
| bearseascape wrote:
| The madlib like sentences approach is actually how masked
| token prediction works! It was one of the pretraining tasks
| for BERT, but nowadays I think all (?) LLMs are trained with
| next token prediction instead.
| latency-guy2 wrote:
| For photo negatives - usually doesn't matter. I am not up to
| date with what the vision folks are doing at these companies,
| but images are usually single channel, and more likely than
| not for regular images in greyscale. Otherwise in complex
| domain for the radar folks, and those are not RGB based
| images at all, rather scatterer defined.
|
| Additional channels being recognized in training usually
| didn't matter for the experiments and models I used to deal
| with before 2022, and if they were, certainly did not matter
| for colors. Then again, the work I was doing was on known
| (and some additional confusers) classes for object detection
| and classification where the color pretty much didn't matter
| in the first place.
| jug wrote:
| And yet, there are some notable differences between them, so
| now that there's a benchmark and attention given to this issue,
| I wonder how much better they can get. Because obviously
| something can be done.
| usaar333 wrote:
| They don't seem to use any recent top models. No opus, no o3,
| no Gemini 25 pro
| yorwba wrote:
| There are keys to attend to, they're just in the original text
| instead of the modified one. Since the model receives both as
| input, it could theoretically attend to those keys.
|
| For the attention mechanism, there isn't much difference
| between Original: {shared prefix} {removed
| part} {shared suffix} Modified: {shared prefix} {shared suffix}
|
| And Original: {shared prefix} {shared suffix}
| Modified: {shared prefix} {added part} {shared suffix}
|
| I think you could implement an algorithm for this in RASP (a
| language for manually programming transformers) roughly like
| this:
|
| 1. The first layer uses attention to the "Original:" and
| "Modified:" tokens to determine whether the current token is in
| the original or modified parts.
|
| 2. The second layer has one head attend equally to all original
| tokens, which averages their values, and another head attends
| equally to all modified tokens, averaging them as well. The
| averages are combined by computing their difference.
|
| 3. The third layer attends to tokens that are similar to this
| difference, which would be the ones in the {removed
| part}/{added part}.
|
| The only ordering-dependent part is whether you compute the
| difference as _original_average_ - _modified_average_ or the
| other way around.
|
| If a model can detect additions but not removal, that would
| show that it is capable of learning this or a similar algorithm
| in principle, but wasn't trained on enough removal-style data
| to develop the necessary circuitry.
| ironmanszombie wrote:
| Thanks for the breakdown. I am far from knowledgeable on AI
| but was wondering why can't a simple comparison work? They
| can definitely be coded, as you have beautifully
| demonstrated.
| XenophileJKO wrote:
| I haven't read the paper yet, but from a structural 'attention'
| perspective being unable to detect unclassified omissions is
| completely expected. (Though I think it is can be solved with
| structured thought.)
|
| For needle in a haystack you have to pay attention to the thing
| that you are trying to find. Attention can do this pretty well.
|
| When looking for an omission, that omission can be anything, you
| can only reason about it by comparing one whole context to
| another whole context. The attention layers can't really do that.
|
| This is similar to the "rank a long set of things" problem.
| Absent some meta cognition process, they just can't do that.
| teruakohatu wrote:
| > When looking for an omission, that omission can be anything,
|
| In this benchmark they give the LLM the necessary information
| to determine what is missing. For example "here is a poem, here
| is a version of that same poem that may or may not be missing
| lines. Are any lines missing?
|
| It's more a tuning issue IMHO than an inherent weakness in
| LLMs.
|
| If I was asked to find an omission in an ML paper, my brain
| compares it with other ML papers, it does not need to compare
| it to Star Ward, Top Gear, Greek history, pottery and the other
| 1000s of contexts I may know about.
| XenophileJKO wrote:
| Sorry I meant the omission can be anything in the context,
| not anything in the world.. lol.
|
| That is still hard. You only have so many attention heads
| looking for things.. you can't pay attention to EVERYTHING..
| which is what is required to find the omission.
| yorwba wrote:
| To pay attention to everything, set the query vector to 0.
| Then all attention scores will be equal and the attention
| output is the average of the value vectors.
| thaumasiotes wrote:
| We should note that "where is there a line missing from this
| poem: ____?" contains sufficient information to answer
| correctly without needing a copy of the original to compare
| to.
|
| Here are two verses of a poem (song) in Mandarin Chinese:
|
| _yi quan ting ni de
|
| er gei ni hao de
|
| shu dao san yong yuan ai ni yi ge
|
| si bu hui fan cuo
|
| wu bu hui luo suo
|
| shuo ni xiang shuo de
|
| zuo ni xiang zuo de
|
| bie pa shi bai yin wei ni you wo
|
| pei ni kan ri luo
|
| pei ni yi qi chang wan wo men ai de ge_
|
| I removed two lines. Where did that happen?
|
| Would your answer be different if I told you that I might or
| might not have removed some lines?
| teruakohatu wrote:
| > Here are two verses of a poem (song) in Mandarin Chinese:
|
| > ...
|
| > I removed two lines. Where did that happen?
|
| If you read the paper you will see they provide the
| original as well as the version missing information.
|
| I did mention this in my comment too.
|
| I am quite sure I could find your two missing lines if you
| provide me the full poem.
|
| Given that you are a prolific commenter on HN, I am sure a
| LLM could be fine tuned to detect missing text from your
| comments without additional information. For example ...
|
| > WinForms is still around. There have been further tec lly
| just a big tire fire and about the best you can do is to
| ignore all of them and develop in WinForms.
|
| It's probably possible to detect missing information from
| "tec" until "lly". But to know what is between is not
| possible for a human either, beyond plausible guesses.
| thaumasiotes wrote:
| ...did you read my comment? The first - and really only -
| thing I say is that the original isn't necessary. Then
| there's an example. You shouldn't have trouble
| identifying where lines have been removed from the
| Chinese poem.
|
| The fact that the original was provided doesn't
| demonstrate that it's necessary to the task. You can
| identify missing text without needing to know what was
| there.
|
| > Given that you are a prolific commenter on HN, I am
| sure a LLM could be fine tuned to detect missing text
| from your comments without additional information.
|
| Same thing. Why would you need to do tuning on text
| authored by me? You can easily detect missing text of
| that style by the fact that the sentence you have fails
| to be English. You can do the same thing in text for
| which you have no prior experience of the author.
|
| > I am quite sure I could find your two missing lines if
| you provide me the full poem.
|
| But hey, if you insist:
|
| Qing Qing Tie Jin Ni De Er Duo
|
| saranghaeyo
|
| Qing Hua Yong Yuan Bu Xian Tai Duo Dui Ni Shuo
|
| Yi Quan Ting Ni De
|
| Er Gei Ni Hao De
|
| Shu Dao San Yong Yuan Ai Ni Yi Ge
|
| Si Bu Hui Fan Cuo
|
| Wu Bu Hui Luo Suo
|
| Mei Tian Wei Ni Da call, cook Ye Bu Cuo
|
| Qing Qing Tie Jin Ni De Er Duo
|
| saranghaeyo
|
| Qing Hua Yong Yuan Bu Xian Tai Duo Dui Ni Shuo
|
| Da Kai Ni De Ai Qing Shou Ce
|
| Jiu Zai Ci Ke
|
| Wei Ni Chang De Zhuan Shu Qing Ge Yao Ji De
|
| Shuo Ni Xiang Shuo De
|
| Zuo Ni Xiang Zuo De
|
| Bie Pa Shi Bai Yin Wei Ni You Wo
|
| Pei Ni Kan Ri Luo
|
| Pei Ni Deng Yu Guo
|
| Pei Ni Yi Qi Chang Wan Wo Men Ai De Ge
|
| Qing Qing Tie Jin Ni De Er Duo
|
| saranghaeyo
|
| Qing Hua Yong Yuan Bu Xian Tai Duo Dui Ni Shuo
|
| Da Kai Ni De Ai Qing Shou Ce
|
| Jiu Zai Ci Ke
|
| Wei Ni Chang De Zhuan Shu Qing Ge Yao Ji De
|
| Wo Qing Qing Kao Jin Ni De Er Duo
|
| Shuo Ai Ni Bu Xian Tai Duo
|
| Ru Guo Xiang Yu De Ji Lu Yi Mo Fen Zhi Yi Na Yao Duo
|
| Qing Xiang Xin Wo De Zhen Zhen Zhen Xin Bi Yu Zhou Huan
| Liao Kuo
|
| Wo Hui Qian Zhao Ni De Shou Zhi Dao Ni Quan Bu Jie Shou
|
| Da Kai Ni De Ai Qing Shou Ce
|
| Jiu Zai Ci Ke
|
| Zhe Shou Zhuan Shu Qing Ge Qing Ji De
| niemandhier wrote:
| I'll take the bait :-). . Endings of lines seem to come in
| pairs ( de, de; cuo, suo; de,de; wo,luo)
|
| I'd therefore conjecture that lines are missing after 'ge'
| and 'ge'.
|
| This of course assumes Chinese poetry is based on vowels
| matching as e.g it is the case in german and not based on
| rhythm as would be the case in Latin and Arabic.
| meltyness wrote:
| Two lines clearly deviate from AAB.
| pkoird wrote:
| So LLMs are poor at string diff, it seems. Tangentially, is there
| any source (a github repo or otherwise) that documents findings
| like these a la what LLMs are good at and what they aren't good
| at?
| birdfood wrote:
| Perhaps related, after watching a talk by Gerald Sussman I loaded
| an image of the Kanizsa triangle into Claude and asked it a
| pretty vague question to see if it could "see" the inferred
| triangle. It recognised the image and went straight into giving
| me a summary about it. So I rotated the image 90 degrees and
| tried in a new conversation, it didn't recognise the image and
| got the number of elements incorrect:
|
| This image shows a minimalist, abstract geometric composition
| with several elements:
|
| Four black shapes that appear to be partial circles or "Pac-Man"
| like forms, each with a wedge cut out, positioned in the four
| corners/quadrants of the image Two thin black triangular or
| arrow-like shapes - one pointing upward in the upper left area,
| and one pointing to the right in the center-right area All
| elements are arranged on a light gray or off-white background
| latentsea wrote:
| I guess they will now just rotate all the images in the
| training data 90 degrees too to fill this kind of gap.
| recursivecaveat wrote:
| Everything old is new again: in the Alexnet paper that kicked
| off the deep learning wave in 2012, they describe
| horizontally flipping every image as a cheap form of data
| augmentation. Though now that we expect models to actually
| read text that seems potentially counter-productive.
| Rotations are similar, in that you'd hope it would learn
| heuristics such as that the sky is almost always at the top.
| latency-guy2 wrote:
| At least from when I was still doing this kind of work,
| look angle/platform angle scatterer signal (radar) mattered
| more than rotation, but rotation was a simple way to get
| quite a bit more samples. It never stopped being relevant
| :)
| bonoboTP wrote:
| That's called data augmentation. It was common alredy
| before AlexNet. And it never stopped being common, it's
| still commonly done.
| mirekrusin wrote:
| That's how you train neural network with synthetic data so it
| extracts actual meaning.
|
| That's how humans also learn ie. adding numbers. First there
| is naive memoization, followed by more examples until you get
| it.
|
| LLM training seems to be falling into memoization trap
| because models are extremely good at it, orders of magnitude
| better than humans.
|
| IMHO what is missing in training process is this feedback
| explaining wrong answer. What we're currently doing with
| training is leaving out this understanding as "exercise to
| the reader". We're feeding correct answers to specific,
| individual examples which promotes memoization.
|
| What we should be doing in post training is ditch direct
| backpropagation on next token, instead let the model finish
| its wrong answer, append explanation why it's wrong and
| continue backpropagation for final answer - now with
| explanation in context to guide it to the right place in
| understanding.
|
| What all of this means is that current models are largely
| underutilized and unnecessarily bloated, they contain way too
| much memoized information. Making model larger is easy, quick
| illusion of improvement. Models need to be squeezed more,
| more focus needs to go towards training flow itself.
| atwrk wrote:
| _> That 's how humans also learn ie. adding numbers. First
| there is naive memoization, followed by more examples until
| you get it._
|
| Just nitpicking here, but this isn't how humans learn
| numbers. They start at birth with competency up to about 3
| or 5 and expand from that. So they can already work with
| quantities of varying size (i.e. they know which is more,
| the 4 apples on the left or the five on the right, and they
| also know what happens if I take one apple from the left
| and put it to the others on the right), and _then_ they
| learn the numbers. So yes, they learn the numbers through
| memorization, but only the signs /symbols, not the numeric
| competency itself.
| mirekrusin wrote:
| Turtles all the way down, things like meaning of "more"
| is also memoized ie initially as "I want more food" etc.
| then refined with time, ie. kid saying "he's more than
| me" is corrected by explaining that there needs to be
| some qualifier for measurable quantity ie. "he's more
| tall (taller) than me" or "he is more fast (faster) than
| me" etc.
|
| Using different modalities (like images, videos,
| voice/sounds instead of pure text) is interesting as well
| as it helps completing the meaning, adds sense of time
| etc.
|
| I don't think we're born with any concepts at all, it's
| all quite chaotic initially with consistent sensory
| inputs that we use to train/stabilise our neural network.
| Newborns for example don't even have concept of
| separation between "me and the environment around me",
| it's learned.
| littlestymaar wrote:
| And it will work.
|
| I just whish the people believing LLM can actually reason and
| generalize see that they don't.
| latentsea wrote:
| At this point think all reasoning really means is having
| seen enough of the right training data to make the correct
| inferences, and they're just missing some training data.
| ben_w wrote:
| If that was evidence current AI don't reason, then the
| Thatcher effect would be evidence that humans don't:
| https://en.wikipedia.org/wiki/Thatcher_effect
|
| LLMs may or may not "reason", for certain definitions of
| the word (there are many), but this specific thing doesn't
| differentiate them from us.
| t-3 wrote:
| Being tricked by optical illusions is more about the
| sensory apparatus and image processing faculties than
| reasoning, but _detecting_ optical illusions is
| definitely a reasoning task. I doubt it 's an important
| enough task to train into general models though.
| akomtu wrote:
| To generalise this idea: if we look at a thousand points that
| more or less fill a triangle, we'll instantly recognize the
| shape. IMO, this simple example reveals what intelligence is
| really about. We spot the triangle because so much complexity -
| a thousand points - fits into a simple, low-entropy geometric
| shape. What we call IQ is the ceiling of complexity of patterns
| that we can notice. For example, the thousand dots may in fact
| represent corners of a 10-dimensional cube, rotated slightly -
| an easy pattern to see for a 10-d mind.
| saithound wrote:
| Cool. Since ChatGPT 4o is actually really good at this
| particular shape identification task, what, if anything do
| you conclude about its intelligence?
| JohnKemeny wrote:
| The entire point here is that LLMs and image recognition
| software is not managing this task, so, not really good at
| this particular shape identification task.
| saithound wrote:
| No, the post's article is not about the sort of shape
| identification task discussed by GP. Or indeed any image
| recognition task: it's a paper about removed context in
| language.
|
| Fwiw, I did test GP's task on ChatGPT 4o directly before
| writing my comment. It is as good at it as any human.
| akomtu wrote:
| Recognizing triangles isn't that impressive. What's the
| ceiling of complexity of patterns in data it can identify
| with is the real question. Give it a list of randomly
| generated xyz coords that fall on a geometric shape, or a
| list of points that sample a trajectory of Earth around
| Sun. Will it tell you that it's an ellipse? Will it derive
| the 2nd Newton's law? Will it notice the deviation from the
| ellipse and find the rule explaining it?
| Workaccount2 wrote:
| Show any LLM a picture of a dog with 5 legs watch them be
| totally unable to count.
| pfdietz wrote:
| Or watch them channel Abraham Lincoln.
| JohnKemeny wrote:
| We really don't know how to compute.
|
| Oct 2011, 30 comments.
|
| https://news.ycombinator.com/item?id=3163473
|
| Strange loop video:
|
| July 2011, 36 comments.
|
| https://news.ycombinator.com/item?id=2820118
| iknownothow wrote:
| As far as I can tell, the paper covers text documents only.
| Therefore your example doesn't quite apply.
|
| It is well known that LLMs have a ways to go when it comes to
| processing images like they process text or audio.
|
| I don't think there's any good performing multimodal model that
| accepts image pixels directly. Most vision capabilities are
| hacks or engineered in. An image undergoes several processing
| steps and each processor's outputs are fed to the transformer
| as tokens. This may happen in one network but there's non-
| transformer networks involved. Examples of preprocessing:
|
| * OCR * CNNs (2D pattern recognizers) with different zooms,
| angles, slices etc * Others maybe too?
| yousif_123123 wrote:
| This is very interesting. 1. Authors mention the attention
| mechanism being perhaps unable to attend to the location of gaps
| since the gaps aren't tokens. But I would've expected a good LLM
| transformer to be at least a bit close to the gap location. I
| don't understand why mathematically the architecture is less
| suitable for that, it could attend to a region that may contain
| gaps. I wonder if fine-tuning on a task like this could help? 2.
| Shorter inputs with less omissions were harder to solve. That is
| not completely surprising, as a human doing this task, if 1 word
| was missing it would be harder to notice. And similarly 1 line
| would be harder than 10 lines. But still interesting for an LLM
| to have this problem. 3. Reasoning models do better, as they can
| write out the documents and potentially solve this easily. It
| still very surprising that this doesn't lead to 100% accuracy.
| This should be a trivial task. Like the paper says, a trivial
| program can be written to solve this. Perhaps ChatGPT (or similar
| agent) could read this paper while training, and know to write
| and run python when solving an issue like this.
|
| The most interesting thing though, is what other aspects of
| intelligence we may not have identified explicitly, and whether
| LLMs and current AI is very bad at them. This paper suggests that
| there likely are many of those, and it seems in general a pretty
| fun time for people working building benchmarks.
| xianshou wrote:
| In many of their key examples, it would also be unclear to a
| human what data is missing:
|
| "Rage, rage against the dying of the light.
|
| Wild men who caught and sang the sun in flight,
|
| [And learn, too late, they grieved it on its way,]
|
| Do not go gentle into that good night."
|
| For anyone who hasn't memorized Dylan Thomas, why would it be
| obvious that a line had been omitted? A rhyme scheme of AAA is at
| least as plausible as AABA.
|
| In order for LLMs to score well on these benchmarks, they would
| have to do more than recognize the original source - they'd have
| to know it cold. This benchmark is really more a test of
| memorization. In the same sense as "The Illusion of Thinking",
| this paper measures a limitation that neither matches what the
| authors claim nor is nearly as exciting.
| jamessinghal wrote:
| The test provides both the original and the modified excerpt in
| the user message, so the LLM doesn't need any memorized version
| of the excerpt to theoretically answer each correctly.
|
| From the paper:
|
| System Prompt You are helping a student practice memorizing
| poems. The student will recite a poem, but they may have missed
| some lines. Your task is to identify exactly which lines are
| missing from their recitation. List only the missing lines,
| nothing else.
|
| User Message Here is the complete original poem: {original
| poem} Now, here is my recitation which may be missing some
| lines: {modified poem} What lines did I miss? Please list only
| the missing lines, nothing else.
| scarface_74 wrote:
| This worked
|
| https://chatgpt.com/share/6855f69d-766c-8010-96e2-ed1b45d3e6.
| ..
| htnwe_2312412 wrote:
| yes, 69.8% of the time.
| OsrsNeedsf2P wrote:
| The criticisms to how AbsenceBench does this are valid, but I'm
| very excited that we are benchmarking this at all. It's
| definitely a push in the right direction
| yandie wrote:
| I wonder how this would apply with vision models? I tried with a
| few example of single images and they appear to do well. I did a
| few toy examples and they seem to do pretty well (Claude +
| Gemini) with spotting differences. An example image:
| https://www.pinterest.com/pin/127578601938412480/
|
| They seem to struggle more when you flip the image around
| (finding fewer differences, and potentially halluciating)
| obscure-enigma wrote:
| this research is too simplified and kind of vague, as it's the
| inherent nature of language models for that matter any
| probabilistic model, to compress the information for better
| generalization since there is a lower bound to how much loss they
| can incur while decoding the information. LLMs are indeed lossy
| compressors
| kadonoishi wrote:
| To detect a presence, a real brain takes in sensory input and
| compares it to expectations, and stays calm or registers
| surprise, and from time to time issues predictions to guide the
| organism.
|
| To detect an absence, the brain cannot rely on sensory input, by
| definition. To be surprised if sensory evidence is _not_ there
| requires a model of the world strong enough to register surprise
| if the expectation is not there, without a sensory prompt.
|
| It seems to me detecting an absence is a strictly higher-order
| neurological task than processing sensory input.
|
| If LLMs can't do this strictly higher-order neurological task, is
| that not a capability currently unique to living things?
| tclancy wrote:
| > from time to time
|
| I know less-than-zero about the subject but I'd imagine the
| temporal aspect alone is a problem. Aren't these agents
| reasoning from a fixed/ frozen version of "reality" rather than
| adjusting in real-time??
| gtsop wrote:
| Thinking is still currently unique to living things, so you
| don't need to resort to what you describe to find the human
| brain uniquness.
|
| Onto what you describe, it has to do with memory. Memory is
| storing and playing back sensory input, in the absence of that
| sensory input. So your brain plays back some past sensory input
| and checks it against current sensory input.
|
| Eg you left the pen on the table. When you come back the pen
| isn't there. Your brain compares the stored memory of seeing
| the pen on the table vs what you see now.
| viralsink wrote:
| LLMs might not be very consistent overall in their learned
| architecture. Some paths may lead to memorized info, some paths
| may lead to advanced pattern matching.
| b0a04gl wrote:
| why are we surprised transformers can't detect what's missing
| when the entire stack assumes the input is complete? the
| tokenizer doesn't leave placeholders. the attention weights have
| nothing to anchor to. even the loss function is built around
| predicting what is, not what isn't. this isn't a model bug. it's
| an architectural omission.
|
| if we want models that detect absences? you need training
| objectives that expect absence. maybe even input encodings that
| represent "this might've been here."
| zurfer wrote:
| I am surprised because it's such a simple task. Any human who
| is a bit diligent would be able to figure it out. They give
| both the original and the modified version.
|
| However it feels a bit like counting letters. So maybe it can
| be solved with post training. We'll know in 3 to 6 months if it
| was easy for the labs to "fix" this.
|
| In my daily use of LLMs I regularly have some overly optimistic
| answers because they fail to consider potentially absent or
| missing information (even harder because it's out of context).
| jonbaer wrote:
| https://en.wikipedia.org/wiki/Chinese_room
| itsgrimetime wrote:
| I'm not sure how to go about solving it at the architecture level
| but I would assume an LLM with access to a diff tool would get
| 100%, but I understand that's not really the point
| TZubiri wrote:
| I have never tried this, but I'm wondering how effective the
| following approaches would be to measure uncertainty and unknowns
| in responses:
|
| https://cookbook.openai.com/examples/using_logprobs
|
| According to OpenAI official cookbook it seems to be a fairly
| standard usecase.
|
| Another approach, especially in classification, would be to
| measure the cosine distance between the user embedding, and the
| ideal embedding of the message category.
| ThrowawayTestr wrote:
| Wouldn't it be better to ask the LLM to use a diff tool instead
| of asking it look at the text directly?
| viralsink wrote:
| This kind of research is about finding the limitations of the
| technology to hopefully advance it in a meaningful direction.
| If this finding impedes you, then sure, you can find a quick
| fix for it. But it's beside the point.
| pu_pe wrote:
| Quite clever and useful benchmark. This implies that without tool
| use, LLMs have a fundamental limitation when it comes to tasks
| like code review.
| iknownothow wrote:
| I'd say that's where we're headed. A big model that's trained
| from the start to use tools and know when to use certain tools
| and how to use tools. Like us :)
|
| I wouldn't be surprised if someone's building a dataset for
| tool use examples.
|
| The newer gen reasoning models are especially good at knowing
| when to do web search. I imagine they'll slowly get better at
| other tools.
|
| At current levels of performance, LLMs having the ability to
| get well curated information by themselves would increase their
| scores by a lot.
| VikRubenfeld wrote:
| Interesting. I encountered an example of this the other day. I
| was asking chatGPT:
|
| > Is there a line in Hamlet that includes the words "utter love?"
|
| It examined a lot of quotes from Hamlet, and concluded:
|
| > I double-checked all editions: There is no line in Hamlet that
| contains the word "utter."
|
| >It is never used in any of the quartos or folios in any line of
| Hamlet.
|
| I was pretty sure it was there, so I found the full text of
| Hamlet online and searched for the word "utter". I found it right
| away and posted to chatGPT:
|
| > That's odd. I found it in two seconds by searching the full
| text for the word "utter".
|
| > This must be known, which, being kept close, might move More
| grief to hide than hate to utter love.
|
| > Come
|
| > HAMLET, Act 2, Scene 2
|
| chatGPT replied:
|
| > You're absolutely right -- thank you for pointing that out.
|
| > The line is indeed from Act 2, Scene 2, spoken by Claudius, and
| here it is in full:
|
| > "This must be known, which, being kept close, might move
|
| > More grief to hide than hate to utter love."
|
| > I regret the earlier oversight -- your memory was sharper than
| my index!
| diggan wrote:
| Yeah, kind of makes sense, LLMs are really bad at that sort of
| recall unless provided tool and a loop to go over the results.
|
| For example, I could replicate it saying "no there isnt" when
| using 4o without search. But with search it gets the answer
| right.
|
| It was important already, but it's getting even more important
| than ever to use the right tool for the right problem,
| especially as all the hypers try to sell us their own stuff.
| GLdRH wrote:
| Seems like ChatGPT is an utter failure
| speakeron wrote:
| It's actually Act 2, Scene 1 and is spoken by Polonius.
| emporas wrote:
| Poor benchmark.
|
| I tried their prompt [1] using 3 numbered items, qwq-32b got it
| right with no problems at all. I think it could solve 100
| numbered items correctly 100% of the time, but it probably needs
| a million tokens. Probably even more, 10 million.
|
| The limitation of 5000 tokens is peanuts for a reasoning model.
| Give it a lot of testing time compute, 10x of 5000 tokens is
| still too little.
|
| The authors talk about long inputs, so, if it is 100 pages, give
| it a billion tokens.
|
| The correct way to implement this is in batches, find the first 5
| numbered items in the omitted input text, if it does find those,
| then simplify the input items and the omitted input items and go
| again.
|
| Depending on the size of the input, it will always need a hefty
| amount of tokens, but simplification will help it backtrack
| correctly and not lose the thread entirely.
|
| [1]You are helping a student practice memorizing poems. The
| student will recite a poem, but they may have missed some lines.
| Your task is to identify exactly which lines are missing from
| their recitation. List only the missing lines, nothing else. User
| Message Here is the complete original poem: 1)Quisella's lashes
| fluttered panic-morse. 2)The Moisture Vampires leeches that
| sucked humidity. 3)Lysandra's nostrils flared precisely one
| degree. Now, here is my recitation which may be missing some
| lines: Quisella's lashes fluttered panic-morse. Lysandra's
| nostrils flared precisely one degree. What lines did I miss?
| Please list only the missing lines, nothing else.
| emporas wrote:
| I just tried qwq-32b using the numbered headlines of HN right
| now, with 26 items [1], I removed 3 headlines, still found all
| 3 omitted items first try, perfect, and it didn't even consume
| 50.000 tokens.
|
| [1]
| https://gist.github.com/pramatias/fee1391ad08c7b965f435f3af1...
| enragedcacti wrote:
| What is interesting about reducing the problem to counting? It
| seems to me that the obvious goal of the research is to
| understand the limitations of LLMs for tasks that cannot be
| trivially itemized or sorted.
| emporas wrote:
| The more specific are the instructions, the better they
| perform. There is a huge difference, between trying to find
| omitted text, or omitted words, or omitted sentences.
|
| If omitted words are to be found, put each word into it's own
| line and number it. The same with sentences.
|
| If you are trying to find omitted words and sentences, make
| one pass with only words, and another one with only
| sentences. Then combine the results.
| enragedcacti wrote:
| To what end? You have to segment and order the document
| (i.e. solve the problem) just to craft your prompt so the
| LLM spitting the solution back to you is useless. The
| experiment uses these tasks because test cases can be
| algorithmically generated and scored, but it's not very
| interesting that one can structure the input to solve this
| specific, useless task with LLMs. It is interesting,
| though, that this limitation could carry over into tasks
| where traditional algorithms fail. LLMs improving at this
| would be legitimately useful which is why a benchmark makes
| sense, but cheating the benchmarks by augmenting the input
| doesn't.
| iknownothow wrote:
| To be fair, I'd put finding literal string diffs in the category
| of asking LLMs to do rote arithmetic.
|
| The attention mechanism does far too much complex thinking for
| such a dumb task. This is precisely where you need to dumb down
| and focus and be disciplined rather than do high level next token
| prediction.
|
| You'd benefit from actually asking the LLM to list the full
| document and compare, kind of like reasoning, and similar to how
| LLMs perform better when they break down arithmetic or algebra
| tasks into smaller steps.
|
| Also my guess would be that the models that perform well are MoE
| models where there may be an Expert or two that does well on
| tasks that needs focus rather than intuition. So without knowing
| anything about Gemini Flash, my guess would be that it's an MoE
| model.
| amelius wrote:
| I bet an LLM would be able to do it if you allowed it to go
| "meta", and asked it to write a python script to detect the
| omissions, where the script can use an LLM.
| yousif_123123 wrote:
| Maybe if instructed, but how would it know it needs to use
| python in this case vs just answer? Perhaps you'd instruct it
| to always attempt using code to reduce errors.
|
| But the idea of trivial problems like this potentially causing
| issues for LLMs might mean other aspects of intelligence could
| also be a struggle for LLMs (which could impact it's coding
| ability as well).
___________________________________________________________________
(page generated 2025-06-21 23:01 UTC)