[HN Gopher] GPT-4o's Memory Breakthrough - Needle in a Needlestack
___________________________________________________________________
GPT-4o's Memory Breakthrough - Needle in a Needlestack
Author : parrt
Score : 152 points
Date : 2024-05-13 21:54 UTC (1 days ago)
(HTM) web link (nian.llmonpy.ai)
(TXT) w3m dump (nian.llmonpy.ai)
| parrt wrote:
| The article shows how much better GPT-4o is at paying attention
| across its input window compared to GPT-4 Turbo and Claude-3
| Sonnet.
|
| We've needed an upgrade to needle in a haystack for a while and
| this "Needle In A Needlestack" is a good next step! NIAN creates
| a prompt that includes thousands of limericks and the prompt asks
| a question about one limerick at a specific location.
| dmose2 wrote:
| It's interesting (though perhaps not surprising) to see the
| variance in curve shape across models.
| 19h wrote:
| I'd like to see this for Gemini Pro 1.5 -- I threw the entirety
| of Moby Dick at it last week, and at one point all books Byung
| Chul-Han has ever published, and it both cases it was able to
| return the single part of a sentence that mentioned or answered
| my question verbatim, every single time, without any
| hallucinations.
| parrt wrote:
| Wow. Cool. I have access to that model and have also seen some
| impressive context extraction. It also gave a really good
| summary of a large code base that I dumped in. I saw somebody
| analyze a huge log file, but we really need something like this
| needle in a needlestack to help identify when models might be
| missing something. At the very least, this could give model
| developers something to analyze their proposed models.
| 19h wrote:
| Funnily enough I ran a 980k token log dump against Gemini Pro
| 1.5 yesterday to investigate an error scenario and it found a
| single incident of a 429 error being returned by a third-
| party API provider while reasoning that "based on the file
| provided and the information that this log file is aggregated
| of all instances of the service in question, it seems
| unlikely that a rate limit would be triggered, and additional
| investigation may be appropriate", and it turned out the
| service had implemented a block against AWS IPs, breaking a
| system that loads press data from said API provider, leaving
| the customer who was affected by it without press data -- we
| didn't even notice or investigate that, and Gemini just
| randomly mentioned it without being prompted for that.
| parrt wrote:
| That definitely makes it seem like it's noticing a great
| deal of its context window. impressive.
| sftombu wrote:
| If I had access to Gemini with a reasonable token rate limit, I
| would be happy to test Gemini. I have had good results with it
| in other situations.
| cj wrote:
| What version of Gemini is built into Google Workspace? (I
| just got the ability _today_ to ask Gemini anything about
| emails in my work Gmail account, which seems like something
| that would require a large context window)
| underlines wrote:
| Such tasks don't need a large context window. Just good
| RAG.
| Fernicia wrote:
| But this content is presumably in its training set, no? I'd be
| interested if you did the same task for a collection of books
| published more recently than the model's last release.
| ben_w wrote:
| I would _hope_ that Byung-Chul Han would not be in the
| training set (at least not without his permission), given he
| 's still alive and not only is the legal question still open
| but it's also definitely rude.
|
| This doesn't mean you're wrong, though.
| sebzim4500 wrote:
| It's pretty easy to confirm that copywritten material is in
| the training data. See the NYT lawsuit against OpenAI for
| example.
| ben_w wrote:
| Part of that back-and-forth is the claim "this specific
| text was copied a lot all over the internet making it
| show up more in the output", and _that_ means it 's _not_
| a useful guide to things where one copy was added to The
| Pile and not removed when training the model.
|
| (Or worse, that Google already had a copy because of
| Google Books and didn't think "might training on this
| explode in our face like that thing with the Street View
| WiFi scanning?")
| 19h wrote:
| To test this hypothesis, I just took the complete book
| "Advances in Green and Sustainable Nanomaterials" [0] and
| pasted it into the prompt, asking Gemini: "What absorbs
| thermal radiations and converts it into electrical signals?".
|
| It replied: "The text indicates that _graphene sheets_
| present high optical transparency and are able to absorb
| thermal radiations with high efficacy. They can then convert
| these radiations into electrical signals efficiently. ".
|
| Screenshot of the PDF with the relevant sentence highlighted:
| https://i.imgur.com/G3FnYEn.png
|
| [0] https://www.routledge.com/Advances-in-Green-and-
| Sustainable-...
| jiggawatts wrote:
| Ask it what material absorbs "infrared light" efficiently.
|
| To me, that's useful intelligence. I can already search
| text for verbatim matches, I want the AI to _understand_
| that "thermal radiations" and "infrared light" are the same
| thing.
| a_wild_dandan wrote:
| Gemini works with brand new books too; I've seen multiple
| demonstrations of it. I'll try hunting one down. Side note:
| this experiment is still insightful even using model training
| material. Just compare its performance _with_ the uploaded
| book(s) to _without._
| DominikPeters wrote:
| Just put the 2500 example linked on the article through Gemini
| 1.5 _Flash_ and it answered correctly ( "The tree has diseased
| leaves and its bark is peeling.") https://aistudio.google.com/
| sftombu wrote:
| Interesting!
| nsagent wrote:
| A number of people in my lab do research into long context
| evaluation of LLMs for works of fiction. The likelihood is very
| high that Moby Dick is in the training data. Instead the people
| in my lab have explored recently published books to avoid these
| issues.
|
| See BooookScore (https://openreview.net/forum?id=7Ttk3RzDeu)
| which was just presented at ICLR last week and FABLES
| (https://arxiv.org/abs/2404.01261) a recent preprint.
| nickca wrote:
| Would love to see Gemini there too!
| personjerry wrote:
| That's great to hear. My biggest issue with GPT-4.0 was that as
| the conversation got longer, the quality diminished (especially
| relevant for coding projects)
|
| I wonder if it'll be better now. Will test today.
| sftombu wrote:
| I had the same experience. With a 16k prompt, Turbo was nearly
| flawless. But it wasn't very good at 32k and not usable at
| 100+. You have to repeat information to get good results with
| longer prompts
| throwthrowuknow wrote:
| That's been my experience so far. My current conversations are
| crazy long compared to any of my gpt4 convos which I had to
| frequently copy context from and start over in a new chat
| youssefabdelm wrote:
| Someone needs to come up with a "synthesis from haystack" test
| that tests not just retrieval but depth of understanding,
| connections, abstractions across diverse information.
|
| When a person reads a book, they have an "overall intuition"
| about it. We need some way to quantify this. Needle in haystack
| tests feel like a simple test that doesn't go far enough.
| sftombu wrote:
| I was thinking about something similar -- to make part of the
| question be sufficient information that the LLM can find the
| limerick. Then the 2nd part would ask something that would
| require a deeper understanding of the limerick (or other text).
| adamgordonbell wrote:
| I've been thinking about that as well.
|
| It's hard, but if you have a piece of fiction or non-fiction it
| hasn't seen before, then a deep reading comprehension question
| can be a good indicator. But you need to be able to separate a
| true answer from BS.
|
| "What does this work says about our culture? Support your
| answer with direct quotes."
|
| I found both gpt-4 and haiku to do alright at this, but
| sometimes give answers that imply fixating on certain sections
| of a 20,000 k context. You could compare it against chunking
| the text, getting the answer for each chunk and combining them.
|
| I suspect if you do that then the chunking would win for things
| that are found in many chunks, like the work is heavy handed on
| a theme, but the large context would be better for a sublter
| message, except sometimes it would miss it altogether and think
| a Fight Club screenplay was a dark comedy.
|
| Interpretation is hard I guess.
| Eisenstein wrote:
| My idea is to buy to a unpublished novel or screenplay with a
| detailed, internally consistent world built in to it and a cast
| of characters that have well crafted motivations and then ask
| it to continue writing from an arbitrary post-mid-point by
| creating a new plot line that combines two characters that
| haven't yet met in the story. If it understands the context it
| should be able to write a new part of the story and will be
| able to use a reader's intuitive sense of the character's
| motivations to move through their arc.
|
| This whole thing would have to be kept under lock-and-key in
| order to be useful, so it would only serve as a kind of
| personal benchmark. Or it could possibly be a prestige award
| that is valued for its conclusions and not for its ability to
| use the methodology to create improvements in the field.
| visarga wrote:
| You can only use it for a short while, they get a copy as
| well.
| Eisenstein wrote:
| I have been thinking about this for use in evaluating
| locally run models, so I didn't make that connection in
| this case. I guess it would have limited utility.
| jddj wrote:
| An elaborate Agatha Christie style whodunit, with a series of
| plot-twists and alibis which can be chopped off the end of the
| piece to modify who is the most likely suspect
| jddj wrote:
| Or a spot the difference.
|
| Generate 1000 generic facts about Alice and the same 1000
| facts about Eve. Randomise the order and change one minor
| detail then ask how they differ.
| pushedx wrote:
| sort alice.txt | diff - <(sort eve.txt)
|
| That's not a task for an LLM
| IanCal wrote:
| Asking students to write an essay about Napoleon isn't
| something we do because we need essays about Napoleon -
| the point is it's a _test_ of capabilities.
| visarga wrote:
| The needles form a graph and the prompt asks graph based tasks.
| sftombu wrote:
| That is an interesting idea
| segmondy wrote:
| Why can't you be that someone?
| gremlinsinc wrote:
| lol, made me think of the euphemism: be the change you want
| to see.
| petulla wrote:
| You need to know that this test set data wasn't included in the
| training data for this to be meaningful.
| sftombu wrote:
| If you ask the questions without providing the limerick first,
| it never gets the right answer. When the LLM gets the wrong
| answer, it is usually because it reverts to its training data
| and gives a generic answer that doesn't apply to the limerick.
| lmeyerov wrote:
| I thought the test limericks were autogenerated?
| sftombu wrote:
| They come from a database of 98k limericks --
| https://zenodo.org/records/5722527
| a_wild_dandan wrote:
| No you don't. Compare the model's performance before and after
| uploading the material.
| asadm wrote:
| I have had good experience with Gemini 1M context model with this
| kind of tasks.
| rguptill wrote:
| We also need a way to determine where a given response fits in
| the universe of responses - is it an "average" answer or a really
| good one
| m3kw9 wrote:
| I thought google Gemini had almost perfect needle in haystack
| performance inside 1 million tokens?
| sftombu wrote:
| The reason I made Needle in a needlestack is the LLMs are
| getting to good at needle in a haystack. Until GPT-4o, no model
| was good at the NIAN benchmark.
| m3kw9 wrote:
| One could have LLM to route it to a text search function and have
| the function report back to the LLM for secondary processing.
| cararemixed wrote:
| What's the chance that these limericks are now in the training
| set? As others mention, it'd be interesting to come up with a way
| to synthesize something sufficiently interesting so it always
| evades training fit.
| sftombu wrote:
| It can't answer the questions without the limericks in the
| prompt. For example, I just asked it "What is sought by the
| English top brass?". The answer from the limerick is "Cranberry
| glass".
|
| 4o responded with "The term "English top brass" typically
| refers to high-ranking officials or leaders within the British
| government, military, or other institutions. What they seek can
| vary widely depending on the context and the specific goals of
| their roles. Here are some general pursuits that might be
| sought by such individuals:
|
| National Security: Ensuring the safety and security of the
| United Kingdom from internal and external threats is a primary
| concern. This involves defense strategies, intelligence
| operations, and counter-terrorism efforts.
|
| Economic Stability: High-ranking officials often focus on
| policies and initiatives aimed at maintaining and improving the
| country's economic health. This includes managing inflation,
| unemployment, trade relations, and economic growth.
|
| Political Influence: Top brass often seek to maintain or expand
| their influence both domestically and internationally. This can
| involve diplomacy, forming alliances, and participating in
| international organizations like the United Nations or NATO.
|
| Social Cohesion: Ensuring social stability and addressing
| issues such as inequality, healthcare, education, and social
| services are critical. This can involve implementing policies
| that promote social welfare and cohesion.
|
| Public Policy Implementation: Leaders are responsible for
| developing and implementing policies that reflect the
| government's priorities. This includes legislation, regulatory
| frameworks, and public administration.
|
| Technological Advancement: Keeping the nation at the forefront
| of technological innovation is often a priority. This includes
| investments in research and development, supporting tech
| industries, and ensuring cybersecurity.
|
| Environmental Sustainability: Addressing climate change and
| promoting sustainable practices are increasingly important.
| This includes policies aimed at reducing carbon emissions,
| protecting natural resources, and transitioning to renewable
| energy sources.
|
| Cultural and Heritage Preservation: Protecting and promoting
| the country's cultural heritage and national identity can also
| be a focus. This includes supporting the arts, preserving
| historical sites, and promoting cultural initiatives.
|
| These pursuits are shaped by the current political climate,
| global trends, and the specific priorities of the leaders in
| question. Would you like more detailed information on any of
| these areas?"
| thorum wrote:
| The needle in the haystack test gives a very limited view of the
| model's actual long context capabilities. It's mostly used
| because early models were terrible at it and it's easy to test.
| In fact, most recent models now do pretty good at this one task,
| but in practice, their ability to do anything complex drops off
| hugely after 32K tokens.
|
| RULER is a much better test:
|
| https://github.com/hsiehjackson/RULER
|
| > Despite achieving nearly perfect performance on the vanilla
| needle-in-a-haystack (NIAH) test, all models (except for
| Gemini-1.5-pro) exhibit large degradation on tasks in RULER as
| sequence length increases.
|
| > While all models claim context size of 32k tokens or greater
| (except for Llama3), only half of them can effectively handle
| sequence length of 32K by exceeding a qualitative threshold,
| Llama2-7b performance at 4K (85.6%). The performance exceeding
| the threshold is underlined.
| throwthrowuknow wrote:
| This is a very promising development. It would be wise for
| everyone to go back and revise old experiments that failed now
| that this capability is unlocked. It should also make RAG even
| more powerful now that you can load a lot more information into
| the context and have it be useful.
| throw7381 wrote:
| Anyone has done any benchmarks for RAG yet?
| itissid wrote:
| How Do we know that gpt-4o.has not been trained on this dataset?
| sftombu wrote:
| It can't answer the questions without the limericks in the
| prompt. For example, I just asked it "What is sought by the
| English top brass?". The answer from the limerick is "Cranberry
| glass".
|
| 4o responded with "The term "English top brass" typically
| refers to high-ranking officials or leaders within the British
| government, military, or other institutions. What they seek can
| vary widely depending on the context and the specific goals of
| their roles. Here are some general pursuits that might be
| sought by such individuals:
|
| National Security: Ensuring the safety and security of the
| United Kingdom from internal and external threats is a primary
| concern. This involves defense strategies, intelligence
| operations, and counter-terrorism efforts.
|
| Economic Stability: High-ranking officials often focus on
| policies and initiatives aimed at maintaining and improving the
| country's economic health. This includes managing inflation,
| unemployment, trade relations, and economic growth.
|
| Political Influence: Top brass often seek to maintain or expand
| their influence both domestically and internationally. This can
| involve diplomacy, forming alliances, and participating in
| international organizations like the United Nations or NATO.
|
| Social Cohesion: Ensuring social stability and addressing
| issues such as inequality, healthcare, education, and social
| services are critical. This can involve implementing policies
| that promote social welfare and cohesion.
|
| Public Policy Implementation: Leaders are responsible for
| developing and implementing policies that reflect the
| government's priorities. This includes legislation, regulatory
| frameworks, and public administration.
|
| Technological Advancement: Keeping the nation at the forefront
| of technological innovation is often a priority. This includes
| investments in research and development, supporting tech
| industries, and ensuring cybersecurity.
|
| Environmental Sustainability: Addressing climate change and
| promoting sustainable practices are increasingly important.
| This includes policies aimed at reducing carbon emissions,
| protecting natural resources, and transitioning to renewable
| energy sources.
|
| Cultural and Heritage Preservation: Protecting and promoting
| the country's cultural heritage and national identity can also
| be a focus. This includes supporting the arts, preserving
| historical sites, and promoting cultural initiatives.
|
| These pursuits are shaped by the current political climate,
| global trends, and the specific priorities of the leaders in
| question. Would you like more detailed information on any of
| these areas?"
| irthomasthomas wrote:
| This is based on a limericks dataset published in 2021.
| https://zenodo.org/records/5722527
|
| I think it very likely that gpt-4o was trained on this. I mean,
| why would you not? Innnput, innnput, Johnny five need more
| tokens.
|
| I wonder why the NIAN team don't generate their limericks using
| different models, and check to make sure they're not in the
| dataset? Then you'd know the models couldn't possibly be trained
| on them.
| sftombu wrote:
| I tested the LLMs to make sure they could not answer the
| questions unless the limerick was given to them. Other than 4o,
| they do very badly on this benchmark, so I don't think the test
| is invalidated by their training.
| ionwake wrote:
| I am in England, do US users have access to memory features? (
| Also do you ahve access to voice customisation yet?
|
| Thanks
| whimsicalism wrote:
| Increasingly convinced that nobody on the public internet knows
| how to do actual LLM evaluations.
| yatz wrote:
| Well, I can now use GPT to transform raw dynamic data into
| beautiful HTML layouts on the fly for low-traffic pages, such as
| change/audit logs, saving a ton of development time and keeping
| my HTML updated even when the data structure has changed. My last
| attempt did not consistently work because GPT4-Turbo sometimes
| ignored the context and instructions almost entirely.
| ijidak wrote:
| Do you have an example of this? I would love to learn more.
___________________________________________________________________
(page generated 2024-05-14 23:00 UTC)