[HN Gopher] GPT-4o's Memory Breakthrough - Needle in a Needlestack
       ___________________________________________________________________
        
       GPT-4o's Memory Breakthrough - Needle in a Needlestack
        
       Author : parrt
       Score  : 152 points
       Date   : 2024-05-13 21:54 UTC (1 days ago)
        
 (HTM) web link (nian.llmonpy.ai)
 (TXT) w3m dump (nian.llmonpy.ai)
        
       | parrt wrote:
       | The article shows how much better GPT-4o is at paying attention
       | across its input window compared to GPT-4 Turbo and Claude-3
       | Sonnet.
       | 
       | We've needed an upgrade to needle in a haystack for a while and
       | this "Needle In A Needlestack" is a good next step! NIAN creates
       | a prompt that includes thousands of limericks and the prompt asks
       | a question about one limerick at a specific location.
        
       | dmose2 wrote:
       | It's interesting (though perhaps not surprising) to see the
       | variance in curve shape across models.
        
       | 19h wrote:
       | I'd like to see this for Gemini Pro 1.5 -- I threw the entirety
       | of Moby Dick at it last week, and at one point all books Byung
       | Chul-Han has ever published, and it both cases it was able to
       | return the single part of a sentence that mentioned or answered
       | my question verbatim, every single time, without any
       | hallucinations.
        
         | parrt wrote:
         | Wow. Cool. I have access to that model and have also seen some
         | impressive context extraction. It also gave a really good
         | summary of a large code base that I dumped in. I saw somebody
         | analyze a huge log file, but we really need something like this
         | needle in a needlestack to help identify when models might be
         | missing something. At the very least, this could give model
         | developers something to analyze their proposed models.
        
           | 19h wrote:
           | Funnily enough I ran a 980k token log dump against Gemini Pro
           | 1.5 yesterday to investigate an error scenario and it found a
           | single incident of a 429 error being returned by a third-
           | party API provider while reasoning that "based on the file
           | provided and the information that this log file is aggregated
           | of all instances of the service in question, it seems
           | unlikely that a rate limit would be triggered, and additional
           | investigation may be appropriate", and it turned out the
           | service had implemented a block against AWS IPs, breaking a
           | system that loads press data from said API provider, leaving
           | the customer who was affected by it without press data -- we
           | didn't even notice or investigate that, and Gemini just
           | randomly mentioned it without being prompted for that.
        
             | parrt wrote:
             | That definitely makes it seem like it's noticing a great
             | deal of its context window. impressive.
        
         | sftombu wrote:
         | If I had access to Gemini with a reasonable token rate limit, I
         | would be happy to test Gemini. I have had good results with it
         | in other situations.
        
           | cj wrote:
           | What version of Gemini is built into Google Workspace? (I
           | just got the ability _today_ to ask Gemini anything about
           | emails in my work Gmail account, which seems like something
           | that would require a large context window)
        
             | underlines wrote:
             | Such tasks don't need a large context window. Just good
             | RAG.
        
         | Fernicia wrote:
         | But this content is presumably in its training set, no? I'd be
         | interested if you did the same task for a collection of books
         | published more recently than the model's last release.
        
           | ben_w wrote:
           | I would _hope_ that Byung-Chul Han would not be in the
           | training set (at least not without his permission), given he
           | 's still alive and not only is the legal question still open
           | but it's also definitely rude.
           | 
           | This doesn't mean you're wrong, though.
        
             | sebzim4500 wrote:
             | It's pretty easy to confirm that copywritten material is in
             | the training data. See the NYT lawsuit against OpenAI for
             | example.
        
               | ben_w wrote:
               | Part of that back-and-forth is the claim "this specific
               | text was copied a lot all over the internet making it
               | show up more in the output", and _that_ means it 's _not_
               | a useful guide to things where one copy was added to The
               | Pile and not removed when training the model.
               | 
               | (Or worse, that Google already had a copy because of
               | Google Books and didn't think "might training on this
               | explode in our face like that thing with the Street View
               | WiFi scanning?")
        
           | 19h wrote:
           | To test this hypothesis, I just took the complete book
           | "Advances in Green and Sustainable Nanomaterials" [0] and
           | pasted it into the prompt, asking Gemini: "What absorbs
           | thermal radiations and converts it into electrical signals?".
           | 
           | It replied: "The text indicates that _graphene sheets_
           | present high optical transparency and are able to absorb
           | thermal radiations with high efficacy. They can then convert
           | these radiations into electrical signals efficiently. ".
           | 
           | Screenshot of the PDF with the relevant sentence highlighted:
           | https://i.imgur.com/G3FnYEn.png
           | 
           | [0] https://www.routledge.com/Advances-in-Green-and-
           | Sustainable-...
        
             | jiggawatts wrote:
             | Ask it what material absorbs "infrared light" efficiently.
             | 
             | To me, that's useful intelligence. I can already search
             | text for verbatim matches, I want the AI to _understand_
             | that "thermal radiations" and "infrared light" are the same
             | thing.
        
           | a_wild_dandan wrote:
           | Gemini works with brand new books too; I've seen multiple
           | demonstrations of it. I'll try hunting one down. Side note:
           | this experiment is still insightful even using model training
           | material. Just compare its performance _with_ the uploaded
           | book(s) to _without._
        
         | DominikPeters wrote:
         | Just put the 2500 example linked on the article through Gemini
         | 1.5 _Flash_ and it answered correctly ( "The tree has diseased
         | leaves and its bark is peeling.") https://aistudio.google.com/
        
           | sftombu wrote:
           | Interesting!
        
         | nsagent wrote:
         | A number of people in my lab do research into long context
         | evaluation of LLMs for works of fiction. The likelihood is very
         | high that Moby Dick is in the training data. Instead the people
         | in my lab have explored recently published books to avoid these
         | issues.
         | 
         | See BooookScore (https://openreview.net/forum?id=7Ttk3RzDeu)
         | which was just presented at ICLR last week and FABLES
         | (https://arxiv.org/abs/2404.01261) a recent preprint.
        
       | nickca wrote:
       | Would love to see Gemini there too!
        
       | personjerry wrote:
       | That's great to hear. My biggest issue with GPT-4.0 was that as
       | the conversation got longer, the quality diminished (especially
       | relevant for coding projects)
       | 
       | I wonder if it'll be better now. Will test today.
        
         | sftombu wrote:
         | I had the same experience. With a 16k prompt, Turbo was nearly
         | flawless. But it wasn't very good at 32k and not usable at
         | 100+. You have to repeat information to get good results with
         | longer prompts
        
         | throwthrowuknow wrote:
         | That's been my experience so far. My current conversations are
         | crazy long compared to any of my gpt4 convos which I had to
         | frequently copy context from and start over in a new chat
        
       | youssefabdelm wrote:
       | Someone needs to come up with a "synthesis from haystack" test
       | that tests not just retrieval but depth of understanding,
       | connections, abstractions across diverse information.
       | 
       | When a person reads a book, they have an "overall intuition"
       | about it. We need some way to quantify this. Needle in haystack
       | tests feel like a simple test that doesn't go far enough.
        
         | sftombu wrote:
         | I was thinking about something similar -- to make part of the
         | question be sufficient information that the LLM can find the
         | limerick. Then the 2nd part would ask something that would
         | require a deeper understanding of the limerick (or other text).
        
         | adamgordonbell wrote:
         | I've been thinking about that as well.
         | 
         | It's hard, but if you have a piece of fiction or non-fiction it
         | hasn't seen before, then a deep reading comprehension question
         | can be a good indicator. But you need to be able to separate a
         | true answer from BS.
         | 
         | "What does this work says about our culture? Support your
         | answer with direct quotes."
         | 
         | I found both gpt-4 and haiku to do alright at this, but
         | sometimes give answers that imply fixating on certain sections
         | of a 20,000 k context. You could compare it against chunking
         | the text, getting the answer for each chunk and combining them.
         | 
         | I suspect if you do that then the chunking would win for things
         | that are found in many chunks, like the work is heavy handed on
         | a theme, but the large context would be better for a sublter
         | message, except sometimes it would miss it altogether and think
         | a Fight Club screenplay was a dark comedy.
         | 
         | Interpretation is hard I guess.
        
         | Eisenstein wrote:
         | My idea is to buy to a unpublished novel or screenplay with a
         | detailed, internally consistent world built in to it and a cast
         | of characters that have well crafted motivations and then ask
         | it to continue writing from an arbitrary post-mid-point by
         | creating a new plot line that combines two characters that
         | haven't yet met in the story. If it understands the context it
         | should be able to write a new part of the story and will be
         | able to use a reader's intuitive sense of the character's
         | motivations to move through their arc.
         | 
         | This whole thing would have to be kept under lock-and-key in
         | order to be useful, so it would only serve as a kind of
         | personal benchmark. Or it could possibly be a prestige award
         | that is valued for its conclusions and not for its ability to
         | use the methodology to create improvements in the field.
        
           | visarga wrote:
           | You can only use it for a short while, they get a copy as
           | well.
        
             | Eisenstein wrote:
             | I have been thinking about this for use in evaluating
             | locally run models, so I didn't make that connection in
             | this case. I guess it would have limited utility.
        
         | jddj wrote:
         | An elaborate Agatha Christie style whodunit, with a series of
         | plot-twists and alibis which can be chopped off the end of the
         | piece to modify who is the most likely suspect
        
           | jddj wrote:
           | Or a spot the difference.
           | 
           | Generate 1000 generic facts about Alice and the same 1000
           | facts about Eve. Randomise the order and change one minor
           | detail then ask how they differ.
        
             | pushedx wrote:
             | sort alice.txt | diff - <(sort eve.txt)
             | 
             | That's not a task for an LLM
        
               | IanCal wrote:
               | Asking students to write an essay about Napoleon isn't
               | something we do because we need essays about Napoleon -
               | the point is it's a _test_ of capabilities.
        
         | visarga wrote:
         | The needles form a graph and the prompt asks graph based tasks.
        
           | sftombu wrote:
           | That is an interesting idea
        
         | segmondy wrote:
         | Why can't you be that someone?
        
           | gremlinsinc wrote:
           | lol, made me think of the euphemism: be the change you want
           | to see.
        
       | petulla wrote:
       | You need to know that this test set data wasn't included in the
       | training data for this to be meaningful.
        
         | sftombu wrote:
         | If you ask the questions without providing the limerick first,
         | it never gets the right answer. When the LLM gets the wrong
         | answer, it is usually because it reverts to its training data
         | and gives a generic answer that doesn't apply to the limerick.
        
         | lmeyerov wrote:
         | I thought the test limericks were autogenerated?
        
           | sftombu wrote:
           | They come from a database of 98k limericks --
           | https://zenodo.org/records/5722527
        
         | a_wild_dandan wrote:
         | No you don't. Compare the model's performance before and after
         | uploading the material.
        
       | asadm wrote:
       | I have had good experience with Gemini 1M context model with this
       | kind of tasks.
        
       | rguptill wrote:
       | We also need a way to determine where a given response fits in
       | the universe of responses - is it an "average" answer or a really
       | good one
        
       | m3kw9 wrote:
       | I thought google Gemini had almost perfect needle in haystack
       | performance inside 1 million tokens?
        
         | sftombu wrote:
         | The reason I made Needle in a needlestack is the LLMs are
         | getting to good at needle in a haystack. Until GPT-4o, no model
         | was good at the NIAN benchmark.
        
       | m3kw9 wrote:
       | One could have LLM to route it to a text search function and have
       | the function report back to the LLM for secondary processing.
        
       | cararemixed wrote:
       | What's the chance that these limericks are now in the training
       | set? As others mention, it'd be interesting to come up with a way
       | to synthesize something sufficiently interesting so it always
       | evades training fit.
        
         | sftombu wrote:
         | It can't answer the questions without the limericks in the
         | prompt. For example, I just asked it "What is sought by the
         | English top brass?". The answer from the limerick is "Cranberry
         | glass".
         | 
         | 4o responded with "The term "English top brass" typically
         | refers to high-ranking officials or leaders within the British
         | government, military, or other institutions. What they seek can
         | vary widely depending on the context and the specific goals of
         | their roles. Here are some general pursuits that might be
         | sought by such individuals:
         | 
         | National Security: Ensuring the safety and security of the
         | United Kingdom from internal and external threats is a primary
         | concern. This involves defense strategies, intelligence
         | operations, and counter-terrorism efforts.
         | 
         | Economic Stability: High-ranking officials often focus on
         | policies and initiatives aimed at maintaining and improving the
         | country's economic health. This includes managing inflation,
         | unemployment, trade relations, and economic growth.
         | 
         | Political Influence: Top brass often seek to maintain or expand
         | their influence both domestically and internationally. This can
         | involve diplomacy, forming alliances, and participating in
         | international organizations like the United Nations or NATO.
         | 
         | Social Cohesion: Ensuring social stability and addressing
         | issues such as inequality, healthcare, education, and social
         | services are critical. This can involve implementing policies
         | that promote social welfare and cohesion.
         | 
         | Public Policy Implementation: Leaders are responsible for
         | developing and implementing policies that reflect the
         | government's priorities. This includes legislation, regulatory
         | frameworks, and public administration.
         | 
         | Technological Advancement: Keeping the nation at the forefront
         | of technological innovation is often a priority. This includes
         | investments in research and development, supporting tech
         | industries, and ensuring cybersecurity.
         | 
         | Environmental Sustainability: Addressing climate change and
         | promoting sustainable practices are increasingly important.
         | This includes policies aimed at reducing carbon emissions,
         | protecting natural resources, and transitioning to renewable
         | energy sources.
         | 
         | Cultural and Heritage Preservation: Protecting and promoting
         | the country's cultural heritage and national identity can also
         | be a focus. This includes supporting the arts, preserving
         | historical sites, and promoting cultural initiatives.
         | 
         | These pursuits are shaped by the current political climate,
         | global trends, and the specific priorities of the leaders in
         | question. Would you like more detailed information on any of
         | these areas?"
        
       | thorum wrote:
       | The needle in the haystack test gives a very limited view of the
       | model's actual long context capabilities. It's mostly used
       | because early models were terrible at it and it's easy to test.
       | In fact, most recent models now do pretty good at this one task,
       | but in practice, their ability to do anything complex drops off
       | hugely after 32K tokens.
       | 
       | RULER is a much better test:
       | 
       | https://github.com/hsiehjackson/RULER
       | 
       | > Despite achieving nearly perfect performance on the vanilla
       | needle-in-a-haystack (NIAH) test, all models (except for
       | Gemini-1.5-pro) exhibit large degradation on tasks in RULER as
       | sequence length increases.
       | 
       | > While all models claim context size of 32k tokens or greater
       | (except for Llama3), only half of them can effectively handle
       | sequence length of 32K by exceeding a qualitative threshold,
       | Llama2-7b performance at 4K (85.6%). The performance exceeding
       | the threshold is underlined.
        
       | throwthrowuknow wrote:
       | This is a very promising development. It would be wise for
       | everyone to go back and revise old experiments that failed now
       | that this capability is unlocked. It should also make RAG even
       | more powerful now that you can load a lot more information into
       | the context and have it be useful.
        
       | throw7381 wrote:
       | Anyone has done any benchmarks for RAG yet?
        
       | itissid wrote:
       | How Do we know that gpt-4o.has not been trained on this dataset?
        
         | sftombu wrote:
         | It can't answer the questions without the limericks in the
         | prompt. For example, I just asked it "What is sought by the
         | English top brass?". The answer from the limerick is "Cranberry
         | glass".
         | 
         | 4o responded with "The term "English top brass" typically
         | refers to high-ranking officials or leaders within the British
         | government, military, or other institutions. What they seek can
         | vary widely depending on the context and the specific goals of
         | their roles. Here are some general pursuits that might be
         | sought by such individuals:
         | 
         | National Security: Ensuring the safety and security of the
         | United Kingdom from internal and external threats is a primary
         | concern. This involves defense strategies, intelligence
         | operations, and counter-terrorism efforts.
         | 
         | Economic Stability: High-ranking officials often focus on
         | policies and initiatives aimed at maintaining and improving the
         | country's economic health. This includes managing inflation,
         | unemployment, trade relations, and economic growth.
         | 
         | Political Influence: Top brass often seek to maintain or expand
         | their influence both domestically and internationally. This can
         | involve diplomacy, forming alliances, and participating in
         | international organizations like the United Nations or NATO.
         | 
         | Social Cohesion: Ensuring social stability and addressing
         | issues such as inequality, healthcare, education, and social
         | services are critical. This can involve implementing policies
         | that promote social welfare and cohesion.
         | 
         | Public Policy Implementation: Leaders are responsible for
         | developing and implementing policies that reflect the
         | government's priorities. This includes legislation, regulatory
         | frameworks, and public administration.
         | 
         | Technological Advancement: Keeping the nation at the forefront
         | of technological innovation is often a priority. This includes
         | investments in research and development, supporting tech
         | industries, and ensuring cybersecurity.
         | 
         | Environmental Sustainability: Addressing climate change and
         | promoting sustainable practices are increasingly important.
         | This includes policies aimed at reducing carbon emissions,
         | protecting natural resources, and transitioning to renewable
         | energy sources.
         | 
         | Cultural and Heritage Preservation: Protecting and promoting
         | the country's cultural heritage and national identity can also
         | be a focus. This includes supporting the arts, preserving
         | historical sites, and promoting cultural initiatives.
         | 
         | These pursuits are shaped by the current political climate,
         | global trends, and the specific priorities of the leaders in
         | question. Would you like more detailed information on any of
         | these areas?"
        
       | irthomasthomas wrote:
       | This is based on a limericks dataset published in 2021.
       | https://zenodo.org/records/5722527
       | 
       | I think it very likely that gpt-4o was trained on this. I mean,
       | why would you not? Innnput, innnput, Johnny five need more
       | tokens.
       | 
       | I wonder why the NIAN team don't generate their limericks using
       | different models, and check to make sure they're not in the
       | dataset? Then you'd know the models couldn't possibly be trained
       | on them.
        
         | sftombu wrote:
         | I tested the LLMs to make sure they could not answer the
         | questions unless the limerick was given to them. Other than 4o,
         | they do very badly on this benchmark, so I don't think the test
         | is invalidated by their training.
        
       | ionwake wrote:
       | I am in England, do US users have access to memory features? (
       | Also do you ahve access to voice customisation yet?
       | 
       | Thanks
        
       | whimsicalism wrote:
       | Increasingly convinced that nobody on the public internet knows
       | how to do actual LLM evaluations.
        
       | yatz wrote:
       | Well, I can now use GPT to transform raw dynamic data into
       | beautiful HTML layouts on the fly for low-traffic pages, such as
       | change/audit logs, saving a ton of development time and keeping
       | my HTML updated even when the data structure has changed. My last
       | attempt did not consistently work because GPT4-Turbo sometimes
       | ignored the context and instructions almost entirely.
        
         | ijidak wrote:
         | Do you have an example of this? I would love to learn more.
        
       ___________________________________________________________________
       (page generated 2024-05-14 23:00 UTC)