[HN Gopher] SimpleQA
       ___________________________________________________________________
        
       SimpleQA
        
       Author : surprisetalk
       Score  : 82 points
       Date   : 2024-10-30 17:16 UTC (5 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | websap wrote:
       | Are they going to make the benchmark available so other LLMs can
       | be compared?
        
         | abhisuri97 wrote:
         | https://github.com/openai/simple-evals/blob/main/simpleqa_ev...
        
           | seany62 wrote:
           | Any way to see the actual questions and answers? Where can I
           | find simple_qa_test_set.csv ?
        
             | sbierwagen wrote:
             | https://openaipublic.blob.core.windows.net/simple-
             | evals/simp...
             | 
             | The steps I took to find this link:
             | 
             | 1) Look at simpleqa_eval.py. See that it loads
             | "az://openaipublic/simple-evals/simple_qa_test_set.csv"
             | Hmm, some weird vendored protocol.
             | 
             | 2) I don't feel like digging through bf.BlobFile() to
             | figure out how it downloads files and I certainly don't
             | want to generate an API key. Cross fingers and do a Bing
             | web search for "az://openaipublic"
             | 
             | 3) That leads me to
             | https://stackoverflow.com/questions/76106366/how-to-use-
             | tikt... Ah ha, this answer has the link https://openaipubli
             | c.blob.core.windows.net/encodings/cl100k_... which
             | automatically downloads a file.
             | 
             | 4) Poke the relevant parts of the az:// link into this
             | link, and a csv appears.
        
       | kaonwarb wrote:
       | Kudos:
       | 
       | > SimpleQA was created to be a greater challenge for frontier
       | models (e.g., GPT-4o scores less than 40%).
        
         | jampekka wrote:
         | And by design:
         | 
         | "To be included in the dataset, each question had to meet a
         | strict set of criteria: ... and most questions had to induce
         | hallucinations from either GPT-4o or GPT-3.5."
        
       | brap wrote:
       | Crazy that even o1-preview gets most things wrong.
       | 
       | This is in line with my own personal experience with LLMs and
       | non-trivial questions. They're excellent when answering questions
       | on topics you know nothing about, and somehow embarrassingly
       | wrong when you actually know the answer yourself...
       | 
       | It's not clear to me why we're still trying to encode all of
       | human knowledge in a single model, instead of teaching the model
       | how to look for answers from an external source (e.g. RAG).
        
         | sksxihve wrote:
         | LLMs are experts in everything you are not
        
           | reverius42 wrote:
           | Sounds a bit like Gell-Mann Amnesia Effect:
           | https://en.wikipedia.org/wiki/Michael_Crichton#Gell-
           | Mann_amn...
        
             | kibwen wrote:
             | The Alt-Mann Amnesia Effect, maybe.
        
           | Nition wrote:
           | That's a nice little aphorism. I think this happens in a lot
           | of things in life. Like comments on Reddit always seem quite
           | insightful until you actually read the article they're
           | commenting on.
        
           | swatcoder wrote:
           | Indeed. Exactly like the journalists, bloggers, self-
           | published book authors, internet commenters, wikipedia
           | editors, and earlier models that taught them almost all of
           | what they know.
        
         | zone411 wrote:
         | You shouldn't use the rate as an indicator. They did something
         | similar to what I did on my hallucinations benchmark
         | (https://github.com/lechmazur/confabulations/), only using
         | questions where at least one model made a mistake. I added this
         | note:
         | 
         | "The benchmark includes questions where at least one LLM
         | confabulated, in order to minimize the number of questions
         | requiring human assessment. Because of this, and since the
         | questions are intentionally adversarial, the absolute
         | percentage should not be used to infer that LLMs frequently
         | confabulate. This leaderboard does not reflect a "typical"
         | hallucination rate."
         | 
         | > instead of teaching the model how to look for answers from an
         | external source (e.g. RAG)
         | 
         | My benchmark specifically focuses on the RAG use case. Even
         | with provided texts, current models still hallucinate.
        
         | Kiro wrote:
         | You're reading this wrong. They've deliberately chosen
         | questions that one or more models fail at. It's not
         | representative at all of how often the model is wrong in
         | general.
        
         | bloomingkales wrote:
         | Honestly, try prompting it with "you are wrong 80% of the time,
         | therefore you will need to double check your answers, first
         | factually, then numerically, then double check the time/date.
         | You are still probably wrong so do a third accuracy check. The
         | user's prompts are always wrong too mostly - so always check
         | them".
         | 
         | I stopped playing with larger models and have been pushing
         | smaller models with this fake ass system prompt and getting
         | good results. It seems like it forces the model to do multiple
         | passes before giving you any response.
         | 
         | My smaller local models give me less bullshit than Meta.ai, for
         | example, which generally spits out nonsense almost immediately.
         | I don't have the same hallucination issue with Llama3 - 8b
         | locally because of custom system prompts.
         | 
         | The model has all the correct information, so it almost needs
         | to do RAG on itself. Multiple passes on itself seems like a way
         | to do it.
        
         | sebzim4500 wrote:
         | I don't think it's surprising that o1-preview is only slightly
         | better than GPT-4o, it was never advertised as being better at
         | this kind of recall.
        
       | yunohn wrote:
       | This eval's goal is a bit unclear to me, especially given the
       | example questions. They're very trivia/minutiae like asking about
       | sports goals for example, which is their stated desire to test
       | factual knowledge. But will this ever be possible by an LLM,
       | without web browsing - which they deliberately removed while
       | evaluating?
        
         | petesergeant wrote:
         | I think the interesting thing here is the difference between
         | Not Attempt and Incorrect -- the goal here seems to be to
         | reduce hallucination
        
         | sbierwagen wrote:
         | >But will this ever be possible by an LLM?
         | 
         | Why not? Just train an unbelievably gigantic LLM that encodes
         | all human knowledge. A hundred trillion parameters ought to do
         | it.
        
       | emurph55 wrote:
       | I've tried using older models to create a cpu player on this
       | lateral thinking game (https://detective-stories.com) and they
       | were surprisingly bad at giving answers. I am curious to see how
       | well the more recent models will do.
        
       | CharlieDigital wrote:
       | _8 authors_ attached to this.                   > SimpleQA is a
       | simple but challenging benchmark for evaluating the factuality of
       | frontier models. A main limitation in SimpleQA is its scope--
       | while SimpleQA is accurate it only measures factuality under the
       | constrained setting of short, fact-seeking queries with a single,
       | verifiable answer. Whether the ability to provide factual short
       | answers correlates with the ability to write lengthy responses
       | filled with numerous facts remains an open research question.
       | 
       | OpenAI going to have some rounds of layoffs in the future.
        
       | ggnore7452 wrote:
       | What's more interesting to me here are the calibration graphs:
       | 
       | * LLMs, at least GPT models, tend to overstate their confidence.
       | * A frequency-based approach appears to achieve calibration
       | closer to the ideal.
       | 
       | This kinda passes my vibe test. That said, I wonder--rather than
       | running 100 trials, could we approximate this by using something
       | like a log-probability ratio? This would especially apply in
       | cases where answers are yes or no, assuming the output spans more
       | than one token.
        
       | Nition wrote:
       | Honestly, I'd never expect it get 'correct's for every little
       | fact like this, but it'd be great to get a lot more 'not
       | attempted'.
       | 
       | "I seem, then, in just this little thing to be wiser than this
       | man at any rate; that what I do not know I do not think I know
       | either." - Socratos, from Plato's _Apology of Socrates_
        
       | chgo1 wrote:
       | Dataset: http://openaipublic.blob.core.windows.net/simple-
       | evals/simpl...
        
       ___________________________________________________________________
       (page generated 2024-10-30 23:00 UTC)