[HN Gopher] SimpleQA
___________________________________________________________________
SimpleQA
Author : surprisetalk
Score : 82 points
Date : 2024-10-30 17:16 UTC (5 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| websap wrote:
| Are they going to make the benchmark available so other LLMs can
| be compared?
| abhisuri97 wrote:
| https://github.com/openai/simple-evals/blob/main/simpleqa_ev...
| seany62 wrote:
| Any way to see the actual questions and answers? Where can I
| find simple_qa_test_set.csv ?
| sbierwagen wrote:
| https://openaipublic.blob.core.windows.net/simple-
| evals/simp...
|
| The steps I took to find this link:
|
| 1) Look at simpleqa_eval.py. See that it loads
| "az://openaipublic/simple-evals/simple_qa_test_set.csv"
| Hmm, some weird vendored protocol.
|
| 2) I don't feel like digging through bf.BlobFile() to
| figure out how it downloads files and I certainly don't
| want to generate an API key. Cross fingers and do a Bing
| web search for "az://openaipublic"
|
| 3) That leads me to
| https://stackoverflow.com/questions/76106366/how-to-use-
| tikt... Ah ha, this answer has the link https://openaipubli
| c.blob.core.windows.net/encodings/cl100k_... which
| automatically downloads a file.
|
| 4) Poke the relevant parts of the az:// link into this
| link, and a csv appears.
| kaonwarb wrote:
| Kudos:
|
| > SimpleQA was created to be a greater challenge for frontier
| models (e.g., GPT-4o scores less than 40%).
| jampekka wrote:
| And by design:
|
| "To be included in the dataset, each question had to meet a
| strict set of criteria: ... and most questions had to induce
| hallucinations from either GPT-4o or GPT-3.5."
| brap wrote:
| Crazy that even o1-preview gets most things wrong.
|
| This is in line with my own personal experience with LLMs and
| non-trivial questions. They're excellent when answering questions
| on topics you know nothing about, and somehow embarrassingly
| wrong when you actually know the answer yourself...
|
| It's not clear to me why we're still trying to encode all of
| human knowledge in a single model, instead of teaching the model
| how to look for answers from an external source (e.g. RAG).
| sksxihve wrote:
| LLMs are experts in everything you are not
| reverius42 wrote:
| Sounds a bit like Gell-Mann Amnesia Effect:
| https://en.wikipedia.org/wiki/Michael_Crichton#Gell-
| Mann_amn...
| kibwen wrote:
| The Alt-Mann Amnesia Effect, maybe.
| Nition wrote:
| That's a nice little aphorism. I think this happens in a lot
| of things in life. Like comments on Reddit always seem quite
| insightful until you actually read the article they're
| commenting on.
| swatcoder wrote:
| Indeed. Exactly like the journalists, bloggers, self-
| published book authors, internet commenters, wikipedia
| editors, and earlier models that taught them almost all of
| what they know.
| zone411 wrote:
| You shouldn't use the rate as an indicator. They did something
| similar to what I did on my hallucinations benchmark
| (https://github.com/lechmazur/confabulations/), only using
| questions where at least one model made a mistake. I added this
| note:
|
| "The benchmark includes questions where at least one LLM
| confabulated, in order to minimize the number of questions
| requiring human assessment. Because of this, and since the
| questions are intentionally adversarial, the absolute
| percentage should not be used to infer that LLMs frequently
| confabulate. This leaderboard does not reflect a "typical"
| hallucination rate."
|
| > instead of teaching the model how to look for answers from an
| external source (e.g. RAG)
|
| My benchmark specifically focuses on the RAG use case. Even
| with provided texts, current models still hallucinate.
| Kiro wrote:
| You're reading this wrong. They've deliberately chosen
| questions that one or more models fail at. It's not
| representative at all of how often the model is wrong in
| general.
| bloomingkales wrote:
| Honestly, try prompting it with "you are wrong 80% of the time,
| therefore you will need to double check your answers, first
| factually, then numerically, then double check the time/date.
| You are still probably wrong so do a third accuracy check. The
| user's prompts are always wrong too mostly - so always check
| them".
|
| I stopped playing with larger models and have been pushing
| smaller models with this fake ass system prompt and getting
| good results. It seems like it forces the model to do multiple
| passes before giving you any response.
|
| My smaller local models give me less bullshit than Meta.ai, for
| example, which generally spits out nonsense almost immediately.
| I don't have the same hallucination issue with Llama3 - 8b
| locally because of custom system prompts.
|
| The model has all the correct information, so it almost needs
| to do RAG on itself. Multiple passes on itself seems like a way
| to do it.
| sebzim4500 wrote:
| I don't think it's surprising that o1-preview is only slightly
| better than GPT-4o, it was never advertised as being better at
| this kind of recall.
| yunohn wrote:
| This eval's goal is a bit unclear to me, especially given the
| example questions. They're very trivia/minutiae like asking about
| sports goals for example, which is their stated desire to test
| factual knowledge. But will this ever be possible by an LLM,
| without web browsing - which they deliberately removed while
| evaluating?
| petesergeant wrote:
| I think the interesting thing here is the difference between
| Not Attempt and Incorrect -- the goal here seems to be to
| reduce hallucination
| sbierwagen wrote:
| >But will this ever be possible by an LLM?
|
| Why not? Just train an unbelievably gigantic LLM that encodes
| all human knowledge. A hundred trillion parameters ought to do
| it.
| emurph55 wrote:
| I've tried using older models to create a cpu player on this
| lateral thinking game (https://detective-stories.com) and they
| were surprisingly bad at giving answers. I am curious to see how
| well the more recent models will do.
| CharlieDigital wrote:
| _8 authors_ attached to this. > SimpleQA is a
| simple but challenging benchmark for evaluating the factuality of
| frontier models. A main limitation in SimpleQA is its scope--
| while SimpleQA is accurate it only measures factuality under the
| constrained setting of short, fact-seeking queries with a single,
| verifiable answer. Whether the ability to provide factual short
| answers correlates with the ability to write lengthy responses
| filled with numerous facts remains an open research question.
|
| OpenAI going to have some rounds of layoffs in the future.
| ggnore7452 wrote:
| What's more interesting to me here are the calibration graphs:
|
| * LLMs, at least GPT models, tend to overstate their confidence.
| * A frequency-based approach appears to achieve calibration
| closer to the ideal.
|
| This kinda passes my vibe test. That said, I wonder--rather than
| running 100 trials, could we approximate this by using something
| like a log-probability ratio? This would especially apply in
| cases where answers are yes or no, assuming the output spans more
| than one token.
| Nition wrote:
| Honestly, I'd never expect it get 'correct's for every little
| fact like this, but it'd be great to get a lot more 'not
| attempted'.
|
| "I seem, then, in just this little thing to be wiser than this
| man at any rate; that what I do not know I do not think I know
| either." - Socratos, from Plato's _Apology of Socrates_
| chgo1 wrote:
| Dataset: http://openaipublic.blob.core.windows.net/simple-
| evals/simpl...
___________________________________________________________________
(page generated 2024-10-30 23:00 UTC)