[HN Gopher] PaperBench
___________________________________________________________________
PaperBench
Author : meetpateltech
Score : 51 points
Date : 2025-04-02 17:06 UTC (5 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| smusamashah wrote:
| We evaluate several frontier models on PaperBench, finding that
| the best-performing tested agent, Claude 3.5 Sonnet (New) with
| open-source scaffolding, achieves an average replication score of
| 21.0%.
| attentive wrote:
| "We wished to also evaluate Claude 3.7 Sonnet, but were unable
| to complete the experiments given rate limits with the
| Anthropic API."
| swyx wrote:
| overall i REALLY like this paper and effort, but this part
| sounds like a bit of bullshit. they dont have the ability to
| implement retries and backoffs to deal with rate limits?
| DrillShopper wrote:
| PaperBench sounds like a benchmarking software package for
| recently released GPUs.
| aSanchezStern wrote:
| Where would the "paper" part come in? Is that just based on the
| word "bench" in general?
| hnuser123456 wrote:
| The recent GPUs were a "paper launch"
| timabdulla wrote:
| What were the human PhDs able to do after more than 48 hours of
| effort? Presumably given that these are top-level PhDs, the
| replication success rate would be close to 100%?
| riku_iki wrote:
| I didn't get idea of this benchmark, they ask to produce code to
| replicate result of papers, which already have code on github?..
| eightysixfour wrote:
| You don't see the value of independent replication of findings?
|
| The agent didn't have access to the code, although they
| acknowledge it could theoretically be in the training set, even
| then the original code wouldn't conform to the structure of the
| test.
| amelius wrote:
| One thing I'd be interested in is a UI for reading papers with AI
| assistance.
| benbreen wrote:
| I've been developing a more elaborate variation on the "chat
| with a pdf" idea for my own use as a researcher. It's mostly
| designed for a historian's workflow but it works pretty well
| for science and engineering papers too. Currently Flash 2.0 is
| the default but you can select other models to use to analyze
| pdfs and other text through various "lenses" ranging from a
| simple summary to text highlighting to extracting organized
| data as a .csv file:
|
| https://source-lens.vercel.app
|
| (Note: this is not at all a production ready app, it's just
| something I've been making for myself, though I'm also now
| sharing it with my students to see how they use it. If anyone
| reads this and is interested in collaborating, let me know).
| rfurmani wrote:
| I'm building such tools at https://sugaku.net, right now
| there's chatting with a paper and browsing similar papers.
| Generally arXiv and other repositories want you to link to them
| and not embed their papers, which makes it hard to build inline
| reading tools, but it's on my roadmap to support that for
| uploaded papers. Would love to hear if you have some feature
| requests there
| no_multitudes wrote:
| Are there examples of the outputs the LLMs under test generated?
| I couldn't find any detailed ones in the paper or code.
|
| The result here seems to be "Our Judge LLM gave another LLM a 21%
| grade for some code it generated", which is ... not qualitatively
| meaningful at all to me.
___________________________________________________________________
(page generated 2025-04-02 23:00 UTC)