hngopher.com

       [HN Gopher] PaperBench
       ___________________________________________________________________
        
       PaperBench
        
       Author : meetpateltech
       Score  : 51 points
       Date   : 2025-04-02 17:06 UTC (5 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | smusamashah wrote:
       | We evaluate several frontier models on PaperBench, finding that
       | the best-performing tested agent, Claude 3.5 Sonnet (New) with
       | open-source scaffolding, achieves an average replication score of
       | 21.0%.
        
         | attentive wrote:
         | "We wished to also evaluate Claude 3.7 Sonnet, but were unable
         | to complete the experiments given rate limits with the
         | Anthropic API."
        
           | swyx wrote:
           | overall i REALLY like this paper and effort, but this part
           | sounds like a bit of bullshit. they dont have the ability to
           | implement retries and backoffs to deal with rate limits?
        
       | DrillShopper wrote:
       | PaperBench sounds like a benchmarking software package for
       | recently released GPUs.
        
         | aSanchezStern wrote:
         | Where would the "paper" part come in? Is that just based on the
         | word "bench" in general?
        
           | hnuser123456 wrote:
           | The recent GPUs were a "paper launch"
        
       | timabdulla wrote:
       | What were the human PhDs able to do after more than 48 hours of
       | effort? Presumably given that these are top-level PhDs, the
       | replication success rate would be close to 100%?
        
       | riku_iki wrote:
       | I didn't get idea of this benchmark, they ask to produce code to
       | replicate result of papers, which already have code on github?..
        
         | eightysixfour wrote:
         | You don't see the value of independent replication of findings?
         | 
         | The agent didn't have access to the code, although they
         | acknowledge it could theoretically be in the training set, even
         | then the original code wouldn't conform to the structure of the
         | test.
        
       | amelius wrote:
       | One thing I'd be interested in is a UI for reading papers with AI
       | assistance.
        
         | benbreen wrote:
         | I've been developing a more elaborate variation on the "chat
         | with a pdf" idea for my own use as a researcher. It's mostly
         | designed for a historian's workflow but it works pretty well
         | for science and engineering papers too. Currently Flash 2.0 is
         | the default but you can select other models to use to analyze
         | pdfs and other text through various "lenses" ranging from a
         | simple summary to text highlighting to extracting organized
         | data as a .csv file:
         | 
         | https://source-lens.vercel.app
         | 
         | (Note: this is not at all a production ready app, it's just
         | something I've been making for myself, though I'm also now
         | sharing it with my students to see how they use it. If anyone
         | reads this and is interested in collaborating, let me know).
        
         | rfurmani wrote:
         | I'm building such tools at https://sugaku.net, right now
         | there's chatting with a paper and browsing similar papers.
         | Generally arXiv and other repositories want you to link to them
         | and not embed their papers, which makes it hard to build inline
         | reading tools, but it's on my roadmap to support that for
         | uploaded papers. Would love to hear if you have some feature
         | requests there
        
       | no_multitudes wrote:
       | Are there examples of the outputs the LLMs under test generated?
       | I couldn't find any detailed ones in the paper or code.
       | 
       | The result here seems to be "Our Judge LLM gave another LLM a 21%
       | grade for some code it generated", which is ... not qualitatively
       | meaningful at all to me.
        
       ___________________________________________________________________
       (page generated 2025-04-02 23:00 UTC)