[HN Gopher] GDPVal: Measuring the performance of our models on r...
       ___________________________________________________________________
        
       GDPVal: Measuring the performance of our models on real-world tasks
        
       Author : BGyss
       Score  : 22 points
       Date   : 2025-09-25 16:55 UTC (6 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | westurner wrote:
       | "GDPVal: Measuring AI model performance on real world
       | economically viable tasks" (2025)
       | https://cdn.openai.com/pdf/d5eb7428-c4e9-4a33-bd86-86dd4bcf1...
       | 
       | GDP? GlobalGoals ... The Sustainable Development Goals (SDGs)
       | include 17 goals, 169 targets, and over 230 indicators.
       | 
       | For strategic alignment,
       | 
       | Strategic alignment:
       | https://en.wikipedia.org/wiki/Strategic_alignment
       | 
       | Sustainable Development Goals:
       | https://en.wikipedia.org/wiki/Sustainable_Development_Goals
       | 
       | To produce the SDGs, IIUC they clustered the world's problems as
       | an international collaborative exercise; to succeed the MDGs
       | (2000-2015).
       | 
       | Each country voluntarily produces an annual SDG report on their
       | progress on their Targets according to the Indicators.
       | 
       | IMHO, Priorities should include clean energy and AI efficiency,
       | given the growth projections for energy use of AI (and our
       | electrical bills given continued expected supply shortages of
       | energy)
       | 
       | Which real-word SDG tasks can be AI eval'd?
        
         | Snuggly73 wrote:
         | Apparently producing a react component that returns a piece of
         | html with aria tags set up. Long horizon my ass.
        
           | westurner wrote:
           | Did the LLM in that case suggest adopting an open-source UI
           | library that already has tests for and implements support for
           | W3C ARIA accessibility features, like React-Aria or other
           | alternatives?
           | 
           | Or did it just do the job as prompted and not mention
           | suggestions for continuous improvement like reusing tested
           | open source components?
        
             | Snuggly73 wrote:
             | Not sure how it went in their tests - I've tried Opus and
             | GPT5 and it was few lines of react + tests, so I guess 'no'
        
       | nextworddev wrote:
       | Couldn't find their open source evals dataset
        
         | Snuggly73 wrote:
         | https://huggingface.co/datasets/openai/gdpval/viewer/default...
        
           | nextworddev wrote:
           | thanks!
        
       | esafak wrote:
       | They reported the competitors' performance for a change.
       | Especially curious because OpenAI is not first. Kudos?
        
       | CuriouslyC wrote:
       | Claude's low noise message style and good commonsense baiting
       | people into thinking they can rely on it for hard stuff.
        
       ___________________________________________________________________
       (page generated 2025-09-25 23:01 UTC)