[HN Gopher] GDPVal: Measuring the performance of our models on r...
___________________________________________________________________
GDPVal: Measuring the performance of our models on real-world tasks
Author : BGyss
Score : 22 points
Date : 2025-09-25 16:55 UTC (6 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| westurner wrote:
| "GDPVal: Measuring AI model performance on real world
| economically viable tasks" (2025)
| https://cdn.openai.com/pdf/d5eb7428-c4e9-4a33-bd86-86dd4bcf1...
|
| GDP? GlobalGoals ... The Sustainable Development Goals (SDGs)
| include 17 goals, 169 targets, and over 230 indicators.
|
| For strategic alignment,
|
| Strategic alignment:
| https://en.wikipedia.org/wiki/Strategic_alignment
|
| Sustainable Development Goals:
| https://en.wikipedia.org/wiki/Sustainable_Development_Goals
|
| To produce the SDGs, IIUC they clustered the world's problems as
| an international collaborative exercise; to succeed the MDGs
| (2000-2015).
|
| Each country voluntarily produces an annual SDG report on their
| progress on their Targets according to the Indicators.
|
| IMHO, Priorities should include clean energy and AI efficiency,
| given the growth projections for energy use of AI (and our
| electrical bills given continued expected supply shortages of
| energy)
|
| Which real-word SDG tasks can be AI eval'd?
| Snuggly73 wrote:
| Apparently producing a react component that returns a piece of
| html with aria tags set up. Long horizon my ass.
| westurner wrote:
| Did the LLM in that case suggest adopting an open-source UI
| library that already has tests for and implements support for
| W3C ARIA accessibility features, like React-Aria or other
| alternatives?
|
| Or did it just do the job as prompted and not mention
| suggestions for continuous improvement like reusing tested
| open source components?
| Snuggly73 wrote:
| Not sure how it went in their tests - I've tried Opus and
| GPT5 and it was few lines of react + tests, so I guess 'no'
| nextworddev wrote:
| Couldn't find their open source evals dataset
| Snuggly73 wrote:
| https://huggingface.co/datasets/openai/gdpval/viewer/default...
| nextworddev wrote:
| thanks!
| esafak wrote:
| They reported the competitors' performance for a change.
| Especially curious because OpenAI is not first. Kudos?
| CuriouslyC wrote:
| Claude's low noise message style and good commonsense baiting
| people into thinking they can rely on it for hard stuff.
___________________________________________________________________
(page generated 2025-09-25 23:01 UTC)