https://www.oii.ox.ac.uk/news-events/study-identifies-weaknesses-in-how-ai-systems-are-evaluated/

Skip down to main content
Mobile menu button
Search for : [                    ] [Search]
Oxford Internet Institute logo

  * Research
          o Research Section Home
          o Research Areas
              # Digital Economies
              # Digital Knowledge and Culture
              # Digital Politics and Government
              # Education, Digital Life and Wellbeing
              # Ethics and Philosophy of Information
              # Information Geography and Inequality
              # Information Governance and Security
              # Social Data Science
              # Research Groups
                  @ Connected Families Research Group
                  @ Culture, Creativity and Technology Group
                  @ Digital Economic Security Lab
                  @ Digital Ethics and Defence Technologies
                  @ Equitable Access to Quality Information Lab
                  @ Fair Digital Economies
                  @ Governance of Emerging Technologies
                  @ Governing in the Age of AI
                  @ Mind and Emerging Technologies Lab
                  @ Reasoning with Machines AI Lab
                  @ Researching Identities in Networks Group
                  @ Subversive Technologies Lab
                  @ Synthetic Society Lab
                  @ Publications
                  @ Projects
                  @ Impact
                  @ Visitor Programme
              # Study
                      - Study Section Home
                      - Our Programmes
                          = MSc in Social Data Science
                          = MSc in Social Science of the Internet
                          = DPhil in Social Data Science
                          = DPhil in Information, Communication and
                            the Social Sciences
                      - Recognised Student Programme
                          = Alumni Stories
                          = Student Reading Groups
                          = Open Days
                          = Summer Doctoral Programme
                              x SDP Alumni Gallery
                      - People
                              x People Section Home
                              x OII People
                                  % Faculty & Research Fellows
                                  % Research Staff
                                  % Senior Fellows
                                  % DPhil Students
                                  % MSc Students
                                  % Administration
                                  % Research Associates
                                  % Advisory Board
                                  % Visitors
                                  % Faculty Associates
                                  % Former Members of the OII
                                  % Vacancies
                                      * Equality, Diversity &
                                        Inclusion
                                  % News & Events
                                          + News & Events Section
                                        Home
                                          + News
                                          + Latest Reports
                                          + Press Coverage
                                          o Upcoming Events
                                          o Videos & Event Recordings
                                          o Podcasts
                                          o Follow Us
                                          o Newsletter
                                          + About
                                          # About Section Home
                                          # Our History
                                          # Our Founding Donor
                                          # Giving to the Oxford
                                        Internet Institute
                                          @ The Shirley Scholars Fund
                                          @ Executive Education
                                          @ Information for Alumni
                                          - Find Us
                                          = The Stephen A. Schwarzman
                                        Centre
                                          - Library
                                        Oxford Internet Institute
                                        logo
                                         
                                         
                                         
                                          @ Research
                                        Research
                                          = Research Areas
                                          x Digital Economies
                                          x Digital Knowledge and
                                        Culture
                                          x Digital Politics and
                                        Government
                                          x Education, Digital Life
                                        and Wellbeing
                                          x Ethics and Philosophy of
                                        Information
                                          x Information Geography and
                                        Inequality
                                          x Information Governance
                                        and Security
                                          x Social Data Science
                                          x Research Groups
                                          % Connected Families
                                        Research Group
                                          % Culture, Creativity and
                                        Technology Group
                                          % Digital Economic Security
                                        Lab
                                          % Digital Ethics and
                                        Defence Technologies
                                          % Equitable Access to
                                        Quality Information Lab
                                          % Fair Digital Economies
                                          % Governance of Emerging
                                        Technologies
                                          % Governing in the Age of
                                        AI
                                          % Mind and Emerging
                                        Technologies Lab
                                          % Reasoning with Machines
                                        AI Lab
                                          % Researching Identities in
                                        Networks Group
                                          % Subversive Technologies
                                        Lab
                                          % Synthetic Society Lab
                                          % Publications
                                          % Projects
                                          % Impact
                                          % Visitor Programme
                                          x Study
                                        Study
                                          * Our Programmes
                                          + MSc in Social Data
                                        Science
                                          + MSc in Social Science of
                                        the Internet
                                          + DPhil in Social Data
                                        Science
                                          + DPhil in Information,
                                        Communication and the Social
                                        Sciences
                                          * Recognised Student
                                        Programme
                                          + Alumni Stories
                                          + Student Reading Groups
                                          + Open Days
                                          + Summer Doctoral Programme
                                          o SDP Alumni Gallery
                                          * People
                                        People
                                          o OII People
                                          # Faculty & Research
                                        Fellows
                                          # Research Staff
                                          # Senior Fellows
                                          # DPhil Students
                                          # MSc Students
                                          # Administration
                                          # Research Associates
                                          # Advisory Board
                                          # Visitors
                                          # Faculty Associates
                                          # Former Members of the OII
                                          # Vacancies
                                          @ Equality, Diversity &
                                        Inclusion
                                          # News & Events
                                        News & Events
                                          - News
                                          - Latest Reports
                                          - Press Coverage
                                          = Upcoming Events
                                          = Videos & Event Recordings
                                          = Podcasts
                                          = Follow Us
                                          = Newsletter
                                          - About
                                        About
                                          x Our History
                                          x Our Founding Donor
                                          x Giving to the Oxford
                                        Internet Institute
                                          % The Shirley Scholars Fund
                                          % Executive Education
                                          % Information for Alumni
                                          * Find Us
                                          + The Stephen A. Schwarzman
                                        Centre
                                          * Library
                                        Search for : 
                                        [                    ]
                                        [Search]
                                          x OII >  
                                          x News & Events >  
                                          x News >  
                                        PRESS RELEASE -

                                        Study identifies weaknesses
                                        in how AI systems are
                                        evaluated 

                                        Oxford Internet Institute
                                        text logo
                                          x OII >  
                                          x News & Events >  
                                          x News >  
                                        PRESS RELEASE -

                                        Study identifies weaknesses
                                        in how AI systems are
                                        evaluated 

                                        Published on
                                        4 Nov 2025
                                        Largest systematic review of
                                        AI benchmarks highlights need
                                        for clearer definitions and
                                        stronger scientific
                                        standards.

                                        A new study led by the Oxford
                                        Internet Institute (OII) at
                                        the University of Oxford and
                                        involving a team of 42
                                        researchers from leading
                                        global institutions including
                                        EPFL, Stanford University,
                                        the Technical University of
                                        Munich, UC Berkeley, the UK
                                        AI Security Institute, the
                                        Weizenbaum Institute, and
                                        Yale University, has found
                                        that many of the tests used
                                        to measure the capabilities
                                        and safety of large language
                                        models (LLMs) lack scientific
                                        rigour.  

                                        In Measuring What Matters:
                                        Construct Validity in Large
                                        Language Model Benchmarks,
                                        accepted for publication in
                                        the upcoming NeurIPS
                                        conference proceedings,
                                        researchers review 445 AI
                                        benchmarks - the standardised
                                        evaluations used to compare
                                        and rank AI systems.  

                                        The researchers found that
                                        many of these benchmarks are
                                        built on unclear definitions
                                        or weak analytical methods,
                                        making it difficult to draw
                                        reliable conclusions about AI
                                        progress, capabilities or
                                        safety. 

                                        "Benchmarks underpin nearly
                                        all claims about advances in
                                        AI," says Andrew Bean, lead
                                        author of the study. "But
                                        without shared definitions
                                        and sound measurement, it
                                        becomes hard to know whether
                                        models are genuinely
                                        improving or just appearing
                                        to." 

                                        Benchmarks play a central
                                        role in how AI systems are
                                        designed, deployed, and
                                        regulated. They guide
                                        research priorities, shape
                                        competition between models,
                                        and are increasingly
                                        referenced in policy and
                                        regulatory frameworks,
                                        including the EU AI Act,
                                        which calls for risk
                                        assessments based on
                                        "appropriate technical tools
                                        and benchmarks." 

                                        The study warns that if
                                        benchmarks are not
                                        scientifically sound, they
                                        may give developers and
                                        regulators a misleading
                                        picture of how capable or
                                        safe AI systems really are. 

                                        "This work reflects the kind
                                        of large-scale collaboration
                                        the field needs," adds Dr.
                                        Adam Mahdi. "By bringing
                                        together leading AI labs,
                                        we're starting to tackle one
                                        of the most fundamental gaps
                                        in current AI evaluation." 

                                        Key findings 

                                          % Lack of statistical
                                        rigour

                                        Only 16% of the reviewed
                                        studies used statistical
                                        methods when comparing model
                                        performance. This means that
                                        reported differences between
                                        systems or claims of
                                        superiority could be due to
                                        chance rather than genuine
                                        improvement. 

                                          % Vague or contested
                                        definitions

                                        Around half of the benchmarks
                                        aimed to measure abstract
                                        ideas such as reasoning or
                                        harmlessness without clearly
                                        defining what those terms
                                        mean. Without a shared
                                        understanding of these
                                        concepts, it is difficult to
                                        ensure that benchmarks are
                                        testing what they intend to.

                                        Examples 

                                          % Confounding formatting
                                        rules - A test might ask a
                                        model to solve a simple logic
                                        puzzle but also require it to
                                        present the answer in a very
                                        specific, complicated format.
                                        If the model gets the puzzle
                                        right but fails the
                                        formatting, it looks worse
                                        than it really is. 
                                          % Brittle performance - A
                                        model might do well on short,
                                        primary school-style maths
                                        questions, but if you change
                                        the numbers or wording
                                        slightly, it suddenly fails.
                                        This shows it may be
                                        memorising patterns rather
                                        than truly understanding the
                                        problem
                                          % Unsupported claims - If a
                                        model scores well on
                                        multiple-choice questions
                                        from medical exams, people
                                        might claim it has
                                        doctor-level expertise. But
                                        passing an exam is only one
                                        small part of what doctors
                                        do, so the result can be
                                        misleading.

                                        Recommendations for better
                                        benchmarking 

                                        The authors stress that these
                                        problems are fixable. Drawing
                                        on established methods from
                                        fields such as psychometrics
                                        and medicine, they propose
                                        eight recommendations to
                                        improve the validity of AI
                                        benchmarks. These include: 

                                          % Define and isolate:
                                        Provide a precise,
                                        operational definition for
                                        the concept being measured
                                        and control for unrelated
                                        factors. 
                                          % Build representative
                                        evaluations: Ensure test
                                        items represent real-world
                                        conditions and cover the full
                                        scope of the target skill or
                                        behaviour.  
                                          % Strengthen analysis and
                                        justification: Use
                                        statistical methods to report
                                        uncertainty and enable robust
                                        comparisons; conduct detailed
                                        error analysis to understand
                                        why a model fails; and
                                        justify why the benchmark is
                                        a valid measure for its
                                        intended purpose. 

                                        The team also provides a 
                                        Construct Validity Checklist,
                                        a practical tool researchers,
                                        developers, and regulators
                                        can use to assess whether an
                                        AI benchmark follows sound
                                        design principles before
                                        relying on its results. The
                                        checklist is available at 
                                        https://oxrml.com/
                                        measuring-what-matters/  

                                        The paper, Measuring What
                                        Matters: Construct Validity
                                        in Large Language Model
                                        Benchmarks, will be published
                                        as part of the NeurIPS 2025
                                        peer-reviewed conference
                                        proceedings in San Diego from
                                        2-7 December. The
                                        peer-reviewed paper is
                                        available on request.  

                                        Media spokespeople 

                                        Lead author: Andrew Bean,
                                        Doctoral Student, Oxford
                                        Internet Institute,
                                        University of Oxford 

                                        Senior authors: Adam Mahdi,
                                        Associate Professor, and Luc
                                        Rocher, Associate Professor,
                                        Oxford Internet Institute,
                                        University of Oxford 

                                        Contact  

                                        For more information and
                                        briefings, please contact:
                                        Anthea Milnes, Head of
                                        Communications
                                        Sara Spinks / Veena McCoole,
                                        Media and Communications
                                        Manager      

                                        T: +44 (0)1865 280527 

                                        M: +44 (0)7551 345493  

                                        E: press@oii.ox.ac.uk    

                                         

                                        About the Oxford Internet
                                        Institute (OII)     

                                        The Oxford Internet Institute
                                        (OII) has been at the
                                        forefront of exploring the
                                        human impact of emerging
                                        technologies for 25 years. As
                                        a multidisciplinary research
                                        and teaching department, we
                                        bring together scholars and
                                        students from diverse fields
                                        to examine the opportunities
                                        and challenges posed by
                                        transformative innovations
                                        such as artificial
                                        intelligence, large language
                                        models, machine learning,
                                        digital platforms, and
                                        autonomous agents. 

                                         

                                        About the University of
                                        Oxford    

                                        Oxford University was placed
                                        number one in the Times
                                        Higher Education World
                                        University Rankings for the
                                        tenth year running in 2025.
                                        At the heart of this success
                                        are the twin-pillars of our
                                        ground-breaking research and
                                        innovation and our
                                        distinctive educational
                                        offer. Oxford is world-famous
                                        for research and teaching
                                        excellence and home to some
                                        of the most talented people
                                        from across the globe.   

                                         

                                        Funding information 

                                          % A.M.B. is supported in
                                        part by the Clarendon
                                        Scholarships and the Oxford
                                        Internet Institute's Research
                                        Programme on AI & Work.  
                                          % A.M. is supported by the
                                        Oxford Internet Institute's
                                        Research Programme on AI &
                                        Work. 
                                          % R.O.K. is supported by a
                                        Fellowship from the Cosmos
                                        Institute. H.M. is supported
                                        by ESRC [ES/P000649/1] and
                                        would like to acknowledge the
                                        London Initiative for Safe
                                        AI. 
                                          % C.E. is supported by the
                                        EPSRC Centre for Doctoral
                                        Training in Health Data
                                        Science (EP/S02428X/1) and
                                        the AXA Research Fund.  
                                          % F.L. is supported by
                                        Clarendon and Jason Hu
                                        studentships.  
                                          % H.R.K.'s PhD is supported
                                        by the Economic and Social
                                        Research Council grant ES/
                                        P000649/1.  
                                          % M.G. was supported by the
                                        SMARTY (PCI2024-153434)
                                        project funded by the Agencia
                                        Estatal de Investigacion
                                        (doi:10.13039/501100011033)
                                        and by the European
                                        Commission through the Chips
                                        Act Joint Undertaking project
                                        SMARTY (Grant 101140087).
                                        This material is based in
                                        part upon work supported by
                                        the National Science
                                        Foundation Graduate Research
                                        Fellowship Program under
                                        Grant No. DGE-2139841.  
                                          % O.D. is supported by the
                                        UKRI's EPSRC AIMS CDT grant
                                        (EP/S024050/1).  
                                          % J.R is supported by the
                                        Engineering and Physical
                                        Sciences Research Council.  
                                          % J.B. would like to
                                        acknowledge funding by the
                                        Federal Ministry of Education
                                        and Research of Germany
                                        (BMBF) under grant no.
                                        16DII131.  
                                          % A. Bibi would like to
                                        acknowledge the UK AISI
                                        systemic safety grant.  
                                          % A. Bosselut gratefully
                                        acknowledges the support of
                                        the Swiss National Science
                                        Foundation (No. 215390),
                                        Innosuisse (PFFS-21-29), the
                                        EPFL Center for Imaging, Sony
                                        Group Corporation, and a Meta
                                        LLM Evaluation Research
                                        Grant. 

                                        Related People

                                        Andrew BeanAndrew Bean

                                        Andrew Bean

                                        DPhil Student

                                        Andrew holds a B.S. in
                                        Applied Mathematics from Yale
                                        University and an MSc in
                                        Social Data Science from the
                                        OII. He is a Clarendon
                                        Scholar and was previously a
                                        Thouron Prize winner at the
                                        University of Cambridge
                                        (Pembroke College).

                                        View profile
                                        Adam MahdiAdam Mahdi

                                        Dr Adam Mahdi

                                        Senior Research Fellow

                                        Adam Mahdi's research focuses
                                        on digital health and
                                        application of machine
                                        learning in social sciences.
                                        He is the director of the
                                        UKRI-funded OxCOVID19 Project
                                        and a fellow at Wolfson
                                        College, University of
                                        Oxford.

                                        View profile
                                        Luc RocherLuc Rocher

                                        Dr Luc Rocher

                                        Associate Professor

                                        Luc conducts human-centred
                                        computing research to
                                        understand how data and
                                        algorithms impact society.
                                        They work to make digital
                                        power visible to the public
                                        and guide the development of
                                        accountable, sustainable, and
                                        safe algorithms for all.

                                        View profile

                                        Related Project

                                        ai-diagnosisai-diagnosis

                                        Benchmarking Large Language
                                        Models for Self-Diagnosis

                                        Our work investigates
                                        applications of large
                                        language models (LLMs) in
                                        healthcare settings, with a
                                        particular focus on
                                        interactions between LLMs and
                                        human users. The project
                                        focuses on LLMs for medical
                                        self-diagnosis.

                                        Read more
                                        Oxford Internet Institute
                                        Stephen A. Schwarzman Centre
                                        for the Humanities
                                        University of Oxford
                                        Radcliffe Observatory Quarter
                                        Woodstock Road
                                        Oxford
                                        OX2 6GG +44 (0)1865 287210

                                        General:
                                        enquiries@oii.ox.ac.uk
                                        Press: press@oii.ox.ac.uk

                                        Athena Swan Bronze Award
                                        Athena Swan Bronze Award
                                        Staff Intranet Newsletter

                                        FOLLOW US:

                                         
                                         
                                         
                                         
                                         
                                         

                                        INFORMATION FOR:

                                          % Prospective students
                                          % Alumni
                                          % Job seekers
                                          % Media
                                          % Policy makers
                                        (c) Oxford Internet Institute
                                        2025 | Terms of Use | Privacy
                                        Policy | Cookie Settings |
                                        Copyright Policy |
                                        Accessibility | Email
                                        Webmaster

                                        We are using cookies to give
                                        you the best experience on
                                        our website.

                                        You can find out more about
                                        which cookies we are using or
                                        switch them off in settings.

                                        Accept Reject Settings
                                          % Privacy Overview
                                          % Strictly Necessary
                                        Cookies
                                          % Google Analytics
                                        Privacy Overview
                                        Oxford Internet Institute
                                        Oxford Internet Institute
                                       
                                        This website uses cookies so
                                        that we can provide you with
                                        the best user experience
                                        possible. Cookie information
                                        is stored in your browser and
                                        performs functions such as
                                        recognising you when you
                                        return to our website and
                                        helping our team to
                                        understand which sections of
                                        the website you find most
                                        interesting and useful.

                                        Strictly Necessary Cookies
                                          % moove_gdrp_popup -  a
                                        cookie that saves your
                                        preferences for cookie
                                        settings. Without this
                                        cookie, the screen offering
                                        you cookie options will
                                        appear on every page you
                                        visit.

                                        This cookie remains on your
                                        computer for 365 days, but
                                        you can adjust your
                                        preferences at any time by
                                        clicking on the "Cookie
                                        settings" link in the website
                                        footer.

                                        Please note that if you visit
                                        the Oxford University
                                        website, any cookies you
                                        accept there will appear on
                                        our site here too, this being
                                        a subdomain. To control them,
                                        you must change your cookie
                                        preferences on the main
                                        University website.

                                        Enable or Disable Cookies [*]
                                        Enabled Disabled
                                        Google Analytics

                                        This website uses Google Tags
                                        and Google Analytics to
                                        collect anonymised
                                        information such as the
                                        number of visitors to the
                                        site, and the most popular
                                        pages. Keeping these cookies
                                        enabled helps the OII improve
                                        our website.

                                        Enabling this option will
                                        allow cookies from:

                                          % Google Analytics -
                                        tracking visits to the
                                        ox.ac.uk and oii.ox.ac.uk
                                        domains

                                        These cookies will remain on
                                        your website for 365 days,
                                        but you can edit your cookie
                                        preferences at any time via
                                        the "Cookie Settings" button
                                        in the website footer.

                                        Enable or Disable Cookies [ ]
                                        Enabled Disabled
                                        Enable All Reject All Save
                                        Changes
                                        Powered by  GDPR Cookie
                                        Compliance