https://www.oii.ox.ac.uk/news-events/study-identifies-weaknesses-in-how-ai-systems-are-evaluated/ Skip down to main content Mobile menu button Search for : [ ] [Search] Oxford Internet Institute logo * Research o Research Section Home o Research Areas # Digital Economies # Digital Knowledge and Culture # Digital Politics and Government # Education, Digital Life and Wellbeing # Ethics and Philosophy of Information # Information Geography and Inequality # Information Governance and Security # Social Data Science # Research Groups @ Connected Families Research Group @ Culture, Creativity and Technology Group @ Digital Economic Security Lab @ Digital Ethics and Defence Technologies @ Equitable Access to Quality Information Lab @ Fair Digital Economies @ Governance of Emerging Technologies @ Governing in the Age of AI @ Mind and Emerging Technologies Lab @ Reasoning with Machines AI Lab @ Researching Identities in Networks Group @ Subversive Technologies Lab @ Synthetic Society Lab @ Publications @ Projects @ Impact @ Visitor Programme # Study - Study Section Home - Our Programmes = MSc in Social Data Science = MSc in Social Science of the Internet = DPhil in Social Data Science = DPhil in Information, Communication and the Social Sciences - Recognised Student Programme = Alumni Stories = Student Reading Groups = Open Days = Summer Doctoral Programme x SDP Alumni Gallery - People x People Section Home x OII People % Faculty & Research Fellows % Research Staff % Senior Fellows % DPhil Students % MSc Students % Administration % Research Associates % Advisory Board % Visitors % Faculty Associates % Former Members of the OII % Vacancies * Equality, Diversity & Inclusion % News & Events + News & Events Section Home + News + Latest Reports + Press Coverage o Upcoming Events o Videos & Event Recordings o Podcasts o Follow Us o Newsletter + About # About Section Home # Our History # Our Founding Donor # Giving to the Oxford Internet Institute @ The Shirley Scholars Fund @ Executive Education @ Information for Alumni - Find Us = The Stephen A. Schwarzman Centre - Library Oxford Internet Institute logo @ Research Research = Research Areas x Digital Economies x Digital Knowledge and Culture x Digital Politics and Government x Education, Digital Life and Wellbeing x Ethics and Philosophy of Information x Information Geography and Inequality x Information Governance and Security x Social Data Science x Research Groups % Connected Families Research Group % Culture, Creativity and Technology Group % Digital Economic Security Lab % Digital Ethics and Defence Technologies % Equitable Access to Quality Information Lab % Fair Digital Economies % Governance of Emerging Technologies % Governing in the Age of AI % Mind and Emerging Technologies Lab % Reasoning with Machines AI Lab % Researching Identities in Networks Group % Subversive Technologies Lab % Synthetic Society Lab % Publications % Projects % Impact % Visitor Programme x Study Study * Our Programmes + MSc in Social Data Science + MSc in Social Science of the Internet + DPhil in Social Data Science + DPhil in Information, Communication and the Social Sciences * Recognised Student Programme + Alumni Stories + Student Reading Groups + Open Days + Summer Doctoral Programme o SDP Alumni Gallery * People People o OII People # Faculty & Research Fellows # Research Staff # Senior Fellows # DPhil Students # MSc Students # Administration # Research Associates # Advisory Board # Visitors # Faculty Associates # Former Members of the OII # Vacancies @ Equality, Diversity & Inclusion # News & Events News & Events - News - Latest Reports - Press Coverage = Upcoming Events = Videos & Event Recordings = Podcasts = Follow Us = Newsletter - About About x Our History x Our Founding Donor x Giving to the Oxford Internet Institute % The Shirley Scholars Fund % Executive Education % Information for Alumni * Find Us + The Stephen A. Schwarzman Centre * Library Search for : [ ] [Search] x OII > x News & Events > x News > PRESS RELEASE - Study identifies weaknesses in how AI systems are evaluated Oxford Internet Institute text logo x OII > x News & Events > x News > PRESS RELEASE - Study identifies weaknesses in how AI systems are evaluated Published on 4 Nov 2025 Largest systematic review of AI benchmarks highlights need for clearer definitions and stronger scientific standards. A new study led by the Oxford Internet Institute (OII) at the University of Oxford and involving a team of 42 researchers from leading global institutions including EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University, has found that many of the tests used to measure the capabilities and safety of large language models (LLMs) lack scientific rigour. In Measuring What Matters: Construct Validity in Large Language Model Benchmarks, accepted for publication in the upcoming NeurIPS conference proceedings, researchers review 445 AI benchmarks - the standardised evaluations used to compare and rank AI systems. The researchers found that many of these benchmarks are built on unclear definitions or weak analytical methods, making it difficult to draw reliable conclusions about AI progress, capabilities or safety. "Benchmarks underpin nearly all claims about advances in AI," says Andrew Bean, lead author of the study. "But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to." Benchmarks play a central role in how AI systems are designed, deployed, and regulated. They guide research priorities, shape competition between models, and are increasingly referenced in policy and regulatory frameworks, including the EU AI Act, which calls for risk assessments based on "appropriate technical tools and benchmarks." The study warns that if benchmarks are not scientifically sound, they may give developers and regulators a misleading picture of how capable or safe AI systems really are. "This work reflects the kind of large-scale collaboration the field needs," adds Dr. Adam Mahdi. "By bringing together leading AI labs, we're starting to tackle one of the most fundamental gaps in current AI evaluation." Key findings % Lack of statistical rigour Only 16% of the reviewed studies used statistical methods when comparing model performance. This means that reported differences between systems or claims of superiority could be due to chance rather than genuine improvement. % Vague or contested definitions Around half of the benchmarks aimed to measure abstract ideas such as reasoning or harmlessness without clearly defining what those terms mean. Without a shared understanding of these concepts, it is difficult to ensure that benchmarks are testing what they intend to. Examples % Confounding formatting rules - A test might ask a model to solve a simple logic puzzle but also require it to present the answer in a very specific, complicated format. If the model gets the puzzle right but fails the formatting, it looks worse than it really is. % Brittle performance - A model might do well on short, primary school-style maths questions, but if you change the numbers or wording slightly, it suddenly fails. This shows it may be memorising patterns rather than truly understanding the problem % Unsupported claims - If a model scores well on multiple-choice questions from medical exams, people might claim it has doctor-level expertise. But passing an exam is only one small part of what doctors do, so the result can be misleading. Recommendations for better benchmarking The authors stress that these problems are fixable. Drawing on established methods from fields such as psychometrics and medicine, they propose eight recommendations to improve the validity of AI benchmarks. These include: % Define and isolate: Provide a precise, operational definition for the concept being measured and control for unrelated factors. % Build representative evaluations: Ensure test items represent real-world conditions and cover the full scope of the target skill or behaviour. % Strengthen analysis and justification: Use statistical methods to report uncertainty and enable robust comparisons; conduct detailed error analysis to understand why a model fails; and justify why the benchmark is a valid measure for its intended purpose. The team also provides a Construct Validity Checklist, a practical tool researchers, developers, and regulators can use to assess whether an AI benchmark follows sound design principles before relying on its results. The checklist is available at https://oxrml.com/ measuring-what-matters/ The paper, Measuring What Matters: Construct Validity in Large Language Model Benchmarks, will be published as part of the NeurIPS 2025 peer-reviewed conference proceedings in San Diego from 2-7 December. The peer-reviewed paper is available on request. Media spokespeople Lead author: Andrew Bean, Doctoral Student, Oxford Internet Institute, University of Oxford Senior authors: Adam Mahdi, Associate Professor, and Luc Rocher, Associate Professor, Oxford Internet Institute, University of Oxford Contact For more information and briefings, please contact: Anthea Milnes, Head of Communications Sara Spinks / Veena McCoole, Media and Communications Manager T: +44 (0)1865 280527 M: +44 (0)7551 345493 E: press@oii.ox.ac.uk About the Oxford Internet Institute (OII) The Oxford Internet Institute (OII) has been at the forefront of exploring the human impact of emerging technologies for 25 years. As a multidisciplinary research and teaching department, we bring together scholars and students from diverse fields to examine the opportunities and challenges posed by transformative innovations such as artificial intelligence, large language models, machine learning, digital platforms, and autonomous agents. About the University of Oxford Oxford University was placed number one in the Times Higher Education World University Rankings for the tenth year running in 2025. At the heart of this success are the twin-pillars of our ground-breaking research and innovation and our distinctive educational offer. Oxford is world-famous for research and teaching excellence and home to some of the most talented people from across the globe. Funding information % A.M.B. is supported in part by the Clarendon Scholarships and the Oxford Internet Institute's Research Programme on AI & Work. % A.M. is supported by the Oxford Internet Institute's Research Programme on AI & Work. % R.O.K. is supported by a Fellowship from the Cosmos Institute. H.M. is supported by ESRC [ES/P000649/1] and would like to acknowledge the London Initiative for Safe AI. % C.E. is supported by the EPSRC Centre for Doctoral Training in Health Data Science (EP/S02428X/1) and the AXA Research Fund. % F.L. is supported by Clarendon and Jason Hu studentships. % H.R.K.'s PhD is supported by the Economic and Social Research Council grant ES/ P000649/1. % M.G. was supported by the SMARTY (PCI2024-153434) project funded by the Agencia Estatal de Investigacion (doi:10.13039/501100011033) and by the European Commission through the Chips Act Joint Undertaking project SMARTY (Grant 101140087). This material is based in part upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-2139841. % O.D. is supported by the UKRI's EPSRC AIMS CDT grant (EP/S024050/1). % J.R is supported by the Engineering and Physical Sciences Research Council. % J.B. would like to acknowledge funding by the Federal Ministry of Education and Research of Germany (BMBF) under grant no. 16DII131. % A. Bibi would like to acknowledge the UK AISI systemic safety grant. % A. Bosselut gratefully acknowledges the support of the Swiss National Science Foundation (No. 215390), Innosuisse (PFFS-21-29), the EPFL Center for Imaging, Sony Group Corporation, and a Meta LLM Evaluation Research Grant. Related People Andrew BeanAndrew Bean Andrew Bean DPhil Student Andrew holds a B.S. in Applied Mathematics from Yale University and an MSc in Social Data Science from the OII. He is a Clarendon Scholar and was previously a Thouron Prize winner at the University of Cambridge (Pembroke College). View profile Adam MahdiAdam Mahdi Dr Adam Mahdi Senior Research Fellow Adam Mahdi's research focuses on digital health and application of machine learning in social sciences. He is the director of the UKRI-funded OxCOVID19 Project and a fellow at Wolfson College, University of Oxford. View profile Luc RocherLuc Rocher Dr Luc Rocher Associate Professor Luc conducts human-centred computing research to understand how data and algorithms impact society. They work to make digital power visible to the public and guide the development of accountable, sustainable, and safe algorithms for all. View profile Related Project ai-diagnosisai-diagnosis Benchmarking Large Language Models for Self-Diagnosis Our work investigates applications of large language models (LLMs) in healthcare settings, with a particular focus on interactions between LLMs and human users. The project focuses on LLMs for medical self-diagnosis. Read more Oxford Internet Institute Stephen A. Schwarzman Centre for the Humanities University of Oxford Radcliffe Observatory Quarter Woodstock Road Oxford OX2 6GG +44 (0)1865 287210 General: enquiries@oii.ox.ac.uk Press: press@oii.ox.ac.uk Athena Swan Bronze Award Athena Swan Bronze Award Staff Intranet Newsletter FOLLOW US: INFORMATION FOR: % Prospective students % Alumni % Job seekers % Media % Policy makers (c) Oxford Internet Institute 2025 | Terms of Use | Privacy Policy | Cookie Settings | Copyright Policy | Accessibility | Email Webmaster We are using cookies to give you the best experience on our website. You can find out more about which cookies we are using or switch them off in settings. Accept Reject Settings % Privacy Overview % Strictly Necessary Cookies % Google Analytics Privacy Overview Oxford Internet Institute Oxford Internet Institute This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful. Strictly Necessary Cookies % moove_gdrp_popup - a cookie that saves your preferences for cookie settings. Without this cookie, the screen offering you cookie options will appear on every page you visit. This cookie remains on your computer for 365 days, but you can adjust your preferences at any time by clicking on the "Cookie settings" link in the website footer. Please note that if you visit the Oxford University website, any cookies you accept there will appear on our site here too, this being a subdomain. To control them, you must change your cookie preferences on the main University website. Enable or Disable Cookies [*] Enabled Disabled Google Analytics This website uses Google Tags and Google Analytics to collect anonymised information such as the number of visitors to the site, and the most popular pages. Keeping these cookies enabled helps the OII improve our website. Enabling this option will allow cookies from: % Google Analytics - tracking visits to the ox.ac.uk and oii.ox.ac.uk domains These cookies will remain on your website for 365 days, but you can edit your cookie preferences at any time via the "Cookie Settings" button in the website footer. Enable or Disable Cookies [ ] Enabled Disabled Enable All Reject All Save Changes Powered by GDPR Cookie Compliance