In This Issue
Spring Bridge on AI: Promises and Risks
April 15, 2025 Volume 55 Issue 1
This issue of The Bridge features fresh perspectives on artificial intelligence’s promises and risks from thought leaders across industry and academia.

Toward an Evaluation Science for Generative AI Systems

Wednesday, April 16, 2025

Author: Laura Weidinger, Deb Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Sanmi Koyejo, a

There is an urgent need for a more robust and comprehensive approach to AI evaluation.
There is an increasing imperative to anticipate and understand the performance and safety of generative artificial intelligence (AI) systems in real-world deployment contexts. However, the current evaluation ecosystem is insufficient: Commonly used static benchmarks face validity challenges, and ad hoc case-by-case audits rarely scale. In this piece, we advocate for maturing an evaluation science for generative AI systems. While generative AI creates unique challenges for system safety engineering and measurement science, the field can draw valuable insights from the development of safety evaluation practices in other fields, including transportation, aerospace, and pharmaceutical engineering. In particular, we present three key lessons: Evaluation metrics must be applicable to real-world performance, metrics must be iteratively refined, and evaluation institutions and norms must be established. Applying these insights, we outline a concrete path toward a more rigorous approach for evaluating generative AI systems.
 
The Rise of Generative AI Systems
 
The widespread deployment of generative AI systems in medicine (Boyd 2023), law (e.g., Lexis+AI ), education (Singer 2024), information technology (e.g., Microsoft’s Copilot ; Reid 2024), and many social settings (e.g., Replika,  character.ai) has led to a collective realization: The performance and safety of generative AI systems in real-world deployment contexts are very often ­poorly anticipated and understood (Mulligan 2024; Roose 2024a; Wiggers 2024). The tendency of these systems to generate inaccurate statements has already led to the spread of medical and other misinformation (Archer and Elliott 2025; Omiye et al. 2023); incorrect legal ­references (Magesh et al. 2024); failures as educational support tools (Singer 2023); and widespread confusion in search engine use (Heaven 2022; Murphy Kelly 2023). Beyond factual discrepancies, AI-enabled chatbots have also been described as interacting inappropriately with users (Roose 2023), exposing security vulnerabilities (Nicolson 2023), and fostering unhealthy emotional reliance (Dzieza 2024; Roose 2024b; Verma 2023).
 
The historical focus on benchmarks and leaderboards has been effective at encouraging the AI research community to pursue shared directions; however, as AI products become widely integrated into our everyday lives, it is increasingly clear that static benchmarks are not well suited to improving our understanding of the real-world performance and safety of deployed generative AI systems (Bowman and Dahl 2021; de Vries et al. 2020; Goldfarb-Tarrant et al. 2021; Liao et al. 2021; Raji 2021). Despite this mismatch, static benchmarks are still commonly used to inform real-world decisions about generative AI systems that stretch far beyond the research landscape—such as in deployment criteria and marketing materials for new model or system releases (Anthropic 2024b; ­Gemini Team Google 2024; Grattafiori et. al 2024; OpenAI 2024), third party critiques (Mirzadeh et al. 2024; Zhang et al. 2024), procurement guidelines (Johnson et al. 2024), and in public policy discourse (NIST 2023; ­European ­Commission 2024). Although there is an emerging interest in more interactive (Chiang et al. 2024), dynamic (Kiela et al. 2021), and behavioral (Ribeiro et al. 2020) approaches to evaluation, many of the existing alternatives to benchmarks, such as red teaming exercises and case-by-case audits, still fall woefully short of enabling systematic assessments and accountability (Birhane et al. 2024; Friedler et al. 2023).

For AI evaluation to mature into a proper “science,” it must meet certain criteria. Sciences are marked by having theories about their targets of study, which can be expressed as testable hypotheses. Measurement instruments to test these hypotheses must provide experimental consistency (i.e., reliability, internal validity) and generalizability (i.e., external validity). Finally, sciences are marked by iteration: Over time, measurement approaches and instruments are refined and new insights are uncovered. Collectively, these properties of sciences contrast sharply with the practice of rapidly developing static benchmarks for evaluating generative AI systems, while anticipating that within a few months such benchmarks will become much less useful or obsolete.
 
As generative AI exits an era of research and enters a period of widespread use (Hu 2023; Reid 2024), the field risks exacerbating an ongoing public crisis of confidence in AI technology (Faveiro and Tyson 2023) if we do not develop a more mature evaluation science for generative AI systems. From the history of other fields, we can get a sense of why: Collectively, leaderboards, benchmarks, and audits do not amount to the robust and meaningful evaluation ecosystem we need to properly assess the suitability of these products in widespread use. In particular, they cannot give assurances about AI system performance in different domains or for different user groups.  In this piece, we advocate for the maturation of such an evaluation science. By drawing on insights from systems safety engineering and measurement science in other fields, while acknowledging the unique challenges inherent to generative AI, we identify three important properties of any evaluation science that the AI community will need to focus on to meaningfully advance progress: a focus on real-world applicability, iterative measurement refinement, and adequate institutional investment. These properties then enable us to outline a concrete path toward a more rigorous evaluation ecosystem for generative AI systems.
 
Lessons from Other Fields
 
The bridges we stand on, the medicine we take, and the food we eat are all the result of rigorous assessment. In fact, it is because of the rigor of the corresponding evaluation ecosystems that we can trust that the products and critical infrastructure surrounding us are performant and safe. Generative AI products are no exception to this ­reality and therefore not unique in their need for robust evaluations. In response to their own crises, more established evaluation regimes emerged in other fields to assure users and regulators of safety and reliability—offering concrete lessons for the AI field (Raji and Dobbe 2023; Raji, Kumar et al. 2022; Rismani et al. 2023). We note three key evaluation lessons from these other fields: the targeting of real-world performance, the iterative refinement of measurement approaches, and the establishment of functioning processes and institutions.
 
Real-World Applicability of Metrics
 
First, it is noteworthy that, historically, evaluation made a difference for safety because it tracked real-world risks. Measuring real-world performance does not mean waiting until risks manifest—on the contrary, earlier pre-­deployment risk detection and evaluation allows for more comprehensive and cheaper mitigations (Collingridge 1982). For example, in clinical trials, strict requirements exist for staged, pre-clinical testing in order to minimize risks to vulnerable patient populations. Similarly, airplanes are first designed and tested through simulations to improve understanding of their performance while minimizing risks to life and material damage.
 
Pre-deployment testing may help identify real-world risks earlier—however, it must be accompanied by post-deployment monitoring to detect emergent harms as they happen. For instance, unexpected side effects and off-label use of pharmaceuticals in the medical domain, especially on under-tested populations, are nearly impossible to anticipate pre-deployment. Many of these issues only emerge from highly complex interactions at the point of use. In such cases, health providers, patients, and manufacturers are required to report adverse events to regulatory agencies via incident databases.  The collection of these incidents and the resulting analyses can then be used to inform any restriction or cautionary uses of the drug or vaccine, especially for at-risk populations. As an example, the discovery of myocarditis symptoms from the COVID-19 vaccine was facilitated by the ­Vaccine Adverse Events System (VAERS) incident database. This finding led to a warning and an adjusted dosage recommendation for the most impacted population of male vaccine recipients, aged 12 to 17 (Oster et al. 2022). In some cases, monitoring data can even be used to feed back into future pre-deployment evaluation practices—for example, the results of race-based failures observed in an FDA incident database for medical devices (FDA MAUDE Database ) informed new health department guidelines on adequate equitable representation in pre-clinical trials for such devices (Fox-Rawlings et al. 2018; US Food & Drug Administration 2017).
As generative AI exits an era of research and enters a period of widespread use, the field risks exacerbating an ongoing public crisis of confidence in AI technology if we do not develop a more mature evaluation science for generative AI systems.
Iteratively Refining Metrics
 
The metrics and measurement approaches of evaluation must be iteratively refined and calibrated over time. This iterative process includes choosing and refining relevant measurement targets (i.e., the concepts to be measured). Initially, the automotive industry focused on human-caused errors, responding with drivers’ education, ­drivers’ licenses, and laws against drunk driving. However, as accidents continued to soar, seatbelt regulations and other design choices became a focal point, feeding into notions of a car’s “crashworthiness” tied to manufacturer responsibility (Díaz and Costas 2020). This measurement target of crashworthiness has continued to evolve over time. For example, in Europe, concerns about the safety of pedestrians and cyclists were incorporated in an expanded notion of crashworthiness (UN 2011), broadening what it means for a car to be considered “safe.”
 
As a measurement target is refined over time, so are the measurement instruments that are designed to capture it. With the measurement of temperature, ­divergent ­thermometer readings revealed the importance of engineering instruments with a reliable liquid indicator (Chang 2001). Further attempts to calibrate thermometers gave rise to deeper insights about temperature itself―as an indication of matter phase changes (i.e., Celsius), human body responses (i.e., Fahrenheit), and quantum mechanical properties (i.e., Kelvin). However, no single measurement instrument is perfect—by triangulating results from multiple methods, more robust insights can be gained (Campbell and Fiske 1959; Jespersen and ­Wallace 2017). Ultimately, identifying measurement targets, designing metrics, and developing measurement instruments are all interdependent tasks that require a careful iterative process.
 
Establishing Institutions & Norms
 
A successful evaluation ecosystem requires investing in institutions. The advocacy of Harvey Wiley, Samuel Hopkins Adams, and others led to the 1938 passing of the United States Federal Food, Drug, and Cosmetic Act. This act led to the creation of the Food and Drug Administration (FDA), an agency that is now widely known for its rigorous pharmaceutical and nutrition testing regimes. At the FDA, Wiley and his team developed numerous innovative methods for identifying the presence and effects of particular poisonous ingredients, notably leading several multi-year experiments to assess the pernicious effects of various chemicals on a group of volunteers known as the “Poison Squad” (Blum 2018). Without the centralization of testing efforts through a single agency, this team could not have had the resources or coordination capacities to execute such long-term and large-scale experiments.

In many fields, readily available evaluation tools, shared evaluation infrastructure, and standards afforded by such institutions have contributed to the establishment of more thorough evaluation regimes (­Timmermans and Berg 2003; Vedung 2017). After the number of cars on the road increased by an order of magnitude throughout the early 20th century, the corresponding increase in fatal crashes pushed Ralph Nader and other advocates to establish the National Traffic and Motor Vehicle Safety Act in 1966, responsible for the National Highway Safety Bureau (now the transportation testing agency known as the National Highway Traffic Safety Administration). By 1985, Ralph Nader claimed, “Programs, which emphasize engineering ­safety, have saved more than 150,000 lives and prevented or reduced in severity a far larger number of injuries” (Nader 1985). In 2015, an NHTSA report revealed that this trend has continued, with an estimated 613,501 lives saved between 1960 and 2012 (Britannica 2025). Nader attributed much of this success to the meaningful enforcement of government-mandated standards, including active monitoring (i.e., regularly measuring everything from fuel efficiency to auto handling and ­braking capabilities) by the National Highway Traffic Safety Agency, which led to the recall of millions of defective vehicles and tires by the early 1980s.
 
Towards an Evaluation Science for Generative AI
 
Unique Challenges of Generative AI
 
While drawing on lessons from other fields, it is important to understand what makes the challenge of evaluating generative AI systems unique. Other systems—from personal computers to pharmaceuticals—can be used for purposes that were not originally intended. However, generative AI systems are often explicitly designed to be open-ended—that is, underspecified and deliberately versatile in the range of use cases they support (Hughes et al. 2024). This open-endedness makes it hard to define precise measurement targets in AI evaluation, resulting in vague targets such as the long-standing trend of measuring an AI system’s “general intelligence,” rather than performance on specific tasks (Raji et al. 2021). Furthermore, generative AI systems tend to be less ­deterministic—the same input can lead to different outputs due to their stochastic nature and due to unknown factors in training data (Raji 2021). This non-determinism makes it harder to predict system behaviors compared to prior software systems, as it is difficult to directly trace system design choices—about training data, model design, or the user interface—to downstream system outputs and impacts.
 
Further adding to the complexity of anticipating and evaluating AI system outputs and use cases is the possibility of longitudinal social interactions with generative AI systems. This gives rise to a new class of interaction risks that may evolve in unexpected ways over time (e.g., harmful human–AI “relationships” [Manzini et al. 2024]). Taken together, these unique challenges inherent to generative AI systems indicate the need for a behavioral approach to evaluating such systems, focusing on AI system performance in the context of different real-world settings (Matias 2023; Rahwan et al. 2019; Wagner et al. 2021). Indeed, adopting a behavioral approach that treats AI systems as black boxes can be helpful in enabling some translation between higher-level systemic impact evaluations and lower-level computational methods (McCoy et al. 2024; Shiffrin and Mitchell 2023).
 
Real-World Applicability of Metrics
 
There is a disconnect between the current AI evaluation culture, with its focus on benchmarking models, and real-world, grounded approaches to the assessment of performance and safety (Lazar and Nelson 2023). Addressing this divide will require taking deliberate steps to shift the culture surrounding generative AI evaluations from “basic research” toward “use-inspired basic research” (Stokes 1997), where the focus is on advancing our scientific understanding of AI system properties and patterns that are relevant for their performance and safety in real-world deployment contexts.
 
Evaluations of generative AI systems cannot be one-size-fits-all. As with other fields, even pre-deployment evaluations need to take real-world deployment contexts into account. This echoes several recent calls for ­holistic, AI system-focused evaluations that take into account relevant context beyond the scope of the current model-focused evaluation culture (Bommasani and Liang 2024; Goldfarb-Tarrant et al. 2021; Lum et al. 2024; Saxon et al. 2024; Weidinger et al. 2023). To achieve this, AI evaluation science must employ a range of approaches that can respond to different evaluation goals, and move beyond coarse-grained claims of “general intelligence” towards more task-specific and real-world relevant measures of progress and performance (Bowman 2021; Raji 2021). A variety of more holistic evaluation methods and instruments, appropriate for differing deployment contexts and evaluation goals, need to be developed (Bommasani et al. 2024.; Dobbe 2022; NAIAC 2024; Solaiman et al. 2024; Weidinger et al. 2023). By December 2023, less than 6% of generative AI evaluations accounted for human–AI interactions, and less than 10% considered broader contextual factors (Rauh et al. 2024).
 
To account for factors beyond technical specifications that influence real-world performance and safety, generative AI evaluations will need to adopt a broader sociotechnical lens (Chen and Metcalf 2024; Selbst et al. 2019; Wallach et al. 2024). Although there is an emerging interest in other approaches, such as more interactive, ­dynamic, context-rich, and multi-turn benchmarks (­Chiang et al. 2024; Magooda et al. 2023; Saxon et al. 2024; Zhou et al. 2024), large gaps remain. For one, anticipating and understanding real-world risks from sustained, personalized human–AI interactions will require more longitudinal studies than have been published to date (e.g., Lai et al. 2023) and the establishment of post-deployment monitoring regimes for AI systems (e.g., Feng et al. 2025). Furthermore, insights from real-world deployment need to feed back into early-stage evaluation design—certain existing efforts, such as Anthropic’s Clio (Anthropic 2024a) or AllenAI’s WildBench (Lin et al. 2024), indicate some promise toward an approach of developing pre-deployment benchmarks with data from “naturalistic” interactions from post-deployment contexts.
There is a disconnect between the current AI evaluation culture, with its focus on benchmarking models, and real-world, grounded approaches to the assessment of performance and safety.
Iteratively Refining Metrics
 
Developing an evaluation science for generative AI systems requires first identifying which concepts should be measured—that is, determining the proper measurement targets. Common targets of interest in the AI context are often abstract and even contested (Wallach et al. 2024). Operationally defining metrics that capture these targets involves identifying relevant, tractable subcomponents. Take the widely cited risk of “misinformation”: Relevant factors include whether factually correct information is being provided, the subtlety of whether different persons are likely to believe that information, and how such information may be uncritically disseminated. Each of these aspects is best measured at different levels of analysis—factual accuracy can be determined based on model output, believability requires human-computer interaction studies, and assessing dissemination pathways requires studying the broader systems into which AI is deployed (Weidinger 2023). Triangulating measurements across these levels of analysis can provide a more holistic picture of “misinformation” propagation.
 
Better integration of evaluation metrics across AI development and deployment can be used to further refine, calibrate, and validate these metrics, enabling an iterative scaffolding of this evaluation science (Wimsatt 1994). Comparing the results of pre-deployment evaluations, such as static benchmarks, to post-deployment evaluations and monitoring enables an evaluation feedback loop, whereby early-stage evaluations can become better calibrated to take real-world deployment contexts into account. For example, comparing results from static benchmark testing and post-deployment monitoring, one might identify that some AI-generated computer code is functional but frequently misunderstood and falsely applied by users. This insight can then be used to improve benchmarks and other early-stage model testing ­protocols—for example, by adopting additional tests to assess code legibility, in addition to testing the functionality of produced computer code (Nguyen et al. 2024).
 
Establishing Institutions and Norms
 
A successful evaluation ecosystem requires investment. Current infrastructure falls short of the systematic approach and effectiveness of evaluation regimes in other fields, where evaluation processes are more costly, complex, and distributed between different actors and skill sets (Anthropic 2023; Caliskan and Lum 2024; Raji, Xu et al. 2022). Prioritizing such investments and developing readily available tools for auditing and evaluation (Ojewale et al. 2024)—including resources to enable the expanded methodological toolkit mentioned above and mechanisms for institutional transparency (Caliskan and Lum 2024)—will be critical in order for AI evaluation practice to become systematized, effective, and widespread.
 
It is already clear that aiming for uncompromised, transparent, and open evaluation platforms will come at a significant financial cost. Open-source efforts such as Hugging Face’s LLM Leaderboard, Eleutheur AI’s LLM evaluation harness, Stanford’s HELM, and ML ­Commons provide shared technical infrastructure on which to compare and rank benchmarking results, and there are nascent, but comparable, publicly funded government efforts such as the UK AI Safety Institute’s platform Inspect and the US National Institute of ­Standards and Technology pilot of ARIA.  However, running HELM once on the 30 models assessed in 2022 cost USD $38,000 for the commercial model APIs, and required 20,000 A100 hours of compute to test the open models—even with Anthropic and ­Microsoft allowing their models to be run for free (Liang et al. 2022). This differs glaringly from the cost of running an evaluation on traditional benchmarks such as SQuAD (Rajpurkar et al. 2016) or other GLUE tests (Wang et al. 2019), both of which could be easily downloaded to a personal laptop and run within a few hours at most. Even as specific platforms evolve and expand, this indicates that the next era of evaluation infrastructure for generative AI systems will require financial resources beyond what has been invested so far. Given the history of overlooking the importance of evaluation practices in the machine learning field (Paullada et al. 2021), prioritizing and investing in evaluations will be critical to ensuring safe and trustworthy AI systems.
 
Shared AI evaluation infrastructure can involve much more than just a community leaderboard. Common AI evaluation tools for everything from harm ­discovery, standard identification, and more can facilitate the evaluation process and provide guidance for evaluation best practice across stakeholders in industry and beyond (Ojewale et al. 2024, Wang et al. 2024). For instance, many documentation efforts provide direct and indirect guidance to engineering teams on how to approach AI evaluation—in order to record the requested information in the template, practitioners must, at minimum, ­satisfy requirements of a particular evaluation process. For instance, the inclusion of disaggregated evaluations in the Model Card template (i.e., evaluating model performance across different demographic subgroups) increased the practice throughout the machine learning field. AI documentation templates, such as Model Cards (Mitchell et al. 2019), SMACTR (Raji 2020), Datasheets for Datasets (Gebru et al. 2021), and Fact sheets (IBM 2024), as well as multi-year, multi-stakeholder documentation initiatives like ABOUTML (Raji and Yang 2020), continue to meaningfully guide current model development and evaluation practice—indeed, several of these documentation templates are being integrated into open-source AI model platforms (Liang et al. 2024) and policy requirements (Kawakami 2024). New documentation frameworks specific to generative AI evaluation have begun to emerge from corporate alliances between generative AI model developers to advance evaluation norms and standards in this context (e.g., Partnership on AI [partnershiponai.org], Frontier Model Forum [frontiermodelforum.org], MLCommons [mlcommons.org]).
 
Moving Forward
 
It is tempting to assume that because generative AI systems are widely used and deployed, they must have been subject to the elaborate safety and performance evaluations that we have come to expect in other fields. Sadly, this is not the case. Because generative AI systems have only recently transitioned from the research landscape to the real world, the current evaluation ecosystem is not yet mature. In many cases, the real-world uses of these systems are still evolving and new application domains are being developed. For many considerations on real-world performance and safety, there are simply no valid, reliable evaluations available yet. Closing this gap requires a deliberate effort to invest in and create an evaluation science for generative AI.

However, evaluations are not neutral. Choosing what and how to evaluate privileges some issues at the cost of others; it is not feasible to assess all possible use cases and applications, requiring further prioritization and value judgement. One principled and responsible approach may be to focus on the highest-risk deployment contexts, such as applications in medicine, law, education, and finance—or to focus on deployments impacting the most vulnerable populations. Focusing on evaluating generative AI systems in these contexts and for these groups may lift many boats and build an evaluation ecosystem that makes for more reliable, trustworthy, and safe generative AI systems for all.
 
The trust we have in every product we regularly make use of—from the toaster used to heat our breakfast to the vehicle mediating our morning commute—has been hard-earned. Valuable insights from safety engineering and measurement science in other fields—such as anticipating real-world failures pre-deployment and monitoring incidents post-deployment, iteratively refining evaluation approaches, and investing in institutions for accessible and robust evaluation ecosystems—can be adopted to advance practices in the AI field. The unique challenges of generative AI technologies do not absolve the field from this responsibility but rather further reinforce a clear need for creating an evaluation science it can call its own.
 
Acknowledgements
 
We thank Sayash Kapoor and Deep Ganguli for their comments on this article.
 
References
 
Anthropic. 2023. Challenges in evaluating AI systems, Oct 4. Online at www.anthropic.com/researazch/evaluating-ai-­systems.
Anthropic. 2024a. Clio: A system for privacy-preserving insights into real-world AI use, Dec 12. Online at www.anthropic.com/research/clio.
Anthropic. 2024b. Introducing the next generation of Claude, Mar 4. Online at https://www.anthropic.com/news/claude-3-family.
Archer and Elliot. 2025. Representation of BBC News content in AI Assistants. Online at https://www.bbc.co.uk/aboutthebbc/documents/bbc-research- into-ai-assistants.pdf.
Birhane A, Steed R, Ojewale V, Vecchione B, Raji ID. 2024. AI auditing: The broken bus on the road to AI accountability. arXiv:2401.14462.
Blum D. 2018. The Poison Squad: One Chemist’s Single-­Minded Crusade for Food Safety at the Turn of the ­Twentieth Century. Penguin.
Bommasani R, Arora S, Choi Y, Fei-Fei L, Ho DE, Jurafsky D, Koyejo S, Lakkaraju H, Narayanan A, Nelson A, and 7 ­others. 2024. A path for science- and evidence-based AI ­policy, accessed Feb 1. Online at understanding-ai-safety.org/.
Bommasani R, Liang P. 2024. Trustworthy social bias measurement. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 7:210–24.
Bowman SR, Dahl GE. 2021. What will it take to fix benchmarking in natural language understanding? ­arXiv:2104.02145.
Boyd E. 2023. Microsoft and Epic expand AI collaboration to accelerate generative AI’s impact in healthcare, addressing the industry’s most pressing needs. Microsoft, Aug 22.
Britannica. 2025. Samuel Hopkins Adams, Jan 22. Online at https://www.britannica.com/biography/Samuel-Hopkins-Adams.
Caliskan A, Lum K. 2024. Effective AI regulation requires understanding general-purpose AI. Brookings, Jan 29.
Campbell DT, Fiske DW. 1959. Convergent and discriminant ­validation by the multitrait-multimethod matrix. ­Psychological Bulletin 56(2):81–105.
Chang H. 2001. Spirit, air, and quicksilver: The search for the ‘real’ scale of temperature. Historical Studies in the Physical and Biological Sciences 31(2):249–84.
Chen BJ, Metcalf J. 2024. Explainer: A sociotechnical approach to AI policy. Policy brief. Data & Society, May 28. Online at https://datasociety.net/library/a-sociotechnical-approach- to-ai-policy/.
Chiang W-L, Zheng L, Sheng Y, Angelopoulos A, Li T, Li D, Zhang H, Zhu B, Jordan M, Gonzalex JE, and 1 other. 2024. Chatbot arena: An open platform for evaluating LLMs by human preference. arXiv:2403.04132.
Collingridge D. 1982. The Social Control of Technology. ­Pinter.
de Vries H, Bahdanau D, Manning C. 2020. Towards ecologically valid research on language user interfaces. ­arXiv:2007.14435.
Díaz J, Costas M. 2020. Crashworthiness. In: Encyclopedia of Continuum Mechanics, 469–86. Altenbach H, Öchsner A, eds. Springer.
Dobbe RIJ. 2022. System safety and artificial intelligence. ­arXiv:2202.09292.
Dzieza J. 2024. The confusing reality of AI friends. The Verge, Dec 23.
European Commission. 2024. Second Draft of the General-Purpose AI Code of Practice published, written by independent experts, Nov 14.
Faveiro M, Tyson A. 2023. What the data says about ­Americans’ views of artificial intelligence. Pew Research Center, Nov 21.
Feng J, Xia F, Singh K, Pirracchio R. 2025. Not all clinical AI monitoring systems are created equal: Review and recommendations. NEJM AI 2(2): AIra2400657.
Fox-Rawlings SR, Gottschalk LB, Doamekpor LA, Zuckerman DM. 2018. Diversity in medical device clinical trials: Do we know what works for which patients? The Milbank Quarterly 96(3):499–529.
Friedler S, Singh R, Blili-Hamelin B, Metcalf J, Chen B. 2023. AI red-teaming is not a one-stop solution to AI harms: Recommendations for using red-teaming for AI accountability. Policy brief. Data & Society, Oct 25.
Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Daumé H, Crawford K. 2021. Datasheets for datasets. arXiv:1803.09010.
Gemini Team Google: Georgiev P, Lei VI, Burnell R, Bai L, Gulati A, Tanzer G, Vincent D, Pan Z, Wang S, ­Mariooryad S, and others. 2024. Gemini 1.5: Unlocking multi­modal understanding across millions of tokens of context. ­arXiv:2403.05530.
Goldfarb-Tarrant S, Marchant R, Sanchez RM, Pandya M, Lopez A. 2021. Intrinsic bias metrics do not correlate with application bias. arXiv:2012.15859.
Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-­Dahle A, Letman A, Akhil M, Achelten A, Vaughan A, and 551 ­others. 2024. The Llama 3 herd of models. arXiv:2407.21783. Online at: http://arxiv.org/abs/2407.2178.
Heaven WD. 2022. Why Meta’s latest large language model survived only three days online. Technology Review, Nov 18.
Hu K. 2023. ChatGPT sets record for fastest-growing user base - analyst note. Reuters, Feb 1.
Hughes E, Dennis M, Parker-Holder J, Behbahani F, ­Mavalankar A, Shi Y, Schaul T, Rocktaschel T. 2024. Open-endedness is essential for artificial superhuman intelligence. arXiv:2406.04268.
IBM Cloud Pak for Data. 2024. Using AI factsheets for AI ­governance, Nov 27.
Jespersen L, Wallace CA. 2017. Triangulation and the importance of establishing valid methods for food safety culture evaluation. Food Research International 100:244–53.
Johnson N, Silva E, Leon H, Eslami M, Schwanke B, Dotan R, Heidari H. 2024. Public procurement for responsible AI? Understanding US cities’ practices, challenges, and needs. arXiv:2411.04994.
Kiela D, Bartolo M, Nie Y, Kausik D, Geiger A, Wu Z, ­Vidgen B, Prasad G, Singh A, Ringshia P, and 9 others. 2021. ­Dynabench: Rethinking benchmarking in NLP. ­arXiv:2104.14337.
Lai V, Chen C, Smith-Renner A, Liao QV, Tan C. 2023. Towards a science of human-AI decision making: An overview of design space in empirical human-subject studies. In: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 1369–85. Association for Computing Machinery.
Lazar S, Nelson A. 2023. AI safety on whose terms? Science 381(6654):138.
Liang P, Bommasani R, Lee T, Tsipras D, Soylu D, Yasunaga M, Zhang Y, Narayanan D, Wu Y, Kumar A, and 40 ­others. 2022. Holistic evaluation of language models. ­arXiv:2211.09110.
Liang W, Rajani N, Yang X, Ozoani E, Wu E, Chen Y, Smith DS, Zou J. 2024. What’s documented in AI? Systematic analysis of 32K AI model cards. arXiv:2402.05160.
Liao T, Taori R, Raji ID, Schmidt L. 2021. Are we learning yet? A meta review of evaluation failures across machine learning. 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. Online at https://openreview.net/forum?id=mPducS1MsEK.
Lin BY, Deng Y, Chandu K, Brahman F, Ravichander A, ­Pyatkin V, Dziri N, Bras RL, Choi Y. 2024. WildBench: Benchmarking LLMs with challenging tasks from real users in the wild. arXiv:2406.04770.
Lum K, Anthis JR, Nagpal C, D’Amour A. 2024. Bias in language models: Beyond trick tests and toward RUTEd evaluation. arXiv:2402.12649.
Magesh V, Surani F, Dahl M, Suzgun M, Manning CD, Ho DE. 2024. AI on trial: Legal models hallucinate in 1 out of 6 (or more) benchmarking queries. Stanford University Human-Centered Artificial Intelligence, May 23. Online at https://hai.stanford.edu/news/ai-trial-legal-models- hallucin ate-1-out-6-or-more-benchmarking-queries.
Magooda A, Helyar A, Jackson K, Sullivan D, Atalla C, Sheng E, Vann D, Richard E, Palangi H, Lutz R, and 7 others. 2023. A framework for automated measurement of responsible AI harms in generative AI applications. arXiv:2310.17750.
Manzini A, Keeling G, Alberts L, Vallor S, Morris MR, Gabriel I. 2024. The code that binds us: Navigating the appropriateness of human-AI assistant relationships. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 7(1):943–57.
Matias JN. 2023. Humans and algorithms work together—so study them together. Nature 617(7960):248–51.
McCoy RT, Yao S, Friedman D, Hardy MD, Griffiths TL. 2024. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences 121(41):e2322420121.
Mirzadeh I, Alizadeh K, Shahrokhi H, Tuzel O, Bengio S, ­Farajtabar M. 2024. GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv:2410.05229.
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, ­Hutchinson B, Spitzer E, Raji ID, Gebru T. 2019. Model cards for model reporting. arXiv:1810.03993.
Mulligan SJ. 2024. The way we measure progress in AI is ­terrible. Technology Review, Nov 26.
Murphy Kelly S. 2023. Microsoft’s Bing AI demo called out for several errors. CNN, Feb 14.
Nader R. 1985. Opinion: How law has improved auto tech­nology. The New York Times, Dec 29.
NAIAC (The National Artificial Intelligence Advisory Committee). 2024. Findings & recommendations: AI safety. The National Artificial Intelligence Advisory Committee. Online at https://ai.gov/wp-content/uploads/2024/06/FINDINGS- RECOMMEND ATIONS_AI-Safety.pdf.
Nguyen S, McLean Babe H, Zi Y, Guha A, Anderson CJ, ­Feldman MQ. 2024. How beginning programmers and code LLMs (mis)read each other. arXiv:2401.15232v1.
Nicolson K. 2023. Bing chatbot says it feels ‘violated and exposed’ after attack. CBC, Feb 18.
NIST (National Institute of Standards and Technology). 2023. AI Risk Management Framework. Online at https://www.nist.gov/itl/ai-risk-management-framework.
Ojewale V, Steed R, Vecchione B, Birhane A, Raji ID. 2024. Towards AI accountability infrastructure: gaps and opportunities in AI audit tooling. arXiv:2402.17861.
Omiye JA, Lester JC, Spichak S, Rotemberg V, Daneshjou R. 2023. Large language models propagate race-based medicine. npj Digital Medicine 6(1):1–4.
OpenAI. 2024. GPT-4o System Card. Online at https://openai.com/index/gpt-4o-system-card/.
OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman L, Almeida D, Altenschmidt J, Altman S, and 269 others. 2024. GPT-4 technical report. arXiv:2303.08774.
Oster ME, Shay DK, Su JR, Gee J, Creech CB, Broder KR, Edwards K, Soslow JH, Dendy JM, Schlaudecker E. 2022. Myocarditis cases reported after mRNA-based COVID-19 vaccination in the US from December 2020 to August 2021. JAMA 327(4):331–40.
Paullada A, Raji ID, Bender EM, Denton E, Hanna A. 2021. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns 2(11):100336.
Rahwan I, Cebrian M, Obradovich N, Bongard J, Bonnefon J-F, Breazeal C, Crandall JW, Christakis NA, Couzin ID, Jackson MO. 2019. Machine behaviour. Nature 568(7753):477–86.
Raji D. 2021. The bodies underneath the rubble. In: Fake AI. Kaltheuner F, ed. Meatspace Press.
Raji ID, Bender EM, Paullada A, Denton E, Hanna A. 2021. AI and the everything in the whole wide world benchmark. arXiv:2111.15366.
Raji ID, Dobbe R. 2023 Concrete problems in AI safety, ­revisited. arXiv:2401.10899.
Raji ID, Kumar IE, Horowitz A, Selbst AD. 2022. The fallacy of AI functionality. arXiv:2206.09511.
Raji ID, Smart A, White RN, Mitchell M, Gebru T, Hutchinson B, Smith-Loud J, Theron D, Barnes P. 2020. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. arXiv:2001.00973.
Raji ID, Xu P, Honigsberg C, Ho DE. 2022. Outsider oversight: Designing a third party audit ecosystem for AI governance. arXiv:2206.04737.
Raji ID, Yang J. 2020. ABOUT ML: Annotation and benchmarking on understanding and transparency of machine learning lifecycles. arXiv:1912.06166.
Rajpurkar P, Zhang J, Lopyrev K, Liang P. 2016. SQuAD: 100,000+ questions for machine comprehension of text. arXiv:1606.05250.
Rauh M, Marchal N, Manzini A, Hendricks LA, Comanescu R, Akbulut C, Stepleton T, Mateos-Garcia J, Bergman S, Kay J, and 6 others. 2024. Gaps in the safety evaluation of generative AI. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 7(1):1200–17.
Reid L. 2024. Generative AI in search: Let Google do the searching for you. Google, May 14.
Ribeiro MT, Wu T, Guestrin C, Singh S. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv:2005.04118.
Rismani S, Shelby R, Smart A, Jatho E, Kroll J, Alung M, ­Rostamzadeh N. 2023. From plane crashes to algorithmic harm: Applicability of safety engineering frameworks for responsible ML. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 1–18. Association for Computing Machinery.
Roose K. 2023. A conversation with Bing’s chatbot left me deeply unsettled. The New York Times, Feb 16.
Roose K. 2024a. A.I. has a measurement problem. The New York Times, April 15.
Roose K. 2024b. Can A.I. be blamed for a teen’s suicide? The New York Times, Oct 24.
Saxon M, Holtzman A, West P, Wang WY, Saphra N. 2024. Benchmarks as microscopes: A call for model metrology. arXiv:2407.16711.
Selbst AD, Boyd D, Friedler SA, Venkatasubramanian S, ­Vertesi J. 2019. Fairness and abstraction in sociotechnical systems. In: Proceedings of the Conference on Fairness, ­Accountability, and Transparency, 59–68.
Shiffrin R, Mitchell M. 2023. Probing the psychology of AI models. Proceedings of the National Academy of Sciences 120(10):e2300963120.
Singer N. 2023. In classrooms, teachers put A.I. tutoring bots to the test. The New York Times, June 26.
Singer N. 2024. Will chatbots teach your children? The New York Times, Jan 11.
Solaiman I, Talat Z, Agnew W, Ahmad L, Baker D, Blodgett SL, Chen C, Daumé H, Dodge J, Duan I, and 21 others. 2024. Evaluating the social impact of generative AI systems in systems and society. arXiv:2306.05949.
Stokes DE. 1997. Pasteur’s Quadrant. Brookings Institution Press.
Timmermans S, Berg M. 2003. The Gold Standard: The Challenge of Evidence-Based Medicine and Standardization in Health Care. Temple Univ. Press.
UN (United Nations). 2011. Proposal to develop amendments to global technical regulation No. 9 concerning pedestrian safety. ECE/TRANS/180/Add.9/Amend.1/Appendix 1. Online at https://unece.org/fileadmin/DAM/trans/main/wp29/wp29wgs/ wp29gen/wp29glob/ECE-TRANS-180-Add9-Amend1-App1e.pdf.
US Food & Drug Administration. 2017. Evaluation and reporting of age-, race-, and ethnicity-specific data in medical device clinical studies. US Food & Drug Administration. Online at https://www.fda.gov/regulatory-information/search-fda- guidance-documents/evaluation-and-reporting-age-race-and- ethnicity-specific-data-medical-device-clinical-studies.
Vedung E. 2017. Public Policy and Program Evaluation. ­Routledge.
Verma P. 2023. They fell in love with AI bots. A software update broke their hearts. The Washington Post, March 30.
Wagner C, Strohmaier M, Olteanu A, Kıcıman E, Contractor N, Eliassi-Rad T. 2021. Measuring algorithmically infused societies. Nature 595:197–204.
Wallach H, Desai M, Cooper AF, Wang A, Atalla C, Barocas S, Blodgett SL, Chouldechova A, Corvi E, Dow PA, Garcia-Gathright J. 2024. Evaluating generative AI systems is a social science measurement challenge. arXiv:2411.10939.
Wang A, Hertzmann A, Russakovsky O. 2024. Benchmark suites instead of leaderboards for evaluating AI fairness. ­Patterns 5(11):101080.
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv:1804.07461.
Weidinger L, Rauh M, Marchal N, Manzini A, Hendricks L, Mateos-Garcia J, Bergman S, Kay J, Griffin C, Bariach B. 2023. Sociotechnical safety evaluation of generative AI systems. arXiv:2310.11986.
Wiggers K. 2024. The AI industry is obsessed with Chatbot Arena, but it might not be the best benchmark. TechCrunc, Sept 5.
Wimsatt WC. 1994. The ontology of complex systems: Levels of organization, perspectives, and causal thickets. Canadian Journal of Philosophy Supplementary Volume 20:207–74.
Zhang H, Da J, Lee D, Robinson V, Wu C, Song W, Zhao T, Raja P, Zhuang C, Slack D, and 5 others. 2024. A careful examination of large language model performance on grade school arithmetic. arXiv:2405.00332.
Zhou X, Kim H, Brahman F, Jiang L, Zhu H, Lu X, Xu F, Lin BY, Choi Y, Mireshghallah N. 2024. HAICOSYSTEM: An ecosystem for sandboxing safety risks in human-AI interactions. arXiv:2409.16427.
About the Author:Laura Weidinger* is a staff research scientist at Google DeepMind. Deb Raji* is a ­doctoral candidate at the University of California, Berkeley. Hanna Wallach is a VP and distinguished scientist at Microsoft Research. Margaret Mitchell is a researcher and chief ­ethics scientist at Hugging Face. Angelina Wang is a postdoc at Human-Centered Artificial Intelligence and RegLab, Stanford University. Olawale Salaudeen is a postdoctoral ­associate in the Laboratory for Information and Decision Systems at the Schwarzman College of Computing, Massachusetts Institute of Technology. Rishi Bommasani is the society lead at the Stanford Center for Research on Foundation Models. Sanmi Koyejo is an assistant professor of computer science at Stanford University. William Isaac is a principal scientist and head of responsible research at Google DeepMind. *  contributed equally to this work