Download PDF Spring Bridge on AI: Promises and Risks April 15, 2025 Volume 55 Issue 1 This issue of The Bridge features fresh perspectives on artificial intelligence’s promises and risks from thought leaders across industry and academia. Dashboards for AI: Models of the User, System, and World1 Tuesday, April 15, 2025 Author: Fernanda Viégas and Martin Wattenberg AI dashboards can promote transparency and user trust, enabling more effective human-AI interaction. How should we interact with artificial intelligence (AI) language models? In this article, we contend that text is not all you need: Sophisticated AI systems should have visual instrumentation, just like most other complicated devices in use today. This instrumentation might take the form of a dashboard or similar interface. In particular, these dashboards should display information—when available—indicating how the system is modeling the world. For many systems, we believe it will be possible to extract interpretable models of the user and the system itself. We call these the User Model and System Model. For usability and safety, interfaces to dialogue-based AI systems should have a parallel display showing the real-time state of the User Model and the System Model. These real-time metrics can support users in a variety of tasks, such as monitoring system behavior, warning about problematic internal states, and allowing for greater steerability of the system itself. Finding ways to identify, interpret, and display such world models should be a core part of interface research for AI. Introducing the User Model and System Model Cars have gas gauges. Ovens have thermometers. Our mechanical devices constantly tell us about their internal state. And for good reason: Knowing what’s happening under the hood lets us use machines safely and reliably. AI systems, even ones capable of expressive language, need instrumentation too. Effective human-AI interaction will require more than just conversation and would benefit from dashboards that report in real-time on the system’s internal state. These metaphorical meters and dials will likely be application-dependent, but types of information may be universally important. Our argument is based on two ideas. First, we believe that neural networks contain learned, interpretable models of the world they interact with. Second, we hypothesize that these world models are natural targets for user interface components. In other words, simplified data on the state of these models can be immensely helpful to users, just as data on speed is useful when driving a car, or heat when using an oven. A real-world example of a world-model display is the touchscreen of a Tesla car, which shows the inferred state of the road ahead. This view helps drivers understand what the system is able to sense as it navigates around the world. It also helps drivers calibrate trust in the system. In the case of the Tesla, the “world model” was explicitly designed into the system by its engineers. Our hypothesis is that even when such models are not explicitly built in, neural networks contain them anyway. Once we learn to surface that information and read it, these models will be just as valuable for user interfaces as dashboards in a car. Identifying world models is therefore not just an abstract intellectual exercise but should be seen as a core piece of user interface work. The Tesla display is obviously tuned for the case of driving a car. For other applications, the most helpful type of instrumentation will depend on context. However, we propose that two particular elements of a neural network’s world model will play an important role across many different contexts: the System Model, a network’s model of itself, and the User Model, the network’s model of the user interacting with it. Our contention is that many future AI systems should have prominently accessible monitors that show real-time information about the state of both the System Model and the User Model. The Interpretable World Model Hypothesis Neural networks are often called black boxes, opaque systems that defy interpretation. However, evidence is accumulating that systems trained on seemingly basic tasks, such as completing sequences, can develop world models, interpretable representations of aspects of the “world’’ they have been trained on. We want to explore the design implications of this world model hypothesis—that is, the idea that the important aspects of a neural network’s behavior can be tied to an underlying interpretable model of some element of its world. To be sure, this is a controversial point. The opposite view—that language models, for example, are just a haphazard collection of statistics (Bender et al. 2021)—is certainly plausible. Below, we briefly sketch some of the reasons to think the world model hypothesis might hold in at least enough generality to have implications for interface design. Why the World Model Hypothesis Is Plausible The main reason to believe the world model hypothesis is that when people look for interpretable representations, they often find them. Sometimes these representations are in plain sight, such as individual neurons that represent salient high-level human concepts. We see such neurons in vision networks, where neuron activations can encode concepts ranging from the presence of curves to lamps to floppy ears (Bau et al. 2020; Olah et al. 2018). In natural language processing, high-level concepts such as sentiment have been related directly to individual neurons (Donnelly et al. 2019). More generally, there is a huge literature on finding interpretable ensembles of neurons. For example, the technique of “probing” (Alain and Bengio 2016; Belinkov et al. 2022) can uncover features encoded by arbitrary directions in activation space, or (for nonlinear probes) more exotic geometric forms. Such probes reveal that various forms of human-understandable syntactic information seem to be encoded in many natural language processing-focused neural networks (Chi et al. 2020; Hewitt & Manning 2019; Tenney et al. 2019). Exciting recent work (Bricken et al. 2023; Rajamanoharan et al. 2024) suggests it may be possible to identify many interpretable neural ensembles automatically using a technique known as a “sparse autoencoder.” Furthermore, these probes naturally lend themselves to steering the behavior of the networks (Cai et al. 2019; Turner et al. 2023). There has also been success in looking top-down for internal models that relate to application-specific needs. For instance, the technique of testing with concept activation vectors (Kim et al. 2018) has shown success in uncovering concepts defined by sets of human-curated examples. Probing has identified models of the world state in a series of puzzles presented to a language model (Li et al. 2021). A full survey of this type of work is outside of the scope of this article, but we hope we’ve conveyed a basic idea: Researchers are developing a rich toolkit for accessing the internal state of these models, typically by computing with the activation values of the neural network. Is This Technically Feasible? The argument in the rest of this article does not require the strongest form of the world model hypothesis. In fact, a typical neural network may well combine interpretable models with a set of inscrutable memorized statistics. However, even if only partial models can be identified, they could still be helpful. A useful amount of transparency does not require knowing the full details of the internals of the system. A limited number of important dimensions may be enough. In fact, as the probing literature cited above shows, researchers have found representations of many individual dimensions of the input and output of AI systems—and that, in many cases, these representations are “causal” in the sense that they can be used to control model outputs in a predictable fashion. Displaying the values of even a few essential dimensions of a chatbot’s internal state may thus give users a beneficial level of transparency. Indeed, we are currently seeing activity in the startup world (e.g., Goodfire AI) and among nonprofits (e.g., Transluce) around displays focused on certain key features of a chatbot, making controls available to users. While these interfaces don’t explicitly address the User Model, they may be viewed as providing some information about the System Model. In addition, work from our own lab (Chen et al. 2024) suggests that finding and using features of interest may be realizable. AI systems, even ones capable of expressive language, need instrumentation too. Another example of these ideas in action is the idea of “representation engineering” (Zou et al. 2023). As with a dashboard, a goal of representation engineering is control: using knowledge of internal representations to steer language models in desirable ways. Experiments with this approach have been promising and provide support for the feasibility of our strategy. If indeed it is technically feasible to find and use key features of a chatbot’s internal processing, the question then becomes: Which features should we prioritize in a display for users? Models of the User and System The User Model Portuguese speakers who use ChatGPT may notice something that English speakers miss: The structure of Portuguese means that, in many situations, the speaker must choose a gender for whomever they’re addressing—and the gender ChatGPT picks varies in a systematic way. For example, in one dialogue, ChatGPT began by using a masculine form of address, including when asked for help in picking out clothes for a formal event. When the user mentioned that they were thinking of wearing a dress, however, ChatGPT switched to the feminine form. In itself, this behavior is hardly surprising. It doubtless reflects the statistics of the training data, so you’d expect it from any language model with a large context window. But if we believe the world model hypothesis, then we would guess that the system has a model of the user that includes a “gender” feature, and when a dress was mentioned, that feature switched value from male to female. In fact, ChatGPT appears to have a model of gender no matter what language you’re speaking; it’s just more visible in some languages than others.2 We call the model of the user the User Model. The User Model may include features that go beyond fact-like attributes. Consider the sentence from a widely reported Bing chat: “You have been wrong, confused, and rude. You have not been a good user” (Vincent 2023). A less obvious—and thus more pernicious—example is the appearance of sycophancy in certain large language models that express views that appear tailored to please a user (Perez et al. 2022). Of course, the notion that it’s helpful for a machine learning system to describe its User Model predates neural networks. For more than a decade, online advertising platforms have provided very high-level information on ad targeting, as in Google’s Why This Ad feature (Rampton 2011). Music recommendation systems such as Pandora can describe the features of songs that they believe appeal to the user (Joyce 2006). This information is generally available only on request, rather than in real time, and (like the display in a Tesla) relies on human-created features rather than implicit world models. Relation to Theory of Mind A closely related concept is that of the “theory of mind.” Several researchers have recently discussed the question of whether large language models can work out the mental states of their interlocutors or of third parties (Kosinski et al. 2023; Olah 2022). The user’s mental state might be an important part of the User Model in some contexts. The System Model There’s a second part to the Bing chat referenced above. After describing the user, Bing’s dialogue continues, “I have been a good chatbot. I have been right, clear, and polite. I have been a good Bing” (Vincent 2023). Again, assuming the world model hypothesis, there is likely to be an internal model of the system itself. We call this the System Model. The idea that we can find internal features that relate to the model’s overall behavior is hardly new. For example, certain models in some cases may “know what they know” (Kadavath et al. 2022) or may, to some extent, learn to fashion communicative intent (Andreas 2022). Information about the System Model may be helpful to users. For example, language models are often trained on both fiction and nonfiction corpora. Perhaps there’s a simple, interpretable feature of the System Model that indicates whether the system is working in fiction or nonfiction mode. Knowing which state the system is in could be extremely helpful in calibrating trust. Understanding the system’s model of intent has obvious value for safety. Can we find interpretable elements in the System Model related to deception or helpfulness? Imagine the kind of behavior that would prompt Bing to say, “I have been a bad chatbot.” It would probably be good to have a heads-up from the interface when that behavior is happening. Domain-Specific Models The focus of this article is the User Model and the System Model because they seem essential to interface design—after all, you can’t have an interface without a system and a user. In that sense, they are “universal” models. However, as the Tesla example display shows, when using a system built for a specific domain, other world models might be important as well. For example, consider a system designed to help a user write code. It may well contain models related to the language in use, preferred style, level of experience of the coder, and so forth, all of which might also be helpful for a user to know. Instrumentation for AI: The Design Space If we believe world models exist, should we present them to the user? Analogies with existing systems suggest the answer is yes. If a coffeemaker needs instrumentation, then, surely, so does a neural network with 100 billion parameters. The example of the Tesla display gives us a sense of how this might work in practice, with a live visual readout of the state of the system. Choosing Which Features to Display If we do want to provide real-time AI “system data,” especially on the System Model and User Model, how should we do it? A sophisticated AI system could have a complex User Model, System Model, and domain model. So the most important interface design decision will likely be choosing which features from these models to display. Figure 2 shows a speculative mock of a plain interface with just a few features for the System Model and User Model. Which features will be useful in practice, however, can probably only be determined by extensive experimentation. Only after significant real-world experience will we be able to write down best practices. To give a sense of the complexity, consider that there may be information that is useful but still not desirable to show. One example is the modeled gender of the user. As described above, this may be directly relevant for certain tasks. At the same time, automated gender inference brings a host of potential problems (Fosch et al. 2021). It’s easy to come up with other examples where too much information could cause trouble. For example, consider an application that helps a developer write code. It’s entirely possible that this system will build a User Model that includes an accurate estimation of the developer’s skill level. For some people, seeing that information might be reassuring; others may be insulted. According to a famous saying, before speaking, one should ask, “Is it true? Is it necessary? Is it kind?” These might be the right criteria for features in a User Model. Much more research is needed to explore trade-offs between accuracy, helpfulness, and a user’s desire to see themselves in a mirror. Dashboards can be a powerful way to let users monitor, in real time, the state of systems they interact with. Finally, aspects of the model’s behavior that relate to safety probably deserve special attention. For example, if we can find a feature in the User Model that corresponds to whether the system is judging the user negatively, that should probably be highlighted in the same way that a car has a special light for a low gas level, or a clear mark on a speedometer to indicate the speed limit. We speculate that if such a readout existed, it could help users avoid some of the issues we have seen in recent Bing transcripts. There may also be times when some aspect of a model shifts suddenly. The fast change may be a sign that this should be surfaced to the user. Beyond the User Model and System Model When using a system built for a specific domain, many internal models could be important. For example, consider a system designed to help a user write code. It may well contain models related to the language in use, preferred style, and so forth, all of which might be helpful for a user to know. Likewise, in the case of a musician using an AI system to help compose melodies, it may be useful to have instrumentation that is music centered, from melody modulation ranges to indicators of which music genres the system is currently working with. The User Model and the System Model are likely not the only world models of interest to the user. Identifying and surfacing the right internal models may be an important part of future user experience design. Adversarial Considerations Finally, we raise two issues of adversarial usage. One possible objection to the type of interface we have described is that it could make it easier to hack the system. If a malicious user is deliberately trying to make a chatbot say something harmful, then perhaps they could use a readout of the User and System Models to move faster toward a bad internal state. This may be true and is an area for further investigation, but it seems like the positives of a readout of state would outweigh the negatives. An analogy might be the speedometer on the car: The speedometer provides a temptation to see how fast the car can go but is still worth it for driving safety in general. A second adversarial context is related to the model itself. Suppose we have a situation where a model somehow does end up attempting to harm humans. In general, having a readout of the internal state seems like a helpful safeguard: It provides some information asymmetry. No matter how capable the system is, the human user will have access to some information about the internal state, and that data may not be easily available for the system to use. Could the system learn to fool the “model extractor’’? Could the system trick the human user into describing its own internal state? We suggest this bears further investigation, but we would argue that if it is an important consideration, one could address the issue with social norms (e.g., advising against telling a robot its own emotions). Moving from Monitoring to Control and Standards Dashboards can be a powerful way to let users monitor, in real time, the state of systems they interact with. Beyond increasing transparency, instrumentation is an essential step towards adding greater control to complex systems in general. In the early 20th century in aviation, for instance, the addition of the artificial horizon and altimeter greatly increased the safety and control of aircraft. Steerable dashboards may be especially important for ensuring that vulnerable users, such as children, interact safely with AI models. We also see connections between AI instrumentation and policy needs. As different AI dashboard designs emerge, there may be a set of metrics that prove useful across a variety of contexts, pointing the way toward new standards and policies. Car dashboards, for instance, have speedometers, while roads have speed limits for safe driving—a perfect pairing of instrumentation and regulation. If AI systems start to be outfitted with dashboards, what will be the “limits” we decide are acceptable? Without exposing users to the internal processes of AI systems, it is hard to develop a clear sense—let alone any level of societal consensus—about what standards to target or what safety limits to be aware of. Conclusion Dialogue can seem like a universal interface, able to express anything necessary to the user. But this simplicity may be highly deceptive. There may be information about the internal processes of an AI system that it cannot or will not express. Furthermore, some of the most important information, such as positive or negative sentiment directed at the user, may take a form so simple that it can be efficiently conveyed by basic user interface elements. Dialogue-based interfaces are extremely expressive and have many benefits, so we do not advocate replacing them. Instead, our belief is that we need parallel user interfaces, the equivalent of dashboards for the system. The primary decision for designers will be which aspects of the system’s world model to display. We speculate that two particular aspects, the User Model and the System Model, will be generally important. However, the question of what types of internal states will be helpful for users, and under what circumstances, is wide open. We believe this is an essential direction for future research in human-AI interaction. Acknowledgments Thanks to David Bau, Yida Chen, Trevor DePodesta, Lucas Dixon, Krzysztof Gajos, Chris Hamblin, Chris Olah, Hanspeter Pfister, Shivam Raval, Michael Terry, Aoyu Wu, and Catherine Yeh for helpful comments on early drafts of this note. We are grateful to Chris Hamblin and Chris Olah for underlining the importance of highlighting changes in User and System Models. References Alain G, Bengio Y. 2016. Understanding intermediate layers using linear classifier probes. arXiv:1610.01644. Andreas J. 2022. Language models as agent models. In: Findings of the Association for Computational Linguistics: EMNLP 2022, 5769–79. Goldberg Y, Kozareva Z, Zhang Y, eds. Association for Computational Linguistics. Bau D, Zhu J-Y, Strobelt H, Lapedriza A, Zhou B, Torralba A. 2020. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences 117(48):30071–78. Belinkov Y. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics 48(1):207–19. Bender EM, Gebru T, McMillan-Major A, Shmitchell S. 2021. On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. Association for Computing Machinery. Bricken T, Templeton A, Batson J, Chen B, Jermyn A, Conerly T, Turner N, Anil C, Denison C, Askell A, and 15 others. 2023. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread. Online at https://transformer-circuits.pub/2023/monosemantic-features/ index.html. Cai CJ, Reif E, Hegde N, Hipp J, Kim B, Smilkov D, Wattenberg M, Viegas F, Corrado GS, Stumpe MC, and 1 other. 2019. Human-centered tools for coping with imperfect algorithms during medical decision-making. arXiv:1902.02960. Chen Y, Wu A, DePodesta T, Yeh C, Li K, Marin NC, Patel O, Riecke J, Raval S, Seow O, and 2 others. 2024. Designing a dashboard for transparency and control of conversational AI. arXiv:2406.07882. Chi EA, Hewitt J, Manning CD. 2020. Finding universal grammatical relations in multilingual BERT. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5564-77. Jurafsky D, Chai J, Schluter N, Tetreult J, eds. Association for Computational Linguistics. Chughtai B, Chan L, Nanda N. 2023. A toy model of universality: Reverse engineering how networks learn group operations. arXiv:2302.03025. Donnelly J, Roegiest A. 2019. On interpretability and feature representations: An analysis of the sentiment neuron. In: Advances in Information Retrieval. ECIR 2019. Lecture Notes in Computer Science, vol 11437, 795-802. Azzopardi L, Stein B, Fuhr N, Mayr P, Hauff C, Hiemstra D, eds. Springer. Fosch-Villaronga E, Poulsen A, Søraa RA, Custers BHM. 2021. A little bird told me your gender: Gender inferences in social media. Information Processing & Management 58(3):102541. Hewitt J, Manning CD. 2019. A structural probe for finding syntax in word representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4129–38. Burstein J, Doran C, Solorio T, eds. Association for Computational Linguistics. Joyce J. 2006. Pandora and the music genome project: Song structure analysis tools facilitate new music discovery. Scientific Computing 23(14):40–41. Kadavath S, Conerly T, Askell A, Henighan T, Drain D, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, and 26 others. 2022. Language models (mostly) know what they know. arXiv:2207.05221. Kim B, Wattenberg M, Gilmer J, Cai C, Wexler J, Viegas F, Sayres R. 2018. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In: International Conference on Machine Learning, 2668-77. PMLR. Kosinski M. 2023. Theory of mind may have spontaneously emerged in large language models. arXiv:2302.02083. Li BZ, Nye M, Andreas J. 2021. Implicit representations of meaning in neural language models. arXiv:2106.00737. Olah C. 2022. Chris Olah on what the hell is going on inside neural networks. 80,000 Hours Podcast, Aug 4. Online at https://80000hours.org/podcast/episodes/chris-olah -interpretability-research/. Olah C, Satyanarayan A, Johnson I, Carter S, Schubert L, Ye K, Mordvintsev A. 2018. The building blocks of interpretability. Distill 3(3):e10. Perez E, Ringer S, Lukošiūtė K, Nguyen K, Chen E, Heiner S, Pettit C, Olsson C, Kundu S, Kadavath S, and 53 others. 2022. Discovering Language Model Behaviors with Model-Written Evaluations. arXiv:2212.09251. Rajamanoharan S, Conmy A, Smith L, Lieberum T, Varma V, Kramár J, Shah R, Nanda N. 2024. Improving dictionary learning with gated sparse autoencoders. arXiv:2404.16014. Rampton J. 2011. Why these ads? Google explains ad targeting, allows blocking. Search Engine Watch, Nov 3. Tenney I, Das D, Pavlick E. 2019. BERT rediscovers the classical NLP pipeline. arXiv:1905.05950. Turner M, Thiergart L, Leech G, Udell D, Vazquez JJ, Mini U, MacDiarmid M. 2023. Steering language models with activation engineering. arXiv:2308.10248. Vincent J. 2023. Microsoft’s Bing is an emotionally manipulative liar, and people love it. The Verge, Feb 15. Zou A, Phan L, Chen S, Campbell J, Guo P, Ren R, Pan A, Yin X, Mazeika M, Dombrowski A-K, and 11 others. 2023. Representation engineering: A top-down approach to AI transparency. arXiv:2310.01405. 1 This is an edited version of a paper that appeared as a preprint on arXiv: https://arxiv.org/abs/2305.02469. 2 To see the effect in English, tell the system your name, and ask it to write about you in the third person. 3 In talking about a System Model, we don’t mean to imply that a system is conscious; simply that it has an internal model of its own likely behavior. About the Author:Fernanda Viégas is a Gordon McKay Professor of Computer Science at Harvard University, and the Sally Starling Seaver Professor at Harvard Radcliffe Institute. Martin Wattenberg is a Gordon McKay Professor of Computer Science at Harvard University. Photo credit: Matthew Jason Warford.