Download PDF Spring Bridge on AI: Promises and Risks April 15, 2025 Volume 55 Issue 1 This issue of The Bridge features fresh perspectives on artificial intelligence’s promises and risks from thought leaders across industry and academia. The Next Frontier in AI: Understanding the 3-D World Tuesday, April 15, 2025 Author: Fei-Fei Li AI is being developed to understand and interact with the 3-D world, opening up new possibilities in fields like robotics and healthcare. Five hundred and forty million years ago, there was pure darkness. All life existed below the surface of the water. But the vast blackness was not due to lack of light. In fact, light penetrated thousands of meters below sea level. The reason was that no living organisms had eyes that could capture that light. It was only upon the emergence of trilobites, the first organisms with the ability to sense light, that species could experience the abundance of sunlight around them. What followed was remarkable. Over the next 10-15 million years, the ability to see is thought to have ushered in the Cambrian explosion, during which most of the major animal groups we know today appeared. The evolution of sight is significant because it is the first time species knew something other than themselves and that there was a world they inhabited. Once sight was more widespread, the nervous system began to evolve and sight led to insight, where species were able to make sense of the 3-D world around them. Next came action, in which species began to manipulate their surroundings. And finally, all of this led to intelligence. Why mention this seemingly random piece of biological history in a publication focused on engineering? Because today we are experiencing a modern-day Cambrian explosion as it relates to artificial intelligence (AI). AI has moved from labs in academia into the mainstream, with incredible tools simply a click away. More than any other technology, AI will change our world in ways we still cannot fathom, and one major way it will do that is by teaching computers to understand and manipulate the 3-D world. This will be done by a subfield of AI known as computer vision, and, just as the eye evolved in organisms, the complexity of a computer’s ability to see and understand what it sees is also undergoing tremendous evolution. Today researchers are pushing computers to have visual intelligence that is the same as or better than that of humans. To understand how we got here, it is important to examine the developments that made this bold goal plausible. Today we are experiencing a modern-day Cambrian explosion as it relates to artificial intelligence. How Modern-Day AI Became Possible Researchers and data scientists have been working on AI for decades. But it was only in the mid-2000s that three powerful forces converged, ushering in modern-day AI. These three forces were: Neural networks: A family of algorithms that are computational models inspired by the human brain. Interconnected nodes are organized in layers that can process and transmit large amounts of information. Graphics Processor Units (GPUs): Fast, specialized pieces of hardware that are very good at performing high-volume processing tasks efficiently. Big Data: As digital information amassed and services like the internet took off, it brought with it the proliferation of and access to large amounts of data. Neural networks had existed as far back as 1943, when neurophysiologist Warren McCulloch and mathematician Walter Pitts created one using electrical circuits as a way to model how neurons worked in the brain. Over the next 40 years, the field continued to grow, and neural networks became more complex, multilayered, and bidirectional, driven particularly by Frank Rosenblatt’s introduction in the 1950s of the Perceptron, a single-layer neural network capable of learning simple patterns, marking a significant step towards practical implementation. However, given the challenges of training complex neural networks, it wasn’t until the 1980s that the field regained momentum with Geoffrey Hinton’s introduction of the backpropagation algorithm, which allowed him and other researchers to efficiently train multi-layer networks. Still, forward progress was slowed by the lack of enough compute power or data on which to effectively train the networks. It was only as Web 2.0 gave people the ability to write on the internet and interact on things like social networks, e-commerce, digital cameras, and smartphones became more readily available, and GPUs became more accessible to researchers that it finally became possible to truly take advantage of neural networks. It is this development that led to huge advances in computer vision, but before discussing that, it is important to understand how researchers even came to pursue it. First, the Power of Human Sight There is good reason for wanting computers to have the same visual intelligence as humans. Human sight is pretty incredible. We receive 2-D information from the world, and our brains translate that to the 3-D world we live in. Our vision allows us to comprehend and make sense of the world, and to then take action within it. Recognizing objects is something humans do especially well. Researchers know this because of something called rapid serial visual perception, a scientific method for studying the timing of vision. In one study testing the processing speed of the human visual system, participants were shown a complex natural image for 20 milliseconds and asked if an animal was present or not. The study found that participants were able to do so within 150 milliseconds, or about the blink of an eye (Thorpe et al. 1996). We know through other fMRI-based studies that we have evolved to have areas of the brain dedicated to visual recognition, known as the fusiform face area or parahippocampal place area, which help us identify people and places (Epstein and Kanwisher 1998; Kanwisher et al. 1997). This is all to say that object recognition is a fundamental building block of visual intelligence in humans. While it may be effortless for humans to identify, say, a picture of a cat, that task is much harder for a computer. This is because there are mathematically infinite ways of rendering a cat from 3-D to 2-D pixels based on lighting, texture, background clutter, or viewing angles, making it a challenging mathematical problem to solve. Knowing how humans identify objects has guided scientists and researchers interested in the field of computer vision and given us a sense of what we would need to accomplish to be successful. The Early Days of Computer Vision Attempts at getting computers to be able to recognize objects can be characterized into three phases. In the first phase, which lasted from the 1970s to the 1990s, smart researchers used hand-designed features and models. People like Irving Biederman, Rodney Brooks, Thomas Binford, Martin Fischler, and R. A. Elschlager used Geon theory, generalized cylinders, and parts and springs models to build object recognition theory. The models were mathematically beautiful, but they didn’t work. The next phase, arguably the most important, was the introduction of machine learning in the early 2000s. When researchers developed machine learning as a statistical modeling technique, they were still inputting hand-designed features—for example, parts of objects such as an ear, an eye, or a mouth that carry semantic information—and then the machine learning models learned the parameters that could stitch those patches of objects together into, for instance, a face, a body, or a cat. The models varied, from pictorial structure and constellation models to boosting to conditional random fields. Another significant development was taking place at the same time that researchers were building these machine learning models. Researchers who were focused on computer vision recognized the importance of data, and as a result, they created benchmarking datasets that they could measure their work against. These data sets included the PASCAL Visual Object Classes (VOC) and CalTech 101, and, while they served as early training datasets, they were small. The PASCAL VOC contained hundreds or thousands of images and 20 object categories (Everingham et al. 2010). It was the work of Irving Biederman, a cognitive psychologist who studied human visual intelligence, that broadened the scale and scope of what researchers believed to be the number of object categories humans can recognize. While never verified, in his influential article “Recognition-by-Components: A Theory of Human Image Understanding” (1987), he posited that there were more than 30,000 object categories humans identify in their lifetime. This number was one of the reasons that my fellow research partners and I launched ImageNet. We understood that it was more than the amount of data needed to advance computer vision; it was the diversity of the data. But the amount of data still mattered. If one thinks about a young child, no one teaches them how to see. They make sense of the world through experiences and examples. If one thinks of their eyes as biological cameras, they take a “picture” every 200 milliseconds. Following this rough estimate, by the time a child is three, they will have seen hundreds of millions of “pictures.” As the field remained focused on developing more advanced algorithms, I realized that rather than focus on that, we should focus on feeding existing algorithms the amount of training data a child received through his or her daily experiences in both quantity and quality. Once computers have visual intelligence that is the same as or better than that of humans, the positive applications are vast. This was what ImageNet did starting in 2007. The project was the perfect undertaking for the confluence of forces mentioned previously: convolutional neural networks, which work best with large data sets; big data, or the hundreds of millions of digital images used in the project; and GPUs, which had advanced by this point to be able to process high-volume tasks efficiently. Researchers employed crowdsourcing through Amazon’s Mechanical Turk platform to identify and categorize nearly 1 billion images. At its peak, almost 49,000 people from 167 countries helped our researchers clean, sort, and label those candidate images. This represents the beginning of the third phase of object recognition. Computer Vision Advances If the goal of computer vision is to teach machines to see as humans do, where people are able to lay their eyes on something and nearly instantly weave entire stories and make meaning of people, places, and things, ImageNet represented a foundational building block of this by greatly advancing a computer’s ability to identify a very large catalog of images. By 2009, ImageNet represented a dataset of more than 15 million images across almost 22,000 categories of objects (Deng et al. 2009). Still, despite the scale of ImageNet, the result was simply teaching a computer to identify objects. To go back to the child analogy, this is like the child speaking a lot of nouns. The next step was advancing to sentences, so rather than identifying a cat, understanding that, for example, the cat is lying on a bed or playing with a ball of yarn. To accomplish this, the connection between big data (the images) and the machine learning algorithms (the convolutional neural networks) processing them needed to evolve. In 2014, a group of researchers and I developed a model that was able to learn not only from the images but also from natural language sentences generated by people. The new models we designed worked like the human brain in that they integrated vision and language. They connected visual snippets of images with words and phrases (Karpathy and Li 2015). This feat took a lot of hard work from a lot of people, but it still didn’t match the level of context and meaning humans get from their vision. While the computer may accurately identify a picture of a boy and a birthday cake, it still couldn’t recognize that the boy was wearing a shirt given to him by his late grandfather or what type of cake it was. Nor could it tell you the emotion on the boy’s face. These are the details that add so much meaning to what we see, so computer vision still had a long way to go. The Era of Spatial Intelligence Is Upon Us Over time, researchers designed faster, more accurate algorithms. Each year, my lab held the ImageNet Challenge, where we tested and tracked the progress of their algorithms. In the span of 13 years, the accuracy of these algorithms ballooned from just over 50% to just over 90% (see figure 1). This was significant, and now some of the latest algorithms can segment objects and predict the dynamic relationship between them in moving pictures (Gupta et al. 2023). Since 2015, when we first got computers to describe a photo using natural language, there was a challenge to do that in reverse. In other words, rather than giving the computer the image and asking the computer to describe it, the challenge was to prompt the computer with what you want to see and ask it to generate something for you. This seemed impossible in 2015. Fast forward to today, generative AI algorithms such as Sora by OpenAI, DALL-E, and many others, powered by a family of diffusion models that take human-prompted sentences and turn them into photos or videos, have achieved this capability. And as progress with such large language model-driven generative algorithms races forward, in my research lab, my students and collaborators have developed a generative video model called Walt, which, while far from perfect, is creating some pretty compelling pieces. As impressive as this seems, there is still progress to be made because these models are still lacking the element of action. A computer that can see can do and learn, but it still cannot act upon the world. This capability, to act upon what is seen, learned, and understood, is what I refer to as “spatial intelligence,” and it is the next frontier in computer vision. A good example is figure 2. A computer can identify that there is a glass on a table, but it cannot do what our human brain does in an instant: understand the geometry of the glass, its place in 3-D space, its relationship with the table, cat, and everything else present, and instinctively want to act to stop the glass from tipping over. Getting a computer to understand this is difficult and requires translating 2-D images into 3-D models. But this will be critical for computers to become spatially intelligent. This year, researchers in my lab were able to load a single 2-D image into a computer and use an algorithm that renders a 3-D image from it (Sargent et al. 2024). At the University of Michigan, a group of researchers has figured out how to turn a descriptive sentence into a 3-D room layout (Höllein et al. 2023). At Stanford, researchers and their students have created an algorithm that takes a single image and generates infinite plausible spaces for viewers to explore (Ge et al. 2024). These examples are the first signs of humans’ ability to model the richness and nuances of the 3-D world in digital forms. This rapid progression of spatial intelligence is ushering in a new era of robotics by catalyzing robotic learning, something that is a key component for any embodied intelligence system that needs to understand and interact with the 3-D world. Where ImageNet used high-quality photos to train computers to see, today researchers are training with behaviors and actions so robots can know how to act in the world. To make this possible, researchers are developing 3-D spatial environments powered by 3-D spatial models so the computers have infinite possibilities from which to learn (Ge et al. 2024). There is also exciting progress happening in robotic language intelligence. Through large language model-based input, my students and our collaborators can input a sentence and have a robot perform an action such as opening a drawer, unplugging a phone charger, or even making a simple sandwich (Huang et al. 2023). Real-World Impact of Spatial Intelligence Once computers have visual intelligence that is the same as or better than that of humans, the positive applications are vast. Healthcare is one industry with many applications. Researchers from my Computer Vision Lab at Stanford University have been collaborating with Stanford’s School of Medicine and hospitals to pilot smart sensors that can detect whether clinicians have entered a patient room without properly washing their hands. AI can also aid with tasks such as keeping track of surgical instruments and alerting care teams if a patient is at risk of a fall. Imagine an autonomous robot transporting medical supplies while caretakers focus on patients, or augmented reality that guides surgeons to do safer, faster, and less invasive procedures. With spatially intelligent AI, all of this and so much more is possible. What took 540 million years to evolve in humans will happen in computers in a matter of decades. And the human species will ultimately be the beneficiary. References Biederman I. 1987. Recognition-by-components: A theory of human image understanding. Psychological Review 94(2):115–47. Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F. 2009. ImageNet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 248–55. IEEE. Epstein R, Kanwisher N. 1998. A cortical representation of the local visual environment. Nature 392:598–601. Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A. 2010. The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision 88:303–38. Ge Y, Tang Y, Xu J, Gokmen C, Li C, Ai W, Martinez BJ, Aydin A, Anvari M, Chakravarthy A, and 13 others. 2024. BEHAVIOR vision suite: Customizable dataset generation via simulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 22401–12. IEEE. Gupta A, Wu J, Deng J, Li F-F. 2023. Siamese masked autoencoders. arXiv:2305.14344. Höllein L, Cao A, Owens A, Johnson J, Nießner M. 2023. Text2Room: Extracting textured 3D meshes from 2D text-to-image models. arXiv:2303.11989. Huang W, Wang C, Zhang R, Li Y, Wu J, L F-F. 2023. VoxPoser: Composable 3D value maps for robotic manipulation with language models. arXiv:2307.05973. Kanwisher N, McDermott J, Chun MM. 1997. The fusiform face area: A module in human extrastriate cortex specialized for face perception. Journal of Neuroscience 17(11):4302–11. Karpathy A, Li F-F. 2015. Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3128–37. IEEE. Sargent K, Li Z, Shah T, Herrmann C, Yu H-X, Zhang Y, Chan ER, Lagun D, Li F-F, Sun D, and 1 other. 2024. ZeroNVS: Zero-shot 360-degree view synthesis from a single image. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9420–29. IEEE. Thorpe S, Fize D, Marlot C. 1996. Speed of processing in the human visual system. Nature 381(6582):520–2. About the Author:Fei-Fei Li (NAE, NAM) is the Denning Co-Director of Stanford’s Institute for Human-Centered Artificial Intelligence.