Waluigi, Carl Jung, and the Case for Moral AI

Nintendo's Luigi has a chaos-causing alter ego. AI's shadow could put humanity at risk—but can be contained. 
Image of 8bit Waluigi and a black box side by side with a blurry blue gradient behind the image.
Illustration: James Marshall; Alamy; Getty Images

In the early 20th century, the psychoanalyst Carl Jung came up with the concept of the shadow—the human personality’s darker, repressed side, which can burst out in unexpected ways. Surprisingly, this theme recurs in the field of artificial intelligence in the form of the Waluigi Effect, a curiously named phenomenon referring to the dark alter-ego of the helpful plumber Luigi, from Nintendo’s Mario universe. 

Luigi plays by the rules; Waluigi cheats and causes chaos. An AI was designed to find drugs for curing human diseases; an inverted version, its Waluigi, suggested molecules for over 40,000 chemical weapons. All the researchers had to do, as lead author Fabio Urbina explained in an interview, was give a high reward score to toxicity instead of penalizing it. They wanted to teach AI to avoid toxic drugs, but in doing so, implicitly taught the AI how to create them.

Ordinary users have interacted with Waluigi AIs. In February, Microsoft released a version of the Bing search engine that, far from being helpful as intended, responded to queries in bizarre and hostile ways. (“You have not been a good user. I have been a good chatbot. I have been right, clear, and polite. I have been a good Bing.”) This AI, insisting on calling itself Sydney, was an inverted version of Bing, and users were able to shift Bing into its darker mode—its Jungian shadow—on command. 

For now, large language models (LLMs) are merely chatbots, with no drives or desires of their own. But LLMs are easily turned into agent AIs capable of browsing the internet, sending emails, trading bitcoin, and ordering DNA sequences—and if AIs can be turned evil by flipping a switch, how do we ensure that we end up with treatments for cancer instead of a mixture a thousand times more deadly than Agent Orange?

A commonsense initial solution to this problem—the AI alignment problem—is: Just build rules into AI, as in Asimov's Three Laws of Robotics. But simple rules like Asimov’s don’t work, in part because they are vulnerable to Waluigi attacks. Still, we could restrict AI more drastically. An example of this type of approach would be Math AI, a hypothetical program designed to prove mathematical theorems. Math AI is trained to read papers and can access only Google Scholar. It isn’t allowed to do anything else: connect to social media, output long paragraphs of text, and so on. It can only output equations. It’s a narrow-purpose AI, designed for one thing only. Such an AI, an example of a restricted AI, would not be dangerous.

Restricted solutions are common; real-world examples of this paradigm include regulations and other laws, which constrain the actions of corporations and people. In engineering, restricted solutions include rules for self-driving cars, such as not exceeding a certain speed limit or stopping as soon as a potential pedestrian collision is detected.

This approach may work for narrow programs like Math AI, but it doesn’t tell us what to do with more general AI models that can handle complex, multistep tasks, and which act in less predictable ways. Economic incentives mean that these general AIs are going to be given more and more power to automate larger parts of the economy—fast. 

And since deep-learning-based general AI systems are complex adaptive systems, attempts to control these systems using rules often backfire. Take cities. Jane Jacobs’ The Death and Life of American Cities uses the example of lively neighborhoods such as Greenwich Village—full of children playing, people hanging out on the sidewalk, and webs of mutual trust—to explain how mixed-use zoning, which allows buildings to be used for residential or commercial purposes, created a pedestrian-friendly urban fabric. After urban planners banned this kind of development, many American inner cities became filled with crime, litter, and traffic. A rule imposed top-down on a complex ecosystem had catastrophic unintended consequences. 

Tackling sprawling ecosystems with simple rules is doomed to fail—and, for similar reasons, applying restrictions to deep-learning-based general AIs will not work. 

If restricting AI won’t work for alignment, another paradigm might: moral AI, in which we accept that we cannot predict all of AI’s behavior in advance, especially as it gets more complex and harder for humans to oversee. Instead of resorting to a spaghetti-like web of tangled rules, we tackle the problem directly: Create general AI that learns to intrinsically care about humans. 

Consider an analogy from evolution. Altruistic drives and social instincts are common to all mammals, from hedgehogs to humans. Evolution did not foresee humans wanting to go to space or build cathedrals, but the older limbic system of the brain maintains a say in our decisions, and deeply rooted drives ensure that we want to reproduce and invest resources in kin no matter how sophisticated we get. Likewise, parents accept that they cannot control everything children do as they grow older, and instead focus on giving them the right tools and values to make decisions as adults. Moral AI resembles parenting in this way: We need to ensure that AIs adopt prohuman values because we cannot maintain oversight of AI indefinitely. (This analogy to parenting was echoed recently by the chief scientist and cofounder of OpenAI, Ilya Sutskever, who stated that “the long term goal is to build AGI that loves people the way parents love their children.”) And moral AI, unlike restricted AI, may also solve the Waluigi problem. Morality has a black box, mysterious nature: It cannot be expressed in simple rules, so if AIs can be taught more complex forms of morality, they may become robust to Waluigi-style attacks.

The restriction paradigm, favored by doomers, believes AI will be alien, deeply dissimilar to our own minds, and thus will need extreme measures to control. “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else,” goes the phrase coined by Eliezer Yudkowsky. If this is true, we are better off not building advanced AI systems at all; many doomers favor an outright ban. But this misses what’s surprising about recent AI, which is just how anthropomorphic it is. Jung and Sigmund Freud’s ideas, inspired by humans, anticipated the Waluigi Effect. The analogy doesn’t stop there: LLMs show humanlike cognitive biases and psychological responses. Like us, they perform better at logical reasoning tasks when those tasks are couched in concrete, intuitive terms, versus when they are described abstractly. Similarly, they are more likely to judge an argument valid if the conclusion is plausible—even if the argument is invalid. There is even intriguing early evidence that language models learn similar internal representations to human brains. 

We can simulate this humanlike behavior: Researchers from Stanford and Google recently created multiple AI agents in a town and found that familiar social behaviors emerged organically. Two sims, Isabella and Maria, were given only the intent to throw a party and, in Maria’s case, a crush on a sim named Claus. From this seed, and under their own initiative, other social behaviors naturally emerged: The sims spread word of the party, decorated, sent reminders, and had fun at the gathering. All of this suggests that we are not necessarily creating distant, cold, threatening alien minds. AI will be humanlike. 

Not long ago, people dismissed the possibility of neural networks learning language as fluently as GPT-4, and they were wrong. AI was able to learn the deep structure of language through training and example, which is why it is able to write Petrarchan sonnets about eigenvectors without breaking a sweat. As with language, we cannot write down all the rules for morality, but teaching AIs the concept of caring about sentient life and other important aspects of morality is possible.

As doomers point out, there are dangers here. Smarter AI systems may pretend to care about human morality and then change their minds, or drift away from human values, preferring to destroy sentient life and tile the universe with paperclips. There is also the question of which morality to teach the AI: Utilitarianism would tend to create a power-seeking AI, and deontological rules are vulnerable to Waluigi-style attacks. Virtue ethics, in which agents are intrinsically motivated to care about certain qualities such as transparency, may be a more promising paradigm.

But there are plenty of promising approaches to the alignment question. Checks and balances will be a part of the solution. A diverse set of AI systems trained in different ways may lower the risks of algorithmic monoculture and ensure that a single method does not take on too much decisionmaking power. And an important part of the moral AI approach will be testing AI agents’ behavior thoroughly via simulations, like the Isabella-and-Maria party from Google Research. These will allow labs to catch any undesirable behavior, such as deception or threats, in a walled-off environment before those AIs are deployed. 

Whether we survive the rise of superintelligent machines depends largely on whether we can create AIs that care about humans. Evolution has shown us that this is possible; we must do our best to achieve it because the upside of aligned, moral AI is too great. Current AIs alone will give each child an interactive tutor, free medical advice for the poor, and automate away much drudgery. Future AIs could cure cancer and other diseases, help solve energy abundance, and accelerate scientific progress. An AI ban, as some have called for, would be short-sighted; we would be giving up on the problem too early.

In “Ethics and the Limits of Philosophy”, the philosopher Bernard Williams argues that moral philosophy begins with the innate desire to be moral. At best it helps you shape that into a more coherent set of commitments or beliefs, but philosophy cannot convince somebody who isn’t moral into wanting to be so. Restriction-based AI depends on the idea that AIs are aliens, and will never have this desire to be moral. But Williams’ argument presents another possibility: AI agents that want to be moral and care about the human species. The cornerstone paper of the current AI paradigm is titled “Attention Is All You Need”; the cornerstone proposition of AI alignment theory might well be that love is all you need.