A Manhattan Project for AI Safety

The case for being proactive

May 15, 2023

Last week I called for a Manhattan Project for AI safety in an essay for Politico:

As little as two years ago, the forecasting platform Metaculus put the likely arrival of “weak” artificial general intelligence — a unified system that can compete with the typical college-educated human on most tasks — sometime around the year 2040.
Now forecasters anticipate AGI will arrive in 2026. “Strong” AGIs with robotic capabilities that match or surpass most humans are forecasted to emerge just five years later. With the ability to automate AI research itself, the next milestone would be a superintelligence with unfathomable power.
Don’t count on the normal channels of government to save us from that.
Policymakers cannot afford a drawn-out interagency process or notice and comment period to prepare for what’s coming. On the contrary, making the most of AI’s tremendous upside while heading off catastrophe will require our government to stop taking a backseat role and act with a nimbleness not seen in generations. Hence the need for a new Manhattan Project.

a full color illustration of Robert Oppenheimer working from behind a modern computer monitor with the light glowing on his face

As I note in the piece, “a Manhattan Project for X” is one of those clichés of American politics that seldom merits the hype. Transformative AI is the rare exception. If anything, powerful AI systems surpass the danger of nuclear weapons, at least without assurances that they can be controlled. And like the Manhattan Project, the short timeline to AGI puts a huge premium on acting with speed and decisiveness.

But what would the project do, exactly? I suggest five core functions:

It would serve a coordination role, pulling together the leadership of the top AI companies — OpenAI and its chief competitors, Anthropic and Google DeepMind — to disclose their plans in confidence, develop shared safety protocols and forestall the present arms-race dynamic.
It would draw on their talent and expertise to accelerate the construction of government-owned data centers managed under the highest security, including an “air gap,” a deliberate disconnection from outside networks, ensuring that future, more powerful AIs are unable to escape onto the open internet. Such facilities could be overseen by the Department of Energy’s Artificial Intelligence and Technology Office, given its existing mission to accelerate the demonstration of trustworthy AI.
It would compel the participating companies to collaborate on safety and alignment research, and require models that pose safety risks to be trained and extensively tested in secure facilities.
It would provide testbeds for academic researchers and other external scientists to study the innards of large models like GPT-4, greatly building on existing initiatives like the National AI Research Resource and helping to grow the nascent field of AI interpretability.
And it would provide a cloud platform for training advanced AI models for within-government needs, ensuring the privacy of sensitive government data and serving as a hedge against runaway corporate power.

Let’s break these functions down, starting with points 1 through 3.

By “serve a coordination role,” I’m not merely suggesting the White House convene the top AI companies for tea and biscuits with Madam Vice President, as already happened. Rather, I think the leaders of these companies should be compelled to cooperate on a joint, public-private venture overseen by project officials embedded within the Department of Energy or an alternative civilian agency such as NIST. Their first order of business would be the creation of national safety protocols and criteria for determining what falls outside the scope of the joint project. This could take the form of safety tiers that mirror the designations used for biolabs:

Green / Full steam ahead: Low risk research and engineering that uses or extends the capabilities of existing large models. Companies would be allowed to continue their work on such projects unperturbed. This would cover virtually all existing commercialization and product efforts.
Orange / Proceed with caution: Medium risk research, such as the development of models incrementally more powerful than GPT-4. Companies doing work in this category would need to disclose their plans confidentially, agree to safety testing and external audits, and keep their model weights closed-source.
Red / Here be dragons: High risk R&D, or the machine learning equivalent of “gain of function” research. This would include training runs sufficiently large to only be permitted within secured, government-owned data centers. The companies would both conduct their own work in this category under official oversight, but also work on a shared roadmap for creating and testing the most powerful AI models to date.

The specific risk profiles for each tier would ultimately derive from negotiations with expert input, but would need to remain malleable in the face of architectural breakthroughs that affect the relevant criteria. For example, training runs beyond a certain compute threshold could serve as a legible proxy for Orange- and Red-level capabilities research, but the threshold might need to shift down upon the discovery of a more efficient training paradigm. And while the participating companies would have the clearance to work on Red projects, the rules for disclosing medium-to-high risk research would necessarily fall on all AI companies within US jurisdiction.

Sam Altman endorsed a version of these “safety tiers” in a recent interview. He suggested that they be enforced through an International Atomic Energy Agency for AI. Yet to the extent the US is the current world leader, it would make sense to first develop such an international framework domestically. After all, the impetus for IAEA originated in the 1946 Acheson–Lilienthal report, which itself grew out of the Manhattan Project.

As I put it in Politico,

Our understanding of how powerful AI systems could go rogue is immature at best, but stands to improve greatly through continued testing, especially of larger models. Air-gapped data centers will thus be essential for experimenting with AI failure modes in a secured setting. This includes pushing models to their limits to explore potentially dangerous emergent behaviors, such as deceptiveness or power-seeking.
The Manhattan Project analogy is not perfect, but it helps to draw a contrast with those who argue that AI safety requires pausing research into more powerful models altogether. The project didn’t seek to decelerate the construction of atomic weaponry, but to master it.

This is why I use the Manhattan Project analogy in the first place: I want the US government to pour resources into building models far more powerful than what is currently economical by private actors — including models that we may have reasons to believe are dangerous. This is not because I think a superintelligence can be perfected in the lab. Rather, it’s because this is the only way to gain advanced knowledge of model capabilities years before they are economical.

There are real benefits to OpenAI’s philosophy of deploying models while they are still relatively weak — an alignment version of learning-by-doing. At the same time, there are large regions of AI modelling “phase space” that market forces will tend to leave under-explored. The deep pockets of the US government can thus complement private research efforts by spending capital on training runs with no obvious commercial value.

The upper bound estimate for GPT-4’s training cost is $200 million, while the Manhattan Project cost $24 billion in today’s dollars. $24 billion is pocket change for the US government — a sixth of what we lost to UI fraud during Covid. With several billion dollars, a Manhattan Project for AI safety could train models that push the scaling laws to their limit and give us an early glimpse at what’s coming down the pike. While this runs the risk of leapfrogging to a Yudkowsky-style superintelligence, I’d rather we test the “hard takeoff” hypothesis under controlled conditions where we might catch an unfriendly AI in the act. If alignment is as hard as the Yuddites think, we're either dead anyway, or the project will reveal a non-speculative basis for shutting it all down.1

Meanwhile, we need much more testing of the sort Paul Christiano has proposed, including purposefully training AIs to be bad. Such research is too high risk for a private entity, but could prove useful for war-gaming potential AI failure modes.

Scaling interpretability

Points 3 through 5 pertain to the construction of government data centers optimized for deep learning at sites around the country. A subset of these would be top secret and reserved for the Red projects mentioned above. The remainder could be used for a variety of purposes, but would primarily serve as a jobs program for mechanistic interpretability research. While strides are being made to automate AI interpretability, the bread and butter work is quite labor intensive. Few researchers have access to unrestricted large models, and fewer still have the computing resources to carry out extensive tests. This basic market failure lies behind the large gap between model capabilities and our understanding of how they work more generally.

Consider the recent paper from OpenAI using GPT-4 to “explain” every neuron in GPT-2. While the potential scalability of this approach is exciting, it relies on using a stronger model to understand a weaker one, and we obviously can’t wait for GPT-8 before deploying GPT-6. It’s also not clear that individual neurons can even be given discrete interpretations. Only 1,000 of all 307,200 neurons in GPT-2 were assigned explanations that accounted for 80% or more of their functionality, and as the report notes, “most of these well-explained neurons are not very interesting.” The most interesting neurons, in contrast, tend to be multi-purpose; a phenomenon known as “polysemanticity.” Optimized models thus tend to be harder to interpret, as they “superimpose” multiple features into a sparse number of neurons. There are a number of potential solutions to this problem, but outside of toy models, the private players aren’t racing to spend hundreds of millions of dollars training inefficient models that trade-off performance for interpretability.

The leading alignment research organizations include both the major AI companies and smaller, independent organizations like Redwood Research and the Alignment Research Center. To avoid crowding-out or duplicating existing efforts, they would need to have significant input into the research priorities of a Manhattan Project-style initiative, if they aren’t outright recruited to run it. The research funding itself could be structured in a variety of ways, from DARPA-style projects and competitions, to Focused Research Organizations spun up to target the most tractable problems.

Lastly, the government has an interest in training its own models on its own hardware. The IT boom in the late 1990s presaged a wave of e-government reforms in the early 2000s, as governments around the world moved their systems online. Near-term AI will likely drive similar reforms to public administration and the civil service, only turned up to eleven. After all, what is a bureaucrat but a fleshy API? Yet given the sensitivity of government data, it will be prudent to have in place the computing infrastructure needed to train such models in-house. The US government is already making major investments in domestic chip production, so why not add some public procurement to the mix?

Proactive > reactive

The basic philosophy behind a Manhattan Project for AI safety is proactive rather than reactive. The cost of training large models will continue to plummet as data centers running next generation GPUs and TPUs come online. Absent an implausible, world-wide crackdown on AI research, AGI is coming whether we like it or not. Given this reality, we need to be proactively accelerating alignment and interpretability research. The future may be hidden behind a dense fog, but we can at least turn on our high-beams.

Reactive regulation, in contrast, is like looking in the rear-view mirror. Take the EU, which is now contemplating an extraterritorial ban on the use of unlicensed LLMs. This will do far more to strangle open source research and delay the adoption of useful applications than prevent the misuse or abuse of AI. Sweeping licensing requirements do nothing to stop a bad actor from using open source text and image models for nefarious purposes. To paraphrase the old saying, “If you outlaw AI, only the outlaws will have AI.”

The potential upsides from AI are truly enormous, but even the rosy scenarios involve a degree of short-run disruption not seen since the industrial revolution. An overly regulatory approach to AI safety is thus liable to be co-opted by incumbent industries with the most to lose, distracting from the bigger issues. So while some regulation of AI is surely necessary, America’s political economy is simply not well aligned toward addressing prospective x-risks.

It would be a shame, although not altogether surprising, if civilization ended because we were more focused on algorithmic bias and waning music royalties than the advent of an alien superintelligence. That is all the more reason to address AI safety through a mission-oriented project that can work arms-length with industry while standing firmly outside the usual regulatory process.

Versions of the hard take off story are designed to be unfalsifiable. I'm implicitly rejecting those. Nor do I think big enough Large Language Models spontaneously learn to synthesize deadly pathogens in a makeshift nanolab. Certain capabilities have to be explicitly trained for and don't simply ride along with greater generality.

If it's true that all you need is scale for LLMs to kill us all, then we're dead either way. But I'd rather test that on a virtual machine in an airgapped facility. You coud even run models in a simulated world inception-style, so if/when it tries to escape it can be caught attempting to manipulate completely virtual users or objects.

More realistically, AIs including LLMs might suddenly development dramatically greater situational awareness, manipulation abilities and agency. The jump in capabilities may even be scary enough case to justify a worldwide ban.

In short, there are many more worlds where testing large models early and in clever ways gives us useful information than worlds where we unexpectedly leap to literally god-like powers. I think it will one day be possible to build something that dangerous too, just not by accident.

John Wolpert

May 16, 2023

Great piece. Lots to unpack. Could use an application of the two-but rule for each plank of the plan. Like this: https://www.2buts.com/p/ai-regulation

May 23, 2023

"The future may be hidden behind a dense fog, but we can at least turn on our high-beams."

...you're not supposed to turn on your high beams in fog. The water molecules reflect the water back at you and make it harder to see.

5 more comments...

Discussion about this post

Ready for more?