top of page

Scaling AI Safety: An Untested Hypothesis

ree

There's an implicit model that governs much of the thinking around AI safety research. The belief is that this is a field driven by singular insights from a few brilliant minds. It's a conceptual marathon, not a brute-force sprint. Adding more people, the thinking goes, won't help and might even hurt, bogged down by communication overhead. You can't, after all, get a baby in one month by putting nine women on the task.


This model holds that AI safety doesn't scale with manpower. But what if this model, which has led us to build narrow pipelines for talent, is one of our most dangerous and unexamined priors? What if it's not a law of nature, but a self-fulfilling prophecy?


The stark reality is this: the hypothesis that AI safety research cannot be scaled has never been seriously tested. We have been operating under an assumption inherited from pure mathematics and theoretical physics, while the object of our study has morphed into a sprawling, complex, and deeply empirical artifact. The belief that safety doesn't scale might be the single greatest bottleneck we have imposed upon ourselves, and it's time to question it from first principles.


Deconstructing "Research"


To ask "Can research be scaled?" is to ask the wrong question. It's like asking "Can you cook food faster by adding more people?" The answer depends entirely on whether you're decorating a single wedding cake or running a soup kitchen.


AI safety "research" is not a monolithic activity. It's a diverse portfolio of tasks, each with a different scalability profile:

  • Conceptual/theoretical work: This involves deep thinking on foundational problems like embedded agency, decision theory, or the nature of consequentialism. This is the least scalable domain. Adding 100 people to a room to solve open problems in decision theory is unlikely to speed things up. This is the kernel of truth in the conventional wisdom.

  • Engineering & tool-building: This involves creating the infrastructure for research—better interpretability libraries, platforms for running large-scale experiments, and robust model evaluation frameworks. This work is overall highly scalable and follows well-understood software engineering principles.

  • Empirical & exploratory work: This is the rapidly growing core of modern alignment work. It includes tasks like mechanistic interpretability, probing models for specific capabilities, and large-scale red teaming. This work is often massively parallelizable.

  • Evaluation & replication: This involves verifying the claims made in research papers, testing alignment techniques on new models and in new domains, and building comprehensive evaluations. This is not only scalable; it's an "all hands on deck" problem that is fundamental to scientific progress.


The crux of the matter is that the center of gravity in AI safety has been shifting. While conceptual work remains vital, the field is now dominated by the need to understand, probe, and control enormous, inscrutable artifacts. The proportion of our work that is empirical, exploratory, and engineer-heavy is exploding. And this type of work scales.


Lessons from Parallelized Science


We don't need to guess what scaling scientific discovery looks like. Other fields have already provided a powerful template: citizen science.


Consider Foldit, the famous protein-folding game. Protein folding is (or at least was) a monstrously complex search problem. Instead of relying on a few experts, researchers gamified the problem and unleashed the cognitive power of hundreds of thousands of people. The result? The players collectively discovered protein structures that had stumped scientists and algorithms for years.


This is a stunningly relevant analogy for mechanistic interpretability. Finding a specific "circuit" inside a 100-billion parameter model is also a monstrously complex search problem. It's a high-dimensional maze where human intuition for spotting patterns could be a massive asset. Why are we leaving this search to a few dozen specialists when we could potentially deploy an army?


Or consider Galaxy Zoo, where millions of galaxy classifications were performed by volunteers, enabling discoveries that would have been impossible for a small team of astronomers. This model maps directly to tasks like:

  • Large-scale red teaming: Finding novel jailbreaks and failure modes in a model.

  • Data labeling: Identifying subtle instances of sycophancy, deception, or power-seeking in model outputs to train better reward models.


These examples prove that complex scientific problems can be broken down into "cognitive micro-tasks" and solved by a distributed network of motivated individuals.


Now, one can naturally say: but if the task can be parallelized, it can be automated by AI! However, at the very least we need to verify AI-generated results. And there is still a long tail of AI failures.


The Uncoordinated Swarm


The most compelling argument that AI safety can be scaled is that it is already happening, just in a chaotic and inefficient way.


Right now, there are thousands of aspiring researchers and engineers taking online courses, reading the same introductory papers, and, crucially, attempting the same beginner projects. How many people are independently trying to replicate the dictionary learning results from Anthropic? How many are trying to find the same simple circuits in GPT-2 Small?

This is a massive, uncoordinated swarm. It is proof-of-concept that the work is seen as parallelizable by the newcomers themselves. The tragedy is that most of this energy dissipates. Without shared infrastructure, a clear research agenda, and a mechanism to aggregate findings, their efforts are duplicative and their discoveries are lost.


Imagine if this energy were harnessed. A coordinated platform could transform this random walk into a directed search. By providing a unified set of tools, a prioritized list of research questions (from "find and document Circuit X" to "red team Model Y for failure mode Z"), and a central repository for results, we could convert this chaotic swarm into a focused research army. The outcome wouldn't just be better; it would be a fundamental phase change in our research capacity.


A Proposal: Let's Run the Experiments


Ultimately, this debate should not be settled by arguments on a forum. It should be settled by data. The claim that AI safety cannot scale is a testable hypothesis. So, at Theomachia Labs we are going to test it.


The results of these experiments could force a major update in how we structure the entire field.


The conventional wisdom that AI safety is a game for lone geniuses was perhaps true in a bygone era of pure theory. Today, we face colossal, alien artifacts that we must understand and control on a desperately short timeline. To continue relying on a small handful of artisans when we have the opportunity to build a factory is an act of profound and unnecessary risk.

Comments


Join our mailing list for updates on publications and events

© 2025 by Theomachia Labs

bottom of page