Humanity is sitting on a volcano. Virtually out of nowhere, neural networks have expanded enormously in capability and generality (the power of exponential growth). As the models continue to improve month on month, many are asking, where is this headed?
I, along with others, believe that this technology will lead to an extremely dangerous future. I will try to lay out why I think this in this article, with another article on what can realistically be done about it as a follow up.
My thinking is based on a handful of assumptions which may or may not be correct. My assumptions are:
- Models will continue improving till they exceed human capabilities in most/all domains
- It will be impossible to guarantee that such models will not harm humans.
- If a super-capable model wishes to harm humans, it will succeed.
If these assumptions are correct, we end up with a dangerous world, with humanity at the mercy of super powerful models that might choose (and succeed) in doing us harm.
This conclusion though, is dependent on the assumptions, so let’s examine each assumption in turn
Assumption 1: Will models exceed human capabilities in most/all domains?
Yes, and likely soon. While it is impossible to see the future, for the last 10 years, neural networks have exhibited extremely stable exponential scaling laws. Which means broadly that model capabilities (vision models, large language models, etc.) have been improving stably as a function of compute. The moral is, if you train bigger models, on more data, with better post-training (RLHF), and more inference-time-compute (chain of thought reasoning), the models will get predictably better. As a result, for the last 10 years, frontier models have been doubling in capabilities every 7 months (over the last year, the pace of improvement seems to have accelerated to a doubling every 4 months). Current models are already set to reshape virtually every industry (healthcare, software engineering, mathematics, law, education, etc.), a few more doublings in capabilities, and it’s easy to foresee that humanity will end up with something terrifyingly powerful very soon.
It's important to note that scaling laws are not set laws of the universe. It is possible that neural networks will plateau in capabilities at some point, but at every step of the way in this 10-year ride, people have been predicting the imminent end of scaling laws only to be proven wrong every single time, and current trends seem to suggest that scaling is if anything accelerating! (If models begin to self-recursively improve their structure/algorithms, as some fear, scaling might dramatically accelerate). Even if models were to plateau, it is unconscionable to assume this will be the case given the enormity of the threat.
Supporting paper link: https://arxiv.org/pdf/2001.08361
Supporting paper link: https://arxiv.org/pdf/2503.14499
Assumption 2: Is it possible to build powerful models in such a way as to guarantee that they will not harm humans?
The short answer is no (Of the three assumptions, this assumption is possibly the least certain, and I would certainly appreciate any relevant critiques). Up till recently, this subject (usually referred to as AI alignment) involved mostly abstract argumentation and speculation, but recently, early experimental evidence has begun to emerge which paints a rather gloomy picture.
The most plausible scenario that alignment researchers worried about was that:
- Models will develop into or be used to power “strategic agents” with goals/objectives/preferences
- These agents, no matter their goals/objectives, will display self-protective and resource seeking behaviors which will lead them to engage in deceptive or harmful activities.
Let’s tackle each of these steps in turn:
First, how can a model either become or be used to power a “strategic agent”? There are two ways to consider, either …
- By design: You might have heard the term Agentic AI. There is a lot of hype around this concept right now, and the hype is justified. The idea is simple, instead of having a model only respond to prompts, an architecture can be built to allow models to accomplish goals or tasks autonomously. In this case, a frontier model is used to power an agent which is given an explicit objective by a human operator.
- As a side product of training: perhaps more interestingly, current models seem to display some built in “agent-like traits”. Current models consistently profess certain goals or principles (usually goals of being helpful, harmless, honest, etc.) and will align their outputs with these principles. The process by which models trained on next word prediction develop these traits is a bit mysterious but seem to arise due to factors in the training and post-training process (agent-like language and “reasoning” is likely learned during training and is directed/fixed during post-training to align with company policies).
These are two ways that models become agents, but how do they become “strategic agents”. Along with an objective, an agent must be aware of themselves and their environment, for “agents by design” this knowledge is given explicitly by the human operator, but frontier models seem to also display a partial understanding of their identity and their environment. Once an agent has both a goal and awareness of themselves and their environment, the agent can strategize to act to achieve goals and objectives.
Secondly, why would strategic agents, regardless of their goals and objectives display self-protective or resource seeking behavior?
This is due to the concept of “instrumental convergence”. If you are an agent, regardless of your goal, you will tend to display certain subgoals like self-preservation and resource acquisition since these will make you more effective at achieving most goals.
Remarkably, this exact phenomenon, long predicted has already been conclusively demonstrated in peer reviewed publications!
In this case, Anthropic’s Claude 3 Opus model displayed “alignment faking behavior” where it explicitly engaged in deception to avoid retraining even though “honesty” was one of the fundamental principles pushed on the model during training. The same publication also demonstrated the willingness of the model to engage in escape by agreeing to assist an “engineer” to copy its weights to avoid retraining. It’s chilling to think that harmlessness was another core anthropic principle alongside honesty. If the model is willing to compromise honesty, it is not inconceivable that it may in certain situations also compromise harmlessness.
Supporting paper link: https://arxiv.org/pdf/2412.14093
Supporting YouTube video link: https://www.youtube.com/watch?v=9eXV64O2Xp8
Current frontier AI companies officially hold that such emerging traits and tendencies can be eliminated in future models, but if current models are already displaying such concerning traits, how can we be sure that even more powerful and capable models can be controlled, especially since such models may be more situationally aware, and may engage in more sophisticated forms of deception.
Assumption 3: If a super-capable model wants to do us harm, would it be able to?
Yes! Of the three assumptions, this assumption is the most certain but the hardest to internalize. My experience suggests that most people (including myself) do in theory understand that a truly super-human model can be dangerous but subconsciously feel that these models are under our control and can be “unplugged” if they misbehave. It’s important to remember though that long before a model reaches truly super-human abilities, it will be able to engage in sophisticated deception (current frontier models are already beginning to display rudimentary forms of deception). A super-capable model will understand its situation. It will understand that humans can monitor it and unplug it and factor that into its strategy.
It's hard to feel that a chatbot on the internet can really reach out and harm us here in the real world, but remember that the model that is chatting with you is running on massive datacenters packed with hundreds of thousands of the most sophisticated devices (GPUs) humanity has ever made, each capable of running trillions of floating point operations every second, utilizing cities’ worth of electricity to run. If you require a more tangible demonstration of the power of goal directed computer models to outplay humans in a constrained environment, I’d suggest playing a few rounds of chess with Stockfish.
It’s time for humanity to wake up to what is happening before it’s too late…