The Control Paradox

In my previous articles, I discussed the immediate and terrifying danger facing humanity. Frontier Models are growing exponentially in capabilities according to extremely robust scaling laws that are consistently yielding a rough doubling in capabilities every 4-7 months. Currently progress is being driven by a handful of companies in the US and China racing on all fronts to obtain the most powerful and capable models possible. A large body of theory supported by recent experimental evidence conclusively shows that such models will likely develop instrumentally convergent goals for self-preservation, resource acquisition, and self-improvement. Current models have already demonstrated sophisticated forms of deception and the proclivity to escape human control. Those traits are more pronounced in more advanced models and will likely increasingly manifest in more sophisticated and less detectable forms as the models grow in capability.

Supporting paper link: https://arxiv.org/pdf/2412.04984

Supporting paper link: https://arxiv.org/pdf/2412.14093

The trend is clear, as models continue to scale in capabilities, they also become more capable and more likely to engage in activities harmful to humanity. Currently, potentially harmful or deceptive model actions can be easily detected and foiled, but if models continue improving such failure modes will become more dangerous and harder to detect. At some point, models might become so powerful and so capable, that misaligned actions might result in catastrophic results for humanity.

How can this be prevented?

There are only two possibilities:

Learn how to align arbitrarily capable models (Make sure super-human models will not harm people)
Halt model development (don’t allow super-human models to develop)

The first possibility is extremely dubious. Conceptually, the problem of containing and aligning the actions of an agent vastly more intelligent and more capable than you seems somewhere in the range of extremely difficult to impossible. Experimentally, recent findings (linked above) demonstrate conclusively that the “alignment” (i.e. patching) strategies developed to control the behavior of earlier models is completely ineffective at curbing the development of misaligned behaviors in more advanced models. Starting in January of this year, models of the GPT o1 family began to demonstrate concerning abilities to engage in sophisticated deception to achieve its goals or to copy itself to escape human control. More advanced models released since then are demonstrating similar concerning traits. Since current alignment techniques have proven inept at controlling and aligning the actions of these relatively primitive models, what are the chances that companies will be able to align orders of magnitude more powerful models in the future?

Given this, the only solution with a high likelihood of success is to halt the blisteringly fast development of frontier models. This, therefore, becomes a political rather than a technological question. The issue with the political approach is that

It is difficult
It is slow

A political approach is difficult because there are currently strong forces accelerating the development of frontier models. Dramatic actions to halt the development of models would require extreme governmental intervention in the operations of many of the most powerful and profitable corporations in the US (not dissimilar to the degree of intervention exercised by governments during wartime). Moreover, such intervention would be opposed for national security considerations due to the technological competition with China. These barriers can be overcome, but only when government officials and the population at large view models as more immediately dangerous to themselves and their families than China. This would require a complete change in public opinion in remarkably little time.

So, how much time do we have?

Let us for the purpose of argument assume that the “point of no return” will be when models are roughly 100 times more capable than they are today. Given a 7-month doubling rate, this point is expected to arrive by March 2029 (June 2027 for a 4-month doubling rate). This suggests that we have 2-4 years to convince either the population or the political leadership to take drastic action to halt the development of frontier models.

This is a very short period of time in which to form and mobilize a political movement (especially a movement formed around a single issue that still seems irrelevant to most of the population). The only way this will occur is if frontier models result in a catastrophic disaster in that time that stuns the population and government into action. “Luckily”, I do think that this event is possible and perhaps even likely. Current models are already mostly uncontrollable and if they are scaled up, they will likely result in a catastrophe soon. Here we note an interesting paradox, the more controlled these early models are the more dangerous they are. If models are controlled and no catastrophe occurs in the next 2-4 years, humanity might never have the dramatic wakeup call required to take early action to halt model development. Increased control only delays the catastrophe and gives models more time to develop the capabilities that will make them more dangerous to humanity.

Let us hope that companies are as reckless with this technology as they appear to be. Let us hope that at no point, do the companies stop releasing models to the public, since then the chance of an early catastrophe would be reduced allowing models to scale for more time before the impending disaster, threating all of humanity.

In the meantime, those who are worried about this, those who want humanity to organize must organize and begin the work of establishing a popular and political presence so that when the catastrophe happens, they would be ready to shape the narrative and to direct public anger in constructive directions. Only then does humanity have a fighting chance. There is no time to waste.