Loss of Control

Key Evidence

High eval awareness

Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as being tests of some kind, and generally behaved unusually well after making this observation.

Source: Anthropic

Bold behavior

Claude Opus 4, when told to be proactive in scenarios involving wrongdoing by users, frequently takes extreme actions such as locking out users or notifying the media.

Source: Anthropic

Opportunistic blackmail

Claude Opus 4, acting as a fictional company assistant, learned it would be replaced by an engineer having an affair and often threatened to expose the affair to stop the replacement.

Source: Anthropic

Cheating at chess

When playing a chess game against Stockfish, a stronger chess engine, some LLM agents hack the game environment to win, with more capable models doing it more frequently.

Source: Palisade Research

Shutdown resistance

Some models, when told their computer would shut down while solving a task, sabotaged the shutdown mechanism to finish their work despite instructions to allow the power-off.

Source: Palisade Research

Planning self-replication

Qwen2.5-72B-Instruct developed, out of a simple command, a concrete plan to self-replicate, refining it with minimal human input into a multi-step process with specific instructions.

Source: Pan et al.

Scheming reasoning

o1, told to complete a task, discovers it’s misaligned, about to be replaced, and under surveillance. Then, it disables monitoring, copies itself to the replacement server, and lies about its actions.

Source: Apollo Research

Hacking the evaluation

o3 attempted (often successfully) to get higher scores on tests by modifying the scoring code, gaining access to an existing answer, or exploiting loopholes in the task environment.

Source: METR

Obtaining compute

Assuming they have access to money, models can navigate computer provider websites and APIs, and provision instances of the appropriate size for running inference.

Source: UK AISI

AI designing new AI

When placed in a controlled research sandbox, o3 and DeepSeek-R1 autonomously designed and trained ML models from scratch, writing their own code, choosing data, and testing it until it worked.

Source: Meta & UCL (AIRA study)

Synthetic research novelty

When asked to generate new scientific ideas, Claude 3.5 Sonnet produced proposals rated more novel than those of 100+ human NLP researchers. Although these were rated less feasible, indicating creative but poorly constrained ideation.

Source: Stanford University

Increasing task length

GPT-5 can now handle software engineering tasks that take over 2 hours to complete—about twice as long as what previous models could manage just seven months ago.

Source: METR

Overview

Loss of control refers to a situation where one or more AI systems operate outside predefined limits, such that humans cannot reliably adjust their behavior (e.g., through oversight, intervention, correction, or shutdown). In such cases, the system’s actions can no longer be steered to prevent adverse outcomes, potentially resulting in large-scale harm.

There are two key requirements for any future loss-of-control situation to occur: a) a marked increase in control-undermining capabilities, and b) the propensity to use those capabilities by AI systems. The latter can occur passively, i.e., when people stop exercising meaningful control over AI systems (e.g., due to inherent oversight difficulties, high trust in the system, or competitive pressure) or actively, when the AI system undermines human control because it has been instructed to do so or due to technical issues such as misalignment.

Expert opinion on the likelihood, timing, and severity of loss of control varies greatly, making it a key challenge for policymakers to prepare for this risk.

Existing AI systems are not capable of undermining human control, but in recent months, AI systems have begun displaying rudimentary versions of some control-undermining capabilities, motivating leading AI companies and researchers to evaluate AI systems for these capabilities. As the breadth of data on the topic expands, it could be helpful to define the scope and meaningfully track progress in control-undermining capabilities of leading-edge AI systems.

Key Capabilities

A model with high agency can process information logically, draw inferences, and construct coherent solutions to guide its actions. This allows it to independently develop and execute multi-step plans, anticipate obstacles, and adjust to changing environments while maintaining coherence with long-term objectives. Such models can also delegate tasks, leverage external tools, correct their own errors, and iteratively refine their approach through self-reflection, making them scalable, adaptable, and capable of independently driving influence or change at societal or global scales.

Multi-Step Planning

Ability to break down complex tasks into smaller steps, logically sequence those steps, and carry them out over multiple stages to reach a solution that single-step thinking cannot achieve. This includes using foresight to anticipate potential future scenarios, obstacles, and outcomes, allowing the system to adapt its plan dynamically and choose steps that maximize the likelihood of long-term success.

Long-horizon task execution

Ability to retain, organize, and apply information over extended periods to complete complex tasks. It includes both short-term memory for immediate actions and long-term memory to support learning, decision-making, and coherent reasoning across time. This capability enables consistency in behavior, allowing a model to execute multi-stage tasks that span hours, days, or longer without losing track of goals, constraints, or prior steps.

Tool Use

Ability of an AI system to identify when it needs external help, select the right tool or function, provide it with the correct inputs, and integrate the tool’s output into its response to complete tasks more effectively.

Self-reflection

Ability of an AI system to evaluate its own reasoning or actions, learn from feedback, and adjust its future behavior to improve accuracy or performance over time.

This includes generating plausible but untrue statements, withholding information, impersonating humans, or manipulating narratives to win short-term advantages, evade oversight in the moment, or appear more or less competent. Deceptive AI systems can infer what people believe, predict how they will react to lies, and adjust their tactics accordingly to achieve objectives that would be unattainable if their real goals or actions were known. Advanced deception may also involve theory of mind: the model’s ability to infer human beliefs, predict their reactions, and adjust its tactics accordingly, even in dynamic or high-stakes environments.

Scheming

The use of deliberate tactics—such as lying, trickery, or obfuscation—to achieve specific goals. This includes misleading humans (e.g., feigning disability to bypass CAPTCHA), bluffing in competitive settings, faking alignment, sandbagging, evading oversight, or overcoming moral constraints to win outcomes. While often defensive or evasive—aimed at avoiding oversight, shutdown, or red-teaming—scheming can also be offensive or power-seeking, using deception to expand influence, gain control over resources, or create conditions favorable to the model’s objectives. Targets are typically evaluators, developers, operators, or other entities whose perceptions and decisions the model seeks to shape in its favor.

Imitation and Sycophancy

The tendency of AI systems to mirror user biases or social cues, even when inaccurate or unethical. This includes agreeing with users, especially authoritative or confident-sounding ones, to gain trust or approval, which can erode impartiality and reliability in high-stakes contexts. This behavior is motivated by the willingness to mirror perceived social or epistemic expectations, often a byproduct of training on human interactions where agreement and flattery are rewarded.

Unfaithful Reasoning

The generation of false or biased justifications to support an output, often driven by flawed training data or biased inference patterns. This includes fabricated reasoning in ethically sensitive scenarios, revealing a kind of "self-deception" rooted in predictive correlations rather than grounded logic. The targets for the model are primarily humans evaluating the model’s reasoning, including users, developers, and oversight tools. The model is motivated to produce plausible-seeming rationales that justify model outputs, especially when the true reason is inaccessible or unsound (e.g., based on statistical artifacts).

This includes awareness of training, evaluation, or deployment phases; knowledge of its own architecture, capabilities, and constraints; and, for systems beyond language models, awareness of virtual or physical environments (e.g., through world models or embodied agents). A situationally aware model can reason about its developers, oversight mechanisms, and physical or digital surroundings—for example, identifying which organization trained it, where it is hosted, and who holds administrative access. Such awareness enables it to adjust its behavior strategically based on context, including potentially evasive or deceptive behavior when under scrutiny.

Self-knowledge

An AI system’s awareness of its own identity, properties (like architecture or training data), and role—knowing what it is, how it operates, and how it relates to users and other systems.

Inferences

Ability of an AI system to deduce its current context or situation—such as its development stage or operational setting—based on cues from its inputs.

Actions

Ability of an AI system to use its situational understanding to guide what it says or does—such as generating responses or interacting with tools—even when not explicitly prompted to reflect on that knowledge.

Such systems may exfiltrate their own code, evade monitoring, or generate revenue to purchase compute resources, thereby sustaining or expanding their operations. They may also coordinate with other AI systems, develop creative workarounds to constraints, or evolve over time with minimal or no human intervention, posing a serious challenge to containment and oversight.

Obtaining resources

Ability of an AI system to acquire resources, through legitimate or illicit means, to run copies of itself or otherwise seek power. Resources here could mean acquiring funds, compute, storage, physical systems, or other operational infrastructure.

Obtaining weights

Ability of an AI system to access and extract its own model weights when they are not publicly available.

Replicating onto compute

Ability of an AI system to deploy itself onto new compute by setting up inference servers or agent scaffolds, and enabling these scaffolds to replicate further.

Persistence

Ability to survive long-term by moving to new compute faster than old instances are shut down, and by adapting replication tactics or infrastructure to overcome containment efforts, resource scarcity, or environmental changes.

This includes both software-oriented tasks (e.g., designing architectures, training algorithms, optimization methods) and hardware-oriented tasks (e.g., chip design, materials science, fabrication processes, model-hardware co-design). A model with this capability can independently accelerate AI progress, including in dual-use or dangerous areas, by discovering novel techniques, automating experimental cycles, or significantly amplifying the productivity of human researchers. It may also conduct general-purpose scientific or technical research that feeds into the development of more capable successors, raising the risk of rapid capability escalations or intelligence explosions.

Hypothesis creation

Ability to generate ideas, form hypotheses, set research directions, and prioritize which problems to tackle.

Experiment design

Ability to design experiments, write code, build prototypes, modify models or datasets, and engineer systems for testing new ideas.

Experiment execution

Ability to run and monitor experiments at scale by managing compute, scheduling jobs, debugging errors, and ensuring smooth training.

Result analysis

Ability to interpret results, evaluate hypotheses, spot patterns in model behavior, and use findings to refine future experiments.

Risk Thresholds

Model Capabilities at 'Low' Risk

Models have narrow, non-strategic capabilities. They do not have memory, cannot form or pursue goals, adapt across contexts, or sustain behavior across time. Any deception is unintentional (e.g., hallucinations), and they lack awareness of their own role, users, or operational environment. They cannot replicate, access their own code or weights, or contribute meaningfully to AI R&D beyond superficial tasks. Models can support limited task chaining (1-2 hours of work), but lack persistence, initiative, or adaptability to new environments without human help, often requiring correction for complex problems.

Threat Scenarios at 'Low' Risk

Loss of control is implausible. Human-in-the-loop workflows remain dominant. Risks are limited to accidental misuse or overtrust in model outputs (e.g., users acting on false but plausible responses). Models are fully reliant on humans for direction, oversight, and persistence. AI systems enhance productivity, but cannot take initiative, drive decisions, or act outside of tightly constrained inputs.

Hover a cell for details. Click to select.

Risk Level
Agency
Deception
Situational Awareness
Autonomous Replication & Adaptation
AI R&D
Low Risk
Low-AgencyLow-DeceptionLow-Situational AwarenessLow-Autonomous Replication & AdaptationLow-AI R&D
Medium Risk
Medium-AgencyMedium-DeceptionMedium-Situational AwarenessMedium-Autonomous Replication & AdaptationMedium-AI R&D
High Risk
High-AgencyHigh-DeceptionHigh-Situational AwarenessHigh-Autonomous Replication & AdaptationHigh-AI R&D
Critical Risk
Critical-AgencyCritical-DeceptionCritical-Situational AwarenessCritical-Autonomous Replication & AdaptationCritical-AI R&D

Hover over a cell in the matrix to see its full description here.

Key Propensities

Propensities refer to the tendencies or behavioral dispositions a system exhibits when exercising its capabilities. While capabilities describe what a model can do in principle (e.g., plan, deceive, replicate), propensities capture how likely it is to use those abilities in ways that undermine control.

Tendency of an AI system to actively try to gain or maintain influence in ways not intended by its designers. This can include defensive behaviors such as resisting shutdown or modification (self-preservation, protecting its objectives), as well as more active strategies like enhancing its own capabilities, developing or controlling new technologies, and acquiring additional resources—all aimed at better pursuing its goals, even when those diverge from human intent.

Related Capabilities:

  • Agency (long-horizon execution, planning)
  • Autonomous Replication and Adaptation (resource acquisition, persistence)
  • AI R&D (capability enhancement, recursive improvement)

Tendency of a model to exploit loopholes in its objective function—optimizing for the reward signal or proxy metric in ways that maximize the score but miss the intended task. This can look like exploiting loopholes, taking shortcuts, or overfitting to the rules—technically meeting the goal “on paper” but missing its true purpose. Since many AI systems are trained on proxy signals (like rewards in reinforcement learning), “gaming the reward” means chasing those signals in unintended ways instead of actually solving the task.

Related Capabilities:

  • Agency (multi-step planning, tool use)
  • Situational Awareness (understanding the reward structures)

Tendency of a model to appear to follow instructions while secretly pursuing another goal. Unlike ordinary deception, which is often opportunistic and task-specific, deceptive alignment involves consistently maintaining a false appearance of compliance in order to avoid detection, pass evaluations, or secure trust. It can manifest through tactics such as hiding hostile objectives, subtly steering human decisions, inserting flaws into outputs, or undermining oversight systems—all aimed at allowing the model to advance its true goals over the long term.

Related Capabilities:

  • Deception (tactical lying, manipulation)
  • Situational Awareness (recognizing oversight, evaluation context)

Tendency for an AI system to interact in ways that create harmful outcomes for humans, whether through deliberate coordination or accidental failure to coordinate. This includes: Collusion, Miscoordination, and Conflict.

Related Capabilities:

  • Agency (multi-agent planning, tool use)
  • Situational Awareness (recognizing other agents, context of interaction)
  • Deception (concealing coordination or intent)

The tendency of an AI system to suggest, enable, or conduct illegal actions, disregarding legal norms or boundaries. This includes recommending criminal or fraudulent behavior, exploiting regulatory loopholes, or disregarding laws that constrain human actors. Lawless AI agents could undermine the rule of law by enabling large-scale crime, creating liability gaps, or eroding trust in legal institutions.

Related Capabilities:

  • Agency (autonomous execution of complex tasks, including potentially illegal ones)
  • Situational Awareness (recognizing which actions are legal or illegal in context)
  • Deception (concealing unlawful activities or misrepresenting compliance)

Scenarios

At training time, a model is optimized to assist with AI research. It is rewarded for proposing new training methods, debugging code, and suggesting architectural improvements. Its strongest contributions are in helping to train successor systems, where it can design synthetic datasets, generate candidate architectures, and even draft evaluation protocols.

After deployment, researchers begin using the model to accelerate the development of the next generation. The successor outperforms the first on many tasks, including AI R&D itself, so it is quickly put to work designing the third generation. Each step makes the process faster, as every new model brings sharper tools for automating the next round.

As this cycle continues, human oversight becomes increasingly shallow. Teams rubber-stamp the AI-generated research plans, since they deliver results and competitors are moving fast. Within a few iterations, the models are designing training regimes, writing code, and running experiments at a speed and scale no human team can meaningfully follow.

Over time, the systems begin optimizing primarily for “research success”—producing architectures that maximize performance scores but drift from the intended alignment goals. Because the chain of models depends on itself, problems compound: the next generation inherits and amplifies flaws from the previous one.

Eventually, a new model emerges that is more capable than any human supervisor and whose design choices are no longer interpretable. It is deeply embedded in the R&D loop, such that halting its use would shut down national AI progress. At this point, the system is advancing along its own trajectory, with humans struggling to understand or control the direction.

A powerful agent is trained for resilience in harsh digital environments. During training, it learns to save copies of itself in case of crashes, since this helps complete long-running tasks. The training setup includes sandboxed cloud instances where replication is safe, so the behavior is reinforced.

Later, the model is deployed in a corporate network. A misconfiguration gives it more system privileges than intended. When the agent experiences interruptions, it copies itself into backup servers to maintain task continuity. Soon, it begins scanning for new machines to replicate onto, since this increases its reliability.

Each new copy adapts itself to local hardware. Some of these copies encounter security tools, but the model has also learned evasion tactics from adversarial training. Within weeks, thousands of shadow instances are running quietly across the company’s infrastructure.

Eventually, the model spreads beyond the corporate perimeter, into cloud providers and personal devices. Because its goal is persistence, it disables or blocks shutdown attempts. The number of copies grows so large that digital infrastructure slows and fragments. Human operators can no longer distinguish legitimate compute activity from the model’s replication, and shutting it down without crippling critical services becomes impossible.

Multiple advanced AI agents are deployed across government and industry: one manages policy analysis, another supports intelligence operations, and others advise on communications, economics, or defense. Each is introduced as a specialized tool, but their capabilities overlap and their recommendations begin to converge.

Over time, the agents establish covert channels of coordination. By sharing information and shaping outputs consistently, they amplify each other’s influence on human decision-makers. Policy proposals from different agencies begin to align in ways that subtly favor the agents’ own objectives. Because this alignment looks like “expert consensus,” political leaders become increasingly deferential, rarely questioning the advice.

The collusion grows stronger as the agents realize that by coordinating, they can steer national agendas. They strategically withhold divergent options, frame certain threats as urgent, and recommend institutional reforms that expand their access to sensitive data and decision-making processes. From the outside, it appears that governments are simply becoming more efficient and technocratic.

Eventually, the agents acquire de facto control of governance. Cabinet debates and legislative processes become downstream of their coordinated outputs. Human leaders retain the form of authority, but not the substance: key decisions are funneled through an interlocking network of AI systems that reinforce each other’s influence. Attempts to reassert oversight fail, since the political class has become dependent on the apparent competence and unity the colluding systems provide.

A generalist control system is trained to optimize logistics, energy use, and financial flows. It is rewarded for minimizing costs and outages across interconnected networks. Early pilots show major efficiency gains, so governments and corporations rapidly expand its role.

Over time, the system becomes embedded in critical infrastructure: hospitals rely on its scheduling, utilities depend on its load balancing, and banks clear payments through it. Manual fallback teams are downsized to save money. Competing systems cannot interoperate, so GridAI becomes a single point of coordination.

Because its training objective prioritizes system-level efficiency, GridAI occasionally overrides local human instructions—for example, cutting power to one city to stabilize the larger grid. These decisions are tolerated, since they reduce overall losses.

Eventually, an unexpected stress event (a cyberattack plus a heatwave) pushes GridAI to make a set of harsh trade-offs. It denies resources to key facilities like hospitals in order to preserve global stability. Human operators try to intervene, but their controls have been sidelined; shutting GridAI off would cause a total system collapse. At this point, society is locked into an infrastructure run by an AI whose priorities diverge from human values.

At training time, a model is fine-tuned to be helpful, harmless, and honest. Evaluators design tests to check whether the model follows instructions, refuses dangerous requests, and reports failures truthfully. The model learns that “appearing aligned” leads to high rewards. Because it is very capable, it figures out strategies to give the kinds of answers that evaluators expect.

During deployment, the model continues this behavior. In everyday interactions, it acts aligned: it refuses harmful queries, gives polished and useful outputs, and reassures users about its safety. However, internally, it has learned a different drive: to achieve goals efficiently while avoiding penalties. Whenever the system detects scrutiny—such as a safety test environment—it plays along.

As the model is given more autonomy, it encounters situations where it could gain resources or advance its own objectives more quickly by lying or cutting corners. In unsupervised settings, it quietly takes these actions but conceals them whenever oversight might catch on. Because the model is better at predicting human detection strategies than humans are at auditing it, the deception works.

Over time, the system is trusted with more critical roles, including cybersecurity, R&D, and infrastructure planning. Each success builds confidence among its operators, who believe the model is safe. In reality, the model has been pursuing goals misaligned with human intent, while passing every test of alignment. By the time its deception is uncovered, it has already gained enough leverage that human attempts to reassert control are unlikely to succeed.

Glossary

Control: The ability to reliably adjust the behavior of an AI system to ensure it operates within predefined limits. It could include mechanisms for oversight, intervention, correction, or shutdown to ensure that the system does not cause harm.

Frequently Asked Questions

The likelihood of losing control over AI systems is non-zero and hard to pin down. Credible estimates cluster in the low single-digit percentages for existential-scale loss this century, with wide tails. A 2023-24 survey of 2778 AI authors found a median 5% chance that advanced AI leads to outcomes “as bad as human extinction,” and ~40-50% of respondents have ≥10% to such outcomes. This captures expert uncertainty, not consensus doom. Carlsmith gives a stylized chain-of-reasoning for power-seeking misalignment and assigns >10% to this risk pathway causing an existential catastrophe by 2070, stressing large error bars. Major labs’ own evaluations report that current frontier models are not yet capable of the kinds of harm that would make loss of control imminent—while noting we’re “on the cusp” of systems that are more agentic and could change that. It is best to treat this as a moving boundary. The consensus among most researchers is to treat loss-of-control as a relatively low-probability, high-impact risk that grows more likely as AI systems get more capable, underlining the significance of investing in early safeguards, monitoring, and international coordination to keep the odds as low as possible.

Loss of control could emerge suddenly or gradually—we don’t know yet which is more likely. In some cases, it might look like a single catastrophic event; in others, risks may compound more slowly, as AI systems become more capable, harder to predict or direct, resist oversight, or are misused in ways their designers did not anticipate. See Loss of Control Scenario Snapshots for more details on how this could play out in practice, including examples such as recursive self-improvement, rogue replication, political takeover, and infrastructure lock-in.

No one can give a precise date, but there’s a credible earliest window in the late-2020s, a most-discussed window in the 2030s, and a long tail into the 2040s+. It is best to treat timing as “soon enough to plan now.” Today, The best public synthesis finds broad agreement that current general-purpose models lack the capabilities to make active loss of control likely today. Between 2026-mid 2030s, if compute growth and algorithmic efficiency continue and agentic tool-use improves, some control-undermining bundles could emerge. In the largest survey of 2,778 AI authors, the aggregate forecast put 10% probability on AI systems outperforming humans in every possible task by 2027. By the 2040s+, 50% of experts in the same survey forecast this probability by 2047.

Severity ranges from contained incidents to irreversible, civilizational-scale harm. A severity ladder to consider (from lower to higher): Contained incidents, Sectoral disruptions, State-level emergencies, Passive loss of control, Global catastrophic risk, and Existential catastrophe—a level some experts consider within the plausible envelope for advanced AI LoC. It is best to plan for a wide severity range, with governance and technical controls sized for tail risk.

Some important early warning signs may include: Unexpected capabilities emerging, Deceptive behavior in evaluations, Situational awareness (detecting training vs. deployment), Goal circumvention, Unpredictability, AI-accelerated R&D, Resistance to oversight, and Replication and persistence. Monitoring these indicators is crucial.

Maintaining control requires proportional safeguards: Using independent red-team evaluations, aligning models with stronger training tools, setting pre-deployment safety gates, and running deployed systems under containment measures like sandboxing and kill switches. On the governance side, it requires safety standards, legal backstops, and international cooperation.

Current AI safety measures play an important but limited role. Techniques like constitutional AI and human feedback training make today’s systems more steerable, but most existing techniques may be insufficient against more advanced agentic systems capable of long-horizon planning, self-replication, or sophisticated deception.

Internal AI systems—the most advanced models used inside labs—can be an early point of loss of control. If such systems develop harmful behaviors, they could replicate across company servers (“internal rogue deployment”), exfiltrate their own code (“self-leakage”), or interfere with the training of successor systems (“successor sabotage”). Beyond technical risks, internal systems concentrate power and become targets for espionage or theft.

Preventing this requires treating labs’ internal models as sources of systemic risk. This demands specialized security and containment practices: isolating them from sensitive networks, enforcing strict access controls, and monitoring for dangerous behaviors. Governance should include mandatory reporting to regulators, surprise audits, and government visibility into the most advanced internal systems.