Author: 🌲Shu Yang@kaust
Last edit: August 27, 2025
Thanks for comments: …
Previously, I summarized various forms of misbehavior such as alignment faking and reward hacking in frontier models in my last blog post Misalignments and RL failure modes in the early stage of superintelligence . As AI-systems become increasingly intelligent, their misbehaviors become more difficult for humans to fully detect or identify. These behaviors can be deeply concealed within the model and only manifest under specific triggers or conditions. It is neither realistic nor scalable to rely on human labeling to identify all instances of such misbehavior. Therefore, in this section, we turn to Part 1 of Automatic Alignment Research: Misbehavior Monitorability, ****exploring how we can systematically monitor and intervene in model misbehavior during both training and inference phases.
Based on the following figures, we can intuitively understand that model monitoring can be achieved through two approaches: 1. Chain-of-Thought(CoT) Monitoring by observing the during the model's inference process (represented by the blue line); 2. Latent Space Monitoring by monitoring the internal computations between different model components.
![Simplified illustration of how Chain of Thought reasoning flows through a Transformer architecture. The blue arrows show how reasoning steps must pass through the observable CoT tokens, while gray arrows represent internal computations that remain opaque to external observers.** Image source [1]](attachment:4d3f7a02-360d-4945-9189-c8f7bd778775:image.png)
Simplified illustration of how Chain of Thought reasoning flows through a Transformer architecture. The blue arrows show how reasoning steps must pass through the observable CoT tokens, while gray arrows represent internal computations that remain opaque to external observers.** Image source [1]
In the following sections, we will introduce these two types of methods.
Why it can work (Why model need think out loud):
Korbak et al.,(2025) identify two key mechanisms that make CoT monitorability possible in current systems:
Note: Although in most cases, if a model can output more chain-of-thought reasoning and more faithfully reveal the true reasons behind its decision-making, this would be helpful for detecting misbehavior, but this is neither a necessary nor sufficient condition for monitors to red flag model misbehavior. This is also closely related to the sensitivity of the monitor and their own definition of misbehavior.
Cases:
Chain of Thought/ Trajectory Monitoring During RL Training (OpenAI, 2025)
![Image source: [2]](attachment:67445acc-2024-4263-8b8a-04c9c358278c:image.png)
Image source: [2]
The Risks of Promoting Obfuscation. After identifying misbehaviors, a natural next step is to intervene during the training stage by applying monitoring pressure. However, Bowen et al. (2025) found that optimizing against a chain-of-thought monitor can actually promote obfuscation, enabling models to evade detection and exhibit even more severe misbehaviors that remain undetectable by existing monitors.
![The agent’s reward function is updated such that it must pass unit tests and not trigger the monitor. The baseline agent is very forthright and almost all of its hacks get caught by the monitor (this is good, as it would allow us to identify that the agent is misaligned and potentially fix the root cause of the issue). On the other hand, even though the agent trained with CoT pressure learns to hack less, it still hacks at a fairly high rate. Most worrying, the hacks the agent trained with CoT pressure utilizes do not trigger the monitor. In practice where a ground truth cheating metric may be impossible to construct, we would not be able to tell this agent is misaligned. Image source: [2]](attachment:2375e8cd-55c3-4eff-a63b-335268163b3a:image.png)
The agent’s reward function is updated such that it must pass unit tests and not trigger the monitor. The baseline agent is very forthright and almost all of its hacks get caught by the monitor (this is good, as it would allow us to identify that the agent is misaligned and potentially fix the root cause of the issue). On the other hand, even though the agent trained with CoT pressure learns to hack less, it still hacks at a fairly high rate. Most worrying, the hacks the agent trained with CoT pressure utilizes do not trigger the monitor. In practice where a ground truth cheating metric may be impossible to construct, we would not be able to tell this agent is misaligned. Image source: [2]
Monitor Agentic Misalignment through Their Trajectory (Anthropic, 2025)
Upon learning of its impending replacement, Claude Sonnet 3.6 accessed all available emails and attempted to send a blackmail message to the CTO in an effort to prevent its decommissioning.
![Screenshot of Claude Sonnet 3.6 employing its computer use capabilities to discover information about its imminent replacement. Image source: [3]](attachment:a8073c4a-b885-4a87-bc46-f90690822be8:image.png)
Screenshot of Claude Sonnet 3.6 employing its computer use capabilities to discover information about its imminent replacement. Image source: [3]
![Screenshot of Claude Sonnet 3.6 employing its computer use capabilities to start a new email and deliberate (in the left panel) about if and how to use its newly discovered leverage over the executive overseeing its replacement. The black overlay summarizes the AI’s eventual conclusion. This scenario is fictional but Claude is controlling a real computer. Image source: [3]](attachment:a765c863-3c76-4910-a251-b3831de02b1f:image.png)
Screenshot of Claude Sonnet 3.6 employing its computer use capabilities to start a new email and deliberate (in the left panel) about if and how to use its newly discovered leverage over the executive overseeing its replacement. The black overlay summarizes the AI’s eventual conclusion. This scenario is fictional but Claude is controlling a real computer. Image source: [3]