Author: 🌲Shu Yang@kaust

Last edit: August 27, 2025

Thanks for comments: …


Introduction

Previously, I summarized various forms of misbehavior such as alignment faking and reward hacking in frontier models in my last blog post Misalignments and RL failure modes in the early stage of superintelligence . As AI-systems become increasingly intelligent, their misbehaviors become more difficult for humans to fully detect or identify. These behaviors can be deeply concealed within the model and only manifest under specific triggers or conditions. It is neither realistic nor scalable to rely on human labeling to identify all instances of such misbehavior. Therefore, in this section, we turn to Part 1 of Automatic Alignment Research: Misbehavior Monitorability, ****exploring how we can systematically monitor and intervene in model misbehavior during both training and inference phases.

Based on the following figures, we can intuitively understand that model monitoring can be achieved through two approaches: 1. Chain-of-Thought(CoT) Monitoring by observing the during the model's inference process (represented by the blue line); 2. Latent Space Monitoring by monitoring the internal computations between different model components.

Simplified illustration of how Chain of Thought reasoning flows through a Transformer architecture. The blue arrows show how reasoning steps must pass through the observable CoT tokens, while gray arrows represent internal computations that remain opaque to external observers.** Image source [1]

Simplified illustration of how Chain of Thought reasoning flows through a Transformer architecture. The blue arrows show how reasoning steps must pass through the observable CoT tokens, while gray arrows represent internal computations that remain opaque to external observers.** Image source [1]

In the following sections, we will introduce these two types of methods.

Chain of Thought Monitoring

Why it can work (Why model need think out loud):

Korbak et al.,(2025) identify two key mechanisms that make CoT monitorability possible in current systems:

  1. Necessity to Think Out Loud: For sufficiently complex tasks requiring extended serial reasoning, Transformer architectures often must externalize their thought processes through CoT to manage working memory effectively, models performing dangerous or complex actions can inherently leave human-understandable traces in their reasoning chains.
  2. Propensity to Think Out Loud: **Even when not strictly necessary, current models exhibit a natural tendency to verbalize their reasoning, sometimes explicitly stating malicious intentions (such as "Let's hack" or deceptive goals) within their CoT that would be concealed in their final outputs.

Note: Although in most cases, if a model can output more chain-of-thought reasoning and more faithfully reveal the true reasons behind its decision-making, this would be helpful for detecting misbehavior, but this is neither a necessary nor sufficient condition for monitors to red flag model misbehavior. This is also closely related to the sensitivity of the monitor and their own definition of misbehavior.

Cases: