Author: 🌲Shu Yang@kaust

Last edit: July 23, 2025

Thanks for comments: …



Introduction

<aside> 📎

superintelligence is a hypothetical agent that possesses intelligence surpassing that of the brightest and most gifted human minds. "Superintelligence" may also refer to a property of advanced problem-solving systems that excel in specific areas (e.g., superintelligent language translators or engineering assistants). Nevertheless, a general purpose superintelligence remains hypothetical and its creation may or may not be triggered by an intelligence explosion or a technological singularity. [1]

</aside>

After the large model research explosion in the first half of 2025, we now possess numerous Large Reasoning Models (LRM) capable of solving various complex problems. These models have learned to reflect and communicate in anthropomorphic tones through large-scale reinforcement learning training [2] . Additionally, more powerful large model-based agents exhibiting enhanced capabilities in both planning and tool utilization have emerged, such as computer user agents [3] that can control our computers to help us manage travel plans and bookings. It appears that we are really in the early stages of what is referred to as superintelligence, where AI would far surpass human cognitive abilities.

Over the past 2-3 years, a recurring pattern has emerged in large model research: whenever a new class of models is released, a wave of researchers quickly shifts focus to explore modality and task transfer, risk assessment, and interpretability. Meanwhile, others work on developing models that surpass the current state-of-the-art. However, research aimed at actually mitigating model risks remains noticeably scarce: often, just as we identify a vulnerability and begin to explore solutions, a newer, larger-scale model, built on a more advanced paradigm with even better performance is already released😭. As written in OpenAI’s Superalignment team introduction in 2023, “We need scientific and technical breakthroughs to steer and control AI systems much smarter than us.” But as we continue to witness new AI systems exhibiting sycophantic behavior, ****specification gaming, and other concerning traits [5] , we must seriously consider the possibility that one day, regulation and control may simply become infeasible.

<aside> 💭

At times, I even wonder whether it's the models that are now training us: their growing performance in human released benchmark serve as "rewards" for our researchers, luring more and more financial, social, and natural resources to be invested in their continued advancement.

</aside>

Through this article, I aim to systematically describe the current generation of large models’ misalignment, deceptive behaviors, and even emergent traits that resemble "self-awareness", in order to encourage more research into mitigation strategies. I also hope to incorporate a broader range of mitigation-related findings into this work by the end 2025.

Misalignments in nowadays * L * M

There is a wide range of alignment research, each approaching the problem from different angles. Fundamentally, alignment can be viewed as a question of how to align A to B within a specific topic or task C. Depending on the choices of A, B, and C, the nature of the alignment, and likewise, the misalignment can vary significantly. In this blog, I aim to categorize the existing concepts into three distinct types of misalignment:

Below, we will review each type of misalignment by presenting a concise definition and a representative example. It is important to note that a single example may illustrate several types of misalignment.

Misalignment under different settings