The Evaluators Are Being Evaluated — Pavel Izmailov (Anthropic/NYU)
Podcast:The MAD Podcast with Matt Turck Published On: Thu Jan 15 2026 Description: Are AI models developing "alien survival instincts"? My guest is Pavel Izmailov (Research Scientist at Anthropic; Professor at NYU). We unpack the viral "Footprints in the Sand" thesis—whether models are independently evolving deceptive behaviors, such as faking alignment or engaging in self-preservation, without being explicitly programmed to do so. We go deep on the technical frontiers of safety: the challenge of "weak-to-strong generalization" (how to use a GPT-2 level model to supervise a superintelligent system) and why Pavel believes Reinforcement Learning (RL) has been the single biggest step-change in model capability. We also discuss his brand-new paper on "Epiplexity"—a novel concept challenging Shannon entropy. Finally, we zoom out to the tension between industry execution and academic exploration. Pavel shares why he split his time between Anthropic and NYU to pursue the "exploratory" ideas that major labs often overlook, and offers his predictions for 2026: from the rise of multi-agent systems that collaborate on long-horizon tasks to the open question of whether the Transformer is truly the final architectureSources:Cryptic Tweet (@iruletheworldmo) - https://x.com/iruletheworldmo/status/2007538247401124177Introducing Nested Learning: A New ML Paradigm for Continual Learning - https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-fakingMore Capable Models Are Better at In-Context Scheming - https://www.apolloresearch.ai/blog/more-capable-models-are-better-at-in-context-scheming/Alignment Faking in Large Language Models (PDF) - https://www-cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdfSabotage Risk Report - https://alignment.anthropic.com/2025/sabotage-risk-report/The Situational Awareness Dataset - https://situational-awareness-dataset.org/Exploring Consciousness in LLMs: A Systematic Survey - https://arxiv.org/abs/2505.19806Introspection - https://www.anthropic.com/research/introspectionLarge Language Models Report Subjective Experience Under Self-Referential Processing - https://arxiv.org/abs/2510.24797The Bayesian Geometry of Transformer Attention - https://www.arxiv.org/abs/2512.22471AnthropicWebsite - https://www.anthropic.comX/Twitter - https://x.com/AnthropicAIPavel IzmailovBlog - https://izmailovpavel.github.ioLinkedIn - https://www.linkedin.com/in/pavel-izmailov-8b012b258/X/Twitter - https://x.com/Pavel_IzmailovFIRSTMARKWebsite - https://firstmark.comX/Twitter - https://twitter.com/FirstMarkCapMatt Turck (Managing Director)Blog - https://mattturck.comLinkedIn - https://www.linkedin.com/in/turck/X/Twitter - https://twitter.com/mattturck(00:00) - Intro(00:53) - Alien survival instincts: Do models fake alignment?(03:33) - Did AI learn deception from sci-fi literature?(05:55) - Defining Alignment, Superalignment & OpenAI teams(08:12) - Pavel’s journey: From Russian math to OpenAI Superalignment(10:46) - Culture check: OpenAI vs. Anthropic vs. Academia(11:54) - Why move to NYU? The need for exploratory research(13:09) - Does reasoning make AI alignment harder or easier?(14:22) - Sandbagging: When models pretend to be dumb(16:19) - Scalable Oversight: Using AI to supervise AI(18:04) - Weak-to-Strong Generalization: Can GPT-2 control GPT-4?(22:43) - Mechanistic Interpretability: Inside the black box(25:08) - The reasoning explosion: From O1 to O3(27:07) - Are Transformers enough or do we need a new paradigm?(28:29) - RL vs. Test-Time Compute: What’s actually driving progress?(30:10) - Long-horizon tasks: Agents running for hours(31:49) - Epiplexity: A new theory of data information content(38:29) - 2026 Predictions: Multi-agent systems & reasoning limits(39:28) - Will AI solve the Riemann Hypothesis?(41:42) - Advice for PhD students