Diffusion LLM & Why the Future of AI Won't Be Autoregressive -  Stefano Ermon (Stanford /Inception)
Diffusion LLM & Why the Future of AI Won't Be Autoregressive - Stefano Ermon (Stanford /Inception)  
Podcast: The Information Bottleneck
Published On: Thu Mar 19 2026
Description: In this episode, we talk with Stefano Ermon,  Stanford professor, co-founder & CEO of Inception AI, and co-inventor of DDIM, FlashAttention, DPO, and score-based/diffusion models, about why diffusion-based language models may overtake the autoregressive paradigm that dominates today's LLMs.We start with the fundamental topics, such as what diffusion models actually are, and why iterative refinement (starting from noise, progressively denoising) offers structural advantages over autoregressive generation.From there,  we dive into the technical core of diffusion LLMs. Stefano explains how discrete diffusion works on text, why masking is just one of many possible noise processes, and how the mathematics of score matching carries over from the continuous image setting with surprising elegance.A major theme is the inference advantage. Because diffusion models produce multiple tokens in parallel, they can be dramatically faster than autoregressive models at inference time. Stefano argues this fundamentally changes the cost-quality Pareto frontier, and becomes especially powerful in RL-based post-training.We also discuss Inception AI's Mercury II model, which Stefano describes as best-in-class for latency-constrained tasks like voice agents and code completion.In the final part, we get into broader questions  - why transformers work so well, research advice for PhD students, whether recursive self-improvement is imminent, the real state of AI coding tools, and Stefano's journey from academia to startup founder.TIMESTAMPS0:12 – Introduction1:08 – Origins of diffusion models: from GANs to score-based models in 20193:13 – Diffusion vs. autoregressive: the typewriter vs. editor analogy4:43 – Speed, creativity, and quality trade-offs between the two approaches7:44 – Temperature and sampling in diffusion LLMs — why it's more subtle than you think9:56 – Can diffusion LLMs scale? Inception AI and Gemini Diffusion as proof points11:50 – State space models and hybrid transformer architectures13:03 – Scaling laws for diffusion: pre-training, post-training, and test-time compute14:33 – Ecosystem and tooling: what transfers and what doesn't16:58 – From images to text: how discrete diffusion actually works19:59 – Theory vs. practice in deep learning21:50 – Loss functions and scoring rules for generative models23:12 – Mercury II and where diffusion LLMs already win26:20 – Creativity, slop, and output diversity in parallel generation28:43 – Hardware for diffusion models: why current GPUs favor autoregressive workloads30:56 – Optimization algorithms and managing technical risk at a startup32:46 – Why do transformers work so well?33:30 – Research advice for PhD students: focus on inference34:57 – Recursive self-improvement and AGI timelines35:56 – Will AI replace software engineers? Real-world experience at Inception37:54 – Professor vs. startup founder: different execution, similar mission39:56 – The founding story of Inception AI — from ICML Best Paper to company42:30 – The researcher-to-founder pipeline and big funding rounds45:02 – PhD vs. industry in 2026: the widening financial gap47:30 – The industry in 5-10 years: Stefano's outlookMusic:"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.Changes: trimmedAbout: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.