This episode examines the fundamental latency bottleneck in autoregressive language models: sequential token generation requires one full transformer forward pass per output token, leaving GPU parallelism idle during single-user inference. The episode centers on Draxler et al. (5 co-authors, UC Irvine and Chan-Zuckerberg Initiative), whose paper on Parallel Token Prediction landed Christmas Eve 2025 and argues that the independence assumption baked into all prior multi-token schemes is not an acceptable approximation but the actual limiting factor. The paper asks whether multiple tokens can be jointly predicted in a single pass — modeling dependencies among them — without sacrificing the expressiveness that makes autoregressive generation reliable. First author Felix Draxler previously led the Free-form Flows work in 2024, and the normalizing flow machinery he developed there is central to how the paper solves the dependency problem.
The episode traces the historical arc carefully. Qi et al. (7 co-authors, Microsoft Research Asia) published ProphetNet in January 2020 — predating GPT-3 by four months — in the encoder-decoder world of seq2seq tasks. Their critique was precise: standard one-step-ahead teacher forcing gives models no incentive to plan ahead, letting local bigram correlations dominate at the expense of long-range coherence. Their answer was n-gram prediction, training the decoder to simultaneously predict tokens at t+1, t+2, and t+3 using parallel heads that did not condition on each other. The independence assumption was already present. When Brown et al. (OpenAI, May 2020) demonstrated that scale and in-context conditioning make the encoder optional, the field shifted to decoder-only architectures — but ProphetNet's core insight migrated cleanly. Gloeckle et al. (FAIR, Meta, April 2024) rebuilt multi-token prediction for decoder-only models using independent output heads, DeepSeek adopted the same approach, and NVIDIA incorporated it into Nemotron 3. The independence assumption migrated with the insight, and Draxler et al. argue that limitation has been compounding ever since.
The episode situates Parallel Token Prediction against the two main camps attacking inference latency. Speculative decoding — covered across twenty prior episodes — keeps the model's output distribution unchanged by using a small draft model whose proposals a large verifier checks in one batched pass; the latency gain comes entirely from accepted tokens per step. Multi-token prediction is the other camp: train the model itself to emit several tokens at once, collapsing multiple forward passes into one, at the cost of changed model behavior during training. Draxler et al.'s contribution is showing that jointly predicting dependent tokens, using normalizing flows to capture the conditional structure across the prediction horizon, preserves the modeling power that independent-head approaches discard. The episode works through both the architectural mechanics and the theoretical argument, making the case that Parallel Token Prediction resolves the tension that has run from ProphetNet through every independent-head scheme in between.
Sources:
1. https://arxiv.org/pdf/2512.21323
2. https://arxiv.org/pdf/2404.19737v1
3. https://arxiv.org/pdf/2412.19437
4. https://arxiv.org/pdf/2512.20856
5. Better & Faster Large Language Models via Multi-Token Prediction — Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve (Meta), 2024
https://scholar.google.com/scholar?q=Better+&+Faster+Large+Language+Models+via+Multi-Token+Prediction
6. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengmian Hu, et al., 2024
https://scholar.google.com/scholar?q=Medusa:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads
7. Lookahead Decoding: Break the Sequential Dependency of LLM Inference — Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, 2024
https://scholar.google.com/scholar?q=Lookahead+Decoding:+Break+the+Sequential+Dependency+of+LLM+Inference
8. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024
https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty
9. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias (Google), 2023
https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding
10. Accelerating Large Language Model Decoding with Speculative Sampling — Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper (DeepMind), 2023
https://scholar.google.com/scholar?q=Accelerating+Large+Language+Model+Decoding+with+Speculative+Sampling
11. Spec-Bench: A Benchmark for Evaluating Speculative Decoding Approaches — Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 2024
https://scholar.google.com/scholar?q=Spec-Bench:+A+Benchmark+for+Evaluating+Speculative+Decoding+Approaches
12. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024
https://scholar.google.com/scholar?q=EAGLE-2:+Faster+Inference+of+Language+Models+with+Dynamic+Draft+Trees
13. Attention Is All You Need — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin (Google Brain), 2017
https://scholar.google.com/scholar?q=Attention+Is+All+You+Need
14. Language Models are Few-Shot Learners (GPT-3) — Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al. (OpenAI), 2020
https://scholar.google.com/scholar?q=Language+Models+are+Few-Shot+Learners+(GPT-3)
15. Scaling Laws for Neural Language Models — Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (OpenAI), 2020
https://scholar.google.com/scholar?q=Scaling+Laws+for+Neural+Language+Models
16. Non-Autoregressive Neural Machine Translation — Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, Richard Socher (Salesforce Research), 2018
https://scholar.google.com/scholar?q=Non-Autoregressive+Neural+Machine+Translation
17. Improving Variational Inference with Inverse Autoregressive Flow — Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling, 2016
https://scholar.google.com/scholar?q=Improving+Variational+Inference+with+Inverse+Autoregressive+Flow
18. Density Estimation Using Real-valued Non-Volume Preserving (Real NVP) Transformations — Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio (Google Brain), 2017
https://scholar.google.com/scholar?q=Density+Estimation+Using+Real-valued+Non-Volume+Preserving+(Real+NVP)+Transformations
19. Glow: Generative Flow with Invertible 1x1 Convolutions — Diederik P. Kingma, Prafulla Dhariwal (OpenAI), 2018
https://scholar.google.com/scholar?q=Glow:+Generative+Flow+with+Invertible+1x1+Convolutions
20. Free-form Flows: Make Any Architecture a Normalizing Flow — Felix Draxler, Peter Sorrenson, Lea Zimmermann, Armand Rousselot, Ullrich Köthe, 2024
https://scholar.google.com/scholar?q=Free-form+Flows:+Make+Any+Architecture+a+Normalizing+Flow
21. Improved Variational Inference with Inverse Autoregressive Flow — Kingma, Salimans, Jozefowicz, Chen, Sutskever, Wierstra, 2016
https://scholar.google.com/scholar?q=Improved+Variational+Inference+with+Inverse+Autoregressive+Flow
22. Closer look at efficient inference methods: A survey of speculative decoding — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Closer+look+at+efficient+inference+methods:+A+survey+of+speculative+decoding
23. Adaptive Speculative Decoding for Large Language Models — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Adaptive+Speculative+Decoding+for+Large+Language+Models
24. LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=LK+Losses:+Direct+Acceptance+Rate+Optimization+for+Speculative+Decoding
25. Higher Acceptance Rates for Speculative Decoding with Randomised Drafting — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Higher+Acceptance+Rates+for+Speculative+Decoding+with+Randomised+Drafting
26. Conditional [MASK] Discrete Diffusion Language Model — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Conditional+[MASK]+Discrete+Diffusion+Language+Model
27. Alternatives To Next Token Prediction In Text Generation: A Survey — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Alternatives+To+Next+Token+Prediction+In+Text+Generation:+A+Survey
28. Future Token Prediction: Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Future+Token+Prediction:+Causal+Language+Modelling+with+Per-Token+Semantic+State+Vector+for+Multi-Token+Prediction
29. AI Post Transformers: Fast Inference from Transformers via Speculative Decoding — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Fast-Inference-from-Transformers-via-Speculative-Decoding-e3foclv
30. AI Post Transformers: Accelerating Large Language Model Decoding with Speculative Sampling — Hal Turing & Dr. Ada Shannon, Thu,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Accelerating-Large-Language-Model-Decoding--with-Speculative-Sampling-e3flhv7
31. AI Post Transformers: EAGLE: Evolution of Lossless Acceleration for LLM Inference — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/EAGLE-Evolution-of-Lossless-Acceleration-for-LLM-Inference-e3focr9
32. AI Post Transformers: MEDUSA: Parallel Decoding Heads for Accelerated LLM Inference — Hal Turing & Dr. Ada Shannon, Thu,
https://podcasters.spotify.com/pod/show/12146088098/episodes/MEDUSA-Parallel-Decoding-Heads-for-Accelerated-LLM-Inference-e3flqk7
33. AI Post Transformers: Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Apples-Speculative-Streaming-Fast-LLM-Inference-without-Auxiliary-Models-e3fod2o
34. AI Post Transformers: Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, Thu,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Draft--Verify-Lossless-Large-Language-Model-Acceleration-via-Self-Speculative-Decoding-e3flplu
35. AI Post Transformers: Apple's Mirror Speculative Decoding: Parallel LLM Inference via Heterogeneous Accelerators — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Apples-Mirror-Speculative-Decoding-Parallel-LLM-Inference-via-Heterogeneous-Accelerators-e3fod0p
36. AI Post Transformers: QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/QuantSpec-Hierarchical-KV-Cache-for-Self-Speculative-Decoding-e3foavf
37. AI Post Transformers: The Free Transformer: VAE Extension for Decoders — Hal Turing & Dr. Ada Shannon, Sun,
https://podcasters.spotify.com/pod/show/12146088098/episodes/The-Free-Transformer-VAE-Extension-for-Decoders-e3a2p6v
Interactive Visualization: PTP: Resolving the Independence Flaw
https://www.do-not-panic.com/viz/2026/03/04/ptp-viz.html