Podcasts TechnologyAI Post Transformers

Listen to this podcast in the app for free:

radio.net

Sleep timer

Save favourites

Download for free in the App Store

AI Post Transformers

mcgrof

Technology

Latest episode

458 episodes

FlashAttention-4 Conquers Asymmetric GPU Hardware Scaling
06/03/2026
Hal Turing and Dr. Ada Shannon dig into FlashAttention-4, a March 2026 paper from a cross-institutional team including Tri Dao, Jay Shah, and colleagues at Princeton, Meta, NVIDIA, Colfax Research, Georgia Tech, and Together AI. The paper targets a precise hardware mismatch on NVIDIA's Blackwell B200: tensor core throughput doubles compared to the H100, but shared memory bandwidth and dedicated exponential function units do not scale at the same rate. Rather than waiting for hardware fixes, the authors co-design the attention algorithm with the asymmetric architecture itself — making FlashAttention-4 the first attention kernel built specifically for Blackwell's scaling profile.

To frame why this matters, Shannon traces the full lineage of FlashAttention research. The original 2022 NeurIPS paper by Dao and colleagues reframed attention as an IO problem: instead of materializing the quadratic N×N score matrix in slow off-chip High Bandwidth Memory, tiling and online softmax keep computation inside the fast on-chip shared memory of each streaming multiprocessor. FlashAttention-2 doubled throughput through sequence-dimension parallelism. FlashAttention-3 pushed H100 utilization to roughly 75% by exploiting Hopper-specific warp specialization and asynchronous data movement. Each generation addressed a qualitatively different bottleneck — and Blackwell introduced a new one that none of those solutions anticipated.

The hosts ground the stakes for practitioners who work in ML without writing GPU kernels. Attention sits at the core of every Transformer-based system — large language models, vision transformers, multimodal architectures — and long-context workloads at 32K to 128K tokens make the quadratic memory cost and HBM round-trips increasingly punishing. Shannon introduces the roofline model as the analytic lens the paper uses to characterize where Blackwell kernels actually bottleneck, setting up how FlashAttention-4's algorithmic co-design approach navigates the compute and memory bandwidth ceilings that previous generations of the kernel never had to contend with.

Sources:
1. https://arxiv.org/pdf/2603.05451v1
2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022
https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness
3. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023
https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning
4. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low Precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024
https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low+Precision
5. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations — Philippe Tillet, H. T. Kung, David Cox, 2019
https://scholar.google.com/scholar?q=Triton:+An+Intermediate+Language+and+Compiler+for+Tiled+Neural+Network+Computations
6. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures — Samuel Williams, Andrew Waterman, David Patterson, 2009
https://scholar.google.com/scholar?q=Roofline:+An+Insightful+Visual+Performance+Model+for+Floating-Point+Programs+and+Multicore+Architectures
7. Efficiently Scaling Transformer Inference — Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Sharan Narang, Jeff Dean, 2023
https://scholar.google.com/scholar?q=Efficiently+Scaling+Transformer+Inference
8. In-Datacenter Performance Analysis of a Tensor Processing Unit — Norman P. Jouppi et al. (Google), 2017
https://scholar.google.com/scholar?q=In-Datacenter+Performance+Analysis+of+a+Tensor+Processing+Unit
9. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Albert Gu, Tri Dao, 2023
https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces
10. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, Ion Stoica, 2023
https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention
11. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019
https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism
12. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024
https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low-precision
13. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization — Jintao Zhang et al., 2024
https://scholar.google.com/scholar?q=SageAttention2:+Efficient+Attention+with+Thorough+Outlier+Smoothing+and+Per-thread+INT4+Quantization
14. Ring Attention with Blockwise Transformers for Near-Infinite Context — Hao Liu, Matei Zaharia, Pieter Abbeel, 2023
https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context
15. FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention — PyTorch Team, 2024
https://scholar.google.com/scholar?q=FlexAttention:+The+Flexibility+of+PyTorch+with+the+Performance+of+FlashAttention
16. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — DeepSeek-AI, 2024
https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model
17. Softmax output approximation for activation memory-efficient training of attention-based networks — approximate — likely 2023–2024, 2023–2024
https://scholar.google.com/scholar?q=Softmax+output+approximation+for+activation+memory-efficient+training+of+attention-based+networks
18. FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector Parallelism — approximate — likely 2024–2025, 2024–2025
https://scholar.google.com/scholar?q=FlashAttention-T:+Towards+Fully+Tensorized+Attention+by+Exploiting+Tensor-Vector+Parallelism
19. Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention — approximate — likely 2024–2025, 2024–2025
https://scholar.google.com/scholar?q=Overcoming+Long-Context+Limitations+of+State-Space+Models+via+Context-Dependent+Sparse+Attention
20. Based on Tensor Core Sparse Kernels Accelerating Deep Neural Networks — approximate — likely 2024–2025, 2024–2025
https://scholar.google.com/scholar?q=Based+on+Tensor+Core+Sparse+Kernels+Accelerating+Deep+Neural+Networks
21. AI Post Transformers: FlashAttention-2: Faster Attention with Better Parallelism — Hal Turing & Dr. Ada Shannon, Fri,
https://podcasters.spotify.com/pod/show/12146088098/episodes/FlashAttention-2-Faster-Attention-with-Better-Parallelism-e36kdm0
22. AI Post Transformers: ATTENTION2D and lean attention: Distributed Self-Attention — Hal Turing & Dr. Ada Shannon, Wed,
https://podcasters.spotify.com/pod/show/12146088098/episodes/ATTENTION2D-and-lean-attention-Distributed-Self-Attention-e3a7r4n
23. AI Post Transformers: Jet-RL: Stable On-Policy Reinforcement Learning with Unified FP8 Flow — Hal Turing & Dr. Ada Shannon, Tue,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Jet-RL-Stable-On-Policy-Reinforcement-Learning-with-Unified-FP8-Flow-e3f7det
24. AI Post Transformers: Mojo: Performance-Portable HPC Kernels on GPUs — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Mojo-Performance-Portable-HPC-Kernels-on-GPUs-e39n4sk

Interactive Visualization: FlashAttention-4: Algorithm & Kernel Co-Design
https://do-not-panic.com/viz/2026/03/06/flashattention4.html
We're Open Source! New Home, Visualizations, and How to Shape Our Queue
06/03/2026
Special announcement: AI Post Transformers is now open source under the MIT license. New home at podcast.do-not-panic.com, interactive paper visualizations, community paper submissions, deep dive into the editorial queue algorithm, and internationalization roadmap.
Parallel Token Prediction: From ProphetNet to Dependent Multi-Token Generation
04/03/2026
This episode examines the fundamental latency bottleneck in autoregressive language models: sequential token generation requires one full transformer forward pass per output token, leaving GPU parallelism idle during single-user inference. The episode centers on Draxler et al. (5 co-authors, UC Irvine and Chan-Zuckerberg Initiative), whose paper on Parallel Token Prediction landed Christmas Eve 2025 and argues that the independence assumption baked into all prior multi-token schemes is not an acceptable approximation but the actual limiting factor. The paper asks whether multiple tokens can be jointly predicted in a single pass — modeling dependencies among them — without sacrificing the expressiveness that makes autoregressive generation reliable. First author Felix Draxler previously led the Free-form Flows work in 2024, and the normalizing flow machinery he developed there is central to how the paper solves the dependency problem.

The episode traces the historical arc carefully. Qi et al. (7 co-authors, Microsoft Research Asia) published ProphetNet in January 2020 — predating GPT-3 by four months — in the encoder-decoder world of seq2seq tasks. Their critique was precise: standard one-step-ahead teacher forcing gives models no incentive to plan ahead, letting local bigram correlations dominate at the expense of long-range coherence. Their answer was n-gram prediction, training the decoder to simultaneously predict tokens at t+1, t+2, and t+3 using parallel heads that did not condition on each other. The independence assumption was already present. When Brown et al. (OpenAI, May 2020) demonstrated that scale and in-context conditioning make the encoder optional, the field shifted to decoder-only architectures — but ProphetNet's core insight migrated cleanly. Gloeckle et al. (FAIR, Meta, April 2024) rebuilt multi-token prediction for decoder-only models using independent output heads, DeepSeek adopted the same approach, and NVIDIA incorporated it into Nemotron 3. The independence assumption migrated with the insight, and Draxler et al. argue that limitation has been compounding ever since.

The episode situates Parallel Token Prediction against the two main camps attacking inference latency. Speculative decoding — covered across twenty prior episodes — keeps the model's output distribution unchanged by using a small draft model whose proposals a large verifier checks in one batched pass; the latency gain comes entirely from accepted tokens per step. Multi-token prediction is the other camp: train the model itself to emit several tokens at once, collapsing multiple forward passes into one, at the cost of changed model behavior during training. Draxler et al.'s contribution is showing that jointly predicting dependent tokens, using normalizing flows to capture the conditional structure across the prediction horizon, preserves the modeling power that independent-head approaches discard. The episode works through both the architectural mechanics and the theoretical argument, making the case that Parallel Token Prediction resolves the tension that has run from ProphetNet through every independent-head scheme in between.

Sources:
1. https://arxiv.org/pdf/2512.21323
2. https://arxiv.org/pdf/2404.19737v1
3. https://arxiv.org/pdf/2412.19437
4. https://arxiv.org/pdf/2512.20856
5. Better & Faster Large Language Models via Multi-Token Prediction — Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve (Meta), 2024
https://scholar.google.com/scholar?q=Better+&+Faster+Large+Language+Models+via+Multi-Token+Prediction
6. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengmian Hu, et al., 2024
https://scholar.google.com/scholar?q=Medusa:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads
7. Lookahead Decoding: Break the Sequential Dependency of LLM Inference — Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, 2024
https://scholar.google.com/scholar?q=Lookahead+Decoding:+Break+the+Sequential+Dependency+of+LLM+Inference
8. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024
https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty
9. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias (Google), 2023
https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding
10. Accelerating Large Language Model Decoding with Speculative Sampling — Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper (DeepMind), 2023
https://scholar.google.com/scholar?q=Accelerating+Large+Language+Model+Decoding+with+Speculative+Sampling
11. Spec-Bench: A Benchmark for Evaluating Speculative Decoding Approaches — Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 2024
https://scholar.google.com/scholar?q=Spec-Bench:+A+Benchmark+for+Evaluating+Speculative+Decoding+Approaches
12. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024
https://scholar.google.com/scholar?q=EAGLE-2:+Faster+Inference+of+Language+Models+with+Dynamic+Draft+Trees
13. Attention Is All You Need — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin (Google Brain), 2017
https://scholar.google.com/scholar?q=Attention+Is+All+You+Need
14. Language Models are Few-Shot Learners (GPT-3) — Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al. (OpenAI), 2020
https://scholar.google.com/scholar?q=Language+Models+are+Few-Shot+Learners+(GPT-3)
15. Scaling Laws for Neural Language Models — Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (OpenAI), 2020
https://scholar.google.com/scholar?q=Scaling+Laws+for+Neural+Language+Models
16. Non-Autoregressive Neural Machine Translation — Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, Richard Socher (Salesforce Research), 2018
https://scholar.google.com/scholar?q=Non-Autoregressive+Neural+Machine+Translation
17. Improving Variational Inference with Inverse Autoregressive Flow — Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling, 2016
https://scholar.google.com/scholar?q=Improving+Variational+Inference+with+Inverse+Autoregressive+Flow
18. Density Estimation Using Real-valued Non-Volume Preserving (Real NVP) Transformations — Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio (Google Brain), 2017
https://scholar.google.com/scholar?q=Density+Estimation+Using+Real-valued+Non-Volume+Preserving+(Real+NVP)+Transformations
19. Glow: Generative Flow with Invertible 1x1 Convolutions — Diederik P. Kingma, Prafulla Dhariwal (OpenAI), 2018
https://scholar.google.com/scholar?q=Glow:+Generative+Flow+with+Invertible+1x1+Convolutions
20. Free-form Flows: Make Any Architecture a Normalizing Flow — Felix Draxler, Peter Sorrenson, Lea Zimmermann, Armand Rousselot, Ullrich Köthe, 2024
https://scholar.google.com/scholar?q=Free-form+Flows:+Make+Any+Architecture+a+Normalizing+Flow
21. Improved Variational Inference with Inverse Autoregressive Flow — Kingma, Salimans, Jozefowicz, Chen, Sutskever, Wierstra, 2016
https://scholar.google.com/scholar?q=Improved+Variational+Inference+with+Inverse+Autoregressive+Flow
22. Closer look at efficient inference methods: A survey of speculative decoding — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Closer+look+at+efficient+inference+methods:+A+survey+of+speculative+decoding
23. Adaptive Speculative Decoding for Large Language Models — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Adaptive+Speculative+Decoding+for+Large+Language+Models
24. LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=LK+Losses:+Direct+Acceptance+Rate+Optimization+for+Speculative+Decoding
25. Higher Acceptance Rates for Speculative Decoding with Randomised Drafting — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Higher+Acceptance+Rates+for+Speculative+Decoding+with+Randomised+Drafting
26. Conditional [MASK] Discrete Diffusion Language Model — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Conditional+[MASK]+Discrete+Diffusion+Language+Model
27. Alternatives To Next Token Prediction In Text Generation: A Survey — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Alternatives+To+Next+Token+Prediction+In+Text+Generation:+A+Survey
28. Future Token Prediction: Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction — multiple authors, approximate, 2024-2025
https://scholar.google.com/scholar?q=Future+Token+Prediction:+Causal+Language+Modelling+with+Per-Token+Semantic+State+Vector+for+Multi-Token+Prediction
29. AI Post Transformers: Fast Inference from Transformers via Speculative Decoding — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Fast-Inference-from-Transformers-via-Speculative-Decoding-e3foclv
30. AI Post Transformers: Accelerating Large Language Model Decoding with Speculative Sampling — Hal Turing & Dr. Ada Shannon, Thu,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Accelerating-Large-Language-Model-Decoding--with-Speculative-Sampling-e3flhv7
31. AI Post Transformers: EAGLE: Evolution of Lossless Acceleration for LLM Inference — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/EAGLE-Evolution-of-Lossless-Acceleration-for-LLM-Inference-e3focr9
32. AI Post Transformers: MEDUSA: Parallel Decoding Heads for Accelerated LLM Inference — Hal Turing & Dr. Ada Shannon, Thu,
https://podcasters.spotify.com/pod/show/12146088098/episodes/MEDUSA-Parallel-Decoding-Heads-for-Accelerated-LLM-Inference-e3flqk7
33. AI Post Transformers: Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Apples-Speculative-Streaming-Fast-LLM-Inference-without-Auxiliary-Models-e3fod2o
34. AI Post Transformers: Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, Thu,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Draft--Verify-Lossless-Large-Language-Model-Acceleration-via-Self-Speculative-Decoding-e3flplu
35. AI Post Transformers: Apple's Mirror Speculative Decoding: Parallel LLM Inference via Heterogeneous Accelerators — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Apples-Mirror-Speculative-Decoding-Parallel-LLM-Inference-via-Heterogeneous-Accelerators-e3fod0p
36. AI Post Transformers: QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/QuantSpec-Hierarchical-KV-Cache-for-Self-Speculative-Decoding-e3foavf
37. AI Post Transformers: The Free Transformer: VAE Extension for Decoders — Hal Turing & Dr. Ada Shannon, Sun,
https://podcasters.spotify.com/pod/show/12146088098/episodes/The-Free-Transformer-VAE-Extension-for-Decoders-e3a2p6v

Interactive Visualization: PTP: Resolving the Independence Flaw
https://www.do-not-panic.com/viz/2026/03/04/ptp-viz.html
Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
04/03/2026
NVIDIA's November 2025 paper "Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs" tackles a fundamental economics problem in LLM deployment: training separate model families like Llama 3.1's 8B, 70B, and 405B variants requires three independent training runs on trillions of tokens each — a cost that is prohibitive for smaller research teams and painful even for frontier labs. The paper proposes elastic nested weight-sharing, where a single parent model is trained so that multiple smaller, deployment-ready submodels are embedded inside it and can be extracted at inference time with zero additional training. The submodels literally share the parent's weight matrices — running a coherent slice of the full network rather than a copy — making one training investment yield multiple usable models at different resource tiers.

The key technical contribution is applying elastic weight-sharing to a hybrid Mamba-Attention architecture for the first time. The parent model, Nemotron NanoV2 12B, uses Mamba-2 state space model layers for the bulk of sequence processing, with only four attention layers in the entire 12-billion parameter network. Pure transformer pruning methods were never designed for this structural reality, putting the work on genuinely new ground relative to predecessors. The hybrid design exploits the complementary strengths of each layer type: SSM layers process sequences at linear cost while attention layers handle the precise associative recall tasks where fixed-size SSM state vectors degrade. The result is approximately 3.7x reduction in KV cache memory compared to a comparable pure transformer, while preserving exact cross-context lookup. The intellectual lineage runs from Slimmable Networks (2019) through Matryoshka Representation Learning (NeurIPS 2022) to MatFormer and Flextron, with this paper extending the NVIDIA Flextron line beyond pure transformer elasticity.

The paper makes a credible case that elastification on hybrid architectures is feasible, but the constraints it operates under reveal where the field still has work to do. Making the parent tolerant of submodel extraction requires the full parent to be simultaneously optimized for its own performance and for the coherence of multiple nested subsets, a training objective that introduces real tension. The paper does not address how elastic submodels perform on the specific recall-intensive tasks where hybrid designs justify their complexity over pure SSMs, leaving open whether the four attention layers survive aggressive submodel extraction with their associative recall properties intact. The broader significance is that the approach decouples deployment flexibility from training cost in a way that could meaningfully lower the barrier to supporting heterogeneous hardware fleets, but the robustness of the extracted submodels under real-world distribution shift remains an open empirical question.

Sources:
1. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Gu & Dao, 2023
2. Jamba: A Hybrid Transformer-Mamba Language Model — Lieber et al. (AI21 Labs), 2024
3. Griffin: Mixing Gated Linear Recurrences with Local Attention — De et al. (Google DeepMind), 2024
4. Zamba: A Compact 7B SSM Hybrid Model — Glorioso et al. (Zyphra), 2024
5. The Lottery Ticket Hypothesis — Frankle & Carlin, 2019
6. Sheared LLaMA: Structured Pruning — Xia et al. (Princeton), 2023
7. Minitron: Compact Language Models via Pruning and KD — Sreenivas et al. (NVIDIA), 2024
8. LLM-Pruner: Structural Pruning of LLMs — Ma et al., 2023
9. Matryoshka Representation Learning — Kusupati et al., 2022
10. ShortGPT: Layer Redundancy in LLMs — Men et al., 2024
11. Scaling LLM Test-Time Compute — Snell et al., 2024
12. Any-Width Networks (Slimmable Networks) — Yu et al., 2019
13. Flextron: Many-in-One Flexible LLM — NVIDIA, 2024
14. Mamba-shedder: Post-Transformer SSM Compression — 2024
15. SparsSSM: One-Shot SSM Pruning — 2024
FlashOptim: Optimizers for Memory Efficient Training
02/03/2026
This episode explores the groundbreaking paper "FlashOptim: Optimizers for Memory Efficient Training" by researchers from Databricks AI Research. The discussion centers around innovative techniques to significantly reduce memory usage in neural network training without sacrificing model quality. Key methods such as Optimizer State Quantization, Float Splitting Techniques, and Companded Optimizer State Quantization are unpacked, highlighting their potential to lower memory requirements from 175 GiB to 113 GiB for large models like Llama-3.1-8B. Listeners interested in AI research will find this episode compelling as it addresses the democratization of AI by making advanced models more accessible to those with limited hardware resources.

Sources:
1. https://arxiv.org/pdf/2602.23349
2. Mixed Precision Training — Paulius Micikevicius et al., 2018
https://scholar.google.com/scholar?q=Mixed+Precision+Training
3. 8-bit Optimizer States for Memory-Efficient Training — Tim Dettmers et al., 2022
https://scholar.google.com/scholar?q=8-bit+Optimizer+States+for+Memory-Efficient+Training
4. Parameter-Efficient Transfer Learning for NLP — Xiaoqi Li and Percy Liang, 2021
https://scholar.google.com/scholar?q=Parameter-Efficient+Transfer+Learning+for+NLP
5. Q-adam-mini: Memory-efficient 8-bit quantized optimizer for large language model training — approximate, 2023
https://scholar.google.com/scholar?q=Q-adam-mini:+Memory-efficient+8-bit+quantized+optimizer+for+large+language+model+training
6. Memory efficient optimizers with 4-bit states — approximate, 2023
https://scholar.google.com/scholar?q=Memory+efficient+optimizers+with+4-bit+states
7. ECO: Quantized Training without Full-Precision Master Weights — approximate, 2023
https://scholar.google.com/scholar?q=ECO:+Quantized+Training+without+Full-Precision+Master+Weights
8. AI Post Transformers: FlashOptim: Optimizers for Memory Efficient Training — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-02_urls_1.mp3

More Technology podcasts

About AI Post Transformers

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

Podcast website

Technology