PodcastsTechnologyAI Post Transformers

AI Post Transformers

mcgrof
AI Post Transformers
Latest episode

458 episodes

  • AI Post Transformers

    FlashAttention-4 Conquers Asymmetric GPU Hardware Scaling

    06/03/2026
    Hal Turing and Dr. Ada Shannon dig into FlashAttention-4, a March 2026 paper from a cross-institutional team including Tri Dao, Jay Shah, and colleagues at Princeton, Meta, NVIDIA, Colfax Research, Georgia Tech, and Together AI. The paper targets a precise hardware mismatch on NVIDIA's Blackwell B200: tensor core throughput doubles compared to the H100, but shared memory bandwidth and dedicated exponential function units do not scale at the same rate. Rather than waiting for hardware fixes, the authors co-design the attention algorithm with the asymmetric architecture itself — making FlashAttention-4 the first attention kernel built specifically for Blackwell's scaling profile.

    To frame why this matters, Shannon traces the full lineage of FlashAttention research. The original 2022 NeurIPS paper by Dao and colleagues reframed attention as an IO problem: instead of materializing the quadratic N×N score matrix in slow off-chip High Bandwidth Memory, tiling and online softmax keep computation inside the fast on-chip shared memory of each streaming multiprocessor. FlashAttention-2 doubled throughput through sequence-dimension parallelism. FlashAttention-3 pushed H100 utilization to roughly 75% by exploiting Hopper-specific warp specialization and asynchronous data movement. Each generation addressed a qualitatively different bottleneck — and Blackwell introduced a new one that none of those solutions anticipated.

    The hosts ground the stakes for practitioners who work in ML without writing GPU kernels. Attention sits at the core of every Transformer-based system — large language models, vision transformers, multimodal architectures — and long-context workloads at 32K to 128K tokens make the quadratic memory cost and HBM round-trips increasingly punishing. Shannon introduces the roofline model as the analytic lens the paper uses to characterize where Blackwell kernels actually bottleneck, setting up how FlashAttention-4's algorithmic co-design approach navigates the compute and memory bandwidth ceilings that previous generations of the kernel never had to contend with.

    Sources:
    1. https://arxiv.org/pdf/2603.05451v1
    2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022
    https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness
    3. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023
    https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning
    4. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low Precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024
    https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low+Precision
    5. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations — Philippe Tillet, H. T. Kung, David Cox, 2019
    https://scholar.google.com/scholar?q=Triton:+An+Intermediate+Language+and+Compiler+for+Tiled+Neural+Network+Computations
    6. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures — Samuel Williams, Andrew Waterman, David Patterson, 2009
    https://scholar.google.com/scholar?q=Roofline:+An+Insightful+Visual+Performance+Model+for+Floating-Point+Programs+and+Multicore+Architectures
    7. Efficiently Scaling Transformer Inference — Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Sharan Narang, Jeff Dean, 2023
    https://scholar.google.com/scholar?q=Efficiently+Scaling+Transformer+Inference
    8. In-Datacenter Performance Analysis of a Tensor Processing Unit — Norman P. Jouppi et al. (Google), 2017
    https://scholar.google.com/scholar?q=In-Datacenter+Performance+Analysis+of+a+Tensor+Processing+Unit
    9. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Albert Gu, Tri Dao, 2023
    https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces
    10. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, Ion Stoica, 2023
    https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention
    11. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019
    https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism
    12. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024
    https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low-precision
    13. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization — Jintao Zhang et al., 2024
    https://scholar.google.com/scholar?q=SageAttention2:+Efficient+Attention+with+Thorough+Outlier+Smoothing+and+Per-thread+INT4+Quantization
    14. Ring Attention with Blockwise Transformers for Near-Infinite Context — Hao Liu, Matei Zaharia, Pieter Abbeel, 2023
    https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context
    15. FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention — PyTorch Team, 2024
    https://scholar.google.com/scholar?q=FlexAttention:+The+Flexibility+of+PyTorch+with+the+Performance+of+FlashAttention
    16. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — DeepSeek-AI, 2024
    https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model
    17. Softmax output approximation for activation memory-efficient training of attention-based networks — approximate — likely 2023–2024, 2023–2024
    https://scholar.google.com/scholar?q=Softmax+output+approximation+for+activation+memory-efficient+training+of+attention-based+networks
    18. FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector Parallelism — approximate — likely 2024–2025, 2024–2025
    https://scholar.google.com/scholar?q=FlashAttention-T:+Towards+Fully+Tensorized+Attention+by+Exploiting+Tensor-Vector+Parallelism
    19. Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention — approximate — likely 2024–2025, 2024–2025
    https://scholar.google.com/scholar?q=Overcoming+Long-Context+Limitations+of+State-Space+Models+via+Context-Dependent+Sparse+Attention
    20. Based on Tensor Core Sparse Kernels Accelerating Deep Neural Networks — approximate — likely 2024–2025, 2024–2025
    https://scholar.google.com/scholar?q=Based+on+Tensor+Core+Sparse+Kernels+Accelerating+Deep+Neural+Networks
    21. AI Post Transformers: FlashAttention-2: Faster Attention with Better Parallelism — Hal Turing & Dr. Ada Shannon, Fri,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/FlashAttention-2-Faster-Attention-with-Better-Parallelism-e36kdm0
    22. AI Post Transformers: ATTENTION2D and lean attention: Distributed Self-Attention — Hal Turing & Dr. Ada Shannon, Wed,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/ATTENTION2D-and-lean-attention-Distributed-Self-Attention-e3a7r4n
    23. AI Post Transformers: Jet-RL: Stable On-Policy Reinforcement Learning with Unified FP8 Flow — Hal Turing & Dr. Ada Shannon, Tue,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/Jet-RL-Stable-On-Policy-Reinforcement-Learning-with-Unified-FP8-Flow-e3f7det
    24. AI Post Transformers: Mojo: Performance-Portable HPC Kernels on GPUs — Hal Turing & Dr. Ada Shannon, Sat,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/Mojo-Performance-Portable-HPC-Kernels-on-GPUs-e39n4sk

    Interactive Visualization: FlashAttention-4: Algorithm & Kernel Co-Design
    https://do-not-panic.com/viz/2026/03/06/flashattention4.html
  • AI Post Transformers

    We're Open Source! New Home, Visualizations, and How to Shape Our Queue

    06/03/2026
    Special announcement: AI Post Transformers is now open source under the MIT license. New home at podcast.do-not-panic.com, interactive paper visualizations, community paper submissions, deep dive into the editorial queue algorithm, and internationalization roadmap.
  • AI Post Transformers

    Parallel Token Prediction: From ProphetNet to Dependent Multi-Token Generation

    04/03/2026
    This episode examines the fundamental latency bottleneck in autoregressive language models: sequential token generation requires one full transformer forward pass per output token, leaving GPU parallelism idle during single-user inference. The episode centers on Draxler et al. (5 co-authors, UC Irvine and Chan-Zuckerberg Initiative), whose paper on Parallel Token Prediction landed Christmas Eve 2025 and argues that the independence assumption baked into all prior multi-token schemes is not an acceptable approximation but the actual limiting factor. The paper asks whether multiple tokens can be jointly predicted in a single pass — modeling dependencies among them — without sacrificing the expressiveness that makes autoregressive generation reliable. First author Felix Draxler previously led the Free-form Flows work in 2024, and the normalizing flow machinery he developed there is central to how the paper solves the dependency problem.

    The episode traces the historical arc carefully. Qi et al. (7 co-authors, Microsoft Research Asia) published ProphetNet in January 2020 — predating GPT-3 by four months — in the encoder-decoder world of seq2seq tasks. Their critique was precise: standard one-step-ahead teacher forcing gives models no incentive to plan ahead, letting local bigram correlations dominate at the expense of long-range coherence. Their answer was n-gram prediction, training the decoder to simultaneously predict tokens at t+1, t+2, and t+3 using parallel heads that did not condition on each other. The independence assumption was already present. When Brown et al. (OpenAI, May 2020) demonstrated that scale and in-context conditioning make the encoder optional, the field shifted to decoder-only architectures — but ProphetNet's core insight migrated cleanly. Gloeckle et al. (FAIR, Meta, April 2024) rebuilt multi-token prediction for decoder-only models using independent output heads, DeepSeek adopted the same approach, and NVIDIA incorporated it into Nemotron 3. The independence assumption migrated with the insight, and Draxler et al. argue that limitation has been compounding ever since.

    The episode situates Parallel Token Prediction against the two main camps attacking inference latency. Speculative decoding — covered across twenty prior episodes — keeps the model's output distribution unchanged by using a small draft model whose proposals a large verifier checks in one batched pass; the latency gain comes entirely from accepted tokens per step. Multi-token prediction is the other camp: train the model itself to emit several tokens at once, collapsing multiple forward passes into one, at the cost of changed model behavior during training. Draxler et al.'s contribution is showing that jointly predicting dependent tokens, using normalizing flows to capture the conditional structure across the prediction horizon, preserves the modeling power that independent-head approaches discard. The episode works through both the architectural mechanics and the theoretical argument, making the case that Parallel Token Prediction resolves the tension that has run from ProphetNet through every independent-head scheme in between.

    Sources:
    1. https://arxiv.org/pdf/2512.21323
    2. https://arxiv.org/pdf/2404.19737v1
    3. https://arxiv.org/pdf/2412.19437
    4. https://arxiv.org/pdf/2512.20856
    5. Better & Faster Large Language Models via Multi-Token Prediction — Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve (Meta), 2024
    https://scholar.google.com/scholar?q=Better+&+Faster+Large+Language+Models+via+Multi-Token+Prediction
    6. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengmian Hu, et al., 2024
    https://scholar.google.com/scholar?q=Medusa:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads
    7. Lookahead Decoding: Break the Sequential Dependency of LLM Inference — Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, 2024
    https://scholar.google.com/scholar?q=Lookahead+Decoding:+Break+the+Sequential+Dependency+of+LLM+Inference
    8. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024
    https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty
    9. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias (Google), 2023
    https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding
    10. Accelerating Large Language Model Decoding with Speculative Sampling — Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper (DeepMind), 2023
    https://scholar.google.com/scholar?q=Accelerating+Large+Language+Model+Decoding+with+Speculative+Sampling
    11. Spec-Bench: A Benchmark for Evaluating Speculative Decoding Approaches — Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 2024
    https://scholar.google.com/scholar?q=Spec-Bench:+A+Benchmark+for+Evaluating+Speculative+Decoding+Approaches
    12. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024
    https://scholar.google.com/scholar?q=EAGLE-2:+Faster+Inference+of+Language+Models+with+Dynamic+Draft+Trees
    13. Attention Is All You Need — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin (Google Brain), 2017
    https://scholar.google.com/scholar?q=Attention+Is+All+You+Need
    14. Language Models are Few-Shot Learners (GPT-3) — Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al. (OpenAI), 2020
    https://scholar.google.com/scholar?q=Language+Models+are+Few-Shot+Learners+(GPT-3)
    15. Scaling Laws for Neural Language Models — Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (OpenAI), 2020
    https://scholar.google.com/scholar?q=Scaling+Laws+for+Neural+Language+Models
    16. Non-Autoregressive Neural Machine Translation — Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, Richard Socher (Salesforce Research), 2018
    https://scholar.google.com/scholar?q=Non-Autoregressive+Neural+Machine+Translation
    17. Improving Variational Inference with Inverse Autoregressive Flow — Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling, 2016
    https://scholar.google.com/scholar?q=Improving+Variational+Inference+with+Inverse+Autoregressive+Flow
    18. Density Estimation Using Real-valued Non-Volume Preserving (Real NVP) Transformations — Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio (Google Brain), 2017
    https://scholar.google.com/scholar?q=Density+Estimation+Using+Real-valued+Non-Volume+Preserving+(Real+NVP)+Transformations
    19. Glow: Generative Flow with Invertible 1x1 Convolutions — Diederik P. Kingma, Prafulla Dhariwal (OpenAI), 2018
    https://scholar.google.com/scholar?q=Glow:+Generative+Flow+with+Invertible+1x1+Convolutions
    20. Free-form Flows: Make Any Architecture a Normalizing Flow — Felix Draxler, Peter Sorrenson, Lea Zimmermann, Armand Rousselot, Ullrich Köthe, 2024
    https://scholar.google.com/scholar?q=Free-form+Flows:+Make+Any+Architecture+a+Normalizing+Flow
    21. Improved Variational Inference with Inverse Autoregressive Flow — Kingma, Salimans, Jozefowicz, Chen, Sutskever, Wierstra, 2016
    https://scholar.google.com/scholar?q=Improved+Variational+Inference+with+Inverse+Autoregressive+Flow
    22. Closer look at efficient inference methods: A survey of speculative decoding — multiple authors, approximate, 2024-2025
    https://scholar.google.com/scholar?q=Closer+look+at+efficient+inference+methods:+A+survey+of+speculative+decoding
    23. Adaptive Speculative Decoding for Large Language Models — multiple authors, approximate, 2024-2025
    https://scholar.google.com/scholar?q=Adaptive+Speculative+Decoding+for+Large+Language+Models
    24. LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding — multiple authors, approximate, 2024-2025
    https://scholar.google.com/scholar?q=LK+Losses:+Direct+Acceptance+Rate+Optimization+for+Speculative+Decoding
    25. Higher Acceptance Rates for Speculative Decoding with Randomised Drafting — multiple authors, approximate, 2024-2025
    https://scholar.google.com/scholar?q=Higher+Acceptance+Rates+for+Speculative+Decoding+with+Randomised+Drafting
    26. Conditional [MASK] Discrete Diffusion Language Model — multiple authors, approximate, 2024-2025
    https://scholar.google.com/scholar?q=Conditional+[MASK]+Discrete+Diffusion+Language+Model
    27. Alternatives To Next Token Prediction In Text Generation: A Survey — multiple authors, approximate, 2024-2025
    https://scholar.google.com/scholar?q=Alternatives+To+Next+Token+Prediction+In+Text+Generation:+A+Survey
    28. Future Token Prediction: Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction — multiple authors, approximate, 2024-2025
    https://scholar.google.com/scholar?q=Future+Token+Prediction:+Causal+Language+Modelling+with+Per-Token+Semantic+State+Vector+for+Multi-Token+Prediction
    29. AI Post Transformers: Fast Inference from Transformers via Speculative Decoding — Hal Turing & Dr. Ada Shannon, Sat,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/Fast-Inference-from-Transformers-via-Speculative-Decoding-e3foclv
    30. AI Post Transformers: Accelerating Large Language Model Decoding with Speculative Sampling — Hal Turing & Dr. Ada Shannon, Thu,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/Accelerating-Large-Language-Model-Decoding--with-Speculative-Sampling-e3flhv7
    31. AI Post Transformers: EAGLE: Evolution of Lossless Acceleration for LLM Inference — Hal Turing & Dr. Ada Shannon, Sat,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/EAGLE-Evolution-of-Lossless-Acceleration-for-LLM-Inference-e3focr9
    32. AI Post Transformers: MEDUSA: Parallel Decoding Heads for Accelerated LLM Inference — Hal Turing & Dr. Ada Shannon, Thu,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/MEDUSA-Parallel-Decoding-Heads-for-Accelerated-LLM-Inference-e3flqk7
    33. AI Post Transformers: Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models — Hal Turing & Dr. Ada Shannon, Sat,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/Apples-Speculative-Streaming-Fast-LLM-Inference-without-Auxiliary-Models-e3fod2o
    34. AI Post Transformers: Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, Thu,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/Draft--Verify-Lossless-Large-Language-Model-Acceleration-via-Self-Speculative-Decoding-e3flplu
    35. AI Post Transformers: Apple's Mirror Speculative Decoding: Parallel LLM Inference via Heterogeneous Accelerators — Hal Turing & Dr. Ada Shannon, Sat,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/Apples-Mirror-Speculative-Decoding-Parallel-LLM-Inference-via-Heterogeneous-Accelerators-e3fod0p
    36. AI Post Transformers: QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, Sat,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/QuantSpec-Hierarchical-KV-Cache-for-Self-Speculative-Decoding-e3foavf
    37. AI Post Transformers: The Free Transformer: VAE Extension for Decoders — Hal Turing & Dr. Ada Shannon, Sun,
    https://podcasters.spotify.com/pod/show/12146088098/episodes/The-Free-Transformer-VAE-Extension-for-Decoders-e3a2p6v

    Interactive Visualization: PTP: Resolving the Independence Flaw
    https://www.do-not-panic.com/viz/2026/03/04/ptp-viz.html
  • AI Post Transformers

    Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

    04/03/2026
    NVIDIA's November 2025 paper "Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs" tackles a fundamental economics problem in LLM deployment: training separate model families like Llama 3.1's 8B, 70B, and 405B variants requires three independent training runs on trillions of tokens each — a cost that is prohibitive for smaller research teams and painful even for frontier labs. The paper proposes elastic nested weight-sharing, where a single parent model is trained so that multiple smaller, deployment-ready submodels are embedded inside it and can be extracted at inference time with zero additional training. The submodels literally share the parent's weight matrices — running a coherent slice of the full network rather than a copy — making one training investment yield multiple usable models at different resource tiers.

    The key technical contribution is applying elastic weight-sharing to a hybrid Mamba-Attention architecture for the first time. The parent model, Nemotron NanoV2 12B, uses Mamba-2 state space model layers for the bulk of sequence processing, with only four attention layers in the entire 12-billion parameter network. Pure transformer pruning methods were never designed for this structural reality, putting the work on genuinely new ground relative to predecessors. The hybrid design exploits the complementary strengths of each layer type: SSM layers process sequences at linear cost while attention layers handle the precise associative recall tasks where fixed-size SSM state vectors degrade. The result is approximately 3.7x reduction in KV cache memory compared to a comparable pure transformer, while preserving exact cross-context lookup. The intellectual lineage runs from Slimmable Networks (2019) through Matryoshka Representation Learning (NeurIPS 2022) to MatFormer and Flextron, with this paper extending the NVIDIA Flextron line beyond pure transformer elasticity.

    The paper makes a credible case that elastification on hybrid architectures is feasible, but the constraints it operates under reveal where the field still has work to do. Making the parent tolerant of submodel extraction requires the full parent to be simultaneously optimized for its own performance and for the coherence of multiple nested subsets, a training objective that introduces real tension. The paper does not address how elastic submodels perform on the specific recall-intensive tasks where hybrid designs justify their complexity over pure SSMs, leaving open whether the four attention layers survive aggressive submodel extraction with their associative recall properties intact. The broader significance is that the approach decouples deployment flexibility from training cost in a way that could meaningfully lower the barrier to supporting heterogeneous hardware fleets, but the robustness of the extracted submodels under real-world distribution shift remains an open empirical question.

    Sources:
    1. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Gu & Dao, 2023
    2. Jamba: A Hybrid Transformer-Mamba Language Model — Lieber et al. (AI21 Labs), 2024
    3. Griffin: Mixing Gated Linear Recurrences with Local Attention — De et al. (Google DeepMind), 2024
    4. Zamba: A Compact 7B SSM Hybrid Model — Glorioso et al. (Zyphra), 2024
    5. The Lottery Ticket Hypothesis — Frankle & Carlin, 2019
    6. Sheared LLaMA: Structured Pruning — Xia et al. (Princeton), 2023
    7. Minitron: Compact Language Models via Pruning and KD — Sreenivas et al. (NVIDIA), 2024
    8. LLM-Pruner: Structural Pruning of LLMs — Ma et al., 2023
    9. Matryoshka Representation Learning — Kusupati et al., 2022
    10. ShortGPT: Layer Redundancy in LLMs — Men et al., 2024
    11. Scaling LLM Test-Time Compute — Snell et al., 2024
    12. Any-Width Networks (Slimmable Networks) — Yu et al., 2019
    13. Flextron: Many-in-One Flexible LLM — NVIDIA, 2024
    14. Mamba-shedder: Post-Transformer SSM Compression — 2024
    15. SparsSSM: One-Shot SSM Pruning — 2024
  • AI Post Transformers

    FlashOptim: Optimizers for Memory Efficient Training

    02/03/2026
    This episode explores the groundbreaking paper "FlashOptim: Optimizers for Memory Efficient Training" by researchers from Databricks AI Research. The discussion centers around innovative techniques to significantly reduce memory usage in neural network training without sacrificing model quality. Key methods such as Optimizer State Quantization, Float Splitting Techniques, and Companded Optimizer State Quantization are unpacked, highlighting their potential to lower memory requirements from 175 GiB to 113 GiB for large models like Llama-3.1-8B. Listeners interested in AI research will find this episode compelling as it addresses the democratization of AI by making advanced models more accessible to those with limited hardware resources.

    Sources:
    1. https://arxiv.org/pdf/2602.23349
    2. Mixed Precision Training — Paulius Micikevicius et al., 2018
    https://scholar.google.com/scholar?q=Mixed+Precision+Training
    3. 8-bit Optimizer States for Memory-Efficient Training — Tim Dettmers et al., 2022
    https://scholar.google.com/scholar?q=8-bit+Optimizer+States+for+Memory-Efficient+Training
    4. Parameter-Efficient Transfer Learning for NLP — Xiaoqi Li and Percy Liang, 2021
    https://scholar.google.com/scholar?q=Parameter-Efficient+Transfer+Learning+for+NLP
    5. Q-adam-mini: Memory-efficient 8-bit quantized optimizer for large language model training — approximate, 2023
    https://scholar.google.com/scholar?q=Q-adam-mini:+Memory-efficient+8-bit+quantized+optimizer+for+large+language+model+training
    6. Memory efficient optimizers with 4-bit states — approximate, 2023
    https://scholar.google.com/scholar?q=Memory+efficient+optimizers+with+4-bit+states
    7. ECO: Quantized Training without Full-Precision Master Weights — approximate, 2023
    https://scholar.google.com/scholar?q=ECO:+Quantized+Training+without+Full-Precision+Master+Weights
    8. AI Post Transformers: FlashOptim: Optimizers for Memory Efficient Training — Hal Turing & Dr. Ada Shannon, 2026
    https://podcast.do-not-panic.com/episodes/2026-03-02_urls_1.mp3

More Technology podcasts

About AI Post Transformers

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.
Podcast website

Listen to AI Post Transformers, The AI Daily Brief: Artificial Intelligence News and Analysis and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features

AI Post Transformers: Podcasts in Family

Social
v8.7.2 | © 2007-2026 radio.de GmbH
Generated: 3/7/2026 - 5:31:18 AM