PodcastsTechnologyAI: post transformers

AI: post transformers

mcgrof
AI: post transformers
Latest episode

361 episodes

  • AI: post transformers

    MEMRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic

    23/1/2026 | 14 mins.
    The January 6, 2026 paper introduces **MEMRL**, a framework designed to help AI agents master new skills by mimicking human **episodic memory** without needing to update the model's underlying weights. This approach addresses the **stability-plasticity dilemma** by decoupling a stable, frozen **Large Language Model** (the reasoning core) from a dynamic, evolving memory bank. Unlike standard retrieval methods that rely solely on semantic similarity, MEMRL uses **non-parametric reinforcement learning** to evaluate the actual utility of past experiences. It employs a **two-phase retrieval mechanism** that first identifies relevant candidates and then selects the most effective ones based on learned **Q-values**. These values are continuously refined through **environmental feedback**, allowing the agent to distinguish high-value strategies from distracting noise. Experiments across various benchmarks show that **MEMRL** significantly improves performance and supports stable **runtime learning** while avoiding the computational costs and forgetting associated with fine-tuning.

    Source:
    https://arxiv.org/pdf/2601.03192
  • AI: post transformers

    Google: R&D inference value on HBF + PNM + low latency interconnect

    23/1/2026 | 17 mins.
    To address the hardware bottlenecks of LLM inference, Google researchers Ma and Patterson propos in their paper "Challenges and Research Directions for Large Language Model Inference Hardware" published on January 8, 2026 a few focus areas of research: High Bandwidth Flash (HBF), Processing-Near-Memory (PNM), and low-latency interconnects.

    **HBF** addresses the "Memory Wall" by stacking flash dies to achieve **10X the capacity** of HBM, making it ideal for storing model weights and long contexts despite its write endurance limitations. **PNM** is advocated over Processing-In-Memory (PIM) for datacenters because placing logic on separate but nearby dies (e.g., 3D stacking) allows for larger software shards (avoiding fine-grained partitioning), utilizes standard high-performance logic processes, and offers better thermal management than integrating logic directly into memory dies. Finally, arguing that **latency trumps bandwidth** for the frequent small messages in inference, the authors suggest optimizing interconnects through high-connectivity topologies (like dragonfly or trees) and **processing-in-network** to accelerate communication collectives.

    Modern large language model (LLM) inference faces a critical memory wall, where hardware compute power outpaces the growth of data transfer speeds. Research suggests addressing these bottlenecks through **3D memory-logic stacking**, near-memory processing, and specialized **interconnect strategies** to reduce latency. Optimization techniques for **Mixture-of-Experts (MoE)** architectures involve balancing **tensor and expert parallelism** across devices to ensure efficient data handling. While high-bandwidth memory remains expensive, alternative storage solutions like **flash memory** are being explored to expand capacity for data centers. Historical data further illustrates the evolving **cost and density** of memory, underscoring the long-term economic shifts in hardware development. Together, these sources outline a roadmap for evolving **AI hardware** to meet the rigorous demands of real-time model decoding.

    Source:
    January 8, 2026
    Challenges and Research Directions for Large Language Model Inference Hardware
    Google
    https://arxiv.org/pdf/2601.05047
  • AI: post transformers

    Meta's solution to massive DLRM inference through software defined memory

    21/1/2026 | 17 mins.
    On November, 2021 Meta (back then Facebook) in collaboration with George Mason University and University of Illinois Chicago published their paper "Supporting Massive DLRM inference through software defined memory".

    Meta addressed the infrastructure challenge of serving massive Deep Learning Recommendation Models by extending the memory hierarchy to include NVMe Storage Class Memory. Because standard storage devices read large data blocks that exceed the small size of embedding rows the company faced significant read amplification and bandwidth waste. To resolve this the engineering team implemented a solution using the NVMe SGL Bit Bucket feature within a software defined memory stack. This modification to the Linux kernel and drivers allows applications to perform direct input output requests for specific data chunks down to four bytes rather than transferring full logical blocks.

    The implementation of bit buckets enables the system to transfer only the requested portion of a data block which significantly optimizes link bandwidth and reduces memory utilization. This granular approach saves approximately 75 percent of bus bandwidth and lowers individual read latency by 3 to 5 percent by removing unnecessary data transfer and memory copies. When applied to production environments this architecture allows data centers to replace expensive DRAM with efficient flash storage for specific model components. These optimizations result in up to 20 percent power savings on simpler hardware and a projected 29 percent increase in performance per watt for multi tenant serving scenarios.

    Sources:
    https://arxiv.org/pdf/2110.11489
    https://lore.kernel.org/linux-nvme/[email protected]/
  • AI: post transformers

    Storage-next: Do We Need New Hardware for AI Storage, or Just Better Layouts?

    21/1/2026 | 14 mins.
    We review the "Storage-Next" paper, published in November 2025, which argues that a fundamental hardware architectural shift is required to elevate NAND flash from a passive storage tier to an active memory tier capable of "seconds-scale" caching. The authors contend that standard SSDs impose a "channel-side ceiling" on IOPS because they are optimized for 4KB blocks, creating massive bandwidth waste when AI applications demand fine-grained access to small items, such as 128-byte embedding vectors. To solve this, they propose specialized "Storage-Next" drives capable of scalable IOPS for small block sizes (e.g., 50M IOPS at 512B), arguing this hardware is necessary to simplify software stacks and enable high-throughput random access without the read amplification penalties inherent in current technology.

    However, the episode explores how concurrent research largely rebuts the strict need for this new hardware by demonstrating that intelligent software and driver modifications can mask these inefficiencies on standard drives. Systems like PageANN and FusionANNS prove that aggregating topologically related vectors into 4KB pages allows existing SSDs to handle billion-scale search efficiently, while Strata utilizes GPU-assisted I/O to bundle fragmented LLM token pages. Furthermore, for workloads specifically requiring fine-grained access like DLRM, Meta researchers successfully implemented a "software-defined memory" solution using the NVMe SGL Bit Bucket feature to strip unwanted data at the driver level, reducing PCIe bandwidth consumption by 75% on standard hardware. These innovations suggest that aside from the specific niche of random hash-based lookups where locality is mathematically impossible, software optimization remains a viable alternative to a physical overhaul of storage media.

    We've previously covered some of the papers here individually:

    Meta's massive DLRM Linux NVMe SGL bit bucket solution:
    https://open.spotify.com/episode/7fPOvegGpWWYqChIVYGfwx?si=uxNPv4hZQvumhwwPGowwTA&context=spotify%3Ashow%3A48ygM4upvm6noxCbmhlz8i

    PageANNS:
    https://open.spotify.com/episode/5rrXWA4KJxGHp4xckirlZ2?si=_Qhzy_g1SZyPrBFmHvlY5g

    FusionsANNS:
    https://open.spotify.com/episode/6Ys51jB54GilRlYsvz4yXR?si=yI8KwDE1QpS6BbnFsinl6g

    Strata:
    https://open.spotify.com/episode/18kCgDcrOsQ5nw58V2HGBB?si=4Rr4ZfqIR-SzaVxyS8hOWA

    Sources:

    November 2025, From Minutes to Seconds: Redefining the Five-Minute Rule for AI-Era Memory Hierarchies, ScaleFlux and NVIDIA and Stanford University https://arxiv.org/pdf/2511.03944

    September 2025, Scalable Disk-Based Approximate Nearest Neighbor Search with Page-Aligned Graph, University of Texas at Dallas and Rutgers University https://arxiv.org/pdf/2509.25487

    August 2025, Strata: Hierarchical Context Caching for Long Context Language Model Serving, Stanford University and NVIDIA
    https://arxiv.org/pdf/2508.18572

    September 2024, FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search, Huazhong University of Science and Technology and Huawei Technologies
    https://arxiv.org/pdf/2409.16576

    October 2021, Supporting Massive DLRM Inference Through Software Defined Memory, Facebook https://arxiv.org/pdf/2110.11489
  • AI: post transformers

    LeCun's AMI Energy-Based Models and the Path to Autonomous Intelligence

    21/1/2026 | 13 mins.
    These sources collectively explore the current landscape and future trajectory of artificial intelligence, specifically focusing on the transition toward human-level reasoning. Renowned scientist Yann LeCun argues that current Large Language Models lack a fundamental understanding of the physical world and proposes a shift toward **objective-driven AI** that utilizes **world models** for better planning and common sense. This technological shift is supported by recent industry developments, such as the launch of **AMI Labs**, a high-valuation startup dedicated to these advanced architectures. Additionally, the materials emphasize the necessity of **open-source platforms** to ensure that the future of digital assistance remains transparent and culturally diverse. While addressing technical limitations, the documents maintain an optimistic view of **super-human intelligence** as a tool that will eventually amplify human potential under safe **guardrail objectives**. Practical elements like **LinkedIn's authentication** processes and **TechCrunch's venture coverage** further illustrate the integration of these technologies into the modern professional ecosystem.

    Sources:

    https://arxiv.org/pdf/2306.02572
    https://cmsa.fas.harvard.edu/media/lecun-20240328-harvard_reduced.pdf
    https://www.lesswrong.com/posts/C5guLAx7ieQoowv3d/lecun-s-a-path-towards-autonomous-machine-intelligence-has-1
    https://www.linkedin.com/mwlite/feed/posts/warrenbpowell_my-response-to-dimitri-bertsekass-thoughtful-activity-7394449098789261312-nXH3
    https://techcrunch.com/2025/12/19/yann-lecun-confirms-his-new-world-model-startup-reportedly-seeks-5b-valuation/

More Technology podcasts

About AI: post transformers

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.
Podcast website

Listen to AI: post transformers, Acquired and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features

AI: post transformers: Podcasts in Family

Social
v8.3.0 | © 2007-2026 radio.de GmbH
Generated: 1/24/2026 - 4:29:36 AM