We review the "Storage-Next" paper, published in November 2025, which argues that a fundamental hardware architectural shift is required to elevate NAND flash from a passive storage tier to an active memory tier capable of "seconds-scale" caching. The authors contend that standard SSDs impose a "channel-side ceiling" on IOPS because they are optimized for 4KB blocks, creating massive bandwidth waste when AI applications demand fine-grained access to small items, such as 128-byte embedding vectors. To solve this, they propose specialized "Storage-Next" drives capable of scalable IOPS for small block sizes (e.g., 50M IOPS at 512B), arguing this hardware is necessary to simplify software stacks and enable high-throughput random access without the read amplification penalties inherent in current technology.
However, the episode explores how concurrent research largely rebuts the strict need for this new hardware by demonstrating that intelligent software and driver modifications can mask these inefficiencies on standard drives. Systems like PageANN and FusionANNS prove that aggregating topologically related vectors into 4KB pages allows existing SSDs to handle billion-scale search efficiently, while Strata utilizes GPU-assisted I/O to bundle fragmented LLM token pages. Furthermore, for workloads specifically requiring fine-grained access like DLRM, Meta researchers successfully implemented a "software-defined memory" solution using the NVMe SGL Bit Bucket feature to strip unwanted data at the driver level, reducing PCIe bandwidth consumption by 75% on standard hardware. These innovations suggest that aside from the specific niche of random hash-based lookups where locality is mathematically impossible, software optimization remains a viable alternative to a physical overhaul of storage media.
We've previously covered some of the papers here individually:
Meta's massive DLRM Linux NVMe SGL bit bucket solution:
https://open.spotify.com/episode/7fPOvegGpWWYqChIVYGfwx?si=uxNPv4hZQvumhwwPGowwTA&context=spotify%3Ashow%3A48ygM4upvm6noxCbmhlz8i
PageANNS:
https://open.spotify.com/episode/5rrXWA4KJxGHp4xckirlZ2?si=_Qhzy_g1SZyPrBFmHvlY5g
FusionsANNS:
https://open.spotify.com/episode/6Ys51jB54GilRlYsvz4yXR?si=yI8KwDE1QpS6BbnFsinl6g
Strata:
https://open.spotify.com/episode/18kCgDcrOsQ5nw58V2HGBB?si=4Rr4ZfqIR-SzaVxyS8hOWA
Sources:
November 2025, From Minutes to Seconds: Redefining the Five-Minute Rule for AI-Era Memory Hierarchies, ScaleFlux and NVIDIA and Stanford University https://arxiv.org/pdf/2511.03944
September 2025, Scalable Disk-Based Approximate Nearest Neighbor Search with Page-Aligned Graph, University of Texas at Dallas and Rutgers University https://arxiv.org/pdf/2509.25487
August 2025, Strata: Hierarchical Context Caching for Long Context Language Model Serving, Stanford University and NVIDIA
https://arxiv.org/pdf/2508.18572
September 2024, FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search, Huazhong University of Science and Technology and Huawei Technologies
https://arxiv.org/pdf/2409.16576
October 2021, Supporting Massive DLRM Inference Through Software Defined Memory, Facebook https://arxiv.org/pdf/2110.11489