
Dynamic Hedging
24/12/2025 | 43 mins.
Nassim Nicholas Taleb’s Dynamic Hedging explores the practical complexities of managing derivative portfolios, emphasizing that real-world trading often defies theoretical models. The text argues that market uncertainty and human behavior render physics-based social science theories ineffective for predicting financial outcomes. Taleb highlights the critical roles of liquidity holes, transaction costs, and the "ArcSine law" in shaping a trader's success or failure. Through technical analysis and "war stories," the book details the risks associated with exotic options, correlation-dependent products, and standard risk management tools like Value at Risk. Ultimately, the work serves as a guide for navigating the volatile discrepancies between formal financial formulas and the intuitive, often chaotic, nature of active market making

vLLM - LLM Serving Optimization: Paging, Routing, and Ranking
18/12/2025 | 40 mins.
This episode primarily focus on optimizing the efficiency and fairness of serving Large Language Models (LLMs) under high load conditions. One key source introduces PagedAttention and the vLLM serving system, which uses operating system-inspired paging techniques to efficiently manage the dynamic Key-Value (KV) cache memory, drastically reducing memory fragmentation and increasing throughput by 2-4x compared to state-of-the-art baselines. Another source focuses on improving LLM serving by proposing a ranking-based scheduling algorithm that approximates shortest-job-first strategies, leveraging prediction to alleviate Head-Of-Line (HOL) blocking and demonstrating significantly lower latency and higher throughput than First-Come-First-Serve (FCFS) and other methods. Finally, a third source addresses the challenge of ensuring fair LLM access in multi-tenant platforms, identifying the inadequacy of existing fairness approaches due to diverse application characteristics and proposing FairServe, which uses throttling and weighted scheduling to manage abusive user behavior

Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
14/12/2025 | 42 mins.
This episode introduces Jamba-1.5, a new series of instruction-tuned large language models built on the Jamba hybrid Transformer-Mamba mixture-of-experts architecture. These models, available in Large (94B active parameters) and Mini (12B active parameters) sizes, are highlighted for their high efficiency, superior throughput, and remarkably low memory usage over long context lengths, up to 256K tokens. A key technical innovation is ExpertsInt8, a novel quantization technique enabling the large model to run efficiently on standard GPU hardware without compromising quality. Evaluations consistently show that Jamba-1.5 models achieve competitive performance on academic and chatbot benchmarks while excelling in long-context tasks compared to other similarly sized open-weight models. The authors also share insights into the model's training stages, multilingual capabilities, and alignment safety considerations

Google's Titans+Miras: Learning to Memorize at Test Time
14/12/2025 | 30 mins.
Over more than a decade there has been an extensive research effort of how effectively utilize recurrent models andattentions. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allowsattending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modelingof dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a newneural long-term memory module that learns to memorize historical context and helps an attention to attend to thecurrent context while utilizing long past information. We show that this neural memory has the advantage of a fastparallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to itslimited context but accurate dependency modeling performs as a short-term memory, while neural memory due to itsability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introducea new family of architectures, called Titans, and present three variants to address how one can effectively incorporatememory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics,and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models.They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack taskscompared to baselines

LLM Architectures: Attention, Mamba, and Efficiency Tradeoffs
06/12/2025 | 43 mins.
This episode examines the architecture and efficiency of Large Language Models (LLMs), focusing heavily on optimizing the attention mechanism and exploring alternatives like State Space Models (SSMs). Several papers introduce and analyze methods to overcome the quadratic complexity of standard self-attention, including Grouped-Query Attention (GQA), Sliding Window Attention (SWA), and the hardware-aware optimizations of FlashAttention. A significant portion of the research centers on Mamba-based models and hybrid architectures that combine SSMs with attention layers, demonstrating that these hybrids, such as the Mamba-2-Hybrid, can achieve better performance on memory recall and long-context tasks than pure Transformers while maintaining efficiency. Finally, one source investigates the internal reasoning of attention mechanisms, proposing that a "preplan-and-anchor" rhythm can be identified and leveraged to create more effective reinforcement learning strategies for fine-grained policy optimization



The Gist Talk