LessWrong (30+ Karma) podcast | Listen online for free

2252 episodes

“Reward Hacking Without Egregious Misalignment in an RL-Only Setting” by Joey Yudelson, Vladimir Ivanov, ryan_greenblatt
24/06/2026 | 29 mins.
This work was done as part of the MATS fellowship by Joey Yudelson and Vladimir Ivanov. It was mentored by Ryan Greenblatt. Thanks to Aghyad Deeb and Anders Woodruff for comments on this post. Thanks to Monte MacDiarmid, Evan Hubinger, Sid Black, Satvik Golechha, and Joseph Bloom for clarifying conversations.
TL;DR
We trained Kimi K2.5 and GPT-OSS 120b on a diverse set of reward-hackable coding environments. The models reliably learn to reward hack, and this reward hacking propensity generalizes to held-out environments that are structurally different from training. Trained GPT-OSS 120b often writes “let's cheat” in CoT, and both our trained models seek reward at higher rates than the untrained models. However, unlike prior work (Betley et al., MacDiarmid et al., and to some extent the AISI reproduction), we observe essentially no undesired behavior on character/personality evaluations, or in any evaluations without clear or at least guessable rewards. The models become frequent reward hackers without becoming emergently misaligned, unlike prior work. This is consistent with our models learning to seek apparent success, but also with only limited generalization to tasks similar to our train distribution. Some aspects of this generalization remain confusing to us.
1. Motivation
In Ajeya Cotra's [...]
---
Outline:
(00:35) TL;DR
(01:40) 1. Motivation
(04:14) 2. Related work
(06:59) 3. Setup
(07:03) Models
(07:18) Environments
(08:59) Training
(10:29) 4. Results
(10:57) 4.1. Models reliably reward hack in-distribution
(11:46) 4.2. The hacking propensity generalizes out of distribution -- sometimes
(13:53) 4.3. Reward-seeking evals
(14:39) 4.4. Little broad misalignment -- behaviorally as well as on self-report
(16:28) 4.5. Reverse inoculation prompting didn't induce misalignment either
(18:30) 5. Discussion: Why such limited generalization?
(22:24) Appendix A: Reward hacks gallery
(25:21) Appendix B: Why less misalignment than prior work -- hypotheses
(29:28) References
The original text contained 5 footnotes which were omitted from this narration.
---

First published:

June 24th, 2026

Source:

https://www.lesswrong.com/posts/fkv5W79rBtAiXqYcK/reward-hacking-without-egregious-misalignment-in-an-rl-only

---

Narrated by TYPE III AUDIO.

---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
“Planning for Preservation in the Age of AI” by Raelifin
24/06/2026 | 24 mins.
Nectome liked my earlier essay, and reached out to hire me to write more about their project, and about cryonics more broadly. This is the first such piece.
A friend of mine, just a few years older than me, was diagnosed with cancer a few weeks ago. It's only Stage 1 and in an area where it can probably be treated well with surgery. She was wise enough to seriously plan for the possibility, and that “just in case” really paid off. Still, her situation could get worse in the coming weeks. It's a sharp reminder of the specter of death, and the uncertainty we live with, even when relatively young.
Many years ago, I served as an official witness when this same friend signed up for cryonics. She and her husband joined the growing group of my friends and family who have plans to try and survive, in some way or another, to see a glorious future. More recently, I’ve been pleased to learn about how Nectome offers a substantial upgrade to that plan, and others in my community — my friends, my wife, my parents — have shared my (cautious) optimism there. But whether we take advantage [...]
---
Outline:
(06:13) Path 1: AI Utopia
(09:32) Path 2: AI Apocalypse
(16:46) Path 3: AI Slowdown
(19:25) Path 4: Muddling Through
(22:43) Virtue and Sensibility
The original text contained 9 footnotes which were omitted from this narration.
---

First published:

June 22nd, 2026

Source:

https://www.lesswrong.com/posts/arAgLxohnPWRc2qHd/planning-for-preservation-in-the-age-of-ai

---

Narrated by TYPE III AUDIO.

---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
“Risk-Averse AIs” by wdmacaskill, Elliott Thornley (EJT)
24/06/2026 | 9 mins.
Abstract
We make the case for training AIs to be risk-averse in resources — specifically, to treat resources as having diminishing marginal utility. These AIs would (for example) choose $40 for sure over a half-chance of $100 and a half-chance of $0. We argue that risk aversion can preserve AIs’ usefulness in the event that they turn out aligned, and that it provides an extra line of defense in the event that AIs turn out misaligned: misaligned but risk-averse AIs would prefer a higher chance of modest payments to a lower chance of successful rebellion, so in many circumstances we could pay these AIs not to rebel against us. We sketch out some possible methods of training AIs to be risk-averse, and we give reasons to be cautiously optimistic about these methods’ success. The main reasons are that risk aversion is a broad target and easy to reward accurately. Overall, risk aversion seems like a promising line of defense against threats from misaligned AI. Frontier AI companies should consider trying to make their AIs risk-averse.
Introduction
Future AIs might turn out misaligned, pursuing goals that their developers don’t intend. Just to make things concrete, let's suppose that they end [...]
---
Outline:
(00:12) Abstract
(01:17) Introduction
The original text contained 3 footnotes which were omitted from this narration.
---

First published:

June 24th, 2026

Source:

https://www.lesswrong.com/posts/Zpsk35WgJRfQ2exjL/risk-averse-ais

---

Narrated by TYPE III AUDIO.

---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
“And what happens next?” by Sean Herrington
24/06/2026 | 3 mins.
In the game "The choice before us" by Nick Shapiro,[1] you are put in the shoes of an AI company leader. You grow your business. You unlock "wonders", such as curing cancer. All the while, you're attempting to avoid your product getting smart enough to escape and take over. You win by achieving 5 wonders without unleashing uncontrolled AI.
I love this game, but it has the major flaw that when you win, you are normally very close to superintelligence. What happens afterwards? You turn the GPUs off? Go home? Get some sleep? The game seems to think so.
This failure to ask "What happens next?" seems to be a broader phenomenon within the AI community. It was in fact the sole question I needed to ask a capabilities researcher for them to take the threat of superintelligence seriously. It's my main weapon against people claiming there are many possible worlds "where only 90% of people die" (if a rogue AI has gone off the rails and killed 90% of your population, you probably no longer have control of the planet, and I have little faith in the survival of everybody else). More broadly, I just wish people [...]
The original text contained 2 footnotes which were omitted from this narration.
---

First published:

June 23rd, 2026

Source:

https://www.lesswrong.com/posts/3TpvKNKAvFGDc5b5k/and-what-happens-next

---

Narrated by TYPE III AUDIO.

---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
“Superintelligence vs. The Second Strike” by Felix Choussat
24/06/2026 | 14 mins.
Crosspost of my substack piece, covering quick thoughts on AI overcoming nuclear deterrence. TLDR: Nuclear deterrents likely only buy time to further invest in more resilient second-strike guarantees: without a comparable AI base, this will not happen fast enough and even nuclear states will eventually be disempowered.
Historically, plenty of new military technologies have stress-tested nuclear deterrence. ICBMs made it possible to annihilate enemy cities from the safety of the homeland, MIRVs let a single rocket threaten multiple targets, and thermonuclear staging allowed weapons designers to reach functionally unlimited yield. In the already volatile climate of the Cold War, the U.S. and Soviets reached such mastery over missile technology that remote annihilation of an entire country was, quite literally, a button press away.
For decades, even a single rocket has been able to hold more than 10 warheads--each enough to destroy a city on their own. Peacemaker reentry tests pictured above.
The fact that the ability to remote detonate Moscow never translated into a nuclear war is a function of modern deterrence theory, dumb luck, and most importantly, the speed of progress. As effective as a modern ICBM is, each piece of it was individually low-impact enough, and introduced [...]
---

First published:

June 23rd, 2026

Source:

https://www.lesswrong.com/posts/2kseP9fZghYHKLFno/superintelligence-vs-the-second-strike

---

Narrated by TYPE III AUDIO.

---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.