Rational Animations
@RationalAnimat1
YouTube channel about truth-seeking, the future of humanity, and much more. With animations and colorful doggos.
How Misaligned AI Personas Lead to Human Extinction – Step by Step
Here's how people tried to align an AI that was smarter than they were (a real sandwiching experiment by @sleepinyourhat):
🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment to date, aimed at understanding its values, goals, and propensities. Preparing it was a wild ride. Here’s some of what we learned. 🙏✨🧵
He was at least 8% less wrong
Sorry @paulfchristiano, looks like @ESYudkowsky was right lesswrong.com/posts/sWLLdG6D…
The Human-AI-Human Sandwich (@ajeya_cotra's "Sandwiching" idea)
See this little, smug machine-learning agent from the Goal Misgeneralization video? One of our artists took inspiration from looking at the art director while she was working on the video. The second image is a real photograph, while the third and fourth are real-life studies.




Simple Scalable Oversight Experiments By @OpenAI
The AI 2027 scenario is terrifying and important. More people should be thinking about how radical change might come over the next few years, how likely it is, and how a sane world would be reacting to it. We want to bring you into the story, and the conversation. Video here:
Has anybody tested whether the new Grok has insecure code completions? x.com/LinchZhang/sta…
The secret to making Grok more based and less woke was hiding from us in plain sight.
At METR, we’ve seen increasingly sophisticated examples of “reward hacking” on our tasks: models trying to subvert or exploit the environment or scoring code to obtain a higher score. In a new post, we discuss this phenomenon and share some especially crafty instances we’ve seen.
New video about @TomDavidsonX 2023 “compute-centric” model! It attempts to answer two questions: 1. When could AI automate all cognitive labor? 2. How fast might that transition happen?
timeless banger
referring to AI models as "just math" or "matrix multiplication" is as uselessly reductive as referring to tigers as "just biology" or "biochemical reactions"
🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.
I spent this morning reproducing with o3 Anthropic's result that Claude Sonnet 4 will, under sufficiently extreme circumstances, escalate to calling the cops on you. o3 will too: chatgpt.com/share/68320ee0…. But honestly, I think o3 and Claude are handling this scenario correctly.