Ryan Greenblatt
@RyanPGreenblatt
Chief scientist at Redwood Research (@redwood_ai), focused on technical AI safety research to reduce risks from rogue AIs
New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
Modern reasoning models think in plain English. Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems. I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.
In this new @80000Hours podcast, I talk about timelines to powerful AI, the speed of AI progress after full automation of AI R&D, what AI takeover might look like, and some other related topics.
Ryan Greenblatt is lead author of "Alignment faking in LLMs" and one of AI's most productive researchers. He puts a 25% probability on automating AI research by 2029. We discuss: • Concrete evidence for and against AGI coming soon • The 4 easiest ways for AI to take over •…
FRI found that superforecasters and bio experts dramatically underestimated AI progress in virology: they often predicted it would take 5-10 years for AI to match experts on a benchmark for troubleshooting virology (VCT), but actually AIs had already reached this level.
Our new study finds: recent AI capabilities could increase the risk of a human-caused epidemic by 2-5x, according to 46 biosecurity experts and 22 top forecasters. One critical AI threshold that most experts said wouldn't be hit until 2030 was actually crossed in early 2025. But…
Someone thought it would be useful to quickly write up a note on my thoughts on scalable oversight research, e.g., research into techniques like debate or generally improving the quality of human oversight using AI assistance or other methods. Broadly, my view is that this is a…