Ryan Kidd
@ryan_kidd44
Co-Executive Director @MATSprogram, Co-Founder @LondonSafeAI, Regrantor @Manifund | PhD in physics | Accelerate AI alignment + build a better future for all
🆕 blog post! My job involves funding projects aimed at preventing catastrophic risks from transformative AI. Over the two years I’ve been doing this, I’ve noticed a number of projects that I wish more people would work on. So here’s my attempt at fleshing out ten of them. 🧵
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
METR previously estimated that the time horizon of AI agents on software tasks is doubling every 7 months. We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement.
When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
Today, Senator Scott Wiener introduced an amended version of his frontier AI legislation SB 53. Secure AI Project is proud to co-sponsor this important legislation, which follows the recommendations of the California Report on Frontier AI. A thread.
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models…
With @luke_drago_, I’m cofounding Workshop Labs, a public benefit corporation preventing human disempowerment from AI. See below for: -impact case -what we’re building -what we hope the future looks like -what we’re hiring for
Announcing Workshop Labs, a public benefit company.
I think this paper has some really exciting results! Some of my favorites that didn't fit in the main thread:
New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.
Governing AI requires international agreements, but cooperation can be risky if there’s no basis for trust. Our new report looks at how to verify compliance with AI agreements without sacrificing national security. This is neither impossible nor trivial.🧵 1/
1/ "Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
The moratorium just got taken out of the budget bill in a LANDSLIDE vote. 99 to 1. Incredible. Thank you to the lawmakers, the children's advocates, the artists and creators, the voters, the labor groups, and everyone who spoke out against the harmful AI law moratorium.
Our new study finds: recent AI capabilities could increase the risk of a human-caused epidemic by 2-5x, according to 46 biosecurity experts and 22 top forecasters. One critical AI threshold that most experts said wouldn't be hit until 2030 was actually crossed in early 2025. But…
Since AI 2027 people have often asked us what they can do to make AGI go well. I've just published a blog post covering: (a) What a prepared world would look like (b) Learning recommendations to get up to speed (c) High-impact jobs and non-professional activities
To quickly transform the world, it's not enough for AI to become super smart (the "intelligence explosion") AI will also have to turbocharge the physical world (the "industrial explosion") New post lays out the stages of the industrial explosion, and argues it will be fast! 🧵