Fazl Barez @ICML2025
@FazlBarez
Building 🤖 | Let's build AI's we can trust!
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵

maybe I will live tweet the actionable interp workshop panel
Come see @edelwax present our poster! Ballroom A - west
I’ll be at #ICML2025 – come say hi and talk to me about responsible AI👋 🎤 Speaking (14th): Post-AGI Civilizational Equilibria post-agi.org 💭 Panel @askalphaxiv (14th eve) lu.ma/n0yavto0 📝 Main-Conf Poster (16th): PoisonBench icml.cc/virtual/2025/p… 👀…
Yesterday’s panel that I ran at ICML on “The science singularity.” The room was packed so people sat down so others could see. Computer scientists are lovely people. Big thanks to @askalphaxiv @tdietterich @pzakin and @FazlBarez!
We’re excited to announce the first workshop on CogInterp: Interpreting Cognition in Deep Learning Models @ NeurIPS 2025! 📣 How can we interpret the algorithms and representations underlying complex behavior in deep learning models? 🌐 coginterp.github.io/neurips2025/ 1/
I'll be at ICML Thurs, Fri, Sat! I'll be at our Gradual Disempowerment poster at 11AM Thursday. Also: I'm planning to spend the rest of the year focused on raising awareness about AI risks. And I'm looking for postdocs/RAs, and for PhD students to start Jan or Sept 2026.
Calling this "training for interpretability" is misleading... it's more like "training that doesn't obviously degrade interpretability". Nobody actually has a method to train for interpretability.
I am extremely excited about the potential of chain-of-thought faithfulness & interpretability. It has significantly influenced the design of our reasoning models, starting with o1-preview. As AI systems spend more compute working e.g. on long term research problems, it is…
If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do! But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
🤓Calling all ML nerds! 🤓 Join me at ICML for the @askalphaxiv happy hour! We have food, drinks, and a panel. An amazing panel. Really the best panel, with @FazlBarez, @pzakin, and @tdietterich! Topic: "The science singularity." lu.ma/n0yavto0
Introducing: Full-Stack Alignment 🥞 A research program dedicated to co-aligning AI systems *and* institutions with what people value. It's the most ambitious project I've ever undertaken. Here's what we're doing: 🧵
This is in collaboration with a huge list of really really amazing researchers, including: @edelwax @xuanalogue @klingefjord @j_foerst @IasonGabriel @Dr_Atoosa @vinnylarouge @atrishasarkar @bakkermichiel @RyanOthKearns @ellie__hain @DavidDuvenaud @FazlBarez @FranklinMatija ...
See you all at the @ActInterp! Gonna be good
🚨Meet our panelists at the Actionable Interpretability Workshop @ActInterp at @icmlconf! Join us July 19 at 4pm for a panel on making interpretability research actionable, its challenges, and how the community can drive greater impact. @nsaphra @saprmarks @kylelostat @FazlBarez