Hadas Orgad @ ICML
@OrgadHadas
PhD student @ Technion | Focused on AI interpretability, robustness & safety | Because black boxes don’t belong in critical systems
I'm excited to share that I'll be joining @KempnerInst @Harvard as a research fellow this September!
Thrilled to announce the 2025 recipients of #KempnerInstitute Research Fellowships: Elom Amemastro, Ruojin Cai, David Clark, Alexandru Damian, William Dorrell, Mark Goldstein, Richard Hakim, Hadas Orgad, Gizem Ozdil, Gabriel Poesia, & Greta Tuckute! bit.ly/3IpzD5E
maybe I will live tweet the actionable interp workshop panel
Crazy amount of cool work concentrated in one room
The first poster session is happening now!
It's really fun to walk around a poster session and literally want to stop by each one!
The first poster session is happening now!
Hope everyone’s getting the most out of #icml25. We’re excited and ready for the Actionable Interpretability (@ActInterp) workshop this Saturday! Check out the schedule and join us to discuss how we can move interpretability toward more practical impact.


After a thunderstorm cancelled my flight, I finally made it to Vancouver for #ICML2025 and the @ActInterp workshop! DM if you want to chat about using interpretability for safer and more controllable AI. We’ll also present the Mech-Interp Benchmark (MIB) on Thu @ 11:00—come by!

Going to #icml2025? Don't miss the Actionable Interpretability Workshop (@ActInterp)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨
Now accepted to #COLM2025! We formally define hidden knowledge in LLMs and show its existence in a controlled study. We even show that a model can know the answer yet fail to generate it in 1,000 attempts 😵 Looking forward to presenting and discussing our work in person.
🚨 It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this “hidden knowledge”? In our new paper, we clearly define this concept and design controlled experiments to test it. 1/🧵
VLMs perform better when answering questions about text than when answering the same questions about images - but why? and how can we fix it? We investigate this gap from a mechanistic interpretability perspective, and use our findings to close a third of it! 🧵
I'm excited that our @ActInterp workshop at @icmlconf received over 150 submissions! We had to expand our reviewer pool to accommodate all submissions. I hope this reflects a growing interest in more actionable approaches to interpretability.

A really interesting paper by Dana and others, dividing SAE features into two groups: input and output features. Output features are actually pretty useful for steering!
Tried steering with SAEs and found that not all features behave as expected? Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
🚨New paper at #ACL2025 Findings! REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space. LMs memorize and leak sensitive data—emails, SSNs, URLs from their training. We propose a surgical method to unlearn it. 🧵👇w/@boknilev @mtutek 1/8
We received more submissions to our Actionable Interpretability workshop (@ActInterp) than expected, and we're now looking for additional reviewers! We're seeking reviewers to handle 2–3 papers between May 24 – June 7. Sign up here: forms.gle/FLToWY3keb832n… Thank you! 🙏
We got more submissions to the workshop than we anticipated, and are looking for reviewers willing to review 2-4 papers between May 24 and June 7. If you are interested, please self-nominate! Thank you 🙏 docs.google.com/forms/d/e/1FAI…
🚨 We're looking for reviewers for the workshop! If you're passionate about making interpretability useful and want to help shape the conversation, we'd love your input. Sign up to review >>💡🔍
🚨 Announcing the keynote speakers in the @ActInterp workshop at #icml2025 Join us to hear these leading experts share their take on how interpretability can drive real-world impact in AI. @_beenkim @cogconfluence @byron_c_wallace @RICEric22
