Hadas Orgad @ ICML

@OrgadHadas

PhD student @ Technion | Focused on AI interpretability, robustness & safety | Because black boxes don’t belong in critical systems

Joined April 2019

130Following

613Followers

Pinned

Hadas Orgad @ ICML@OrgadHadas · Jul 16

I'm excited to share that I'll be joining @KempnerInst @Harvard as a research fellow this September!

KKempner Institute at Harvard University@KempnerInst · Jul 16

Thrilled to announce the 2025 recipients of #KempnerInstitute Research Fellowships: Elom Amemastro, Ruojin Cai, David Clark, Alexandru Damian, William Dorrell, Mark Goldstein, Richard Hakim, Hadas Orgad, Gizem Ozdil, Gabriel Poesia, & Greta Tuckute! bit.ly/3IpzD5E

102

6.0K

Hadas Orgad @ ICML Retweeted

Aryaman Arora@aryaman2020 · Jul 19

maybe I will live tweet the actionable interp workshop panel

100

12.0K

Hadas Orgad @ ICML@OrgadHadas · Jul 19

Crazy amount of cool work concentrated in one room

AActionable Interpretability Workshop ICML2025@ActInterp · Jul 19

The first poster session is happening now!

1.0K

Hadas Orgad @ ICML@OrgadHadas · Jul 19

It's really fun to walk around a poster session and literally want to stop by each one!

AActionable Interpretability Workshop ICML2025@ActInterp · Jul 19

The first poster session is happening now!

1.0K

Hadas Orgad @ ICML@OrgadHadas · Jul 17

Hope everyone’s getting the most out of #icml25. We’re excited and ready for the Actionable Interpretability (@ActInterp) workshop this Saturday! Check out the schedule and join us to discuss how we can move interpretability toward more practical impact.

OrgadHadas's tweet image. Hope everyone’s getting the most out of #icml25. We’re excited and ready for the Actionable Interpretability (@ActInterp) workshop this Saturday!
Check out the schedule and join us to discuss how we can move interpretability toward more practical impact.

3.0K

Hadas Orgad @ ICML@OrgadHadas · Jul 16

After a thunderstorm cancelled my flight, I finally made it to Vancouver for #ICML2025 and the @ActInterp workshop! DM if you want to chat about using interpretability for safer and more controllable AI. We’ll also present the Mech-Interp Benchmark (MIB) on Thu @ 11:00—come by!

OrgadHadas's tweet image. After a thunderstorm cancelled my flight, I finally made it to Vancouver for #ICML2025 and the @ActInterp workshop!

DM if you want to chat about using interpretability for safer and more controllable AI.

We’ll also present the Mech-Interp Benchmark (MIB) on Thu @ 11:00—come by!

605

Hadas Orgad @ ICML Retweeted

Mor Geva@megamor2 · Jul 8

Going to #icml2025? Don't miss the Actionable Interpretability Workshop (@ActInterp)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨

7.0K

Hadas Orgad @ ICML@OrgadHadas · Jul 9

Now accepted to #COLM2025! We formally define hidden knowledge in LLMs and show its existence in a controlled study. We even show that a model can know the answer yet fail to generate it in 1,000 attempts 😵 Looking forward to presenting and discussing our work in person.

ZZorik Gekhman@zorikgekhman · Mar 31

🚨 It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this “hidden knowledge”? In our new paper, we clearly define this concept and design controlled experiments to test it. 1/🧵

3.0K

Hadas Orgad @ ICML Retweeted

Yaniv Nikankin@YNikankin · Jun 11

VLMs perform better when answering questions about text than when answering the same questions about images - but why? and how can we fix it? We investigate this gap from a mechanistic interpretability perspective, and use our findings to close a third of it! 🧵

148

122

10.0K

Hadas Orgad @ ICML@OrgadHadas · May 31

I'm excited that our @ActInterp workshop at @icmlconf received over 150 submissions! We had to expand our reviewer pool to accommodate all submissions. I hope this reflects a growing interest in more actionable approaches to interpretability.

OrgadHadas's tweet image. I'm excited that our @ActInterp workshop at @icmlconf received over 150 submissions! We had to expand our reviewer pool to accommodate all submissions. I hope this reflects a growing interest in more actionable approaches to interpretability.

722

Hadas Orgad @ ICML@OrgadHadas · May 27

A really interesting paper by Dana and others, dividing SAE features into two groups: input and output features. Output features are actually pretty useful for steering!

DDana Arad 🎗️@dana_arad4 · May 27

Tried steering with SAEs and found that not all features behave as expected? Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

1.0K

Hadas Orgad @ ICML Retweeted

Tomer Ashuach@tomerashuach · May 27

🚨New paper at #ACL2025 Findings! REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space. LMs memorize and leak sensitive data—emails, SSNs, URLs from their training. We propose a surgical method to unlearn it. 🧵👇w/@boknilev @mtutek 1/8

4.0K

Hadas Orgad @ ICML Retweeted

Tal Haklay ✈️ACL@tal_haklay · May 20

We received more submissions to our Actionable Interpretability workshop (@ActInterp) than expected, and we're now looking for additional reviewers! We're seeking reviewers to handle 2–3 papers between May 24 – June 7. Sign up here: forms.gle/FLToWY3keb832n… Thank you! 🙏

1.0K

Hadas Orgad @ ICML@OrgadHadas · May 19

We got more submissions to the workshop than we anticipated, and are looking for reviewers willing to review 2-4 papers between May 24 and June 7. If you are interested, please self-nominate! Thank you 🙏 docs.google.com/forms/d/e/1FAI…

AActionable Interpretability Workshop ICML2025@ActInterp · Apr 16

🚨 We're looking for reviewers for the workshop! If you're passionate about making interpretability useful and want to help shape the conversation, we'd love your input. Sign up to review >>💡🔍

3.0K

Hadas Orgad @ ICML@OrgadHadas · May 19

🚨 Announcing the keynote speakers in the @ActInterp workshop at #icml2025 Join us to hear these leading experts share their take on how interpretability can drive real-world impact in AI. @_beenkim @cogconfluence @byron_c_wallace @RICEric22

OrgadHadas's tweet image. 🚨 Announcing the keynote speakers in the @ActInterp workshop at #icml2025
Join us to hear these leading experts share their take on how interpretability can drive real-world impact in AI.
@_beenkim @cogconfluence @byron_c_wallace @RICEric22

3.0K