Mor Geva
@megamor2
"writing is not only about reporting results; it also provides a tool to uncover new thoughts and ideas. Writing compels us to think"
maybe I will live tweet the actionable interp workshop panel
Starting soon!! #ICML2025
🚨The Actionable Interpretability Workshop is happening tomorrow at ICML! Join us for an exciting lineup of speakers, nearly 70 posters, and a great panel discussion 🙌 Don’t miss it! 🔍⚙️ @icmlconf @ActInterp
Suppose you're reading something (let it be a paper, review, email, whatever) and as you read you get the sense it was largely written by an LLM. What's your first reaction? How does it change how you read it? EDIT: by "largely written by an LLM" I mean the writer heavily used…
🚨Meet our panelists at the Actionable Interpretability Workshop @ActInterp at @icmlconf! Join us July 19 at 4pm for a panel on making interpretability research actionable, its challenges, and how the community can drive greater impact. @nsaphra @saprmarks @kylelostat @FazlBarez
Building a science of model understanding that addresses real-world problems is one of the key AI challenges of our time. I'm so excited this workshop is happening! See you at #ICML2025 ✨
Going to #icml2025? Don't miss the Actionable Interpretability Workshop (@ActInterp)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨
Gonna be there /w the g.o.a.t. @Pranav_AL, can't wait for it! Thank you so much for the workshop 🚀!
Going to #icml2025? Don't miss the Actionable Interpretability Workshop (@ActInterp)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨
Going to #icml2025? Don't miss the Actionable Interpretability Workshop (@ActInterp)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨

A Vision-Language Model can answer questions about Robin Williams. It can also recognize him in a photo. So why does it FAIL when asked the same questions using his photo instead of his name? A thread on our new #acl2025 paper that explores this puzzle 🧵
What makes some jailbreak suffixes stronger than others? We looked into the inner workings of GCG-like attacks and found a cool hijacking mechanism that strong attacks heavily rely on. This also lets us enhance attacks and defenses against them. Check out @matanbt 's thread 👇
What makes or breaks powerful jailbreak suffixes? 🔓🤖 We find that: 🥷 they work by hijacking the model’s context; ♾️ the more universal a suffix is the stronger its hijacking; ⚔️🛡️ utilizing these insights, it is possible to both enhance and mitigate these attacks. 🧵
What makes or breaks powerful jailbreak suffixes? 🔓🤖 We find that: 🥷 they work by hijacking the model’s context; ♾️ the more universal a suffix is the stronger its hijacking; ⚔️🛡️ utilizing these insights, it is possible to both enhance and mitigate these attacks. 🧵
🚨 New Paper 🧵 How effectively do reasoning models reevaluate their thought? We find that: - Models excel at identifying unhelpful thoughts but struggle to recover from them - Smaller models can be more robust - Self-reevaluation ability is far from true meta-cognitive awareness