David Bau
@davidbau
Computer Science Professor at Northeastern, Ex-Googler. Believes AI should be transparent. @[email protected] @davidbau.bsky.social http://baulab.info
What is the goal of interpretability in AI? I spoke a bit about this at the recent far.ai alignment workshop: youtube.com/watch?v=AIfmSx… The event had excellent talks from many others, worth a view, linked in the thread.
Vienna #AlignmentWorkshop: 129 researchers tackled #AISafety from interpretability & robustness to governance. Keynote by @janleike + talks by @vkrakovna @DavidSKrueger @ghadfield @RobertTrager @NeelNanda5 @davidbau @hlntnr @MaryPhuong10 and more. Blog recap & videos. 👇
I'd like to highlight this very cool finding by @jcz12856876. He finds llms have harmfulness representations different from refusal! He can use it to detect some jailbreaking attacks... It's an excellent step towards precise, interpretable control of safety behavior.
1/ 🚨New Paper 🚨 LLMs are trained to refuse harmful instructions, but internally, do they see harmfulness and refusal as the same? ⚔️We find causal evidence that 👈”LLMs encode harmfulness and refusal separately” 👉. ✂️LLMs may know a prompt is harmful internally yet still…
The New England Mechanistic Interpretability Workshop, NEMI 2025 is August 22 in Boston. Talks, posters, meals, discussion... Most of all, an excellent chance to chat about new ideas with other great researchers in the field! Register and please retweet!
🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:…
How do you discover the ethical values of an AI when it is about what the AI *refuses* to say? In his preprint @can_rager develops a procedure for crawling refusals. It reveals huge differences in models from different countries! We should all audit our AI systems.
Can we uncover the list of topics a language model is censored on? Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:
I love this summary of "what to learn in a PhD." What Jack points out seems like a simple principle, but living through the intellectual uncertainty and chaos of being lost in your research for a long time—it's emotionally much harder learn to do it than it sounds.
the most satisfying takeaway from a phd is that you can solve problems far beyond your capabilities if you're willing to throw yourself at them again and again, over a long period, while staying open to new ideas it's not really about being smart. just curious & persistent