David Bau

@davidbau

Computer Science Professor at Northeastern, Ex-Googler. Believes AI should be transparent. @[email protected] @davidbau.bsky.social http://baulab.info

Boston

Joined January 2009

270Following

6KFollowers

Pinned

David Bau@davidbau · Sep 11

What is the goal of interpretability in AI? I spoke a bit about this at the recent far.ai alignment workshop: youtube.com/watch?v=AIfmSx… The event had excellent talks from many others, worth a view, linked in the thread.

FFAR.AI@farairesearch · Sep 11

Vienna #AlignmentWorkshop: 129 researchers tackled #AISafety from interpretability & robustness to governance. Keynote by @janleike + talks by @vkrakovna @DavidSKrueger @ghadfield @RobertTrager @NeelNanda5 @davidbau @hlntnr @MaryPhuong10 and more. Blog recap & videos. 👇

24.0K

David Bau@davidbau · Jul 25

I'd like to highlight this very cool finding by @jcz12856876. He finds llms have harmfulness representations different from refusal! He can use it to detect some jailbreaking attacks... It's an excellent step towards precise, interpretable control of safety behavior.

JJiachen Zhao@jcz12856876 · Jul 22

1/ 🚨New Paper 🚨 LLMs are trained to refuse harmful instructions, but internally, do they see harmfulness and refusal as the same? ⚔️We find causal evidence that 👈”LLMs encode harmfulness and refusal separately” 👉. ✂️LLMs may know a prompt is harmful internally yet still…

3.0K

David Bau@davidbau · Jul 1

The New England Mechanistic Interpretability Workshop, NEMI 2025 is August 22 in Boston. Talks, posters, meals, discussion... Most of all, an excellent chance to chat about new ideas with other great researchers in the field! Register and please retweet!

KKoyena Pal@kpal_koyena · Jun 30

🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:…

6.0K

David Bau@davidbau · Jun 13

How do you discover the ethical values of an AI when it is about what the AI *refuses* to say? In his preprint @can_rager develops a procedure for crawling refusals. It reveals huge differences in models from different countries! We should all audit our AI systems.

CCan Rager@can_rager · Jun 13

Can we uncover the list of topics a language model is censored on? Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:

2.0K

David Bau@davidbau · Jun 3

I love this summary of "what to learn in a PhD." What Jack points out seems like a simple principle, but living through the intellectual uncertainty and chaos of being lost in your research for a long time—it's emotionally much harder learn to do it than it sounds.

jjxmo@jxmnop · Jun 2

the most satisfying takeaway from a phd is that you can solve problems far beyond your capabilities if you're willing to throw yourself at them again and again, over a long period, while staying open to new ideas it's not really about being smart. just curious & persistent

183

14.0K