Dmitrii Kharlapenko
@dmhook
1/ Introducing ⚪️CircleGuardBench — a new benchmark for evaluating AI moderation models. Here’s why it’s cool: – Tests harm detection, jailbreak resistance, false positives, and latency – Covers 17 real-world harm categories – First benchmark designed for production-level…
How interpretable are task vectors? Using our new task vector cleaning method we find SAE features responsible for detecting and encoding specific ICL tasks. See details in our second MATS 6.0 post with @neverrixx, @NeelNanda5 and @ArthurConmy. lesswrong.com/posts/5FGXmJ3w…
We use LLM’s capabilities to explain concepts from their minds in my and @neverrixx abstract SAE features research. Excited to continue our MATS 6.0 work under the mentorship of @NeelNanda5 and @ArthurConmy . More cool stuff to come! lesswrong.com/posts/8ev6coxC…