Dmitrii Kharlapenko

@dmhook

Joined October 2023

24Following

119Followers

Dmitrii Kharlapenko Retweeted

White Circle@whitecircle_ai · May 7

1/ Introducing ⚪️CircleGuardBench — a new benchmark for evaluating AI moderation models. Here’s why it’s cool: – Tests harm detection, jailbreak resistance, false positives, and latency – Covers 17 real-world harm categories – First benchmark designed for production-level…

17.0K

Dmitrii Kharlapenko@dmhook · Aug 12

How interpretable are task vectors? Using our new task vector cleaning method we find SAE features responsible for detecting and encoding specific ICL tasks. See details in our second MATS 6.0 post with @neverrixx, @NeelNanda5 and @ArthurConmy. lesswrong.com/posts/5FGXmJ3w…

6.0K

Dmitrii Kharlapenko@dmhook · Aug 5

We use LLM’s capabilities to explain concepts from their minds in my and @neverrixx abstract SAE features research. Excited to continue our MATS 6.0 work under the mentorship of @NeelNanda5 and @ArthurConmy . More cool stuff to come! lesswrong.com/posts/8ev6coxC…

16.0K