A
Aryaman Arora
@aryaman2020
member of technical staff @stanfordnlp
🌲
Joined December 2018
2KFollowing
7KFollowers
Pinned
A
Aryaman Arora@aryaman2020 · Jun 14
I'll be interning at @TransluceAI for the summer doing interp 🫡 will be staying in SF
15
5
250
24
38.0K
A
Aryaman Arora@aryaman2020 · Jul 22
💥New Paper💥 #LLMs encode harmfulness and refusal separately! 1️⃣We found a harmfulness direction 2️⃣The model internally knows a prompt is harmless, but still refuses it🤯 3️⃣Implication for #AI #safety & #alignment? Let’s analyze the harmfulness direction and use Latent Guard 🛡️
1/ 🚨New Paper 🚨 LLMs are trained to refuse harmful instructions, but internally, do they see harmfulness and refusal as the same? ⚔️We find causal evidence that 👈”LLMs encode harmfulness and refusal separately” 👉. ✂️LLMs may know a prompt is harmful internally yet still…
2
19
143
56
15.0K