Aryaman Arora

@aryaman2020

member of technical staff @stanfordnlp

🌲

Joined December 2018

2KFollowing

7KFollowers

Pinned

Aryaman Arora@aryaman2020 · Jun 14

I'll be interning at @TransluceAI for the summer doing interp 🫡 will be staying in SF

249

38.0K

Aryaman Arora@aryaman2020 · Jul 27

please go to this fire poster

HHarshit Joshi@harshitj__ · Jul 27

flying to Vienna 🇦🇹 for ACL to present Genie Worksheets (Monday 11am)! come and say hi if you want to talk about how to create controllable and reliable application layers on top of LLMs, knowledge discovery and curation, or just wanna hang

2.0K

Aryaman Arora@aryaman2020 · Jul 22

💥New Paper💥 #LLMs encode harmfulness and refusal separately! 1️⃣We found a harmfulness direction 2️⃣The model internally knows a prompt is harmless, but still refuses it🤯 3️⃣Implication for #AI #safety & #alignment? Let’s analyze the harmfulness direction and use Latent Guard 🛡️

JJiachen Zhao@jcz12856876 · Jul 22

1/ 🚨New Paper 🚨 LLMs are trained to refuse harmful instructions, but internally, do they see harmfulness and refusal as the same? ⚔️We find causal evidence that 👈”LLMs encode harmfulness and refusal separately” 👉. ✂️LLMs may know a prompt is harmful internally yet still…

143

15.0K