Yi Zeng 曾祎

@EasonZeng623

probe to improve | Ph.D. @VTEngineering | Amazon Research Fellow | #AI_safety 🦺 #AI_security 🛡 | I deal with the dark side of machine learning.

Virginia, US

Joined August 2017

1KFollowing

1KFollowers

Pinned

Yi Zeng 曾祎@EasonZeng623 · Dec 18, 2023

Now you know there's another dude just discussed AI Safety and Security with both sides ;) #NeurIPS2023 [📸 With legendaries @ylecun and Yoshua Bengio]

EasonZeng623's tweet image. Now you know there's another dude just discussed AI Safety and Security with both sides ;) #NeurIPS2023

[📸 With legendaries @ylecun and Yoshua Bengio]

119

23.0K

Pinned

Yi Zeng 曾祎 Retweeted

Aleksander Madry@aleks_madry · Feb 6

Do current LLMs perform simple tasks (e.g., grade school math) reliably? We know they don't (is 9.9 larger than 9.11?), but why? Turns out that, for one reason, benchmarks are too noisy to pinpoint such lingering failures. w/ @josh_vendrow @EdwardVendrow @sarameghanbeery 1/5

238

132

48.0K

Yi Zeng 曾祎 Retweeted

Andrew Curran@AndrewCurran_ · Jul 23

'Update Federal procurement guidelines to ensure that the government only contracts with frontier large language model (LLM) developers who ensure that their systems are objective and free from top-down ideological bias.' there is an executive order on this arriving today.

8.0K

Yi Zeng 曾祎@EasonZeng623 · Jul 23

The AI action plan has been released.

AAndrew Curran@AndrewCurran_ · Jul 23

Tomorrow is the unveiling of the AI Action Plan.

175

1.0K

698

495.0K

Yi Zeng 曾祎 Retweeted

Mahavir@Mahavir_Dabas18 · Jul 12

🎉 Thrilled to be presenting my first paper at @icmlconf! "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning" We introduce ACTOR—a lightweight, activation-based training method that reduces over-refusal without…

1.0K

Yi Zeng 曾祎@EasonZeng623 · Jul 8

🔹 AI alignment really needs interdisciplinary work! 🔹 See my talk on "how to humanize AI to persuade them for jailbreaking": buff.ly/R7SN4W4

FFAR.AI@farairesearch · Jul 3

📄 ACL'24 Outstanding & Best Social Impact Paper: buff.ly/BWj9jNM 🎥 Full talk from Singapore Alignment Workshop: buff.ly/R7SN4W4

6.0K

Yi Zeng 曾祎@EasonZeng623 · Jun 30

I'll never forget this model as well as the relationship between pretrain, SFT and RLHF.

cclem 🤗@ClementDelangue · Jun 30

Baidu just released 23 models at the same time on @huggingface - from 0.3B to 424B parameters. Let’s go!

527

342

70.0K

Yi Zeng 曾祎 Retweeted

Xiuyu Li @ ICML@xiuyu_l · Jun 30

Sparsity can make your LoRA fine-tuning go brrr 💨 Announcing SparseLoRA (ICML 2025): up to 1.6-1.9x faster LLM fine-tuning (2.2x less FLOPs) via contextual sparsity, while maintaining performance on tasks like math, coding, chat, and ARC-AGI 🤯 🧵1/ z-lab.ai/projects/spars…

208

135

34.0K

Yi Zeng 曾祎 Retweeted

Jason Weston@jaseweston · Jun 30

🌉 Bridging Offline & Online RL for LLMs 🌉 📝: arxiv.org/abs/2506.21495 New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO…

100

454

317

65.0K

Yi Zeng 曾祎 Retweeted

Dawn Song@dawnsongtweets · Jun 18

1/ 🔥 AI agents are reaching a breakthrough moment in cybersecurity. In our latest work: 🔓 CyberGym: AI agents discovered 15 zero-days in major open-source projects 💰 BountyBench: AI agents solved real-world bug bounty tasks worth tens of thousands of dollars 🤖…

142

485

335

101.0K

Yi Zeng 曾祎@EasonZeng623 · Apr 22

AIR-Bench is a Spotlight @iclr_conf 2025! Catch our poster on Fri, Apr 26, 10 a.m.–12:30 p.m. SGT (Poster Session 5). Sadly, I won’t be there in person (visa woes, again), but the insights—and our incredible team—will be with you in Singapore. Go say hi 👋

YYi Zeng 曾祎@EasonZeng623 · Jul 31

🧵[1/5] Introducing AIR 2024: Unifying AI risk categorizations with a shared language to improve AI safety. W/ @kevin_klyman @andyz245 @YUYANG_UCLA @MinzhouP & guidance from @ruoxijia @dawnsongtweets @percyliang @uiuc_aisecure for kicking off my AI policy research journey 🏦.

4.0K

Yi Zeng 曾祎 Retweeted

Dawn Song@dawnsongtweets · Mar 13

🚀 Really excited to launch #AgentX competition hosted by @BerkeleyRDI @UCBerkeley alongside our LLM Agents MOOC series (a global community of 22k+ learners & growing fast). Whether you're building the next disruptive AI startup or pushing the research frontier, AgentX is your…

111

416

145

67.0K

Yi Zeng 曾祎 Retweeted

Josh Engels@JoshAEngels · Feb 25

1/14: If sparse autoencoders work, they should give us interpretable classifiers that help with probing in difficult regimes (e.g. data scarcity). But we find that SAE probes consistently underperform! Our takeaway: mech interp should use stronger baselines to measure progress 🧵

520

388

67.0K

Yi Zeng 曾祎 Retweeted

Dylan Sam@dylanjsam · Feb 17

Excited to share new work from my internship @GoogleAI ! Curious as to how we should measure the similarity between examples in pretraining datasets? We study the role of similarity in pretraining 1.7B parameter language models on the Pile. arxiv: arxiv.org/abs/2502.02494 1/🧵

171

19.0K

Yi Zeng 曾祎 Retweeted

Yihe Deng@Yihe__Deng · Feb 10

New paper & model release! Excited to introduce DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails, showcasing our new DuoGuard-0.5B model. - Model: huggingface.co/DuoGuard/DuoGu… - Paper: arxiv.org/abs/2502.05163 - GitHub: github.com/yihedeng9/DuoG… Grounded in a…

134

59.0K

Yi Zeng 曾祎 Retweeted

Aryaman Arora@aryaman2020 · Jan 30

new paper! 🫡 we introduce 🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering. we find that: 🥇prompting and finetuning are still best 🥈supervised interp methods are effective 😮SAEs lag behind

418

245

103.0K

Yi Zeng 曾祎 Retweeted

Tomek Korbak@tomekkorbak · Jan 30

🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe? How do we make a safety case demonstrating that these measures are sufficient? Our new paper from @AISafetyInst and @redwood_ai sketches a part of an AI control safety case in detail, proposing an…

139

28.0K

Yi Zeng 曾祎 Retweeted

Aran Komatsuzaki@arankomatsuzaki · Jan 29

Open Problems in Mechanistic Interpretability This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

293

258

28.0K

Yi Zeng 曾祎 Retweeted

Lee Sharkey@leedsharkey · Jan 29

Big new review! 🟦Open Problems in Mechanistic Interpretability🟦 We bring together perspectives from ~30 top researchers to outline the current frontiers of mech interp. It highlights the open problems that we think the field should prioritize! 🧵

548

587

75.0K

Yi Zeng 曾祎 Retweeted

Stephen McAleer@McaleerStephen · Jan 28

DeepSeek should create a preparedness framework/RSP if they continue to scale reasoning models.

250

164.0K

Yi Zeng 曾祎 Retweeted

Rafael Rafailov @ NeurIPS@rm_rafailov · Jan 9

We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.

230

1.0K

171.0K