Yi Zeng 曾祎
@EasonZeng623
probe to improve | Ph.D. @VTEngineering | Amazon Research Fellow | #AI_safety 🦺 #AI_security 🛡 | I deal with the dark side of machine learning.
Now you know there's another dude just discussed AI Safety and Security with both sides ;) #NeurIPS2023 [📸 With legendaries @ylecun and Yoshua Bengio]
![EasonZeng623's tweet image. Now you know there's another dude just discussed AI Safety and Security with both sides ;) #NeurIPS2023
[📸 With legendaries @ylecun and Yoshua Bengio]](https://pbs.twimg.com/media/GBpuvolXIAERrv0.jpg)
![EasonZeng623's tweet image. Now you know there's another dude just discussed AI Safety and Security with both sides ;) #NeurIPS2023
[📸 With legendaries @ylecun and Yoshua Bengio]](https://pbs.twimg.com/media/GBpuvojXkAAIbyo.jpg)
Do current LLMs perform simple tasks (e.g., grade school math) reliably? We know they don't (is 9.9 larger than 9.11?), but why? Turns out that, for one reason, benchmarks are too noisy to pinpoint such lingering failures. w/ @josh_vendrow @EdwardVendrow @sarameghanbeery 1/5
'Update Federal procurement guidelines to ensure that the government only contracts with frontier large language model (LLM) developers who ensure that their systems are objective and free from top-down ideological bias.' there is an executive order on this arriving today.
The AI action plan has been released.
Tomorrow is the unveiling of the AI Action Plan.
🎉 Thrilled to be presenting my first paper at @icmlconf! "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning" We introduce ACTOR—a lightweight, activation-based training method that reduces over-refusal without…
🔹 AI alignment really needs interdisciplinary work! 🔹 See my talk on "how to humanize AI to persuade them for jailbreaking": buff.ly/R7SN4W4
📄 ACL'24 Outstanding & Best Social Impact Paper: buff.ly/BWj9jNM 🎥 Full talk from Singapore Alignment Workshop: buff.ly/R7SN4W4
I'll never forget this model as well as the relationship between pretrain, SFT and RLHF.
Baidu just released 23 models at the same time on @huggingface - from 0.3B to 424B parameters. Let’s go!
Sparsity can make your LoRA fine-tuning go brrr 💨 Announcing SparseLoRA (ICML 2025): up to 1.6-1.9x faster LLM fine-tuning (2.2x less FLOPs) via contextual sparsity, while maintaining performance on tasks like math, coding, chat, and ARC-AGI 🤯 🧵1/ z-lab.ai/projects/spars…
🌉 Bridging Offline & Online RL for LLMs 🌉 📝: arxiv.org/abs/2506.21495 New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO…
1/ 🔥 AI agents are reaching a breakthrough moment in cybersecurity. In our latest work: 🔓 CyberGym: AI agents discovered 15 zero-days in major open-source projects 💰 BountyBench: AI agents solved real-world bug bounty tasks worth tens of thousands of dollars 🤖…
AIR-Bench is a Spotlight @iclr_conf 2025! Catch our poster on Fri, Apr 26, 10 a.m.–12:30 p.m. SGT (Poster Session 5). Sadly, I won’t be there in person (visa woes, again), but the insights—and our incredible team—will be with you in Singapore. Go say hi 👋
🧵[1/5] Introducing AIR 2024: Unifying AI risk categorizations with a shared language to improve AI safety. W/ @kevin_klyman @andyz245 @YUYANG_UCLA @MinzhouP & guidance from @ruoxijia @dawnsongtweets @percyliang @uiuc_aisecure for kicking off my AI policy research journey 🏦.
🚀 Really excited to launch #AgentX competition hosted by @BerkeleyRDI @UCBerkeley alongside our LLM Agents MOOC series (a global community of 22k+ learners & growing fast). Whether you're building the next disruptive AI startup or pushing the research frontier, AgentX is your…
1/14: If sparse autoencoders work, they should give us interpretable classifiers that help with probing in difficult regimes (e.g. data scarcity). But we find that SAE probes consistently underperform! Our takeaway: mech interp should use stronger baselines to measure progress 🧵
Excited to share new work from my internship @GoogleAI ! Curious as to how we should measure the similarity between examples in pretraining datasets? We study the role of similarity in pretraining 1.7B parameter language models on the Pile. arxiv: arxiv.org/abs/2502.02494 1/🧵
New paper & model release! Excited to introduce DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails, showcasing our new DuoGuard-0.5B model. - Model: huggingface.co/DuoGuard/DuoGu… - Paper: arxiv.org/abs/2502.05163 - GitHub: github.com/yihedeng9/DuoG… Grounded in a…
new paper! 🫡 we introduce 🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering. we find that: 🥇prompting and finetuning are still best 🥈supervised interp methods are effective 😮SAEs lag behind
🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe? How do we make a safety case demonstrating that these measures are sufficient? Our new paper from @AISafetyInst and @redwood_ai sketches a part of an AI control safety case in detail, proposing an…
Open Problems in Mechanistic Interpretability This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
Big new review! 🟦Open Problems in Mechanistic Interpretability🟦 We bring together perspectives from ~30 top researchers to outline the current frontiers of mech interp. It highlights the open problems that we think the field should prioritize! 🧵
DeepSeek should create a preparedness framework/RSP if they continue to scale reasoning models.
We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.