Maksym Andriushchenko
@maksym_andr
Working on AI safety, robustness, and generalization (Square Attack, RobustBench, AgentHarm, etc). PhD from @EPFL supported by Google & OpenPhil PhD fellowships
⚠️Standard jailbreak attacks overfocus on info that can anyway be easily found online. However, LLM agents can cause much more harm in the near future. 🚨Today we are announcing AgentHarm. It's like HarmBench/JailbreakBench but for agents. It's good :-) arxiv.org/abs/2410.09024

How does LLM redteaming scale as actors become more capable? We studied this empirically on over 500 combinations of attacker and target models, and you can find a lot more info in the quoted thread below! In short, we find that redteaming sucess to be surprisingly predictable:
Stronger models need stronger attackers! 🤖⚔️ In our new paper we explore how attacker-target capability dynamics affect red-teaming success (ASR). Key insights: 🔸Stronger models = better attackers 🔸ASR depends on capability gap 🔸Psychology >> STEM for ASR More in 🧵👇
We are presenting OS-Harm as a spotlight at the Workshop on Computer Use Agents tomorrow at 2:50pm! Drop by to learn more :)
🚨Excited to release OS-Harm! 🚨 The safety of computer use agents has been largely overlooked. We created a new safety benchmark based on OSWorld for measuring 3 broad categories of harm: 1. deliberate user misuse, 2. prompt injections, 3. model misbehavior.
🚀 Big time! We can finally do LLM RL fine-tuning with rewards and leverage offline/off-policy data! ❌ You want rewards, but GRPO only works online? ❌ You want offline, but DPO is limited to preferences? ✅ QRPO can do both! 🧵Here's how we do it:
Meanwhile, @Kimi_Moonshot has actually cooked with K2. Even without extended reasoning, it is on par with frontier models like Grok-4 on GPQA free-form. Massive congrats to them.
🚨Thought Grok-4 saturated GPQA? Not yet! ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have…
this seems like the first mainstream case of emergent misalignment well-described in arxiv.org/abs/2502.17424. steering values in post-training is far from straightforward…
blocked it because of this. No hate on the timeline please!
There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer…
Are you a graduate student in #Ukraine interested in machine learning and neuroscience? My research lab at #UofT is now accepting applications for remote thesis supervision. (1/3) #neuroAI #compneuro @VectorInst @UofT @UofTCompSci @UHN
Great paper from earlier this month. ✅ Great benchmark ✅ Improving our methods for attacks ✅ Improving out methods for defense arxiv.org/abs/2506.10949
Very important benchmark about the safety of computer use agents. Validates our findings in SafeArena (safearena.github.io) that agents can complete harmful tasks - now with reasoning models and on OS tasks. We need safer digital agents asap before more productization
🚨Excited to release OS-Harm! 🚨 The safety of computer use agents has been largely overlooked. We created a new safety benchmark based on OSWorld for measuring 3 broad categories of harm: 1. deliberate user misuse, 2. prompt injections, 3. model misbehavior.
Check out our new paper on monitoring decomposition jailbreak attacks! Monitoring is (still) an underappreciated research direction :-) There should be more work on this!
LLMs won’t tell you how to make fake IDs—but will reveal the layouts/materials of IDs and make realistic photos if asked separately. 💥Such decomposition attacks reach 87% success across QA, text-to-image, and agent settings! 🛡️Our monitoring method defends with 93% success! 🧵
we need a BugBot and Background Agent but for writing... ideally directly in Overleaf :-)
Cursor 1.0 is out now! Cursor can now review your code, remember its mistakes, and work on dozens of tasks in the background.
the MathArena paper is out. they evaluate frontier LLMs on new, uncontaminated competition math problems. i was expecting grok-3-mini and qwen-3 to be lower, while claude-3.7 to be much higher! arxiv.org/abs/2505.23281

Excited to share our recent work on unifying continuous generative models! ✅ Train/sample all diffusion/flow-matching/consistency models ✅ Ultra-efficient training/tuning (e.g., Fine-tune 250→2-step models in 8 mins!) ✅ Plug-and-play zero-cost sampling acceleration (1/6)