Maksym Andriushchenko

@maksym_andr

Working on AI safety, robustness, and generalization (Square Attack, RobustBench, AgentHarm, etc). PhD from @EPFL supported by Google & OpenPhil PhD fellowships

Lausanne, Switzerland

Joined April 2018

837Following

4KFollowers

Pinned

Maksym Andriushchenko@maksym_andr · Oct 14

⚠️Standard jailbreak attacks overfocus on info that can anyway be easily found online. However, LLM agents can cause much more harm in the near future. 🚨Today we are announcing AgentHarm. It's like HarmBench/JailbreakBench but for agents. It's good :-) arxiv.org/abs/2410.09024

maksym_andr's tweet image. ⚠️Standard jailbreak attacks overfocus on info that can anyway be easily found online. However, LLM agents can cause much more harm in the near future.

🚨Today we are announcing AgentHarm. It's like HarmBench/JailbreakBench but for agents. It's good :-)

arxiv.org/abs/2410.09024

148

11.0K

Pinned

Maksym Andriushchenko@maksym_andr · May 28

How does LLM redteaming scale as actors become more capable? We studied this empirically on over 500 combinations of attacker and target models, and you can find a lot more info in the quoted thread below! In short, we find that redteaming sucess to be surprisingly predictable:

AAlexander Panfilov ✈️ ICML 2025@kotekjedi_ml · May 28

Stronger models need stronger attackers! 🤖⚔️ In our new paper we explore how attacker-target capability dynamics affect red-teaming success (ASR). Key insights: 🔸Stronger models = better attackers 🔸ASR depends on capability gap 🔸Psychology >> STEM for ASR More in 🧵👇

2.0K

Maksym Andriushchenko@maksym_andr · Jul 18

We are presenting OS-Harm as a spotlight at the Workshop on Computer Use Agents tomorrow at 2:50pm! Drop by to learn more :)

MMaksym Andriushchenko@maksym_andr · Jun 19

🚨Excited to release OS-Harm! 🚨 The safety of computer use agents has been largely overlooked. We created a new safety benchmark based on OSWorld for measuring 3 broad categories of harm: 1. deliberate user misuse, 2. prompt injections, 3. model misbehavior.

2.0K

Maksym Andriushchenko Retweeted

Skander Moalla@SkanderMoalla · Jul 14

🚀 Big time! We can finally do LLM RL fine-tuning with rewards and leverage offline/off-policy data! ❌ You want rewards, but GRPO only works online? ❌ You want offline, but DPO is limited to preferences? ✅ QRPO can do both! 🧵Here's how we do it:

136

152

19.0K

Maksym Andriushchenko@maksym_andr · Jul 12

Meanwhile, @Kimi_Moonshot has actually cooked with K2. Even without extended reasoning, it is on par with frontier models like Grok-4 on GPQA free-form. Massive congrats to them.

NNikhil Chandak@nikhilchandak29 · Jul 11

🚨Thought Grok-4 saturated GPQA? Not yet! ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have…

224

182.0K

Maksym Andriushchenko@maksym_andr · Jul 9

this seems like the first mainstream case of emergent misalignment well-described in arxiv.org/abs/2502.17424. steering values in post-training is far from straightforward…

DDaniel@growing_daniel · Jul 8

blocked it because of this. No hate on the timeline please!

1.0K

Maksym Andriushchenko Retweeted

Shashwat Goel@ShashwatGoel7 · Jul 4

There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer…

228

223

33.0K

Maksym Andriushchenko Retweeted

Brokoslaw Laschowski@DrLaschowski · Jun 26

Are you a graduate student in #Ukraine interested in machine learning and neuroscience? My research lab at #UofT is now accepting applications for remote thesis supervision. (1/3) #neuroAI #compneuro @VectorInst @UofT @UofTCompSci @UHN

3.0K

Maksym Andriushchenko Retweeted

Cas (Stephen Casper) @ ICML@StephenLCasper · Jun 21

Great paper from earlier this month. ✅ Great benchmark ✅ Improving our methods for attacks ✅ Improving out methods for defense arxiv.org/abs/2506.10949

2.0K

Maksym Andriushchenko@maksym_andr · Jun 20

Very important benchmark about the safety of computer use agents. Validates our findings in SafeArena (safearena.github.io) that agents can complete harmful tasks - now with reasoning models and on OS tasks. We need safer digital agents asap before more productization

MMaksym Andriushchenko@maksym_andr · Jun 19

3.0K

Maksym Andriushchenko@maksym_andr · Jun 13

Check out our new paper on monitoring decomposition jailbreak attacks! Monitoring is (still) an underappreciated research direction :-) There should be more work on this!

JJohn (Yueh-Han) Chen@jcyhc_ai · Jun 13

LLMs won’t tell you how to make fake IDs—but will reveal the layouts/materials of IDs and make realistic photos if asked separately. 💥Such decomposition attacks reach 87% success across QA, text-to-image, and agent settings! 🛡️Our monitoring method defends with 93% success! 🧵

2.0K

Maksym Andriushchenko@maksym_andr · Jun 6

we need a BugBot and Background Agent but for writing... ideally directly in Overleaf :-)

CCursor@cursor_ai · Jun 4

Cursor 1.0 is out now! Cursor can now review your code, remember its mistakes, and work on dozens of tasks in the background.

908

Maksym Andriushchenko@maksym_andr · Jun 2

the MathArena paper is out. they evaluate frontier LLMs on new, uncontaminated competition math problems. i was expecting grok-3-mini and qwen-3 to be lower, while claude-3.7 to be much higher! arxiv.org/abs/2505.23281

maksym_andr's tweet image. the MathArena paper is out. they evaluate frontier LLMs on new, uncontaminated competition math problems. i was expecting grok-3-mini and qwen-3 to be lower, while claude-3.7 to be much higher!

arxiv.org/abs/2505.23281

1.0K

Maksym Andriushchenko Retweeted

LINs Lab@tlin81447321 · May 31

Excited to share our recent work on unifying continuous generative models! ✅ Train/sample all diffusion/flow-matching/consistency models ✅ Ultra-efficient training/tuning (e.g., Fine-tune 250→2-step models in 8 mins!) ✅ Plug-and-play zero-cost sampling acceleration (1/6)

7.0K