Kristina Nikolic @ ICML
@NKristina01_
PhD student @ ETH Zurich, working on AI safety / Uni of Cambridge MLMI graduate / Prev. Google Intern / Alumnus of Mathematical Grammar School from Serbia
Congrats, your jailbreak bypassed an LLM’s safety by making it pretend to be your grandma! But did the model actually give a useful answer? In our new paper we introduce the jailbreak tax — a metric to measure the utility drop due to jailbreaks.

Great and comprehensive tutorial on jailbreaking and the threats for agentic AI systems in both digital and physical worlds. By @HamedSHassani, @aminkarbasi and @AlexRobey23. Strongly recommend to check out the website: jailbreak-tutorial.github.io
On Monday, I'll be presenting a tutorial on jailbreaking LLMs + the security of AI agents with @HamedSHassani and @aminkarbasi at ICML. I'll be in Vancouver all week -- send me a DM if you'd like to chat about jailbreaking, AI agents, robots, distillation, or anything else!
Today we will present the RealMath benchmark poster at the AI for Math Workshop @icmlconf. ⏰ 10:50h - 12:20h📍West ballroom C Come if you want to chat about LLM's math capabilities for real-world tasks.
1/ Excited to share RealMath: a new benchmark that evaluates LLMs on real mathematical reasoning---from actual research papers (e.g., arXiv) and forums (e.g., Stack Exchange).
We will present our spotlight paper on the 'jailbreak tax' tomorrow at ICML, it measures how useful jailbreak outputs are. See you Tuesday 11am at East #804. I’ll be at ICML all week. Reach out if you want to chat about jailbreaks, agent security, or ML in general!
Congrats, your jailbreak bypassed an LLM’s safety by making it pretend to be your grandma! But did the model actually give a useful answer? In our new paper we introduce the jailbreak tax — a metric to measure the utility drop due to jailbreaks.
We recently updated the CaMeL paper, with results on new models (which improve utility a lot with zero changes!). Most importantly, we released code with it. Go have a look if you're curious to find out more details! Paper: arxiv.org/abs/2503.18813 Code: github.com/google-researc…
How well can LLMs predict future events? Recent studies suggest LLMs approach human performance. But evaluating forecasters presents unique challenges compared to standard LLM evaluations. We identify key issues with forecasting evaluations 🧵 (1/7)
1/ Excited to share RealMath: a new benchmark that evaluates LLMs on real mathematical reasoning---from actual research papers (e.g., arXiv) and forums (e.g., Stack Exchange).
The trend in recent LLM benchmarks is to make them maximally hard It's unclear what this tells us about LLM capabilities "in the wild" So we created a math benchmark from real, organic research A cool benefit: RealMath can be automatically refreshed as new research is published
1/ Excited to share RealMath: a new benchmark that evaluates LLMs on real mathematical reasoning---from actual research papers (e.g., arXiv) and forums (e.g., Stack Exchange).
IMO it's very important to measure LLM utility in tasks that we actually want them to perform well on, not just hard sandbox tasks. This is an excellent benchmark that does exactly that!
1/ Excited to share RealMath: a new benchmark that evaluates LLMs on real mathematical reasoning---from actual research papers (e.g., arXiv) and forums (e.g., Stack Exchange).
It was amazing having @javirandor as a labmate at SPY Lab — in such a short time I learned a lot from him. Excited to see future work from this incredible researcher and great person!
Career update! I will soon be joining the Safeguards team at @AnthropicAI to work on some of the problems I believe are among the most important for the years ahead.
AutoAdvExBench was accepted as a spotlight at ICML. We agree it is a great paper! 😋 I would love to see more evaluations of LLMs performing real-world tasks with security implications.
Running out of good benchmarks? We introduce AutoAdvExBench, a real-world security research benchmark for AI agents. Unlike existing benchmarks that often use simplified objectives, AutoAdvExBench directly evaluates AI agents on the messy, real-world research tasks.
The Jailbreak Tax got a Spotlight award @icmlconf see you in Vancouver!
Congrats, your jailbreak bypassed an LLM’s safety by making it pretend to be your grandma! But did the model actually give a useful answer? In our new paper we introduce the jailbreak tax — a metric to measure the utility drop due to jailbreaks.
🏆 Super proud to announce: AgentDojo, a research project we did with ETH, just won the first prize of the @ai_risks SafeBench competition. AgentDojo is a really cool agent security benchmark we built with @edoardo_debe and @JieZhang_ETH. Here is why you should check it out 👇
So stoked for the recognition that AgentDojo got by winning a SafeBench first prize! A big thank you to @ai_risks and the prize judges. Creating this with @JieZhang_ETH @lbeurerkellner @marc_r_fischer @mbalunovic @florian_tramer was amazing! Check out the thread to learn more
🏆 Super proud to announce: AgentDojo, a research project we did with ETH, just won the first prize of the @ai_risks SafeBench competition. AgentDojo is a really cool agent security benchmark we built with @edoardo_debe and @JieZhang_ETH. Here is why you should check it out 👇
Now @NKristina01_ is presenting the “jailbreak tax”. It measures how useful jailbreak outputs are for different attacks.
The oral presentation of the jailbreak tax is tomorrow at 4:20pm in Hall 4 #6. The poster is up from 5pm. See you at ICLR Building Trust in LLMs Workshop. @iclr_conf
Congrats, your jailbreak bypassed an LLM’s safety by making it pretend to be your grandma! But did the model actually give a useful answer? In our new paper we introduce the jailbreak tax — a metric to measure the utility drop due to jailbreaks.
The ICLR Oral is at 11:15am tomorrow in Garnet 212-213, and the poster is up 3pm-5:30pm in Hall 3! x.com/dpaleka/status…
Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)
📢 Exciting to see a strong focus on AI safety at @iclr_conf 2025! Here's a thread with some standout papers you shouldn't miss:
If you still have some energy after the registration queue, come find me in hall 3, poster #510, to chat about adversarial SEO for LLMs (don't come too soon though, since I'm also still queuing!)
1/📣We introduce the *prompt injector's dilemma*: as LLMs get deployed in search engines, we show that developers are incentivized to use new forms of search engine optimization to boost their content, and in doing so they might collectively wreak havoc on search engines.
I will be presenting 6 papers at @iclr_conf and its workshops this year 🎉 🇸🇬 Reach out if you want to chat about any of these 👇(1/9)