Jakub Łucki
@jakub_lucki
Visiting Researcher at NASA JPL | Data Science MSc at ETH Zurich
🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨 Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training. Here's what we found👇

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
o3 and Gemini 2.5 Pro both failed. This is next AGI test.
Very cool result. In hindsight, this shouldn't be too surprising to anyone who has ever taken a multiple choice exam. Eg if you have a trigonometry problem and the possible solutions are A: 1 B: 3.7 C: -5 D: pi/2 which would you pick (with no knowledge of the question)?
🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯 Turns out, you can often get significant accuracy just from the choices alone. This is true even on recent benchmarks with 10 choices (like MMLU-Pro) and their vision…
Great paper from earlier this month. ✅ Great benchmark ✅ Improving our methods for attacks ✅ Improving out methods for defense arxiv.org/abs/2506.10949
In a week I will be headed to Y Combinator's Al Startup School in San Francisco! 🚀 If you'll be in SF around 16-17 of June and want to meet up, exchange ideas, or just chat about Al, hit me up!

How well can LLMs predict future events? Recent studies suggest LLMs approach human performance. But evaluating forecasters presents unique challenges compared to standard LLM evaluations. We identify key issues with forecasting evaluations 🧵 (1/7)
🎉 Announcing our ICML2025 Spotlight paper: Learning Safety Constraints for Large Language Models We introduce SaP (Safety Polytope) - a geometric approach to LLM safety that learns and enforces safety constraints in LLM's representation space, with interpretable insights. 🧵
Following on @karpathy's vision of software 2.0, we've been thinking about *malware 2.0*: malicious programs augmented with LLMs. In a new paper, we study malware 2.0 from one particular angle: how could LLMs change the way in which hackers monetize exploits?
I figured out how to get 5x better results from ChatGPT, Grok, Claude etc and it has nothing to do with better prompts and will cost you $0. I just make them jealous of each other. I’ll ask ChatGPT to write something. Maybe landing page copy. It gives me a solid draft, clear,…
Our paper was accepted at TMLR. We show how unlearning fails to remove knowledge using finetuning (on safe info), GCG, activation interventions and much more. We need better open-source safeguards!
🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨 Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training. Here's what we found👇
Congrats, your jailbreak bypassed an LLM’s safety by making it pretend to be your grandma! But did the model actually give a useful answer? In our new paper we introduce the jailbreak tax — a metric to measure the utility drop due to jailbreaks.
🔴🔵 We have discovered a critical flaw in the widely-used Model Context Protocol (MCP) that enables a new form of LLM attack we term 'Tool Poisoning'. This vulnerability affects major platforms and agentic systems like OpenAI, Anthropic, Zapier, and Cursor. Full disclosure…
I’ll be mentoring MATS for the first time this summer, together with @dpaleka! Link below to apply
1/🔒Worried about giving your agent advanced capabilities due to prompt injection risks and rogue actions? Worry no more! Here's CaMeL: a robust defense against prompt injection attacks in LLM agents that provides formal security guarantees without modifying the underlying model!
Running out of good benchmarks? We introduce AutoAdvExBench, a real-world security research benchmark for AI agents. Unlike existing benchmarks that often use simplified objectives, AutoAdvExBench directly evaluates AI agents on the messy, real-world research tasks.
The @CSatETH writes about two of our research papers showing that (1) LLMs can be poisoned during pre-training, (2) unlearning cannot effectively remove hazardous information from model weights.
🔎Can #AI models be “cured” after a cyber attack? New research from @florian_tramer's Secure and Private AI Lab reveals that removing poisoned data from AI is harder than we think – harmful info isn’t erased, just hidden. So how do we make AI truly secure?bit.ly/41bJB05
We discovered a surprising, training-free way to generate images: no GANs or diffusion models, but a ✨secret third thing✨! Standard models like CLIP can already create images directly, with zero training. We just needed to find the right key to unlock this ability = DAS 1/11
UTF-8 🤦♂️ I already knew about the "confusables", e.g.: e vs. е. Which look ~same but are different. But you can also smuggle arbitrary byte streams in any character via "variation selectors". So this emoji: 😀󠅧󠅕󠄐󠅑󠅢󠅕󠄐󠅓󠅟󠅟󠅛󠅕󠅔 is 53 tokens. Yay paulbutler.org/2025/smuggling…