Mikhail Terekhov
@MiTerekhov
PhD in ML @ CLAIRE lab, EPFL. MATS 7.1. AI Control.
AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵

Well, to avoid steganography, let's make sure our multi-agent LLM research workflows are composed of agents with different base models then
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
🚀 Big time! We can finally do LLM RL fine-tuning with rewards and leverage offline/off-policy data! ❌ You want rewards, but GRPO only works online? ❌ You want offline, but DPO is limited to preferences? ✅ QRPO can do both! 🧵Here's how we do it:
Michel Foucault is thought to be the world's most cited academic, with >1,440,000 citations. But Geoffrey Hinton has been catching up, accelerating where Foucault is decelerating in the last 5 years. When will Hinton overtake Foucault - when is the Moment of Hintotality?
How do diffusion models create images and can we control that process? We are excited to release a update to our SDXL Turbo sparse autoencoder paper. New title: One Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models Spoiler: We have FLUX SAEs now :)
a day in life > wake up > OpenBrain new model > update arxiv draft "we test SotA models" to "advanced models" > rinse and repeat
A lot of the reading drive for me comes from sparse reinforcement. Occasionally I find a passage that's so good it justifies the whole endeavor. However, twitter gave me a negative connotation for the same feeling of discovery. Now every time I read I have a cognitive dissonance.
goalposts so fast Einstein is in shambles
Today, we're releasing ARC-AGI-2. It's an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with. It keeps the same format as ARC-AGI-1, while significantly…