Luke Bailey
@LukeBailey181
CS PhD student @Stanford. Former CS and Math undergraduate @Harvard.
Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵
New paper on synthetic pretraining! We show LMs can synthesize their own thoughts for more data-efficient pretraining, bootstrapping their capabilities on limited, task-agnostic data. We call this new paradigm “reasoning to learn”. arxiv.org/abs/2503.18866 Here’s how it works🧵
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models…
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
Robotic models are advancing rapidly—but how do we scale their improvement? 🤖 We propose a recipe for batch online RL (train offline with online rollouts) that enables policies to self-improve without complications of online RL More: pd-perry.github.io/batch-online-rl (1/8)
We just released DeepSeek-Prover V2. - Solves nearly 90% of miniF2F problems - Significantly improves the SoTA performance on the PutnamBench - Achieves a non-trivial pass rate on AIME 24 & 25 problems in their formal version Github: github.com/deepseek-ai/De…
This is a lot of fun and really well put together. I recommend checking out the attention variant notebooks.
trained a nanoGPT? feeling behind before o4-mini? 🚨🚨i'm open-sourcing beyond-nanoGPT, an internal codebase to help people go from LLM basics to research-level understanding. 🚨🚨 it contains thousands of lines of from-scratch, annotated pytorch implementing advanced…
We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵
Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training. We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency. Every video below is produced directly by…
@YangjunR 's vision here is cool: Can we use the reasoning capabilities of a model to "fill in the missing context and thoughts" that are behind pretraining data? Does this lead to more data-efficient ways to do pretraining?
Reasoning to Learn from Latent Thoughts "Motivated by how humans apply deliberate thinking to learn from limited data, we train an LM to infer (or “decompress”) latent thoughts underlying the highly compressed observed data. These synthesized latent thoughts augment the raw…
This, in spirit, reminds me of Obfuscated Adversarial Training (OAT) - we don’t explicitly train models not to do harmful things, but instead to have activations that are easy to probe when they do harmful things. We want the model to be misaligned in “the right way” (easy to…
obvious applications of interpretability are steering and monitoring (if you can get those to work that is). another application area i haven't seen much in is evals — we could eval whether models produce correct answers for the right internal reasons?
I taught an LLM to optimize proteins. It proposed a better carbon capture enzyme. Introducing Pro-1, an 8b param reasoning model trained using GRPO towards a physics based reward function for protein stability. It takes in a protein sequence + text description + previous…
Creating AI regulations with cost and compute thresholds can be made easier by following simple principles. Big thanks to coauthors @StephenLCasper and @schreier_tim.
🚨 New paper: Some AI regulations make requirements contingent on cost & compute thresholds. But there's no standardized accounting procedure. We tackle this problem with 7 practical principles. ***Spoiler alert: DeepSeek did not actually spend only $6M to train V3.***
Code is now available for our obfuscated activations paper. Code: github.com/LukeBailey181/… Project page: obfuscated-activations.github.io Updated arxiv: arxiv.org/abs/2412.09565