Luke Bailey

@LukeBailey181

CS PhD student @Stanford. Former CS and Math undergraduate @Harvard.

Joined July 2023

269Following

351Followers

Pinned

Luke Bailey@LukeBailey181 · Dec 13

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

370

221

56.0K

Pinned

Luke Bailey Retweeted

Yangjun Ruan@YangjunR · Mar 26

New paper on synthetic pretraining! We show LMs can synthesize their own thoughts for more data-efficient pretraining, bootstrapping their capabilities on limited, task-agnostic data. We call this new paradigm “reasoning to learn”. arxiv.org/abs/2503.18866 Here’s how it works🧵

105

480

385

48.0K

Luke Bailey Retweeted

Scott Emmons@emmons_scott · Jul 9

Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models…

168

63.0K

Luke Bailey Retweeted

CLS@ChengleiSi · Jun 30

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

169

596

202

138.0K

Luke Bailey Retweeted

Perry Dong@perryadong · May 15

Robotic models are advancing rapidly—but how do we scale their improvement? 🤖 We propose a recipe for batch online RL (train offline with online rollouts) that enables policies to self-improve without complications of online RL More: pd-perry.github.io/batch-online-rl (1/8)

109

26.0K

Luke Bailey Retweeted

Zhihong Shao@zhs05232838 · Apr 30

We just released DeepSeek-Prover V2. - Solves nearly 90% of miniF2F problems - Significantly improves the SoTA performance on the PutnamBench - Achieves a non-trivial pass rate on AIME 24 & 25 problems in their formal version Github: github.com/deepseek-ai/De…

322

2.0K

630

451.0K

Luke Bailey@LukeBailey181 · Apr 16

This is a lot of fun and really well put together. I recommend checking out the attention variant notebooks.

TTanishq Kumar@tanishqkumar07 · Apr 16

trained a nanoGPT? feeling behind before o4-mini? 🚨🚨i'm open-sourcing beyond-nanoGPT, an internal codebase to help people go from LLM basics to research-level understanding. 🚨🚨 it contains thousands of lines of from-scratch, annotated pytorch implementing advanced…

316

Luke Bailey Retweeted

Cassidy Laidlaw@cassidy_laidlaw · Apr 11

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

217

2.0K

1.0K

487.0K

Luke Bailey Retweeted

Karan Dalal@karansdalal · Apr 7

Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training. We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency. Every video below is produced directly by…

187

939

6.0K

3.0K

1.4M

Luke Bailey@LukeBailey181 · Mar 25

@YangjunR 's vision here is cool: Can we use the reasoning capabilities of a model to "fill in the missing context and thoughts" that are behind pretraining data? Does this lead to more data-efficient ways to do pretraining?

TTanishq Abraham is at ICML@iScienceLuvr · Mar 25

Reasoning to Learn from Latent Thoughts "Motivated by how humans apply deliberate thinking to learn from limited data, we train an LM to infer (or “decompress”) latent thoughts underlying the highly compressed observed data. These synthesized latent thoughts augment the raw…

7.0K

Luke Bailey@LukeBailey181 · Mar 11

This, in spirit, reminds me of Obfuscated Adversarial Training (OAT) - we don’t explicitly train models not to do harmful things, but instead to have activations that are easy to probe when they do harmful things. We want the model to be misaligned in “the right way” (easy to…

AAryaman Arora@aryaman2020 · Mar 10

obvious applications of interpretability are steering and monitoring (if you can get those to work that is). another application area i haven't seen much in is evals — we could eval whether models produce correct answers for the right internal reasons?

446

Luke Bailey Retweeted

Michael Hla@hla_michael · Mar 7

I taught an LLM to optimize proteins. It proposed a better carbon capture enzyme. Introducing Pro-1, an 8b param reasoning model trained using GRPO towards a physics based reward function for protein stability. It takes in a protein sequence + text description + previous…

340

3.0K

2.0K

307.0K

Luke Bailey@LukeBailey181 · Feb 25

Creating AI regulations with cost and compute thresholds can be made easier by following simple principles. Big thanks to coauthors @StephenLCasper and @schreier_tim.

CCas (Stephen Casper) @ ICML@StephenLCasper · Feb 25

🚨 New paper: Some AI regulations make requirements contingent on cost & compute thresholds. But there's no standardized accounting procedure. We tackle this problem with 7 practical principles. ***Spoiler alert: DeepSeek did not actually spend only $6M to train V3.***

286

Luke Bailey@LukeBailey181 · Feb 15

Code is now available for our obfuscated activations paper. Code: github.com/LukeBailey181/… Project page: obfuscated-activations.github.io Updated arxiv: arxiv.org/abs/2412.09565

8.0K