Yangjun Ruan
@YangjunR
Visiting @stanfordAILab | ML Ph.D. student @UofT & @VectorInst
New paper on synthetic pretraining! We show LMs can synthesize their own thoughts for more data-efficient pretraining, bootstrapping their capabilities on limited, task-agnostic data. We call this new paradigm “reasoning to learn”. arxiv.org/abs/2503.18866 Here’s how it works🧵

Reasoning to Learn from Latent Thoughts Author's Explanation: x.com/YangjunR/statu… Overview: This paper enhances LLM pretraining data efficiency under data constraints by inferring latent thoughts underlying web text, significantly improving math performance from 5.7% to…
New paper on synthetic pretraining! We show LMs can synthesize their own thoughts for more data-efficient pretraining, bootstrapping their capabilities on limited, task-agnostic data. We call this new paradigm “reasoning to learn”. arxiv.org/abs/2503.18866 Here’s how it works🧵
We are happy to announce our @NeurIPSConf workshop on LLM evaluations! Mastering LLM evaluation is no longer optional -- it's fundamental to building reliable models. We'll tackle the field's most pressing evaluation challenges. For details: sites.google.com/corp/view/llm-…. 1/3
What makes a great scientist? Most AI scientist benchmarks miss the key skill: designing and analyzing experiments. 🧪 We're introducing SciGym: the first simulated lab environment to benchmark #LLM on experimental design and analysis capabilities. #AI4SCIENCE #ICML25
Thinking Machines Lab exists to empower humanity through advancing collaborative general intelligence. We're building multimodal AI that works with how you naturally interact with the world - through conversation, through sight, through the messy way we collaborate. We're…
Some real metrics in the wild.
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
New Anthropic Research: Project Vend. We had Claude run a small shop in our office lunchroom. Here’s how it went.
Giving your models more time to think before prediction, like via smart decoding, chain-of-thoughts reasoning, latent thoughts, etc, turns out to be quite effective for unblocking the next level of intelligence. New post is here :) “Why we think”: lilianweng.github.io/posts/2025-05-…
Putting It All into Context: Simplifying Agents with LCLMs Putting all the core code in the context often leads to better performance on SWE-bench than using agent scaffolding
40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
I have been very impressed by how much more efforts Tristan has spent on extensive validation of his perplexity correlation idea, even *after* the paper has been accepted. Check out this simple idea that works!
At #ICLR, check out Perplexity Correlations: a statistical framework to select the best pretraining data with no LLM training! I can’t make the trip, but @tatsu_hashimoto will present the poster for us! Continue reading for the latest empirical validations of PPL Correlations:
We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) x.com/OpenAI/status/…
OpenAI o3 and o4-mini openai.com/live/
Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training. We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency. Every video below is produced directly by…
Language models learn inefficiently from compressed web text, requiring excessive data. This paper augments pretraining data with inferred "latent thoughts" (reasoning, context) underlying the text, improving data efficiency. Training on text paired with synthetic thoughts…
An LLM generates an article verbatim—did it “train on” the article? It’s complicated: under n-gram definitions of train-set inclusion, LLMs can complete “unseen” texts—both after data deletion and adding “gibberish” data. Our results impact unlearning, MIAs & data transparency🧵
🚨This week's top AI/ML research papers: - GPT-4o System Card: Native Image Generation - Anthropic's On the Biology of a LLM - Gemma 3 Technical Report - Qwen2.5-Omni Technical Report - Reasoning to Learn from Latent Thoughts - Defeating Prompt Injections by Design - Scaling…