Eric Zelikman
@ericzelikman
lgtm-ing @xAI // was phd-ing @stanford
stare long enough and any optimization problem starts looking like a computer kernel
Check out our new work: Generalization from context often outperforms generalization from finetuning. And you might get the best of both worlds by spending extra compute at train-time.
How do language models generalize from information they learn in-context vs. via finetuning? We show that in-context learning can generalize more flexibly, illustrating key differences in the inductive biases of these modes of learning — and ways to improve finetuning. Thread: 1/
🚀 Excited to share that the Workshop on Mathematical Reasoning and AI (MATH‑AI) will be at NeurIPS 2025! 📅 Dec 6 or 7 (TBD), 2025 🌴 San Diego, California
CollabLLM won #ICML2025 ✨Outstanding Paper Award along with 6 other works! icml.cc/virtual/2025/a… 🫂 Absolutey honored and grateful for coauthors @MSFTResearch @StanfordAILab and friends who made this happen! 🗣️ Welcome people to our presentations about CollabLLM tomorrow…
Even the smartest LLMs can fail at basic multiturn communication Ask for grocery help → without asking where you live 🤦♀️ Ask to write articles → assumes your preferences 🤷🏻♀️ ⭐️CollabLLM (top 1%; oral @icmlconf) transforms LLMs from passive responders into active collaborators.…
building reasoning agents w/ @YuchenHe07 @qhwang3 was so fun, and the next paradigm will be even cooler -- agents will solve far harder problems far faster
From the 1st RL training using tools on a mini reasoning model at 16% HLE till now building the smartest agent w/ @qhwang3 @ericzelikman , more fun and breakthroughs to go! 🤖
It turns out that a lot of the most interesting behavior of LLMs can be explained without knowing anything about architecture or learning algorithms. Here we predict the rise (and fall) of in-context learning using hierarchical Bayesian methods.
🚨New paper! We know models learn distinct in-context learning strategies, but *why*? Why generalize instead of memorize to lower loss? And why is generalization transient? Our work explains this & *predicts Transformer behavior throughout training* without its weights! 🧵 1/
fun note: @HeinrichKuttler once described my env config as "the final boss of python venv issues" -- has been mostly issue free for a few months now, thanks mostly to uv 🤞
We've been using uv a few months now and I've never felt better. I have more energy. My skin is clearer. My eye sight has improved.
40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
NaN sample efficiency x.com/AndrewZ4573249…
seems like a big theme lately (e.g. also "RL for Reasoning w/ One Training Example") is that approaches don't get nearly enough bang for each training point's buck - cool!
seems like a big theme lately (e.g. also "RL for Reasoning w/ One Training Example") is that approaches don't get nearly enough bang for each training point's buck - cool!
Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10
cool pipeline for analyzing lots of screenshot data 🖼️ we need good tools to understand how we interact w/ complex algos
New paper up on ArXiv, with lead author Merve Cerit presenting it at #CHI2025: the Media Content Atlas (MCA): an open-source, AI-powered pipeline for inductive inquiry into what people actually see and do on their phones.
tiny oversight, think you missed a model. happy to help out!
For the first time, Google is responding to OpenAI's announcement in < 24 hours The WAR is officially ON, and Google wants the LLM market Google is now dominating +90% of the price share
Douglas Adams was right about everything having to do with AI.
It is fitting that out of all the great science fiction authors that imagined AI, Douglas Adams continues to be the most fundamentally correct: ✅ Machines that work best when emotionally manipulated ✅Machines that guilt you ✅Very long “thinking” times for very hard questions
i prefer to have axis labels actually, just figured someone needed to hear that
i prefer to have axis labels actually, just figured someone needed to hear that