Fred Zhang
@FredZhang0
research scientist @googledeepmind, prev phd @berkeley_eecs, DM open
This is the most scaling-pilled project I've ever been part of, and the team really cooked. TL;DR: With RL and inference scaling, Gemini perfectly solved 5 out of 6 problems, reaching a gold medal in IMO '25, all within the time constraints of 4.5hr.
An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵
We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) x.com/OpenAI/status/…
OpenAI o3 and o4-mini openai.com/live/
Every OOM improvement along this trendline can be qualitatively different and break the line itself. In particular, I expect an t-AGI, for t ~ 1 week, would automate a decent fraction of tasks in day-to-day AI R&D and accelerate the trend, potentially to superexponential rate.
When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
Arthur, Neel and the interp team at GDM are incredibly brilliant. You should consider working with them!
We are hiring Applied Interpretability researchers on the GDM Mech Interp Team!🧵 If interpretability is ever going to be useful, we need it to be applied at the frontier. Come work with @NeelNanda5, the @GoogleDeepMind AGI Safety team, and me: apply by 28th February as a…
LMs can generalize to implications of facts they are finetuned on. But what mechanisms enable this, and how are these mechanisms learned in pretraining? We develop conceptual and empirical tools for studying these qns. 🧵
Check out our new work on scaling training data attribution (TDA) toward LLM pretraining - and some interesting things we found along the way. arxiv.org/abs/2410.17413 and more below from most excellent student researcher @tylerachang ⬇️
We scaled training data attribution (TDA) methods ~1000x to find influential pretraining examples for thousands of queries in an 8B-parameter LLM over the entire 160B-token C4 corpus! medium.com/people-ai-rese…
LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵
Memorization is NOT merely detrimental for reasoning tasks - sometimes, it’s surprisingly helpful. I’m really enjoying this project, as we work toward a more rigorous definition and understanding of reasoning and memorization (albeit in a controlled synthetic setting):…
*Do LLMs learn to reason, or are they just memorizing?*🤔 We investigate LLM memorization in logical reasoning with a local inconsistency-based memorization score and a dynamically generated Knights & Knaves (K&K) puzzle benchmark. 🌐: memkklogic.github.io (1/n)
alternative timeline: strong interp is information theoretically solvable, but never solved, due to computational complexity barriers same may apply to neuroscience and fundamental physics
strong interpretability will be solved, it’s just a matter of when (before AGI/before ASI/after ASI) but when it is solved, it’ll mark a major shift from taming dragons to designing super-dragons
Exciting time working on interp!
Announcing Transluce, a nonprofit research lab building open source, scalable technology for understanding AI systems and steering them in the public interest. Read a letter from the co-founders Jacob Steinhardt and Sarah Schwettmann: transluce.org/introducing-tr…
sparks of AGI -> no, we need rigorous scientific eval -> lots of eval came out -> test sets saturated & leaked everywhere -> now we need human expertise & creativity to design the last exam, a.k.a., the approach of sparks of AGI
Normally, when you hear about "eval contamination" in leading language models you assume a) negligence or b) explicit cheating on evaluations. With extensive synthetic data usage, this is changing, which means we need to be even more careful with transparency and data curation.…
The ability to properly contextualize is a core competency of LLMs, yet even the best models sometimes struggle. In a new preprint, we use #MechanisticInterpretability techniques to propose an explanation for contextualization errors: the LLM Race Conditions Hypothesis. [1/9]
> Doubling of the human lifespan last year, I made a funny bet against a friend that, with 20% chance, >1 person of our generation will live to be >500 years old. still feeling it since then.
Machines of Loving Grace: my essay on how AI could transform the world for the better darioamodei.com/machines-of-lo…
Glad to have played a small role in this new benchmark effort on evaluating LM for forecasting. TL;DR: it's a fully dynamic set that asks you to forecast the future and so is always contamination-free. We find frontier models still not as good as humans.
Today, we're excited to announce ForecastBench: a new benchmark for evaluating AI and human forecasting capabilities. Our research indicates that AI remains worse at forecasting than expert forecasters. 🧵 Arxiv: arxiv.org/abs/2409.19839 Website: forecastbench.org
Language models can imitate patterns in prompts. But this can lead them to reproduce inaccurate information if present in the context. Our work (arxiv.org/abs/2307.09476) shows that when given incorrect demonstrations for classification tasks, models first compute the correct…