Philipp Fränken
@jphilippfranken
post-training and RL @GoogleDeepMind
Presenting this tomorrow at @NeurIPSConf East Exhibit Hall A-C #2111 (4:30 p.m. PST — 7:30 p.m. PST). Come along if you want to chat about synthetic preference data with @gandhikanishk
Constitutional AI showed LMs can learn to follow constitutions by labeling their own outputs. But why can't we just tell a base model the principles of desired behavior and rely on it to act appropriately? Introducing SAMI: Self-Supervised Alignment with Mutual Information!
It turns out that a lot of the most interesting behavior of LLMs can be explained without knowing anything about architecture or learning algorithms. Here we predict the rise (and fall) of in-context learning using hierarchical Bayesian methods.
🚨New paper! We know models learn distinct in-context learning strategies, but *why*? Why generalize instead of memorize to lower loss? And why is generalization transient? Our work explains this & *predicts Transformer behavior throughout training* without its weights! 🧵 1/
🚨 New benchmark alert! 🚨 Can today’s LLMs implement tomorrow’s research ideas? We put them to the test. Introducing #ResearchCodeBench: 212 tasks from 2024–25 ML papers and code, most released after any model’s training cutoff. 🔗 researchcodebench.github.io 🧵
Tokasaurus is out! Happy Throughput Thursday to those who celebrate :)
Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with @achakravarthy01, @ryansehrlich, @EyubogluSabri, @brad19brown, @jshetaye,…
Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with @achakravarthy01, @ryansehrlich, @EyubogluSabri, @brad19brown, @jshetaye,…
New Paper!! We try to understand why some LMs self-improve their reasoning while others hit a wall. The key? Cognitive behaviors! Read our paper on how the right cognitive behaviors can make all the difference in a model's ability to improve with RL! 🧵1/13
This is the dataset we curated for our own reasoning experiments. There is a lot of reasoning data coming out now, but we spend extra time on this to make sure all the problems are high-quality and suitable for RL training!
thrilled to see Big-MATH climbing to #3️⃣ on @huggingface—clear signal the community wants more high-quality, verifiable RL datasets. grateful to everyone who’s been liking, downloading, and supporting ❤️
thrilled to see Big-MATH climbing to #3️⃣ on @huggingface—clear signal the community wants more high-quality, verifiable RL datasets. grateful to everyone who’s been liking, downloading, and supporting ❤️
Releasing Big-MATH—the first heavily curated & verifiable dataset designed specifically for large-scale RL training & LLM reasoning! 📝 250,000+ problems, 47k NEW Q's ✅ 10x larger than existing datasets like MATH 🧑⚖️ Verifiable—we eliminated 400k+ problems Details below! 🧵👇
A note on hyperbole, halo, and language models. No not about startup valuations!
arxiv.org/abs/2502.06204 work with the amazing @tsvilodub @gandhikanishk @HaoranZhaoHRZ @jphilipp95 @meanwhileina
Today, I launched Manas AI –– a full stack AI company setting out to shift drug discovery from a decade-long process to one that takes a few years; bringing life-saving treatments to patients faster than ever.
We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning. State-of-the-art AIs get <10% accuracy and are highly overconfident. @ai_risk @scaleai
Scaling inference-time interaction
As we enter the world of test-time compute, we are seeing increasing returns by simply letting our agents do their thing for longer. For the first time, we are running our agent for hundreds of steps on these benchmarks. Instead of accumulating errors, CUA introspects, updates,…
Ever watched someone solve a hard math problem? Their first attempt is rarely perfect. They sketch ideas, cross things out, and try new angles. This process of exploration is key to human reasoning and our latest research formalizes this as Meta Chain-of-Thought (1/8) 🧵👇
We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.
SynthLabs + Stanford presents: Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought Proposes Meta Meta-CoT, which extends CoT by explicitly modeling the underlying reasoning required to arrive at a particular CoT
Presenting this cool paper led by @jphilippfranken Come by today at 4.30 if you are around :)
Presenting this tomorrow at @NeurIPSConf East Exhibit Hall A-C #2111 (4:30 p.m. PST — 7:30 p.m. PST). Come along if you want to chat about synthetic preference data with @gandhikanishk
How are AI Assistants being used in the real world? Our new research shows how to answer this question in a privacy preserving way, automatically identifying trends in Claude usage across the world. 1/
Constitutional AI showed LMs can learn to follow constitutions by labeling their own outputs. But why can't we just tell a base model the principles of desired behavior and rely on it to act appropriately? Introducing SAMI: Self-Supervised Alignment with Mutual Information!
If you're at NeurIPS, come tomorrow for the Oral+Poster on "Learning Formal Mathematics from Intrinsic Motivation"! Really fun work with @DavidKarlBroman @nickhaber @noahdgoodman that put together much of what I did in the past years, w/ a new twist with open-ended learning!
Excited that @GabrielPoesia will be presenting his Oral on Learning Formal Mathematics From Intrinsic Motivation. We make and prove conjectures from scratch, without any human data, by learning what is hard but provable. Gabe’s on the job market, btw. neurips.cc/virtual/2024/o…