Jon Saad-Falcon
@JonSaadFalcon
AI PhD @hazyresearch @StanfordAILab | Previously @databricks @allen_ai @GeorgiaTech
What is the best way to spend your inference compute budget to create LLM systems greater than the sum of their parts? In our latest paper, we present Archon, an architecture search framework for inference-time techniques! Archon is enabled by inference-time architecture search…
📢 As promised ✨, we're open-sourcing LMUnit! Our SoTA generative model for fine-grained criteria evaluation of your LLM responses 🎯 ✅ SoTA on Flask & BigGbench ✅ SoTA generative reward model on RewardBench2 🤗 Models available on @huggingface: tiny.cc/qjzp001 💻…
Excited to share 🤯 that our LMUnit models with @ContextualAI just claimed the top spots on RewardBench2 🥇 How did we manage to rank +5% higher than models like Gemini, Claude 4, and GPT4.1? More in the details below: 🧵 1/11
LLM judges are powerful for automated evaluation but expensive and biased.📣 Meet PAJAMA, a new framework that distills LLM judging logic into a compact, executable form (a new representation), cutting costs from thousands to just cents.🚀 We'll present at ICML PRAL on Friday!
Looking forward to attending ICML! Here are some works on memory/long context, verification, kernel design, multi-model AI systems, and theoretical understanding of test-time scaling from my awesome students and collaborators!
Today we launch Asimov. Asimov is our code research agent that is best-in-class in codebase comprehension. It is built for teams, built for enterprises, and built to remember. We use it everyday to accelerate our velocity and streamline distributed ops. Link below to sign up…
Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.
Thrilled to share that I’ve joined @reflection_ai! We’re building superintelligent autonomous systems by co-designing research and product. Today, we’re launching Asimov. As AI benchmarks saturate, evaluation will increasingly live inside real-world products that are…
Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
We've been tirelessly working for the last few years to change how machines think. H-Net unlocks the next step of architecture scaling, bringing together many of the lessons learned in the last 5 years of SSM research to break fundamentally new ground in learning end to end…
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Efficient data curation is critical for modern ML. 📣 We introduce Mimic Score, a new, lightweight, model-based metric for sample utility that leverages reference model's weights to identify high-value samples and accelerate training. 🎉 Accepted as an Oral at ICML’25 DataWorld!
✨ Test-Time Scaling for Robotics ✨ Excited to release 🤖 RoboMonkey, which characterizes test-time scaling laws for Vision-Language-Action (VLA) models and introduces a framework that significantly improves the generalization and robustness of VLAs! 🧵(1 / N) 🌐 Website:…
1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:
Verdict systems can now judge image inputs. Score product photos. Ad creatives. UI mockups. Haize anime birds. Judge any thing for any quality—and understand why.
Shrinking the Generation-Verification Gap with Weak Verifiers "we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers." "Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs…
See @JonSaadFalcon's post for more details: x.com/JonSaadFalcon/… Paper: arxiv.org/abs/2506.18203 Blog: hazyresearch.stanford.edu/blog/2025-06-1… github.com/HazyResearch/s…… Datasets and Models: huggingface.co/collections/ha…
How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…
Introducing Weaver, a test time scaling method for verification! Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny…
LLMs often generate correct answers but struggle to select them. Weaver tackles this by combining many weak verifiers (reward models, LM judges) into a stronger signal using statistical tools from Weak Supervision—matching o3-mini-level accuracy with much cheaper models! 📊
Very exciting work on using weak supervision for RL- closing the “generation-verification gap”!! Once again- principled approaches to labeling/data development are the keys!
How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…
What a throwback to weak supervision! Great work @JonSaadFalcon @ekellbuch @MayeeChen!
How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…
LLMs can generate 100 answers, but which one is right? Check out our latest work closing the generation-verification gap by aggregating weak verifiers and distilling them into a compact 400M model. If this direction is exciting to you, we’d love to connect.
How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…