Jon Saad-Falcon (@JonSaadFalcon)

Pinned

J

Jon Saad-Falcon@JonSaadFalcon · Sep 30

What is the best way to spend your inference compute budget to create LLM systems greater than the sum of their parts? In our latest paper, we present Archon, an architecture search framework for inference-time techniques! Archon is enabled by inference-time architecture search…

8

55

187

112

104.0K

J

Jon Saad-Falcon@JonSaadFalcon · Jul 22

📢 As promised ✨, we're open-sourcing LMUnit! Our SoTA generative model for fine-grained criteria evaluation of your LLM responses 🎯 ✅ SoTA on Flask & BigGbench ✅ SoTA generative reward model on RewardBench2 🤗 Models available on @huggingface: tiny.cc/qjzp001 💻…

WWilliam Berrios@w33lliam · Jun 23

Excited to share 🤯 that our LMUnit models with @ContextualAI just claimed the top spots on RewardBench2 🥇 How did we manage to rank +5% higher than models like Gemini, Claude 4, and GPT4.1? More in the details below: 🧵 1/11

1

14

31

12

5.0K

Jon Saad-Falcon Retweeted

T

Tzu-Heng Huang@zihengh1 · Jul 18

LLM judges are powerful for automated evaluation but expensive and biased.📣 Meet PAJAMA, a new framework that distills LLM judging logic into a compact, executable form (a new representation), cutting costs from thousands to just cents.🚀 We'll present at ICML PRAL on Friday!

1

8

25

14

2.0K

Jon Saad-Falcon Retweeted

A

Azalia Mirhoseini@Azaliamirh · Jul 16

Looking forward to attending ICML! Here are some works on memory/long context, verification, kernel design, multi-model AI systems, and theoretical understanding of test-time scaling from my awesome students and collaborators!

3

16

86

20

22.0K

Jon Saad-Falcon Retweeted

A

Aakanksha Chowdhery@achowdhery · Jul 16

Today we launch Asimov. Asimov is our code research agent that is best-in-class in codebase comprehension. It is built for teams, built for enterprises, and built to remember. We use it everyday to accelerate our velocity and streamline distributed ops. Link below to sign up…

25

16

366

68

146.0K

Jon Saad-Falcon Retweeted

M

Misha Laskin@MishaLaskin · Jul 16

Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.

100

174

1.0K

314.0K

J

Jon Saad-Falcon@JonSaadFalcon · Jul 16

Thrilled to share that I’ve joined @reflection_ai! We’re building superintelligent autonomous systems by co-designing research and product. Today, we’re launching Asimov. As AI benchmarks saturate, evaluation will increasingly live inside real-world products that are…

MMisha Laskin@MishaLaskin · Jul 16

Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.

14

9

58

5

8.0K

J

Jon Saad-Falcon@JonSaadFalcon · Jul 11

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

SSukjun (June) Hwang@sukjun_hwang · Jul 11

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

59

183

1.0K

755

184.0K

J

Jon Saad-Falcon@JonSaadFalcon · Jul 11

We've been tirelessly working for the last few years to change how machines think. H-Net unlocks the next step of architecture scaling, bringing together many of the lessons learned in the last 5 years of SSM research to break fundamentally new ground in learning end to end…

AAlbert Gu@_albertgu · Jul 11

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

2

15

124

36

13.0K

Jon Saad-Falcon Retweeted

T

Tzu-Heng Huang@zihengh1 · Jul 10

Efficient data curation is critical for modern ML. 📣 We introduce Mimic Score, a new, lightweight, model-based metric for sample utility that leverages reference model's weights to identify high-value samples and accelerate training. 🎉 Accepted as an Oral at ICML’25 DataWorld!

1

14

26

7

3.0K

Jon Saad-Falcon Retweeted

J

Jacky Kwok@jackyk02 · Jul 9

✨ Test-Time Scaling for Robotics ✨ Excited to release 🤖 RoboMonkey, which characterizes test-time scaling laws for Vision-Language-Action (VLA) models and introduces a framework that significantly improves the generalization and robustness of VLAs! 🧵(1 / N) 🌐 Website:…

2

13

34

8

45.0K

Jon Saad-Falcon Retweeted

J

Jerry Liu@jerrywliu · Jul 7

1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:

13

122

642

547

81.0K

Jon Saad-Falcon Retweeted

L

Leonard Tang@leonardtang_ · Jun 25

Verdict systems can now judge image inputs. Score product photos. Ad creatives. UI mockups. Haize anime birds. Judge any thing for any quality—and understand why.

3

11

46

14

6.0K

Jon Saad-Falcon Retweeted

T

Tanishq Abraham is at ICML@iScienceLuvr · Jun 24

Shrinking the Generation-Verification Gap with Weak Verifiers "we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers." "Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs…

4

27

125

77

11.0K

J

Jon Saad-Falcon@JonSaadFalcon · Jun 26

See @JonSaadFalcon's post for more details: x.com/JonSaadFalcon/… Paper: arxiv.org/abs/2506.18203 Blog: hazyresearch.stanford.edu/blog/2025-06-1… github.com/HazyResearch/s…… Datasets and Models: huggingface.co/collections/ha…

JJon Saad-Falcon@JonSaadFalcon · Jun 24

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…

0

3

13

6

2.0K

Jon Saad-Falcon Retweeted

A

Azalia Mirhoseini@Azaliamirh · Jun 26

Introducing Weaver, a test time scaling method for verification! Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny…

3

50

225

128

18.0K

Jon Saad-Falcon Retweeted

M

Mayee Chen@MayeeChen · Jun 24

LLMs often generate correct answers but struggle to select them. Weaver tackles this by combining many weak verifiers (reward models, LM judges) into a stronger signal using statistical tools from Weak Supervision—matching o3-mini-level accuracy with much cheaper models! 📊

15

36

237

159

23.0K

J

Jon Saad-Falcon@JonSaadFalcon · Jun 24

Very exciting work on using weak supervision for RL- closing the “generation-verification gap”!! Once again- principled approaches to labeling/data development are the keys!

JJon Saad-Falcon@JonSaadFalcon · Jun 24

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…

1

7

22

6

4.0K

J

Jon Saad-Falcon@JonSaadFalcon · Jun 24

What a throwback to weak supervision! Great work @JonSaadFalcon @ekellbuch @MayeeChen!

JJon Saad-Falcon@JonSaadFalcon · Jun 24

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…

1

7

24

5

5.0K

J

Jon Saad-Falcon@JonSaadFalcon · Jun 24

LLMs can generate 100 answers, but which one is right? Check out our latest work closing the generation-verification gap by aggregating weak verifiers and distilling them into a compact 400M model. If this direction is exciting to you, we’d love to connect.

JJon Saad-Falcon@JonSaadFalcon · Jun 24

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…

1

16

46

8

5.0K