Robert Washbourne

@rawsh0

ai @ zyphra

Joined October 2021

2KFollowing

187Followers

Pinned

should you prompt engineer or finetune? probably both. using the extremely generalizable benchmark of llm chess puzzle solving ability, combining DSPy with finetuning improves gpt-4o-mini accuracy by 280%. raw.sh/posts/chess_pu…

rawsh0's tweet card. Explore how LLMs can be optimized to solve chess puzzles with the combination of DSPy automatic prompt engineering and finetuning, reaching completion model accuracy with chat models.

10.0K

Pinned

Robert Washbourne@rawsh0 · Jun 5

Pass@1024 results of our RL model (AceReason-Nemotron-7B) and its starting SFT model (DeepSeek-R1-Distill-Qwen-7B) on LiveCodeBench-v6, which features a large answer space and high-quality test cases that are difficult to solve through 'guessing', even with extensive sampling.…

WWei Ping@_weiping · May 23

Introducing AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning (RL) We propose conducting RL on math-only prompts first, then on code-only prompts. Our key findings include: - Math-only RL significantly boosts both math and code benchmarks! -…

6.0K

Pinned

Robert Washbourne@rawsh0 · May 30

Thanks for reproducing our results! This help a lot. Our Qwen model checkpoints are released here: huggingface.co/collections/yp…, happy to let people have a try!

RRobert Washbourne@rawsh0 · May 30

Why is there a drop from what's claimed in the paper? The authors evaluate with n=8, not with different seeds, and use a different prompt and sampling settings (standardized across tested models). Their codebase is public and results look reproducible. Details of my runs:

3.0K

Robert Washbourne Retweeted

rishi@rishiiyer01 · Jul 9

reach out if you want to work with me and others on novel architectures for pretraining! dms are open jobs.ashbyhq.com/zyphra/e509d43…

604

Robert Washbourne@rawsh0 · Jun 11

next o series model needs a “table formatting not totally fucked” reward

193

Robert Washbourne Retweeted

Zyphra@ZyphraAI · Jun 9

Zyphra is expanding! Join our growing team in Palo Alto. We have multiple roles open across multimodal foundation models, RL, product, and infrastructure. Check them out here: jobs.ashbyhq.com/zyphra

7.0K

Robert Washbourne@rawsh0 · Jun 8

A few more observations after replicating the Tower of Hanoi game with their exact prompts: - You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and…

JJosh Wolfe@wolfejosh · Jun 7

Apple just GaryMarcus'd LLM reasoning ability

255

2.0K

960

610.0K

Robert Washbourne@rawsh0 · Jun 8

wonder if the response-length collapse is a training artifact — full chain-of-thought on toughest questions loses coherence as context stretches → gets penalized → gradually learns to abort early with stub reasoning, maybe due to length penalties

rawsh0's tweet image. wonder if the response-length collapse is a training artifact — full chain-of-thought on toughest questions loses coherence as context stretches → gets penalized → gradually learns to abort early with stub reasoning, maybe due to length penalties

212

Robert Washbourne Retweeted

❄

❄️Andrew Zhao❄️@_AndrewZhao · Jun 2

RL scaling is here arxiv.org/pdf/2505.24864

118

783

678

161.0K

Robert Washbourne@rawsh0 · May 30

Looking at this further, I think “Incorrect Baselines” is using reference numbers from Table 3 of the Sober Reasoning paper, which uses the optimal temperature and top p settings for each model, and is comparing to reported results with untuned sampling settings

RRobert Washbourne@rawsh0 · May 30

Confused about the recent LLM RLVR tweet which claims reported accuracy gains can totally reverse? I was too. Until I realized some of the comparisons are unstandardized. I compiled discrepancies in a thread below 🧵👇

390

Robert Washbourne@rawsh0 · May 30

I would argue "Sober Reasoning" actually shows models are generalizing fairly well - if the average score across seeds of the RL model matches the best seed of the base model, that seems like a fairly significant improvement (especially with totally diff prompts than training)

rawsh0's tweet image. I would argue "Sober Reasoning" actually shows models are generalizing fairly well - if the average score across seeds of the RL model matches the best seed of the base model, that seems like a fairly significant improvement (especially with totally diff prompts than training)

168

Robert Washbourne Retweeted

Lifan Yuan@lifan__yuan · May 29

We always want to scale up RL, yet simply training longer doesn't necessarily push the limits - exploration gets impeded by entropy collapse. We show that the performance ceiling is surprisingly predictable, and the collapse is driven by covariance between logp and advantage.

545

506

62.0K