Robert Washbourne
@rawsh0
ai @ zyphra
should you prompt engineer or finetune? probably both. using the extremely generalizable benchmark of llm chess puzzle solving ability, combining DSPy with finetuning improves gpt-4o-mini accuracy by 280%. raw.sh/posts/chess_pu…
Pass@1024 results of our RL model (AceReason-Nemotron-7B) and its starting SFT model (DeepSeek-R1-Distill-Qwen-7B) on LiveCodeBench-v6, which features a large answer space and high-quality test cases that are difficult to solve through 'guessing', even with extensive sampling.…
Introducing AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning (RL) We propose conducting RL on math-only prompts first, then on code-only prompts. Our key findings include: - Math-only RL significantly boosts both math and code benchmarks! -…
Thanks for reproducing our results! This help a lot. Our Qwen model checkpoints are released here: huggingface.co/collections/yp…, happy to let people have a try!
Why is there a drop from what's claimed in the paper? The authors evaluate with n=8, not with different seeds, and use a different prompt and sampling settings (standardized across tested models). Their codebase is public and results look reproducible. Details of my runs:
reach out if you want to work with me and others on novel architectures for pretraining! dms are open jobs.ashbyhq.com/zyphra/e509d43…
next o series model needs a “table formatting not totally fucked” reward
Zyphra is expanding! Join our growing team in Palo Alto. We have multiple roles open across multimodal foundation models, RL, product, and infrastructure. Check them out here: jobs.ashbyhq.com/zyphra
A few more observations after replicating the Tower of Hanoi game with their exact prompts: - You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and…
Apple just GaryMarcus'd LLM reasoning ability
wonder if the response-length collapse is a training artifact — full chain-of-thought on toughest questions loses coherence as context stretches → gets penalized → gradually learns to abort early with stub reasoning, maybe due to length penalties

Looking at this further, I think “Incorrect Baselines” is using reference numbers from Table 3 of the Sober Reasoning paper, which uses the optimal temperature and top p settings for each model, and is comparing to reported results with untuned sampling settings
Confused about the recent LLM RLVR tweet which claims reported accuracy gains can totally reverse? I was too. Until I realized some of the comparisons are unstandardized. I compiled discrepancies in a thread below 🧵👇
I would argue "Sober Reasoning" actually shows models are generalizing fairly well - if the average score across seeds of the RL model matches the best seed of the base model, that seems like a fairly significant improvement (especially with totally diff prompts than training)

We always want to scale up RL, yet simply training longer doesn't necessarily push the limits - exploration gets impeded by entropy collapse. We show that the performance ceiling is surprisingly predictable, and the collapse is driven by covariance between logp and advantage.