kalomaze
@kalomaze
ML researcher (@primeintellect), speculator • extremely silly jester
i'm assuming most people knew that this is how Anthropic and co. already handle fuzzier stuff, but good to see it represented in papers rather than just products
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains 'We introduce Rubrics as Rewards (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a relative…
proper multiturn state tracking task working with RL-able pass rate on devstral 24b, cool

noticed this before in qwen models yeah. can't tell if bad init or aggressive pretraining LR or whatever the hell causes entire rows to be outliers
What the fuck, Qwen?