Grad
@Grad62304977
Likely to be a result of the ICL used, would be interesting to have the answers without ICL that got gold
Wow GDM's IMO gold winning solutions just dropped. At first glance they look much cleaner than OpenAI's
Launching SYNTHETIC-2: our next-gen open reasoning dataset and planetary-scale synthetic data generation run. Powered by our P2P inference stack and DeepSeek-R1-0528, it verifies traces for the hardest RL tasks. Contribute towards AGI via open, permissionless compute.
I don’t think ppl praise OpenAI enough for their openness with o1. Of course not very open, but key details like confirming it’s just one autoregressive model generating a CoT trained with rl were really enough to understand closely how to make an o1 model, and for DeepSeek to go…
Doesn't seem like many scrolled down to see this (including me). Great performance for a 7B and more evidence that the main driver behind r1-0528 was just more rl and longer max CoT length
Along side the remark MiMo-VL series, we also present MiMo-7B-RL-0530, which has seen significant improvements in reasoning and general capabilities through continuous reinforcement learning (RL) after the initial open-source release of MiMo-7B. In multiple mathematical coding…
We always want to scale up RL, yet simply training longer doesn't necessarily push the limits - exploration gets impeded by entropy collapse. We show that the performance ceiling is surprisingly predictable, and the collapse is driven by covariance between logp and advantage.
Reinforcing General Reasoning without Verifiers 🈚️ R1-Zero-like RL thrives in domains with verifiable rewards (code, math). But real-world reasoning (chem, bio, econ…) lacks easy rule-based verifiers — and model-based verifiers add complexity. Introducing *VeriFree*: ⚡ Skip…