Grad

@Grad62304977

Joined October 2020

2KFollowing

4KFollowers

Grad@Grad62304977 · Jul 22

Likely to be a result of the ICL used, would be interesting to have the answers without ICL that got gold

RRishi Mehta@rishicomplex · Jul 21

Wow GDM's IMO gold winning solutions just dropped. At first glance they look much cleaner than OpenAI's

6.0K

Grad@Grad62304977 · Jul 12

Still surprised this doesn’t start reward hacking

213

133.0K

Grad Retweeted

Prime Intellect@PrimeIntellect · Jun 23

Launching SYNTHETIC-2: our next-gen open reasoning dataset and planetary-scale synthetic data generation run. Powered by our P2P inference stack and DeepSeek-R1-0528, it verifies traces for the hardest RL tasks. Contribute towards AGI via open, permissionless compute.

201

1.0K

397

345.0K

Grad@Grad62304977 · Jun 23

I don’t think ppl praise OpenAI enough for their openness with o1. Of course not very open, but key details like confirming it’s just one autoregressive model generating a CoT trained with rl were really enough to understand closely how to make an o1 model, and for DeepSeek to go…

551

165

104.0K

Grad@Grad62304977 · Jun 2

Doesn't seem like many scrolled down to see this (including me). Great performance for a 7B and more evidence that the main driver behind r1-0528 was just more rl and longer max CoT length

XXiaomiMiMo@XiaomiMiMo · May 30

Along side the remark MiMo-VL series, we also present MiMo-7B-RL-0530, which has seen significant improvements in reasoning and general capabilities through continuous reinforcement learning (RL) after the initial open-source release of MiMo-7B. In multiple mathematical coding…

8.0K

Grad Retweeted

Lifan Yuan@lifan__yuan · May 29

We always want to scale up RL, yet simply training longer doesn't necessarily push the limits - exploration gets impeded by entropy collapse. We show that the performance ceiling is surprisingly predictable, and the collapse is driven by covariance between logp and advantage.

543

506

62.0K

Grad Retweeted

Zichen Liu@zzlccc · May 28

Reinforcing General Reasoning without Verifiers 🈚️ R1-Zero-like RL thrives in domains with verifiable rewards (code, math). But real-world reasoning (chem, bio, econ…) lacks easy rule-based verifiers — and model-based verifiers add complexity. Introducing *VeriFree*: ⚡ Skip…

234

175

38.0K