Nikhil Chandak

@nikhilchandak29

PhD Student at Max Planck Institute. Past @iiit_hyderabad @VectorInst. Interested in better evals, forecasting, and open-endedness.

Tübingen, Germany

Joined December 2016

414Following

357Followers

Pinned

Nikhil Chandak@nikhilchandak29 · Jul 4

🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯 Turns out, you can often get significant accuracy just from the choices alone. This is true even on recent benchmarks with 10 choices (like MMLU-Pro) and their vision…

nikhilchandak29's tweet image. 🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯

Turns out, you can often get significant accuracy just from the choices alone. This is true even on recent benchmarks with 10 choices (like MMLU-Pro) and their vision…

14.0K

Pinned

Nikhil Chandak@nikhilchandak29 · Jun 6

Forecasting future events is a fascinating task for language models. Arguably the hardest application for a pure "oracle" that can't take actions; requiring reasoning about conflicting info, planning, information seeking... But, forecasting is also uniquely hard to evaluate:

DDaniel Paleka@dpaleka · Jun 5

How well can LLMs predict future events? Recent studies suggest LLMs approach human performance. But evaluating forecasters presents unique challenges compared to standard LLM evaluations. We identify key issues with forecasting evaluations 🧵 (1/7)

3.0K

Pinned

Nikhil Chandak Retweeted

Arvindh Arun@arvindh__a · May 30

Does text help KG Foundation Models generalize better? 🤔 Yes (and no)! ☯️ Bootstrapped by LLMs improving KG relation labels, we show that textual similarity between relations can act as an invariance - helping generalization across datasets! 🧵👇

4.0K

Nikhil Chandak@nikhilchandak29 · Jul 19

Pretty happy with how my predictions are holding up. 5/6 was the gold medal threshold this year. OAI's "experimental reasoning LLM" got that exactly, failing only to solve the one hard combinatorics problem, P6. My advice remains: look beyond the medal. Brief thread. 1/

AAlexander Wei@alexwei_ · Jul 19

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

254

45.0K

Nikhil Chandak@nikhilchandak29 · Jul 12

Meanwhile, @Kimi_Moonshot has actually cooked with K2. Even without extended reasoning, it is on par with frontier models like Grok-4 on GPQA free-form. Massive congrats to them.

NNikhil Chandak@nikhilchandak29 · Jul 11

🚨Thought Grok-4 saturated GPQA? Not yet! ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have…

223

182.0K

Nikhil Chandak@nikhilchandak29 · Jul 4

Very cool result. In hindsight, this shouldn't be too surprising to anyone who has ever taken a multiple choice exam. Eg if you have a trigonometry problem and the possible solutions are A: 1 B: 3.7 C: -5 D: pi/2 which would you pick (with no knowledge of the question)?

NNikhil Chandak@nikhilchandak29 · Jul 4

4.0K

Nikhil Chandak@nikhilchandak29 · Jul 3

TIL half of SWE-Bench-Verified is fixing issues in a single repository. We really need to be careful with how we name benchmarks, and be explicit about which capabilities they test. Fix-issues-in-the-Django-repo-Bench doesnt have the same ring to it, and thats the point.

EEpoch AI@EpochAIResearch · Jun 13

Furthermore, the low diversity of codebases limits external validity. Django comprises nearly half of all issues and five repositories account for over 80% of the benchmark.

913

Nikhil Chandak@nikhilchandak29 · Jul 1

A great example of scientific discourse at its best—thoughtful, constructive, and conclusive. We now have more rigorous evidence that confidence maximization improves reasoning. 👇

MMihir Prabhudesai@mihirp98 · Jun 30

1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing…

4.0K

Nikhil Chandak@nikhilchandak29 · Jun 30

SShashwat Goel@ShashwatGoel7 · May 29

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

14.0K

Nikhil Chandak Retweeted

Nikhil Chandak@nikhilchandak29 · May 31

Circling back, it seems like your base model numbers are also quite different from what is reported in Qwen3 report (few-shot). For eg., it looks like you can get the same performance on GPQA from base model as your method, making it unclear how better your method actually is?

472