Sanchit Ahuja
@SanchitAhuja7
Trynna work. Research Fellow at @MSFTResearch x-ml at @SkitTech Alum at @bitspilaniindia.
New paper! Can reasoning in non-English languages be token-efficient and accurate? We evaluate this across 3 models, 7 languages, and 4 math benchmarks. Here’s what we found 🧵 (1/n)

Announcing the Microsoft Research India Academic Summit 2025 Microsoft Research (MSR) India Academic Summit is an event aimed at strengthening ties between the Indian academic community and researchers at MSR India. 📅 Event Dates: June 24th & 25th
Evaluating Evaluations (@evaluatingevals) Meetup at #ACL2025NLP #ACL2025 Evaluations help us understand capabilities, risks, and opportunities of models, while improving reliability and robustness. Let’s chat about evaluation science & meet the EvalEval community! Monday,…
🚨 AI Evals Crisis: Officially kicking off the Eval Science Workstream 🚨 We’re building a shared scientific foundation for evaluating AI systems, one that’s rigorous, open, and grounded in real-world & cross-disciplinary best practices👇 (1/2) evalevalai.com/research/2025/…
The repo is public! Feel free to play with the code for other models and languages :)
The code for this paper will become available here: github.com/microsoft/Effi… as soon as the logistics get sorted out. (10/n)
This is neat. So presumably English CoT has lots of filler words/artifacts that don't do meaningful computations. Surely English still has some benefits due to its ubiquity in training As OpenAI and GDM hide their thinking tokens, maybe they're running broken-English CoT models
New paper! Can reasoning in non-English languages be token-efficient and accurate? We evaluate this across 3 models, 7 languages, and 4 math benchmarks. Here’s what we found 🧵 (1/n)
Excited to share our new paper on cross-lingual LLM reasoning - with @SanchitAhuja7 and Barun! Turns out models may reason more efficiently in langs like Arabic or Korean- cutting tokens without hurting accuracy. A step toward rethinking the default role of English in reasoning!
New paper! Can reasoning in non-English languages be token-efficient and accurate? We evaluate this across 3 models, 7 languages, and 4 math benchmarks. Here’s what we found 🧵 (1/n)
A friend of mine at Adalat AI is looking for a 3-month research intern to work on legal benchmarking—running experiments across LLMs, analyzing results, and co-authoring a paper targeting an A* conference. DM me or @orgho12 on Twitter if you're interested.
Many PhDs (my past self included) fall into the trap of thinking that publishing in top-tier conferences is the ultimate goal. But publishing ≠ impact. Muon was just a blog post. It got Keller into OpenAI, he might be training GPT-5 with it now. I'm grateful he listed me as…
The reason I didn't write a proper arxiv paper for Muon is because I simply don't think there's any relationship between the ability to publish a paper with lots of good-looking results about a new optimizer, and whether that optimizer actually works. I only trust speedruns.
My brain reading this as - Oh, Charles is coming up with a custom H100 GPU huh 🥲
Sooo proud of this one 🩵 my limited edition Beoplay H100 with @BangOlufsen - only 216 out there!