Deepak Nathani
@deepaknathani11
PhD Student @UCSBNLP | Mentor @CAREMentorship | Prev: @AIatMeta | @AWS AI | @GoogleAI India | @IITHyderabad
🎉 Thrilled to share MLGym and MLGym-Bench, our new framework for AI Research Agents! 🚀 Developed during my Meta internship, MLGym provides a flexible environment for benchmarking and developing new agents for AI research tasks. 🔬 MLGym-Bench consists of 13 diverse AI research…
Super excited to share 🧠MLGym 🦾 – the first Gym environment for AI Research Agents 🤖🔬 We introduce MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. The key contributions of our work are: 🕹️ Enables the…
MLGym has been accepted to #COLM2025! See you in Montreal 🇨🇦
🎉 Thrilled to share MLGym and MLGym-Bench, our new framework for AI Research Agents! 🚀 Developed during my Meta internship, MLGym provides a flexible environment for benchmarking and developing new agents for AI research tasks. 🔬 MLGym-Bench consists of 13 diverse AI research…
1/ 🕵️ Algorithm discovery could lead to huge AI breakthroughs! But what is the best way to learn or discover new algorithms? I'm so excited to share our brand new @rl_conference paper which takes a step towards answering this! 🧵
@GoogleDeepMind India 🇮🇳 & Japan 🇯🇵 are looking for strong candidates in multilinguality, multicultural, & multimodality areas. RS Bangalore: job-boards.greenhouse.io/deepmind/jobs/… RS Tokyo: job-boards.greenhouse.io/deepmind/jobs/… RE Tokyo: job-boards.greenhouse.io/deepmind/jobs/…
If you are interested in open-ended discovery, this is an amazing opportunity! @robertarail is great and you will have fun working on a challenging problem
I’m building a new team at @GoogleDeepMind to work on Open-Ended Discovery! We’re looking for strong Research Scientists and Research Engineers to help us push the frontier of autonomously discovering novel artifacts such as new knowledge, capabilities, or algorithms, in an…
Excited to be at @IC2S2 #ic2s22025. I will be presenting this work at the plenary lightning talks (after keynotes) on Thursday, and in the poster session afterwards. Looking forward to making new friends :D If you are interested in culture and evaluation, let's chat!!!
🖋️ Curious how writing differs across (research) cultures? 🚩 Tired of “cultural” evals that don't consult people? We engaged with researchers to identify & measure ✨cultural norms✨in scientific writing, and show that❗LLMs flatten them❗ 📜 arxiv.org/abs/2506.00784 1/11
“Apple looses key AI leaders to Meta” I discovered this while doing the live demo of Reka Research 😂 Go watch the video and play with our agent
Reka Research is our AI agent that scours the web to answer your toughest questions. Ready to unlock its full potential? Learn directly from the team who built it!
We are hosting a student researcher this year at the Paradigms of Intelligence team at Google! Interested in working with @ninoscherrer and me on AGI, or whatever you think is the next big thing 🥰, please consider applying! docs.google.com/forms/u/2/d/e/…
How should we rank generalist agents on a wide set of benchmarks and tasks? Honored to get the AAMAS best paper award for SCO, a scheme based on voting theory which minimizes the mistakes in predicting agent comparisons based on the evaluation data. arxiv.org/abs/2411.00119
We’re excited to announce our next speaker: Roberta Raileanu (@robertarail) from @GoogleDeepMind! Roberta will discuss NetHack: A Grand Challenge for RL and LLM Agents Alike. ⚔️ Join us on August 5th to learn how to develop agents capable of tackling open-ended environments!
🎉🚀 Release Day 3: It’s #OpenSource time! We’re pleased to announce the open-sourcing of Reka Flash 3.1, our coding model achieving state-of-the-art performance on LiveCodebench and other benchmarks. 📈🔍 🛠️✨ We’re also opensourcing Reka Quant, our cutting-edge quantization…
📢 We are open sourcing ⚡Reka Flash 3.1⚡ and 🗜️Reka Quant🗜️. Reka Flash 3.1 is a much improved version of Reka Flash 3 that stands out on coding due to significant advances in our RL stack. 👩💻👨💻 Reka Quant is our state-of-the-art quantization technology. It achieves…
Try out Reka vision today! You can create super cool social media reels with natural language prompts, search across many videos and get summaries. Congrats to the team for the great work.
Excited to introduce Reka Vision, an agentic visual understanding and search platform. Transform your unstructured multimodal data into insights and actions.
🥳Our work UTGen & UTDebug on teaching LLMs to generate effective unit tests & improve code debugging/generation has been accepted to @COLM_conf #COLM2025! Stay tuned for more exciting results -- e.g., using 32B-scale UTGen models to improve debugging with frontier models like…
🚨 Excited to share: "Learning to Generate Unit Tests for Automated Debugging" 🚨 which introduces ✨UTGen and UTDebug✨ for teaching LLMs to generate unit tests (UTs) and debugging code from generated tests. UTGen+UTDebug improve LLM-based code debugging by addressing 3 key…
Solid work from @AIatMeta on ablating and improving AIDE on MLE-Bench! The rigor of empirical evaluation has reached a new level, making the experimental signals super strong. Highly recommended for anyone interested in AI-Driven R&D/Agentic Search!
AI Research Agents are becoming proficient at machine learning tasks, but how can we help them search the space of candidate solutions and codebases? Read our new paper looking at MLE-Bench: arxiv.org/pdf/2507.02554 #LLM #Agents #MLEBench
Agent S2 is accepted to #COLM2025! My first COLM paper. See you in Montreal in October! 🍁
Since launching Agent S2, many folks working on GUI/computer-use agents asked for our tech report. Here we go! 🎉New SOTA on 3 major computer use benchmarks. • OSWorld (15 steps): 27.0% 🚀 (+18.9%) • OSWorld (50 steps): 34.5% 🚀 (+32.7%) • WindowsAgentArena: 29.8% 🚀…
Very proud of this work! If you're interested in AI agents and their current challenges, give this a read. Thanks to my incredible collaborators and to @Meta and @ucl for enabling me to tackle something of this scale for my first PhD paper. Excited for what's ahead!
Scaling AI research agents is key to tackling some of the toughest challenges in the field. But what's required to scale effectively? It turns out that simply throwing more compute at the problem isn't enough. We break down an agent into four fundamental components that shape…
Giving a talk at @jetbrains with @anton_iades on inference time scaling for SWE agents! Am I allowed to mention the word “cursor” here?
AIRA strikes again! This time we conduct an in-depth study of research agents on MLE-Bench (i.e. kaggle competitions). We find that while exploration and search matter, the biggest delta is due to our more robust software stack. We are open-sourcing all of this to allow YOU to…
AI Research Agents are becoming proficient at machine learning tasks, but how can we help them search the space of candidate solutions and codebases? Read our new paper looking at MLE-Bench: arxiv.org/pdf/2507.02554 #LLM #Agents #MLEBench
AI Research Agents are becoming proficient at machine learning tasks, but how can we help them search the space of candidate solutions and codebases? Read our new paper looking at MLE-Bench: arxiv.org/pdf/2507.02554 #LLM #Agents #MLEBench
Better Agents = Smarter Search Algorithms + Improved Operators + Precise Evaluation + Robust Infrastructure. For, even the best F1 driver can't win without a great car! Great to see my PhD research on LLMs + Search come together in this amazing paper! #LLM #Search #AI
Scaling AI research agents is key to tackling some of the toughest challenges in the field. But what's required to scale effectively? It turns out that simply throwing more compute at the problem isn't enough. We break down an agent into four fundamental components that shape…