Divyat Mahajan
@divyat09
Ph.D. Student @Mila_Quebec | Visiting Researcher @AIatMeta | Causality, Trustworthy ML | Former: @MSFTResearch @IITKanpur
Happy to share that Compositional Risk Minimization has been accepted at #ICML2025 📌Extensive theoretical analysis along with a practical approach for extrapolating classifiers to novel compositions! 📜 arxiv.org/abs/2410.06303

Llama Nemotron model just got Super-Charged ⚡️We released Llama-Nemotron-Super-v1.5 today! The best open model that can be deployed on a single H100 🚀 Enhanced for reasoning, tool use, general chat, and instruction following. HF : huggingface.co/nvidia/Llama-3…
Very excited to announce Llama-Nemotron-Super-V1.5! Super-V1.5 is now better than Ultra-V1. This is currently the best model that can be deployed on a single H100. Reasoning On/Off and drop in replacement for V1. Open-weight, code and data on HF huggingface.co/nvidia/Llama-3…
I’m also excited to be presenting this work (openreview.net/forum?id=4ZX2a…) at ICCOPT at USC. Theory aside there are some applications that may interest ppl in RL, games, and performative prediction. Let me know if you are in the area and want to chat!
🚨 New paper drop! 🚨 🤔 When a transformer sees a sequence that could be explained by many rules, which rule does it pick? It chooses the simplest sufficient one! 🧵👇
As the field moves towards agents doing science, the ability to understand novel environments through interaction becomes critical. AutumnBench is an attempt at measuring this abstract capability in both humans and current LLMs. Check out the blog post for more insights!
We’re proud to announce the launch of AutumnBench, an open-source benchmark developed on our Autumn platform. This benchmark, led by our MARA team, provides a novel platform for evaluating world modeling and causal reasoning in both human and artificial intelligence.
Thrilled to share that our work received the Outstanding Paper Award at ICML! I will be giving the oral presentation on Tuesday at 4:15 PM. @Jaeyeon_Kim_0 and I both will be at the poster session shortly after the oral presentation. Please attend if possible!
Excited about this new work where we dig into the role of token order in masked diffusions! MDMs train on some horribly hard tasks, but careful planning at inference can sidestep the hardest ones, dramatically improving over vanilla MDM sampling (e.g. 7%->90% acc on Sudoku) 1/
congrats on the award!! great to see more work designing insightful tasks that bring out the role of token ordering & difficulty (should i say "indecipherability" ;-) ). I think the idea of learning token-level subproblems is broken both for diffusion (and next-token learning)
Excited about this new work where we dig into the role of token order in masked diffusions! MDMs train on some horribly hard tasks, but careful planning at inference can sidestep the hardest ones, dramatically improving over vanilla MDM sampling (e.g. 7%->90% acc on Sudoku) 1/
“Apple looses key AI leaders to Meta” I discovered this while doing the live demo of Reka Research 😂 Go watch the video and play with our agent
Reka Research is our AI agent that scours the web to answer your toughest questions. Ready to unlock its full potential? Learn directly from the team who built it!
I will be at the Actionable Interpretability Workshop (@ActInterp, #ICML) presenting *SSAEs* in the East Ballroom A from 1-2pm. Drop by (or send a DM) to chat about (actionable) interpretability, (actionable) identifiability, and everything in between!
1\ Hi, can I get an unsupervised sparse autoencoder for steering, please? I only have unlabeled data varying across multiple unknown concepts. Oh, and make sure it learns the same features each time! Yes! A freshly brewed Sparse Shift Autoencoder (SSAE) coming right up. 🧶
Excited to present our work "Improving the scaling laws of synthetic data with deliberate practice", tomorrow at #ICML2025 📢 Oral: Wed. 10:45 AM 📍 West Ballroom B (Oral 3C Data-Centric ML) 🖼️ Poster: 🕚 11:00 AM – 1:30 PM 📍 East Exhibition Hall A-B (Poster Session 3 East)
🚀 New Paper Alert! Can we generate informative synthetic data that truly helps a downstream learner? Introducing Deliberate Practice for Synthetic Data (DP)—a dynamic framework that focuses on where the model struggles most to generate useful synthetic training examples. 🔥…
Today at #ICML2025, we present Deliberate Practice: an approach to improve sample-efficiency by generating harder, not more, examples. - Oral talk at 10:45 - West Ballroom B | Orals 3C: Data-Centric ML Join us to discuss principled approaches to more efficient learning.
Excited to present our work "Improving the scaling laws of synthetic data with deliberate practice", tomorrow at #ICML2025 📢 Oral: Wed. 10:45 AM 📍 West Ballroom B (Oral 3C Data-Centric ML) 🖼️ Poster: 🕚 11:00 AM – 1:30 PM 📍 East Exhibition Hall A-B (Poster Session 3 East)
Check out our new work on learning diffusion models with guidance from pretrained vision and language embeddings. Also, a contributed talk at #ICML2025 FM4LS workshop this Saturday! 💡Results in 23x speedup compared to SiT-XL on the class-conditional ImageNet 256×256 benchmark
Excited to share: “Learning Diffusion Models with Flexible Representation Guidance” With my amazing coauthors @zhuci19, @sharut_gupta, @zy27962986, @StefanieJegelka, @stats_stephen, Tommi Jaakkola Paper: arxiv.org/pdf/2507.08980 Code: github.com/ChenyuWang-Mon…
This work delivers on both theory and practice—offering the sharpest provable compositionality guarantees I know of, alongside state‑of‑the‑art performance on tough compositional distribution‑shift benchmarks.
Presenting CRM at #ICML2025 📌 Wednesday, 16th July, 11 am 📍East Exhibition Hall A-B (E-2101) Lets chat about distribution shifts! Been deep into causality & invariance based perspectives, and recently exploring robust LLM pretraining architectures.
Presenting CRM at #ICML2025 📌 Wednesday, 16th July, 11 am 📍East Exhibition Hall A-B (E-2101) Lets chat about distribution shifts! Been deep into causality & invariance based perspectives, and recently exploring robust LLM pretraining architectures.
Happy to share that Compositional Risk Minimization has been accepted at #ICML2025 📌Extensive theoretical analysis along with a practical approach for extrapolating classifiers to novel compositions! 📜 arxiv.org/abs/2410.06303
Distributional diffusion models with scoring rules at #icml25 Fewer, larger denoising steps using distributional losses! Wednesday 11am poster E-1910 arxiv.org/pdf/2502.02483 @agalashov @ValentinDeBort1 Guntupalli @zhouguangyao @sirbayes @ArnaudDoucet1
How to align your diffusion model with unseen objectives at inference time? Presenting Diffusion Tree Sampling/Search (DTS/DTS*) 🥳 Using MCTS-style search, DTS steadily improves sample quality with compute, matching the best baseline with 5× less compute!
My thesis is now online! umontreal.scholaris.ca/items/f8670d1c… This is more than just a list of publications. I invested a lot of time and passion writing this thesis in hope that it will make for an interesting read. Here's a summary of what you'll find in it.
I'm delighted to share that our paper has been accepted by #TMLR! We empirically observed signs of scaling laws regarding how the choice of pre-trained models affects OOD test errors and Expected Calibration Error on downstream tasks.
An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Cali... Hiroki Naganuma, Ryuichiro Hataya, Kotaro Yoshida, Ioannis Mitliagkas. Action editor: Mingsheng Long. openreview.net/forum?id=tYjoH… #accuracy #trained #deep
Excited to release AbstentionBench -- our paper and benchmark on evaluating LLMs’ *abstention*: the skill of knowing when NOT to answer! Key finding: reasoning LLMs struggle with unanswerable questions and hallucinate! Details and links to paper & open source code below! 🧵1/9
A good language model should say “I don’t know” by reasoning about the limits of its knowledge. Our new work AbstentionBench carefully measures this overlooked skill in leading models in an open-codebase others can build on! We find frontier reasoning degrades models’ ability to…
Excited to share our work on Transformer-PSMs: a neural sequence model with constant per-token inference time and log(seq-len) memory. It presents a sweet spot between transformers (linear scaling with KV cache) and RNNs/state space models (constant). Check the thread below 👇
Transformers: ⚡️fast to train (compute-bound), 🐌slow to decode (memory-bound). Can Transformers be optimal in both? Yes! By exploiting sequential-parallel duality. We introduce Transformer-PSM with constant time per token decode. 🧐 arxiv.org/pdf/2506.10918