MIT NLP
@nlp_mit
NLP Group at @MIT_CSAIL! PIs: @yoonrkim @jacobandreas @lateinteraction @pliang279 @david_sontag, Jim Glass, @roger_p_levy
Hello everyone! We are quite a bit late to the twitter party, but welcome to the MIT NLP Group account! follow along for the latest research from our labs as we dive deep into language, learning, and logic 🤖📚🧠

fun new paper training LLMs to analyze their own uncertainty and be more calibrated in their confidence! arxiv.org/abs/2507.16806
🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --…
Check out this new paper training LLMs to analyze their own uncertainty and be more calibrated! from @MehulDamani2 @ishapuri101 @StewartSlocum1 @IdanShenfeld and co!
🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --…
I'm currently in Vancouver for #ICML2025 this week and will present our work, "Understanding the Emergence of Multimodal Representation Alignment" later today at 4:30pm. Come by to chat!
If you are interested in questioning how we should pretrain models and create new architectures for general reasoning - then checkout E606 @ ICML, our position by @seungwookh and I on potential directions for the next generation reasoning models!
Presenting our ICML spotlight poster at today 11am @ E-606 w/ @jyo_pari! We need to fundamentally change how we train to achieve true reasoning. Reward-based Pretraining (RPT) > Supervised Pretraining
🧙♂️Excited to share our new whitepaper “General Reasoning Requires Learning to Reason from the Get-Go.” We argue that simply making models bigger and feeding them more data is NOT enough for robust, adaptable reasoning. (1/n)
We will present the work TODAY at 4:30 PM at West Hall #421 with a huge poster! Come visit us!
Excited to share our latest benchmark: CLIMB, where we built a solid data foundation for multimodal clinical models. With 4.51M patient samples, totaling 19.01 TB of data across 13 domains, it's currently the largest public clinical benchmark! Paper: arxiv.org/abs/2503.07667 Code:…
Excited to be here at #ICML2025 to present our paper on 'pragmatic misalignment' in (deployed!) RAG systems: narrowly "accurate" responses that can be profoundly misinterpreted by readers. It's especially dangerous for consequential domains like medicine! arxiv.org/pdf/2502.14898
I'll be presenting "(How) Do Language Models Track State" at ICML! Come by our poster tomorrow, Tuesday July 15 from 4:30pm - 7pm to chat about LMs and whether/how they encode dynamic world models! 🔗 icml.cc/virtual/2025/p…
Past work has shown that world state is linearly decodable from LMs trained on text and games like Othello. But how do LMs *compute* these states? We investigate state tracking using permutation composition as a model problem, and discover interpretable, controllable procedures🧵
How do task vectors emerge during pretraining—and can they predict ICL performance? Come see our ICML spotlight poster "Emergence and Effectiveness of Task Vectors in ICL" at 11am @ East Hall A-B (#E-2312) with @jinyeop_song! 🔗 icml.cc/virtual/2025/p…
At #ICML 🇨🇦 this week. I'm convinced that the core computations are shared across modalities (vision, text, audio, etc). The real question is the (synthetic) generative process that ties them. Reach out if you have thoughts or want to chat!
I will be in Vancouver🇨🇦 for #ICML2025 this week and present #SelfCite on Tuesday morning. Happy to chat and connect. See you there! Blog post link: selfcite.github.io
(1/5)🚨LLMs can now self-improve to generate better citations✅ 📝We design automatic rewards to assess citation quality 🤖Enable BoN/SimPO w/o external supervision 📈Perform close to “Claude Citations” API w/ only 8B model 📄arxiv.org/abs/2502.09604 🧑💻github.com/voidism/SelfCi…
Come check out our ICML poster on combining Test-Time Training and In-Context Learning for on-the-fly adaptation to novel tasks like ARC-AGI puzzles. I will be presenting with @jyo_pari at E-2702, Tuesday 11-1:30!
📣 Excited to announce SpaVLE: #NeurIPS2025 Workshop on Space in Vision, Language, and Embodied AI! 👉 …vision-language-embodied-ai.github.io 🦾Co-organized with an incredible team → @fredahshi · @maojiayuan · @DJiafei · @ManlingLi_ · David Hsu · @Kordjamshidi 🌌 Why Space & SpaVLE? We…
💡Bridging speech, sound, & music representations with one universal model? We introduce USAD ✅ 📚 Distills knowledge from domain-specific SSL models 🎯 Matches expert models across speech/audio/music tasks 📄 arxiv.org/abs/2506.18843 🧑💻 huggingface.co/MIT-SLS/USAD-B…
Scientists at Massachusetts Institute of Technology have devised a way for large language models to keep learning on the fly—a step toward building AI that continually improves itself. wired.com/story/this-ai-…
Most problems have clear-cut instructions: solve for x, find the next number, choose the right answer. Puzzlehunts don’t. They demand creativity and lateral thinking. We introduce PuzzleWorld: a new benchmark of puzzlehunt problems challenging models to think creatively.
Transformers: ⚡️fast to train (compute-bound), 🐌slow to decode (memory-bound). Can Transformers be optimal in both? Yes! By exploiting sequential-parallel duality. We introduce Transformer-PSM with constant time per token decode. 🧐 arxiv.org/pdf/2506.10918
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch…