idan shenfeld
@IdanShenfeld
PhD student @MIT, Student Researcher @GoogleDeepMind
What’s keeping robot arms from working like human arms? They're big, slow, have the wrong joints, and can't conform to their environment. DexWrist solves all of these issues and simplifies learning constrained, dynamic manipulation👉 dexwrist.csail.mit.edu
More evidence that RL training is fundamentally different from SFT. We need more work studying these differences!
🚨 Paper Alert: “RL Finetunes Small Subnetworks in Large Language Models” From DeepSeek V3 Base to DeepSeek R1 Zero, a whopping 86% of parameters were NOT updated during RL training 😮😮 And this isn’t a one-off. The pattern holds across RL algorithms and models. 🧵A Deep Dive
Improving reasoning with RL led to more hallucinations. In our new paper, we show how to mitigate this by teaching models to reason about what they don't know.!
🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --…
Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-con…
Imagine a hacker causing ChatGPT to genuinely believe the stock market is crashing. No breach, no code, all an attacker needs is a malicious prompt and clicking 👍to change the model's weights. Our research paper demonstrates a vulnerability in LLMs RLHF mechanism:
How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it…
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
Excited to present the 𝗳𝗶𝗿𝘀𝘁 𝗽𝗮𝗽𝗲𝗿 𝗼𝗳 𝗺𝘆 𝗣𝗵𝗗 ! 🎉 Let's chat if you're attending #ICRA2025, I'll be presenting: 📍Mon, May 19 | 9:40-10:40 — Workshop on Field Robotics 📍Thu, May 22 | 08:30–09:00 — Session ThAT5 @ETH @ETH_AI_Center
Check out our #ICRA2025 paper ! Robust, precise, and terrain-aware—our RL controller significantly improves over baselines for whole-body 6-DoF tracking.💥 Project website: leggedrobotics.github.io/wholebody-pose… Work led by: @TifannyPortela, Andrei Cramariuc, Mayank Mittal and Marco Hutter
Check out our #ICRA2025 paper ! Robust, precise, and terrain-aware—our RL controller significantly improves over baselines for whole-body 6-DoF tracking.💥 Project website: leggedrobotics.github.io/wholebody-pose… Work led by: @TifannyPortela, Andrei Cramariuc, Mayank Mittal and Marco Hutter
Giving history to our robot policies is crucial to solve a variety of daily tasks. However, diffusion policies get worse when adding history. 🤖 In our recent work we learn how adding an auxiliary loss that we name Past-Token Prediction (PTP) together with cached embeddings…
New work! 🚨 Recurrent LLMs like Mamba and RWKV can efficiently process millions of tokens, yet still underperform on real-world long-context tasks. What's holding them back? 🤔 And how can a lightweight fix boost their performance by 35% on LongBench? 👇🏼🧵 Github:…
I am super excited to be presenting our work on adaptive inference -time compute at ICLR! Come chat with me on Thursday 4/24 at 3PM (Poster #219). I am also happy to chat about RL/reasoning/ RLHF/ inference scaling (DMs are open)!
Inference-time compute can boost LM performance, but it's costly! How can we optimally allocate it across prompts? In our latest work, we introduce a simple method to adaptively allocate more compute to harder problems. 🔥 Paper: arxiv.org/abs/2410.04707 Learn more! 1/N
Learning from both sim+real data could scale robot imitation learning. But what are the scaling laws & principles of sim+real cotraining? We study this in the first focused analysis of sim+real cotraining spanning 250+ policies & 40k+ evals arxiv.org/abs/2503.22634 (1/6)
How did a community manage to do it? Long story, but we had fun playing with interactivity; many of us built different things and merged them. We care about people, interaction, LLMs, us & we followed no rules Share or build with us! 💬discord.com/invite/KMndsqw…