Gabriel Sarch
@GabrielSarch
Ph.D. Candidate at Carnegie Mellon University @mldcmu @cmuneurosci. Prev. @yutori_ai @MSFTResearch. Incoming postdoc @PrincetonPLI.
How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵
Sutton and Barto’s book first sparked my fascination with RL, and Sutton’s essays still push me to rethink my assumptions. These are writings I will revisit throughout my career.
awards.acm.org/about/2024-tur… Machines that learn from experience were explored by Alan Turing almost eighty years ago, which makes it particularly gratifying and humbling to receive an award in his name for reviving this essential but still nascent idea.
Check out the great work by Jacob on reconstructing video from fMRI using a motion bottleneck! Outperforms standard approaches without motion guidance. Oral + poster tomorrow at #CVPR2025. Don't miss it!
1/6 🚀 Excited to share that BrainNRDS has been accepted as an oral at #CVPR2025! We decode motion from fMRI activity and use it to generate realistic reconstructions of videos people watched, outperforming strong existing baselines like MindVideo and Stable Video Diffusion.🧠🎥
We're excited to launch Scouts — always-on AI agents that monitor the web for anything you care about.
Go check out @GabrielSarch's new work on how to train VLMs to reason in a grounded way via RL. ViGoRL works quite well! I personally like some of the insights about how to induce useful base behaviors in the model that can be amplified via RL for better visual reasoning.…
How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵
Unifying 2D and 3D perception is a key step toward building more capable embodied agents. Great work by Ayush and team on scaling 3D referential grounding and QA!
1/ Despite having access to rich 3D inputs, embodied agents still rely on 2D VLMs—due to the lack of large-scale 3D data and pre-trained 3D encoders. We introduce UniVLG, a unified 2D-3D VLM that leverages 2D scale to improve 3D scene understanding. univlg.github.io
I am excited to share that I’ll be joining Princeton University as a @PrincetonPLI Postdoctoral Fellow next fall! I look forward to working on core problems in multimodal models and agentic reasoning with the incredible faculty and students!

New discovery! LLMs are just like humans! Overthinking GREATLY HURTS their performance If we select the solution with the lower overthinking score. We improve model performance by almost 30% while reducing costs by 43% (o1_low) Is reasoning really the future of LLMs? 🧵
number one reason to get a phd is because it's the only time in life you can think of something and go "hmm, that would be pretty cool" and then just spend several months *doing it*; oh and you will be joined by other people who think it's cool too and suggest ways for you to…
Yutori is an exceptional technical team with a unique vision for how web agents will evolve into personal assistants for everyday life. Check out their demo, and if you're excited about the future of web agents, be sure to reach out to them!
OpenAI's announcement of Operator on Thursday was a great excuse for us to come out of stealth to show off the AI agents tech we've been building at Yutori. Which means I can now say out loud — we're hiring! Our current top hiring priorities are an awesome founding frontend…