Pete Shaw
@ptshaw2
Research Scientist @GoogleDeepmind
Excited to share a new paper: “ALTA: Compiler-Based Analysis of Transformers” (w/ @James_Cohan, @jacobeisenstein, @kentonctlee, @JonathanBerant, @toutanova) arxiv.org/abs/2410.18077

AgentRewardBench will be presented at @COLM_conf 2025 in Montreal! See you soon and ping me if you want to meet up!
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…
Neural networks can express more than they learn, creating expressivity-trainability gaps. Our paper, “Mind The Gap,” shows neural networks best learn parallel algorithms, and analyzes gaps in faithfulness and effectiveness. @rao2z
We're hiring a research scientist on the Foundational Research in Language team at GDM. The role is right here in sunny Seattle! job-boards.greenhouse.io/deepmind/jobs/…
This was my first time submitting to TMLR, and thanks to the reviewers and AE @murefil for making it a positive experience! TMLR seems to offer some nice pros vs. ICML/ICLR/NeurIPS, eg: - Potentially lower variance review process - Not dependent on conference calendar
ALTA: Compiler-Based Analysis of Transformers Peter Shaw, James Cohan, Jacob Eisenstein, Kenton Lee, Jonathan Berant, Kristina Toutanova. Action editor: Alessandro Sordoni. openreview.net/forum?id=h751w… #compiler #interpreter #programming
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…
Hi ho! New work: arxiv.org/pdf/2503.14481 With amazing collabs @jacobeisenstein @jdjdhekchbdjd @adamjfisch @ddua17 @fantinehuot @mlapata @vicky_zayats Some things are easier to learn in a social setting. We show agents can learn to faithfully express their beliefs (along... 1/3
📣📣 My team at Google DeepMind is hiring a student researcher for summer/fall 2025 in Seattle! If you're a PhD student interested in getting deep RL to (finally) work reliably in interesting domains, apply at the link below and reach out to me via email so I know you aplied👇
Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. automate hate speech and spread misinformation? To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web…
Submit your work to the first Agent + Language workshop at @aclmeeting! We have a list of awesome speakers, and you'll have the chance to meet other researchers working on agents!
Interested in knowing more about LLMs agents and in contributing to this topic?🚀 📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻 We have an exciting lineup of speakers 🗓️ Submit your work by *March 1st*
Excited to share our work on improving Gemini for learning!
In a recent technical report, LearnLM, our set of AI models and capabilities fine-tuned for learning, outperformed other leading AI models on the principles of learning science. Now it’s available to try out in AI Studio. Learn more ↓ goo.gle/4gmEdxp
I'll be at NeurIPS this week. Please reach out if you would like to chat!
I have multiple vacancies for PhD and Masters students at @Mila_Quebec @McGill_NLP in NLP/ML focusing on representation learning, reasoning, multimodality and alignment. Deadline for applications is Dec 1st. More details: mila.quebec/en/prospective…
The #AlphaFold 3 model code and weights are now available for academic use. We @GoogleDeepMind are excited to see how the research community continues to use AlphaFold to address open questions in biology and new lines of research. github.com/google-deepmin…
We love the excitement & results from the community on AlphaFold 3 and are doubling the AF Server daily job limit to 20. Happy to also share that we're working on releasing the AF3 model (incl weights) for academic use, which doesn’t depend on our research infra, within 6 months.
The RASP-L conjecture was one of the reasons I became interested in languages that compile to Transformers, leading to our recent work on ALTA. Looking forward to reading this one!
We finally formalize the RASP-L conjecture in this work! - theoretical guarantee on generalization for C-RASP tasks - validated on 8 algo and 17 finite-state languages - C-RASP is based on communication complexity, can only transfer O(logN) bits b/w inputs Thread for details👇
I am hiring for a research engineering role in NYC, focused on Gemini post training. If you are interested, please apply here. Deadline is just in two weeks. boards.greenhouse.io/deepmind/jobs/…