Arkil Patel
@arkil_patel
CS PhD Student at Mila and McGill | Worked at AllenNLP and Microsoft Research
𝐓𝐡𝐨𝐮𝐠𝐡𝐭𝐨𝐥𝐨𝐠𝐲 paper is out! 🔥🐋 We study the reasoning chains of DeepSeek-R1 across a variety of tasks and settings and find several surprising and interesting phenomena! Incredible effort by the entire team! 🌐: mcgill-nlp.github.io/thoughtology/
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. 🔗: mcgill-nlp.github.io/thoughtology/
Nice work! We observed a similar trend on certain math tasks in our work: arxiv.org/abs/2504.07128 Section 4.1 has a discussion of our findings. You might want to consider citing it :) cc @saraveramarjano @arkil_patel @sivareddyg
If you’re at ICML and if you work on interpretability or causality, go talk to @_shruti_joshi_, she has a fantastic paper!
I will be at the Actionable Interpretability Workshop (@ActInterp, #ICML) presenting *SSAEs* in the East Ballroom A from 1-2pm. Drop by (or send a DM) to chat about (actionable) interpretability, (actionable) identifiability, and everything in between!
Come find us at the #ICML2025 poster if you are interested in safety of web agents!
I'll be at #ICML2025 this week presenting SafeArena (Wednesday 11AM - 1:30PM in East Exhibition Hall E-701). Come by to chat with me about web agent safety (or anything else safety-related)!
SafeArena is being presented at #ICML2025 !! Check out our poster and talk to @ncmeade for all things ‘safety ∪ agents ∪ LLMs’!
I'll be at #ICML2025 this week presenting SafeArena (Wednesday 11AM - 1:30PM in East Exhibition Hall E-701). Come by to chat with me about web agent safety (or anything else safety-related)!
Congrats @vernadankers!! We’re lucky to have you join our lab!
Congratulations Verna! This was one of the best theses I've ever read, I highly recommend checking out Verna's work on the tradeoffs between memorization and generalization in language models! vernadankers.com
I miss Edinburgh and its wonderful people already!! Thanks to @tallinzen and @PontiEdoardo for inspiring discussions during the viva! I'm now exchanging Arthur's Seat for Mont Royal to join @sivareddyg's wonderful lab @Mila_Quebec 🤩
Huge congratulations to Dr. @vernadankers for passing her viva today! 🥳🎓 It's been an honour sharing the PhD journey with you. I wasn’t ready for the void your sudden departure left (in the office and in my life!). Your new colleagues are lucky to have you! 🥺🥰 @Edin_CDT_NLP
"Build the web for agents, not agents for the web" This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
Do LLMs hallucinate randomly? Not quite. Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably. 📎 Paper: arxiv.org/abs/2505.22630 1/n
📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ @melaniesclar, and @tsvetshop 1/n
AgentRewardBench Evaluating Automatic Evaluations of Web Agent Trajectories
A key reason RL for web agents hasn’t fully taken off is the lack of robust reward models. No matter the algorithm (PPO, GRPO), we can’t reliably do RL without a reward signal. With AgentRewardBench, we introduce the first benchmark aiming to kickstart progress in this space.
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…
many many many thanks to @kchonyc and @Yoshua_Bengio for enabling the wildest ever start of my research career 2014 was a very special time to do deep learning, a commit that changes 50 lines of code could give you a ToT award 10 years later 😲
Super timely work led by @xhluca with extensive human evaluation of agent trajectories across multiple benchmarks and LLMs!
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…
DeepSeek-R1 Thoughtology now #2 on @huggingface daily papers Thanks for building this great platform for sharing new papers @_akhaliq
DeepSeek-R1 Thoughtology: Let’s <think> about LLM reasoning 142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
DeepSeek-R1 Thoughtology: Let’s <think> about LLM reasoning 142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
I think one of the most underrated sources of insight in research is just looking at the model's outputs. The Thoughtology paper is what happens when an entire lab of grad students at Mila do this cumbersome task for R1's CoT and actually quantifies all the patterns we saw.
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. 🔗: mcgill-nlp.github.io/thoughtology/
Thoughtology is trending today on hf daily papers! Read our paper for a detailed analysis of R1’s long chains of thoughts across a variety of settings. huggingface.co/papers/2504.07…
And thoughtology is now on Arxiv! Read more about R1 reasoning 🐋💭 across visual, cultural and psycholinguistic tasks at the link below: 🔗 arxiv.org/abs/2504.07128
Introducing nanoAhaMoment: Karpathy-style, single file RL for LLM library (<700 lines) - super hackable - no TRL / Verl, no abstraction💆♂️ - Single GPU, full param tuning, 3B LLM - Efficient (R1-zero countdown < 10h) comes with a from-scratch, fully spelled out YT video [1/n]
Watch Siva’s talk on thoughtology: youtube.com/live/aO_cTIY9K…
𝐓𝐡𝐨𝐮𝐠𝐡𝐭𝐨𝐥𝐨𝐠𝐲 paper is out! 🔥🐋 We study the reasoning chains of DeepSeek-R1 across a variety of tasks and settings and find several surprising and interesting phenomena! Incredible effort by the entire team! 🌐: mcgill-nlp.github.io/thoughtology/
I will be giving a talk about this work @SimonsInstitute tomorrow (Apr 2nd 3PM PT). Join us, both in-person or virtually. simons.berkeley.edu/workshops/futu…
Introducing the DeepSeek-R1 Thoughtology -- the most comprehensive study of R1 reasoning chains/thoughts ✨. Probably everything you need to know about R1 thoughts. If we missed something, please let us know.