Sara Vera Marjanović
@saraveramarjano
PhD fellow in NLP IR & XAI 🏠 @uni_copenhagen @MLSectionUCPH @CopeNLU ✈️ @Mila_Quebec @mcgill_nlp // Recreational sufferer.
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. 🔗: mcgill-nlp.github.io/thoughtology/

Nice work! We observed a similar trend on certain math tasks in our work: arxiv.org/abs/2504.07128 Section 4.1 has a discussion of our findings. You might want to consider citing it :) cc @saraveramarjano @arkil_patel @sivareddyg
Have you ever wondered whether a few times of data contamination really lead to benchmark overfitting?🤔 Then our latest paper about the effect of data contamination on LLM evals might be for you!🚀 "How Much Can We Forget about Data Contamination?" (accepted at #ICML2025) shows…
Excited to share the results of my internship research with @AIatMeta, as part of a larger world modeling release! What subtle shortcuts are VideoLLMs taking on spatio-temporal questions? And how can we instead curate shortcut-robust examples at a large-scale? Details 👇🔬
Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model,…
"Build the web for agents, not agents for the web" This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
A great collab with former labmates @AntChen_ & Dongyan! Interesting cognitive limitation in LMs: strong disjunctive bias leads to poor performance on conjunctive causal inference tasks. Mirrors adult human biases—possibly a byproduct of training data priors.
Language model (LM) agents are all the rage now—but they may have cognitive biases when inferring causal relationships! We eval LMs on psych task to find: - LMs struggle with certain simple causal relationships - They show biases similar to human adults (but not children) 🧵⬇️
Congratulations to Mila members Ada Tur, Gaurav Kamath and @sivareddyg for their SAC award at #NAACL2025! Check out Ada's talk in Session I: Oral/Poster 6. Paper: arxiv.org/abs/2502.05670
In "Investigating Human Values in Online Communities", we perform a high-scale study of the unique values expressed by online communities arxiv.org/abs/2402.14177 #NAACL2025 #NLProc @NadavBorenstein @rnav_arora @frimelle @IAugenstein x.com/NadavBorenstei…
Ever wondered which subreddit is the most benevolent? In a new paper (preprint: arxiv.org/abs/2402.14177), @rnav_arora, @frimelle , @IAugenstein and @I annotated 6M posts across 10k subreddits with Schwartz values.
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…
DeepSeek-R1 marks a shift in LLM reasoning. But what happens when we dive deep into its "thoughts"? 🤔 A new "Thoughtology" paper explores DeepSeek-R1's reasoning chains, capabilities, limitations & even its safety concerns.
DeepSeek-R1 Thoughtology now #2 on @huggingface daily papers Thanks for building this great platform for sharing new papers @_akhaliq
DeepSeek-R1 Thoughtology: Let’s <think> about LLM reasoning 142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
「DeepSeek, Reasoning」论文 DeepSeek-R1 Thoughtology: Let’s <think> about LLM reasoning 一篇非常出色的 DeepSeek-R1 分析论文,甚至为此创造了新的词汇 thoughtology(思维学)。 首先,这就是 DeepSeek 开源的力量。 相比 OpenAI 的 o1,DeepSeek-R1…
Thoughtology is trending today on hf daily papers! Read our paper for a detailed analysis of R1’s long chains of thoughts across a variety of settings. huggingface.co/papers/2504.07…
And thoughtology is now on Arxiv! Read more about R1 reasoning 🐋💭 across visual, cultural and psycholinguistic tasks at the link below: 🔗 arxiv.org/abs/2504.07128
And thoughtology is now on Arxiv! Read more about R1 reasoning 🐋💭 across visual, cultural and psycholinguistic tasks at the link below: 🔗 arxiv.org/abs/2504.07128
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. 🔗: mcgill-nlp.github.io/thoughtology/
I'm so grateful to @bcs_irsg @TechAtBloomberg for honouring me with the Karen Spärck Jones Award 🙏 I gave the award lecture on LLMs’ Utilisation of Parametric & Contextual Knowledge at #ECIR2025 today (slides: isabelleaugenstein.github.io/slides/2025_EC…) bcs.org/membership-and… #NLProc @CopeNLU
Introducing nanoAhaMoment: Karpathy-style, single file RL for LLM library (<700 lines) - super hackable - no TRL / Verl, no abstraction💆♂️ - Single GPU, full param tuning, 3B LLM - Efficient (R1-zero countdown < 10h) comes with a from-scratch, fully spelled out YT video [1/n]
Talking about "DeepSeek-R1 Thoughtology: Let’s <think> about LLM reasoning" Going live at 11am PDT (i.e., 20 mins). Last minute change of plans. You might be able to see live here: youtube.com/watch?v=aO_cTI…
I will be giving a talk about this work @SimonsInstitute tomorrow (Apr 2nd 3PM PT). Join us, both in-person or virtually. simons.berkeley.edu/workshops/futu…