Tuhin Chakrabarty
@TuhinChakr
Assistant Prof @sbucompsc @stonybrooku Researcher → @SFResearch Interests : Human Centered AI / Future of Work / AI & Creativity Formerly @ColumbiaCompSci
Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.


Lots of doubts about these findings but the most egregious bit is Historians have higher exposure to AI than Customer Service Representatives ? Nonsense !!
Microsoft just released their list of the 40 jobs most AI-applicable and the 40 jobs least AI-applicable. This list may have overlap with replaceability and job transformation. (Phlebotomists, you've got nothing to worry about.)
Is LLM use finally making me less capable? I started using LLMs three years ago for text and code gen. Now, I use several of them, for a ton more things. In fact, I feel like I use them for a huge fraction of the cognitive tasks that I perform that can be described in text.…
Anthropic has been a series of ideological decisions later defeated by business realities
SCOOP: Leaked memo from Anthropic CEO Dario Amodei outlines the startup's plans to seek investment from the United Arab Emirates and Qatar. “Unfortunately, I think ‘no bad person should ever benefit from our success’ is a pretty difficult principle to run a business on.”
Happy to present OLMoTrace at #ACL2025NLP next week!! 🤗 If you stop by the demo session on Tuesday, July 29, 10:30am-12pm, @yanaiela and @sewon__min will be sharing how we use OLMoTrace to make LLMs more transparent. Unfortunately I'm unable to attend in-person due to visa 🥹
Today we're unveiling OLMoTrace, a tool that enables everyone to understand the outputs of LLMs by connecting to their training data. We do this on unprecedented scale and in real time: finding matching text between model outputs and 4 trillion training tokens within seconds. ✨
Go work with Abhilasha! She is an amazing researcher and person. ☺️
Life update: I’m excited to share that I’ll be starting as faculty at the Max Planck Institute for Software Systems(@mpi_sws_) this Fall!🎉 I’ll be recruiting PhD students in the upcoming cycle, as well as research interns throughout the year: lasharavichander.github.io/contact.html
ChatGPT Agent is a huge step up on BearCubs, esp on multimodal/interactive tasks (e.g., playing web games)! It gets 65.8% accuracy vs Deep Research's 36% and Operator's 23%. Humans are at ~85%, and clearly better/faster at fine control & complex filtering.
Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web! ✅ Humans achieve 85% accuracy ❌ OpenAI Operator: 24% ❌ Anthropic Computer Use: 14% ❌ Convergence AI Proxy: 13%
issues w preference LM benchmarks 🐡data contains cases where the "bad" response is just as good as chosen one 🐟model rankings can feel off (claude ranks lower than expected) led by @cmalaviya11 (TACL 2025), we study underspecified queries & detrimental effect on model evals
In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇
🏆 #ICML2025 Best Paper Award: AI Safety Should Prioritize the Future of Work 📄 Paper: arxiv.org/abs/2504.13959 🎉 Congratulations to Sanchaita Hazra @hsanchaita, Bodhisattwa Prasad Majumder @mbodhisattwa, and Tuhin Chakrabarty @TuhinChakr for winning the Outstanding Award —…
Life update: I’m excited to share that I’ll be starting as faculty at the Max Planck Institute for Software Systems(@mpi_sws_) this Fall!🎉 I’ll be recruiting PhD students in the upcoming cycle, as well as research interns throughout the year: lasharavichander.github.io/contact.html
If you're a prospective student reaching out to a PI who you want to work with - remember that we receive quite a few such emails, and if multiple people use ChatGPT to draft their letter, we're bound to see some phrases repeat over and over again.
"Seeing" robins and sparrows may not necessarily make them birdier to LMs! Super excited about this paper -- massive shoutout to all my co-authors, especially @yulu_qin and @dhevarghese for leading the charge!
Does vision training change how language is represented and used in meaningful ways?🤔 The answer is a nuanced yes! Comparing VLM-LM minimal pairs, we find that while the taxonomic organization of the lexicon is similar, VLMs are better at _deploying_ this knowledge. [1/9]
ever since VLMs were a thing i've been interested in how the additional visual modality changes language in meaningful ways. after negative findings after negative findings, excited to report this result! proud of our junior authors for digging into this 🐸
Does vision training change how language is represented and used in meaningful ways?🤔 The answer is a nuanced yes! Comparing VLM-LM minimal pairs, we find that while the taxonomic organization of the lexicon is similar, VLMs are better at _deploying_ this knowledge. [1/9]
"writing is not only about reporting results; it also provides a tool to uncover new thoughts and ideas. Writing compels us to think"
Strange world to live in. AI Twitter is overblown with claims on IMO performance of LLMs when 99% of the humans can't do it or care about it. The gap between the real utility and what it takes to paint the illusion of intelligence will only grow with time :)
This explains why OpenAI results are out and GDM results are not. And what's out is not even official results verified by IMO!
🚨 According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony. According to a Coordinator on Problem 6, the one problem OpenAI…
Join us in west building room 223 for the #Memorization workshop!!
At some point @amazon should release public data on the sales of AI generated books :) Maybe this will clear the air of the “transformativeness” myth of training on 📚
My dad bought a book on Carl Jung from Amazon He’s a few pages in, telling me how bad it is I look. The cover looks like Dall-E. The text formatting has messy white space. The introduction’s third sentence is “Jung was not X, he was Y.” The entire book is ChatGPT generated!
Excited to share what I have been focusing on this year! Inference-time search to optimize Bayesian surprise pushes us towards long-horizon discovery! Introducing "AutoDS": Autonomous Discovery via Surprisal. "It can not only find the diamond in the rough, but also can rule out…
Great science starts with great questions. 🤔✨ Meet AutoDS—an AI that doesn’t just hunt for answers, it decides which questions are worth asking. 🧵