Greg Durrett

@gregd_nlp

CS professor at UT Austin. Large language models and NLP. he/him

Joined December 2017

833Following

7KFollowers

Pinned

Greg Durrett@gregd_nlp · Aug 21, 2023

📣 Today we launched an overhauled NLP course to 600 students in the online MS programs at UT Austin. 98 YouTube videos 🎥 + readings 📖 open to all! cs.utexas.edu/~gdurrett/cour… w/5 hours of new 🎥 on LLMs, RLHF, chain-of-thought, etc! Meme trailer 🎬 youtu.be/DcB6ZPReeuU 🧵

318

153

36.0K

Pinned

Greg Durrett@gregd_nlp · Jul 9

Excited to share that QUDsim has been accepted to #COLM2025!! 🎉🎉

RRamya Namuduri@ramya_namuduri · Apr 21

Have that eerie feeling of déjà vu when reading model-generated text 👀, but can’t pinpoint the specific words or phrases 👀? ✨We introduce QUDsim, to quantify discourse similarities beyond lexical, syntactic, and content overlap.

1.0K

Greg Durrett Retweeted

Anisha Gunjal@anisha_gunjal · Jul 24

🤔 How do we train LLMs on real-world tasks where it’s hard to define a single verifiable answer? Our work at @scale_AI introduces Rubrics as Rewards (RaR) — a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. 🧵

220

180

22.0K

Greg Durrett Retweeted

Sedrick Keh@sedrickkeh2 · Jul 18

📢📢📢 Releasing OpenThinker3-1.5B, the top-performing SFT-only model at the 1B scale! 🚀 OpenThinker3-1.5B is a smaller version of our previous 7B model, trained on the same OpenThoughts3-1.2M dataset.

112

11.0K

Greg Durrett@gregd_nlp · Jul 10

Happy to share that EvalAgent has been accepted to #COLM2025 @COLM_conf 🎉🇨🇦 We introduce a framework to identify implicit and diverse evaluation criteria for various open-ended tasks! 📜 arxiv.org/pdf/2504.15219

MManya Wadhwa@ManyaWadhwa1 · Apr 22

Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇

5.0K

Greg Durrett Retweeted

Lily Chen@lilyychenn · Jul 1

Are we fact-checking medical claims the right way? 🩺🤔 Probably not. In our study, even experts struggled to verify Reddit health claims using end-to-end systems. We show why—and argue fact-checking should be a dialogue, with patients in the loop arxiv.org/abs/2506.20876 🧵1/

4.0K

Greg Durrett@gregd_nlp · Jun 26

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )

PPercy Liang@percyliang · May 22

Marin 32B training crossed 1.5 trillion tokens today...

101

1.0K

294.0K

Greg Durrett Retweeted

Percy Liang@percyliang · Jun 18

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

567

5.0K

7.0K

640.0K

Greg Durrett@gregd_nlp · Jun 19

There’s been hot debate about (The Illusion of) The Illusion of Thinking. My take: it’s not that models can’t reason — they just aren’t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning…

XXi Ye@xiye_nlp · Jan 14

🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem？ 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize…

4.0K

Greg Durrett@gregd_nlp · Jun 18

If we don't do physical work in our jobs, we go to the gym and work out. What are the gyms for skills that LLMs will automate?

4.0K

Greg Durrett@gregd_nlp · Jun 16

I'm excited about Leo's use of hypernetworks for data efficient knowledge editing! Tweaking what a model learns from data is very powerful & useful for other goals like alignment. Haven't seen much other work building on MEND recently, but let me know what cool stuff we missed!

LLeo Liu@ZEYULIU10 · Jun 16

LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.

2.0K

Greg Durrett Retweeted

Leo Liu@ZEYULIU10 · Jun 16

194

111

28.0K

Greg Durrett Retweeted

Xi Ye@xiye_nlp · Jun 12

🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a…

7.0K

Greg Durrett Retweeted

Bespoke Labs@bespokelabsai · Jun 9

Understanding what’s in the data is a high leverage activity when it comes to training/evaluating models and agents. This week we will drill down into a few popular benchmarks and share some custom viewers that will help pop up various insights. Our viewer for GPQA (Google…

6.0K

Greg Durrett Retweeted

Chaitanya Malaviya@cmalaviya11 · Jun 6

Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓

14.0K

Greg Durrett Retweeted

Vaishnavh Nagarajan@_vaishnavh · Jun 2

📢 New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: → LLMs are limited in creativity since they learn to predict the next token → creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" 🌱) 1/ 🧵

165

112

27.0K

Greg Durrett Retweeted

Ryan Marten@ryanmart3n · Jun 5

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data…

191

921

725

189.0K

Greg Durrett@gregd_nlp · Jun 3

CosmicAI collab: benchmarking the utility of LLMs in astronomy coding workflows & focusing on the key research capability of scientific visualization. @sebajoed @jessyjli @Murtazahusaintx @gregd_nlp @StephaJuneau @paultorrey9 Adam Bolton, Stella Offner, Juan Frias, Niall Gaffney

SSebastian Joseph@sebajoed · Jun 2

How good are LLMs at 🔭 scientific computing and visualization 🔭? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵

1.0K

Greg Durrett Retweeted

Asher Zheng@Asher_Zheng00 · Jun 3

Language is often strategic, but LLMs tend to play nice. How strategic are they really? Probing into that is key for future safety alignment.🛟 👉Introducing CoBRA🐍, a framework that assesses strategic language. Work with my amazing advisors @jessyjli and @David_Beaver! 🧵👇

3.0K

Greg Durrett Retweeted

XLLM-Reason-Plan@XllmReasonPlan · May 20

📢Announcing 𝐭𝐡𝐞 𝐟𝐢𝐫𝐬𝐭 𝐰𝐨𝐫𝐤𝐬𝐡𝐨𝐩 𝐨𝐧 𝐭𝐡𝐞 𝐀𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐋𝐋𝐌 𝐄𝐱𝐩𝐥𝐚𝐢𝐧𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐭𝐨 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠 𝐚𝐧𝐝 𝐏𝐥𝐚𝐧𝐧𝐢𝐧𝐠 at @COLM_conf! We welcome perspectives from LLM, XAI, and HCI! CFP (Due June 23): …reasoning-planning-workshop.github.io

10.0K

Greg Durrett@gregd_nlp · Jun 2

CoT is effective for in-domain reasoning tasks, but Fangcong's work takes a nice step in improving compositional generalization of CoT reasoning. We teach models that atomic CoT skills fit together like puzzle pieces so it can then combine them in novel ways. Lots to do here!

FFangcong Yin@fangcong_y10593 · Jun 2

Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!

2.0K