Greg Durrett
@gregd_nlp
CS professor at UT Austin. Large language models and NLP. he/him
π£ Today we launched an overhauled NLP course to 600 students in the online MS programs at UT Austin. 98 YouTube videos π₯ + readings π open to all! cs.utexas.edu/~gdurrett/courβ¦ w/5 hours of new π₯ on LLMs, RLHF, chain-of-thought, etc! Meme trailer π¬ youtu.be/DcB6ZPReeuU π§΅
Excited to share that QUDsim has been accepted to #COLM2025!! ππ
Have that eerie feeling of dΓ©jΓ vu when reading model-generated text π, but canβt pinpoint the specific words or phrases π? β¨We introduce QUDsim, to quantify discourse similarities beyond lexical, syntactic, and content overlap.
π€ How do we train LLMs on real-world tasks where itβs hard to define a single verifiable answer? Our work at @scale_AI introduces Rubrics as Rewards (RaR) β a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. π§΅
π’π’π’ Releasing OpenThinker3-1.5B, the top-performing SFT-only model at the 1B scale! π OpenThinker3-1.5B is a smaller version of our previous 7B model, trained on the same OpenThoughts3-1.2M dataset.
Happy to share that EvalAgent has been accepted to #COLM2025 @COLM_conf ππ¨π¦ We introduce a framework to identify implicit and diverse evaluation criteria for various open-ended tasks! π arxiv.org/pdf/2504.15219
Evaluating language model responses on open-ended tasks is hard! π€ We introduce EvalAgent, a framework that identifies nuanced and diverse criteria πβοΈ. EvalAgent identifies π©βπ«π expert advice on the web that implicitly address the userβs prompt π§΅π
Are we fact-checking medical claims the right way? π©Ίπ€ Probably not. In our study, even experts struggled to verify Reddit health claims using end-to-end systems. We show whyβand argue fact-checking should be a dialogue, with patients in the loop arxiv.org/abs/2506.20876 π§΅1/
So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )
Marin 32B training crossed 1.5 trillion tokens today...
Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:
Thereβs been hot debate about (The Illusion of) The Illusion of Thinking. My take: itβs not that models canβt reason β they just arenβt perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoningβ¦
π€Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problemοΌ πIntroducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesizeβ¦
If we don't do physical work in our jobs, we go to the gym and work out. What are the gyms for skills that LLMs will automate?
I'm excited about Leo's use of hypernetworks for data efficient knowledge editing! Tweaking what a model learns from data is very powerful & useful for other goals like alignment. Haven't seen much other work building on MEND recently, but let me know what cool stuff we missed!
LLMs trained to memorize new facts canβt use those facts well.π€ We apply a hypernetwork to βοΈeditβοΈ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!π‘ Our approach, PropMEND, extends MEND with a new objective for propagation.
LLMs trained to memorize new facts canβt use those facts well.π€ We apply a hypernetwork to βοΈeditβοΈ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!π‘ Our approach, PropMEND, extends MEND with a new objective for propagation.
π€ Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? π£ Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: π Better head detection: we find aβ¦
Understanding whatβs in the data is a high leverage activity when it comes to training/evaluating models and agents. This week we will drill down into a few popular benchmarks and share some custom viewers that will help pop up various insights. Our viewer for GPQA (Googleβ¦
Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below π§΅β
π’ New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: β LLMs are limited in creativity since they learn to predict the next token β creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" π±) 1/ π§΅
Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all dataβ¦
CosmicAI collab: benchmarking the utility of LLMs in astronomy coding workflows & focusing on the key research capability of scientific visualization. @sebajoed @jessyjli @Murtazahusaintx @gregd_nlp @StephaJuneau @paultorrey9 Adam Bolton, Stella Offner, Juan Frias, Niall Gaffney
How good are LLMs at π scientific computing and visualization π? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. π§΅
Language is often strategic, but LLMs tend to play nice. How strategic are they really? Probing into that is key for future safety alignment.π πIntroducing CoBRAπ, a framework that assesses strategic language. Work with my amazing advisors @jessyjli and @David_Beaver! π§΅π
π’Announcing ππ‘π ππ’π«π¬π π°π¨π«π€π¬π‘π¨π© π¨π§ ππ‘π ππ©π©π₯π’ππππ’π¨π§ π¨π πππ ππ±π©π₯ππ’π§πππ’π₯π’ππ² ππ¨ ππππ¬π¨π§π’π§π ππ§π ππ₯ππ§π§π’π§π at @COLM_conf! We welcome perspectives from LLM, XAI, and HCI! CFP (Due June 23): β¦reasoning-planning-workshop.github.io
CoT is effective for in-domain reasoning tasks, but Fangcong's work takes a nice step in improving compositional generalization of CoT reasoning. We teach models that atomic CoT skills fit together like puzzle pieces so it can then combine them in novel ways. Lots to do here!
Solving complex problems with CoT requires combining different skills. We can do this by: π§©Modify the CoT data format to be βcomposableβ with other skills π₯Train models on each skill πCombine those models Lead to better 0-shot reasoning on tasks involving skill composition!