Katherine Thai
@kthai1618
I did CS, Math, and English at @rutgersu. Now I do NLP as a PhD student @UMass_NLP and @pangramlabs (from NYC). she/her
my life is just a repeating sequence of this scene: at hibachi, chef does the veggie toss thing, it hits me in the face, I require therapy
We have updated #nocha, a leaderboard for reasoning over long-context narratives 📖, with some new models including #Gemini 2.5 Pro which shows massive improvements over the previous version! Congrats to #Gemini team 🪄 🧙 Check 🧵for 🔗 to the website :)
My first PhD paper started with something like “Because it’s impossible to feed an entire novel to a language model, here’s what we did instead.” 😂 Anyway, here’s NEW work on tricky long context (& often subtle!) claim verification. I personally read 3 novels for the dataset 📚
Can #LLMs truly reason over loooong context? 🤔 NoCha asks LLMs to verify claims about *NEW* fictional books 🪄 📚 ⛔ LLMs that solve needle-in-the-haystack (~100%) struggle on NoCha! ⛔ None of 11 tested LLMs reach human performance → 97%. The best, #GPT-4o, gets only 55.8%.
🎉🎉🎉 So happy this is finally out there!!!
📢Happy to announce litmt.org, a platform for sharing and commenting on LLM-generated translations of novels into over 20 target languages. LitMT aims to make previously-untranslated world literature accessible beyond language barriers. 📚 1/6
I was napping for like 80% of the time Ben spent building this
📢Happy to announce litmt.org, a platform for sharing and commenting on LLM-generated translations of novels into over 20 target languages. LitMT aims to make previously-untranslated world literature accessible beyond language barriers. 📚 1/6
I’m learning so much and having the best time with @pangramlabs this summer. Congrats to the team!
Thrilled to announce that @pangramlabs closed its seed round, bringing our total raised to $4M! I'm so excited to continue our mission serving schools, internet platforms, and more with incredibly accurate AI detection technology.
🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.
"Tell, Don't Show" was accepted to #ACL2025 Findings! Our simple approach for literary topic modeling combines the new (language models) with the old (classic LDA) to yield better topics. A possible addition to your CSS/DH research 🛠️ box ✨📚 arxiv.org/abs/2505.23166
🤔 Can simple string-matching metrics like BLEU rival reward models for LLM alignment? 🔍 We show that given access to a reference, BLEU can match reward models in human preference agreement, and even train LLMs competitively with them using GRPO. 🫐 Introducing BLEUBERI:
What is @pangramlabs and how is it different than other AI detection tools out there? And how do we convince a skeptic of this? Our false positive rate is roughly 1 in 10,000. How do we assess this number how can an academic trust our numbers? 1/
Getting lots of replies pushing Pangram Labs here. They claim very low false positive rates on their website. I remain doubtful without independent assessment of false positives (this study was not meant to do that), & concerned that these detectors are used adversarially.
After a great team lunch (ft. mango sticky rice) and successful brainstorming sesh, I’m so excited to start working with @pangramlabs for the summer as a research scientist intern! Hard not to be excited when you hear the team talk about what they’re working on—stay tuned :)
Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web! ✅ Humans achieve 85% accuracy ❌ OpenAI Operator: 24% ❌ Anthropic Computer Use: 14% ❌ Convergence AI Proxy: 13%
Super cool project I contributed to where we evaluated a bunch of web browsing agents on really interesting multimodal challenges—I got to watch agents try to play some silly flash games and accidentally click on an ad for pants 😂
Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web! ✅ Humans achieve 85% accuracy ❌ OpenAI Operator: 24% ❌ Anthropic Computer Use: 14% ❌ Convergence AI Proxy: 13%
⚠️ Current methods for generating instruction-following data fall short for long-range reasoning tasks like narrative claim verification. We present CLIPPER✂️, a compression-based pipeline that produces grounded instructions for ~$0.5 each, 34x cheaper than human annotations.
People often claim they know when ChatGPT wrote something, but are they as accurate as they think? Turns out that while general population is unreliable, those who frequently use ChatGPT for writing tasks can spot even "humanized" AI-generated text with near-perfect accuracy 🎯
Summer update to our NoCha long-context LLM leaderboard! Main highlight: LLaMA3.1 405B is the first (and currently only) open-weight model to convincingly beat the random guessing baseline of 25%, ranking at #5 overall! novelchallenge.github.io
I hate those airport security setups where the bins could really use a scheduler so much!!!
I just went to my first craft circle and people recognized me from my fiber arts instagram and it felt ever better than when people recognize me at conferences
Karma is not having any deadlines last night 💜 Thanks to @taylorswift13 for soundtracking my academic career from middle schooler to PhD student. #EastRutherfordTSTheErasTour

Interviewed a new therapist who has never heard of ChatGPT. As an NLP grad student is this good or bad ?
my friend hired a couch assembly task rabbit who did it but complained the entire time that it was hard and they “could have been out with friends instead” and my friend was like “you willingly accepted this job ???” but as a PhD student I totally get it