Michael Saxon
@m2saxon
🧐 Multilingualism, evals, and more for ((V)L,R)Ms & T2I 🔜🏔️Postdoc @uw @techpolicylab 🌊NSF GRF & CS PhD @ucsbNLP 🌵ECE BS/MS @asu typo-prone wordcel
Check out our super fun paper on reasoning model overthinking! We introduce very simple measures of overthinking, questions to elicit overthinking, and a Schwarzeneggerian training-free, black-box intervention to force RMs to allow them to set and obey their own token budgets!
🧠 Reasoning models often overthink. 🚀 In our new paper, we show: 1️⃣ Two overthinking scores. 2️⃣ DUMB500 — a benchmark of extremely easy questions. 3️⃣ THOUGHT TERMINATOR — a decoding method that reduces token waste by up to 90%, often improving accuracy. Details below 👇
The NIH Is Capping Research Proposals Because It's Overwhelmed by AI Submissions 🔗 404media.co/nih-capping-re…
There's fundamentally no commonality between AI and crypto except that they both use GPUs, and many of the same shills shill for both. Any policy or position that ties AI and crypto together is inherently shill-y
🚨The UK AISI identified four methodological flaws in AI "scheming" studies (deceptive alignment) conducted by Anthropic, MTER, Apollo Research, and others: "We call researchers studying AI 'scheming' to minimise their reliance on anecdotes, design research with appropriate…
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
It is with a heavy heart I report we actually found the Minecraft movie to be pretty funny
pro tip: you can basically read >100 books per day by asking chatgpt to summarize them for you.
pro tip: you can basically read >100 books per day by asking chatgpt to summarize them for you.
To appear at COLM 2025!
🧠 Reasoning models often overthink. 🚀 In our new paper, we show: 1️⃣ Two overthinking scores. 2️⃣ DUMB500 — a benchmark of extremely easy questions. 3️⃣ THOUGHT TERMINATOR — a decoding method that reduces token waste by up to 90%, often improving accuracy. Details below 👇
Check out our fun ACL 2025 findings paper! Using a mocap suit Justin recorded dozens of mimed actions. We then generate in-context and out-of-context videos of characters performing these actions, to see if VLMs can understand this form of nonverbal communication!
Can you tell what actions are being mimed in this video? If so, you’re smarter than AI models! Check the last tweet in this thread for answers. In a new paper, we present MIME, which evaluates whether vision language models (VLMs) have a robust understanding of human actions. 🧵
the existence of large language models implies the existence of Large, huge, HUGE, small, and footnotesize language models
By its definition, pretrained LMs must be "stochastic parrots." Also by its defition (and Chomskian arguments), humans are not stochastic parrots (I do believe this). Problem is, it is not obvious at all what a stochastic parrot can or can't do.
The correct take: - “stochastic parrot” is a top notch turn of phrase that does accurately convey something about what LLMs do - stochastic parrots can be smarter than most of the people who like that phrase think - but not as smart as the people who don’t like that phrase think
The Traffic God is the most fearsome and capricious of the Californian Pantheon
It is official: my first day at UCSB as a faculty. Happy to join the force to advance AI research and nurture talent. Let's go @ucsbNLP @ucsbcs!
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
Genuinely these distributional associations are why they still can’t tune grok to be both helpful and anti-woke. Best they can do is have it deny the existence of any objective reality.
Excited to have supervised these papers! EM was wild, with unclear implications for safety We answer how: there's a general evil vector. Boosting this is A solution to SFT on any narrow evil task We don't know WHY it's so general, but release better EM models to boost research
Our position paper was selected for an oral at #ACL2025! Definitely attend if you want to hear spicy takes on why MCQA benchmarks suck and how education researchers can teach us to solve these problems 👀
🚨 New Position Paper 🚨 Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬 We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠 Here's why MCQA evals are broken, and how to fix them 🧵