Michael Saxon

@m2saxon

🧐 Multilingualism, evals, and more for ((V)L,R)Ms & T2I 🔜🏔️Postdoc @uw @techpolicylab 🌊NSF GRF & CS PhD @ucsbNLP 🌵ECE BS/MS @asu typo-prone wordcel

Santa Barbara, California

Joined December 2017

2KFollowing

3KFollowers

Pinned

Michael Saxon@m2saxon · Apr 21

Check out our super fun paper on reasoning model overthinking! We introduce very simple measures of overthinking, questions to elicit overthinking, and a Schwarzeneggerian training-free, black-box intervention to force RMs to allow them to set and obey their own token budgets!

SSophia Xiao Pu@XiaoSophiaPu · Apr 21

🧠 Reasoning models often overthink. 🚀 In our new paper, we show: 1️⃣ Two overthinking scores. 2️⃣ DUMB500 — a benchmark of extremely easy questions. 3️⃣ THOUGHT TERMINATOR — a decoding method that reduces token waste by up to 90%, often improving accuracy. Details below 👇

6.0K

Pinned

Michael Saxon Retweeted

404 Media@404mediaco · Jul 22

The NIH Is Capping Research Proposals Because It's Overwhelmed by AI Submissions 🔗 404media.co/nih-capping-re…

3.0K

Michael Saxon@m2saxon · Jul 23

There's fundamentally no commonality between AI and crypto except that they both use GPUs, and many of the same shills shill for both. Any policy or position that ties AI and crypto together is inherently shill-y

379

Michael Saxon@m2saxon · Jul 22

Math olympiads are for nerds and apparently now LMs?

577

Michael Saxon Retweeted

Nirit Weiss-Blatt, PhD@DrTechlash · Jul 13

🚨The UK AISI identified four methodological flaws in AI "scheming" studies (deceptive alignment) conducted by Anthropic, MTER, Apollo Research, and others: "We call researchers studying AI 'scheming' to minimise their reliance on anecdotes, design research with appropriate…

273

146

121.0K

Michael Saxon Retweeted

METR@METR_Evals · Jul 10

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

235

1.0K

7.0K

3.0K

3.5M

Michael Saxon@m2saxon · Jul 10

It is with a heavy heart I report we actually found the Minecraft movie to be pretty funny

1.0K

Michael Saxon@m2saxon · Jul 8

pro tip: you can basically read >100 books per day by asking chatgpt to summarize them for you.

PPacky McCormick@packyM · Jul 5

pro tip: you can basically read >100 books per day by asking chatgpt to summarize them for you.

1.0K

Michael Saxon@m2saxon · Jul 8

To appear at COLM 2025!

SSophia Xiao Pu@XiaoSophiaPu · Apr 21

3.0K

Michael Saxon@m2saxon · Jul 7

Check out our fun ACL 2025 findings paper! Using a mocap suit Justin recorded dozens of mimed actions. We then generate in-context and out-of-context videos of characters performing these actions, to see if VLMs can understand this form of nonverbal communication!

JJustin Cho 조현동@HJCH0 · Jul 7

Can you tell what actions are being mimed in this video? If so, you’re smarter than AI models! Check the last tweet in this thread for answers. In a new paper, we present MIME, which evaluates whether vision language models (VLMs) have a robust understanding of human actions. 🧵

1.0K

Michael Saxon@m2saxon · Jul 4

the existence of large language models implies the existence of Large, huge, HUGE, small, and footnotesize language models

2.0K

Michael Saxon@m2saxon · Jul 4

By its definition, pretrained LMs must be "stochastic parrots." Also by its defition (and Chomskian arguments), humans are not stochastic parrots (I do believe this). Problem is, it is not obvious at all what a stochastic parrot can or can't do.

CColin Fraser@colin_fraser · Jul 4

The correct take: - “stochastic parrot” is a top notch turn of phrase that does accurately convey something about what LLMs do - stochastic parrots can be smarter than most of the people who like that phrase think - but not as smart as the people who don’t like that phrase think

3.0K

Michael Saxon@m2saxon · Jul 3

The Traffic God is the most fearsome and capricious of the Californian Pantheon

451

Michael Saxon Retweeted

Xin Eric Wang@xwang_lk · Jul 1

It is official: my first day at UCSB as a faculty. Happy to join the force to advance AI research and nurture talent. Let's go @ucsbNLP @ucsbcs!

447

29.0K

Michael Saxon Retweeted

CLS@ChengleiSi · Jun 30

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

170

599

204

139.0K

Michael Saxon@m2saxon · Jun 29

Genuinely these distributional associations are why they still can’t tune grok to be both helpful and anti-woke. Best they can do is have it deny the existence of any objective reality.

NNeel Nanda@NeelNanda5 · Jun 16

Excited to have supervised these papers! EM was wild, with unclear implications for safety We answer how: there's a general evil vector. Boosting this is A solution to SFT on any narrow evil task We don't know WHY it's so general, but release better EM models to boost research

2.0K

Michael Saxon@m2saxon · Jun 27

Our position paper was selected for an oral at #ACL2025! Definitely attend if you want to hear spicy takes on why MCQA benchmarks suck and how education researchers can teach us to solve these problems 👀

NNishant is ill-prepared for ACL2025@NishantBalepur · Feb 21

🚨 New Position Paper 🚨 Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬 We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠 Here's why MCQA evals are broken, and how to fix them 🧵

6.0K