Alex Shaw

@alexgshaw

Researching @LaudeInstitute & investing @LaudeVentures Co-creator of Terminal Bench. Formerly Google. BYU alum.

Joined October 2021

450Following

254Followers

Pinned

Alex Shaw@alexgshaw · Jul 16

Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now…

alexgshaw's tweet image. Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days.

We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks.

Now…

10.0K

Alex Shaw@alexgshaw · Jul 25

Hill climbing Terminal-Bench means getting good at text-based computer use, not just coding.

AAlex Albert@alexalbert__ · Jul 25

I'm making a list of all the non-coding things people are doing with Claude Code. What are you using Claude Code for?

231

Alex Shaw Retweeted

Andy Konwinski@andykonwinski · Jul 24

K prize round one results are live. Huge congrats to Eduardo for taking the top spot. A solo builder from Brazil, his winning submission correctly closed 9 out of 120 github issues. $50K prize ($278k BRL!)

2.0K

Alex Shaw@alexgshaw · Jul 23

Using Terminal-bench for evaluating coding capabilities!🥰

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

393

Alex Shaw@alexgshaw · Jul 22

Great results for the open-source community :) Congrats @Alibaba_Qwen

QQwen@Alibaba_Qwen · Jul 22

387

Alex Shaw Retweeted

Greg Kamradt@GregKamradt · Jul 22

The world is moving towards agents Static benchmarks don't measure what agents do best (multi-turn reasoning) Thus, interactive benchmarks: * Terminal Bench (@alexgshaw, @Mike_A_Merrill) * Text Arena (@LeonGuertler) * BALROG (@PaglieriDavide, @_rockt) * ARC-AGI-3 (@arcprize)

220

162

19.0K

Alex Shaw Retweeted

Guillermo Rauch@rauchg · Jul 21

Terminals are amazing. You'll never regret mastering the shell & unix fundamentals (pipes, processes, filesystem), and key tools (curl, dig, jq, ssh, tmux, apt, nvim…) Especially given agents' propensity to wield these

402

107

29.0K

Alex Shaw Retweeted

Braden Hancock@bradenjhancock · Jul 21

I’m excited to share that I started a new role today as the first Research Partner at @LaudeInstitute and @LaudeVentures! My first interaction with Laude Ventures was over a year ago. As a fellow researcher-turned-founder of @SnorkelAI out of @StanfordAILab, I had a lot of…

2.0K

Alex Shaw Retweeted

Sergey Karayev@sergeykarayev · Jul 21

Terminal Bench is a cool benchmark I just came across! CLI SWE agents must complete tasks like - Build Linux kernel - Configure git server - Train an ML model Take-away: Claude 4 models are GOATed (the lead Warp model is a combo of sonnet and opus).

3.0K

Alex Shaw Retweeted

ARC Prize@arcprize · Jul 18

Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI We’re releasing: * 3 games (environments) * $10K agent contest * AI agents API Starting scores - Frontier AI: 0%, Humans: 100%

226

2.0K

544

335.0K

Alex Shaw@alexgshaw · Jul 15

Terminal-Bench and Warp featured in TechCrunch today 🚀 Agents operating computers using terminals is becoming a powerful paradigm. Btw, we have something exciting to share this week so stay tuned :)

MMike A. Merrill@Mike_A_Merrill · Jul 15

Terminal-Bench and @warpdotdev @zachlloydtweets in TechCrunch today :) (link in replies) I firmly believe that the future of LLM-Computer interaction is through something that looks like a terminal interface. Great to see this picking up steam.

407

Alex Shaw@alexgshaw · Jul 14

Congrats to the OpenHands team!

AAll Hands AI@allhands_ai · Jul 14

OpenHands is live on TerminalBench and gets 41.3% with claude-4-sonnet, 6 points better than Claude Code! If you want to use an agent that can use the terminal, in your terminal -- try out the OpenHands CLI.

173

Alex Shaw Retweeted

Mike A. Merrill@Mike_A_Merrill · Jul 13

It's great to see Terminal-Bench on the Kimi K2 model card. We love open source models, and just made it even easier to test them by adding better support for local models to our harness through LiteLLM

974

Alex Shaw@alexgshaw · Jul 6

Congrats to the Daytona team, we’ve already started integrating them into Terminal-Bench!

DDaytona@daytonaio · Jul 2

BREAKING: @daytonaio just became the fastest-growing infrastructure company in history. $0 → $1M ARR in 2 months. Faster than Stripe. Faster than Vercel. Faster than AWS. Yes, really. 🧵👇

184