Clayton Thorrez

@cthorrez

Rating systems and paired comparison experimentation enjoyer @lmarena_ai

Joined March 2016

2KFollowing

1KFollowers

Been a fun first 2 weeks :)

llmarena.ai@lmarena_ai · Jul 25

We've been busy lately: new arenas, new models, and new methodologies! So we've created a changelog page where you can track all the updates we make to the leaderboards. In addition to the new Search Arena, and new models like the latest Imagen 4, Grok 4, Kimi K2, Seedream 3 and…

645

Clayton Thorrez@cthorrez · Jul 25

if you have two numpy arrays with the same data, and one is c contiguous and one is fortran contiguous, and you use them as inputs to scipy lbfgs, you can get pretty nontrivial differences in results. like 1e-3 and 1e-2, not like like 1e-9 differences 😢

153

Clayton Thorrez@cthorrez · Jul 24

wait, whether your numpy arrays with the same data are c contiguous or fortran contiguous can actually impact the results of scipy.lbfgs by a nontrivial amount 🤯

121

Clayton Thorrez@cthorrez · Jul 24

S tier coding playlist

Clayton Thorrez Retweeted

lmarena.ai@lmarena_ai · Jul 23

🚨 BIG NEWS 🚨 Search Arena is live with 7 top models with search capabilities ready for testing. Be sure to have the "Search" modality selected in the chat box, and get testing. 🌐 @xAi: Grok 4 @anthropic: Claude Opus 4 @perplexity: Sonar Pro High & Reasoning Pro High…

491

143

57.0K

Clayton Thorrez@cthorrez · Jul 22

Qwen3-Coder is now live on WebDev Arena Prompt: “bouncing ball in rotating hypercube” It one-shotted the visualization, with controls for rotation and ball speed included. Kinda crazy

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

7.0K

Clayton Thorrez@cthorrez · Jul 19

A story in 3 parts: :D

179

12.0K

Clayton Thorrez Retweeted

lmarena.ai@lmarena_ai · Jul 15

🚨 Breaking News: Grok 4's result is now live! With 4k+ community votes, xAI’s Grok-4 tied for #3 overall in Text Arena — a huge leap from Grok-3. It scores Top-3 across all categories (#1 in Math, #2 in Coding, #3 in Hard Prompts). Detailed analysis in the thread 🧵

177

2.0K

384

481.0K

Clayton Thorrez@cthorrez · Jul 10

I wonder how many billions in cloud costs are spent each year on for loops over pandas data frames and NumPy arrays that can be trivially vectorized. It's super disappointing to see sota LLMs still suggesting this stuff due to how prevalent it is in training data

247

Clayton Thorrez@cthorrez · Jul 10

There's a perverse incentive I hate with twitch. If I'm watching a stream and it just crashes or freezes, I have to refresh or click on the stream pause play to get it to resume. This 100% of the results in an ad roll as if I had just tuned in. Buggy product = more $ >:(

188

Clayton Thorrez@cthorrez · Jul 10

And then a fun postseason exercise is to take the negative log of each of the probabilities for the true outcome and take the mean of that and compete with all your friends who can have the lowest number 🤓

pparker fleming@statsowar · Jul 30

A fun preseason exercise that even the most casual fan will benefit from: Grab a pencil and paper. Write down your favorite team's schedule for the season. Assign a value 0-1 for every game, the probability that your team wins. Add those values up and there's your reasonable…

323