Clayton Thorrez
@cthorrez
Rating systems and paired comparison experimentation enjoyer @lmarena_ai
Been a fun first 2 weeks :)
We've been busy lately: new arenas, new models, and new methodologies! So we've created a changelog page where you can track all the updates we make to the leaderboards. In addition to the new Search Arena, and new models like the latest Imagen 4, Grok 4, Kimi K2, Seedream 3 and…
if you have two numpy arrays with the same data, and one is c contiguous and one is fortran contiguous, and you use them as inputs to scipy lbfgs, you can get pretty nontrivial differences in results. like 1e-3 and 1e-2, not like like 1e-9 differences 😢
wait, whether your numpy arrays with the same data are c contiguous or fortran contiguous can actually impact the results of scipy.lbfgs by a nontrivial amount 🤯
🚨 BIG NEWS 🚨 Search Arena is live with 7 top models with search capabilities ready for testing. Be sure to have the "Search" modality selected in the chat box, and get testing. 🌐 @xAi: Grok 4 @anthropic: Claude Opus 4 @perplexity: Sonar Pro High & Reasoning Pro High…
Qwen3-Coder is now live on WebDev Arena Prompt: “bouncing ball in rotating hypercube” It one-shotted the visualization, with controls for rotation and ball speed included. Kinda crazy
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
🚨 Breaking News: Grok 4's result is now live! With 4k+ community votes, xAI’s Grok-4 tied for #3 overall in Text Arena — a huge leap from Grok-3. It scores Top-3 across all categories (#1 in Math, #2 in Coding, #3 in Hard Prompts). Detailed analysis in the thread 🧵
I wonder how many billions in cloud costs are spent each year on for loops over pandas data frames and NumPy arrays that can be trivially vectorized. It's super disappointing to see sota LLMs still suggesting this stuff due to how prevalent it is in training data
There's a perverse incentive I hate with twitch. If I'm watching a stream and it just crashes or freezes, I have to refresh or click on the stream pause play to get it to resume. This 100% of the results in an ad roll as if I had just tuned in. Buggy product = more $ >:(
And then a fun postseason exercise is to take the negative log of each of the probabilities for the true outcome and take the mean of that and compete with all your friends who can have the lowest number 🤓
A fun preseason exercise that even the most casual fan will benefit from: Grab a pencil and paper. Write down your favorite team's schedule for the season. Assign a value 0-1 for every game, the probability that your team wins. Add those values up and there's your reasonable…