Anastasios Nikolas Angelopoulos
@ml_angelopoulos
Building LMArena. Black-box statistics, model evaluation. Formerly @Berkeley_EECS, @stanford_ee, student researcher @GoogleDeepMind.
🚨 New Textbook on Conformal Prediction 🚨 arxiv.org/abs/2411.11824 “The goal of this book is to teach the reader about the fundamental technical arguments that arise when researching conformal prediction and related questions in distribution-free inference. Many of these…




It's genuinely mind-boggling how good models are getting at one-shotting complex visualizations from simple prompts Prompt: "two black holes colliding animation" This model perfectly implemented: – 2-body gravity simulation – Dynamic particle accretion disks – Collision +…
We've been busy lately: new arenas, new models, and new methodologies! So we've created a changelog page where you can track all the updates we make to the leaderboards. In addition to the new Search Arena, and new models like the latest Imagen 4, Grok 4, Kimi K2, Seedream 3 and…
We updated our Imagen 4 models and Ultra is tied for #1 on the lmarena leaderboard! The models are available in Google AI Studio and the Gemini API - try them out and let us know what you think.
Exciting Text-to-Image leaderboard update! Two new Imagen 4.0 models from @GoogleDeepMind just dropped: 🥇 Imagen 4.0 Ultra (v2) ties at #1 with @OpenAI’s GPT-Image-1 🥉 Imagen 4.0 (v2) lands strong at #3 Congrats to the Google Imagen team!
Exciting Text-to-Image leaderboard update! Two new Imagen 4.0 models from @GoogleDeepMind just dropped: 🥇 Imagen 4.0 Ultra (v2) ties at #1 with @OpenAI’s GPT-Image-1 🥉 Imagen 4.0 (v2) lands strong at #3 Congrats to the Google Imagen team!
🚨 Model Update: Qwen3-coder is in the WebDev Arena! @Alibaba_Qwen have released their best coding model to date and it's now live in WebDev Arena awaiting your hardest prompts for real world testing. Prompt: "style a basic login form using Tailwind CSS with dark mode…
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
Come see which models are the best at search! We re-launched on the new UI :)
🚨 BIG NEWS 🚨 Search Arena is live with 7 top models with search capabilities ready for testing. Be sure to have the "Search" modality selected in the chat box, and get testing. 🌐 @xAi: Grok 4 @anthropic: Claude Opus 4 @perplexity: Sonar Pro High & Reasoning Pro High…
This is amazing @Alibaba_Qwen !!
Qwen3-Coder is now live on WebDev Arena Prompt: “bouncing ball in rotating hypercube” It one-shotted the visualization, with controls for rotation and ball speed included. Kinda crazy
Advanced version of Gemini Deep Think (announced at #GoogleIO) using parallel inference time computation achieved gold-medal performance at IMO, solving 5/6 problems with rigorous proofs as verified by official IMO judges! Congrats to all involved! deepmind.google/discover/blog/…
🧵Top 10 Open Models by Provider Though proprietary models often top the charts, open models are also paired in battle mode, and ranked on our public leaderboards. Here are the top 10 when stacked by top open model by provider. - #1 Kimi K2 (Modified MIT) @Kimi_Moonshot - #2…
it's actually BONKERS that Moonshot, a company no one had even heard of a week ago, is absolutely mogging the likes of Anthropic, DeepSeek, and Meta 🤯 AGI really could arise from anywhere at any time 👀
🚨 BREAKING: @Kimi_Moonshot’s Kimi-K2 is now the #1 open model in the Arena! With over 3K community votes, it ranks #5 overall, overtaking DeepSeek as the top open model. Huge congrats to the Moonshot team on this impressive milestone! The leaderboard now features 7 different…
Kimi-K2 by @Kimi_Moonshot is now the #1 open model in the world. The score is a bit below that of the recent Grok-4 API release, and a bit above that of Deepseek R1 (May). After that comes Qwen 3, Deepseek-v3 (March), Deepseek R1, Mistral Medium, Minimax M1. Very…
🚨 BREAKING: @Kimi_Moonshot’s Kimi-K2 is now the #1 open model in the Arena! With over 3K community votes, it ranks #5 overall, overtaking DeepSeek as the top open model. Huge congrats to the Moonshot team on this impressive milestone! The leaderboard now features 7 different…
🚨 BREAKING: @Kimi_Moonshot’s Kimi-K2 is now the #1 open model in the Arena! With over 3K community votes, it ranks #5 overall, overtaking DeepSeek as the top open model. Huge congrats to the Moonshot team on this impressive milestone! The leaderboard now features 7 different…
🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…
I’m at #ICML2025 this week presenting my work on multiaccuracy and multicalibration with proxy sensitive attributes. If you are interested, please come by poster E-1101 on Tuesday at 4:30 pm PST to learn more! @Jere_je_je @PaulYiMD icml.cc/virtual/2025/p…
Excited to introduce our new work at ICML 2025: 1. Conformal Risk Control for LLM Alignment, arxiv.org/pdf/2502.20285 with @lihua_lei_stat 2. Auto-Eval for Quantile-Based Risk Measures arxiv.org/pdf/2507.05220 with @zemelgroup @SquareZollo Please take a look if interested!
Thoughts on Grok 4 results in LMArena Grok's API model is tied for #3 overall with style control-remember, style control is default now in LMArena. Without style control, it's #2 overall. In Math, its preliminary ranking is tied for #1, along with Minimax-M1, Gemini-2.5-pro, and…
🚨 Breaking News: Grok 4's result is now live! With 4k+ community votes, xAI’s Grok-4 tied for #3 overall in Text Arena — a huge leap from Grok-3. It scores Top-3 across all categories (#1 in Math, #2 in Coding, #3 in Hard Prompts). Detailed analysis in the thread 🧵
Extremely excited to announce that I've joined @lmarena_ai! For years I've been working in LLMs for my job, and hacking on rankings and ratings for fun, beyond thrilled to be able to join this project at the intersection!