andrea panizza
@unsorsodicorda
Data Scientist, aerospace engineer, trekking & comics lover, applying #MachineLearning #DeepLearning Statistics to Industrial Applications.
Quick thread on the recent IMO results and the relationship between symbol manipulation, reasoning, and intelligence in machines and humans:
🚀Introducing Intern-S1, our most advanced open-source multimodal reasoning model yet! 🥳Strong general-task capabilities + SOTA performance on scientific tasks, rivaling leading closed-source commercial models. 🥰Built upon a 235B MoE language model and a 6B Vision encoder.…
Proud to introduce Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant RL algorithm that powers the large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking) 🚀 📄 huggingface.co/papers/2507.18…
We are excited to announce that @shengjia_zhao will be the Chief Scientist of Meta Superintelligence Labs! Shengjia is a brilliant scientist who most recently pioneered a new scaling paradigm in his research. He will lead our scientific direction for our team. Let's go 🚀
I am very excited to take up the role of chief scientist for meta super-intelligence labs. Looking forward to building asi and aligning it to empower people with the amazing team here. Let’s build!
Anthropic could be bankrupted within the next few months, thanks to last week's barely covered legal ruling, which exposes the AI startup to billions to hundreds of billions in damages for its use of pirated, copyright-protected works.
🚀 We’re excited to introduce Qwen3-235B-A22B-Thinking-2507 — our most advanced reasoning model yet! Over the past 3 months, we’ve significantly scaled and enhanced the thinking capability of Qwen3, achieving: ✅ Improved performance in logical reasoning, math, science & coding…
Beautiful work. Just a little sad that no university from the country where Latin literally originated, with the second largest epigraphics db, and the most Latin epigraphs in general, collaborated with Google Deepmind on this super-cool project
Our Aeneas AI model gives historians valuable new insights into ancient inscriptions & ancient history that may have taken years to uncover otherwise. Published in @Nature today: deepmind.google/discover/blog/…
I'm notorious for turning down 99% of the hundreds of requests every months to join calls (because I hate calls!). The @huggingface team saw an opportunity and bullied me in accepting to do a zoom call with users who upgrade to pro. I only caved under one strict condition:…
HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
I'm surprised I've seen exactly zero tweets about Google Agentspace, Google's B2B agentic framework cloud.google.com/products/agent…
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
Qwen3-Coder is now available in Cline 🧵 New 480B parameter model with 35B active parameters. > 256K context window > comparable performance on SWE-bench to Claude Sonnet 4 > SoTA among open source models
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
SWE-bench Verified is the gold standard for evaluating coding agents: 500 real-world issues + tests by OpenAI. Sounds bullet-proof? Not quite. We show passing its unit tests != matching ground truth. In our ACL paper, we fixed buggy evals: 24% of agents moved up or down the…
Lovely to see the impressive performance of the Seed Prover developed by the ByteDance Seed team at IMO 2025 — achieving a silver-level score (30 out of 42) within three days, and reaching (35 out of 42) with extended compute time. leanprover.zulipchat.com/#narrow/channe…
Everyone's talking about AI performance on the IMO. Let me highlight 🇨🇦Canadian 11th grader Warren Bei🇨🇦, one of five participants with a *perfect* 42/42. This is his *fifth* (and final) IMO representing Canada, with three golds and two silvers. (➡️ MIT undergrad in the fall)
The environmental footprint of training Mistral Large 2: as of January 2025, and after 18 months of usage, Large 2 generated the following impacts: - 20,4 ktCO₂e, - 281 000 m3 of water consumed, - and 660 kg Sb eq (standard unit for resource depletion). The marginal impacts of…
Interesting approach! However, we looked at the proofs and methodology and we found a few problems, specifically with the use of hints given to the model. While the scaffold indeed improves performance, it does not solve all problems accurately and would not get a gold medal.🧵
🚨 Olympiad math + AI: We ran Google’s Gemini 2.5 Pro on the fresh IMO 2025 problems. With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity. The model could win gold! 🥇 #AI #Math #LLMs #IMO2025