Shalev Lifshitz
@Shalev_lif
do androids dream of electric sheep? @ something new, previously @UofT @VectorInst
Hot off the Servers 🔥💻 --- we’ve found a new approach for scaling test-time compute! Multi-Agent Verification (MAV) scales the number of verifier models at test-time, which boosts LLM performance without any additional training. Now we can scale along two dimensions: by…

Bonnie is awesome! Join her team!
Our team @GoogleDeepMind is hiring! Join a team of world-class researchers working on open-ended self-improvement! 🔥
Our team @GoogleDeepMind is hiring! Join a team of world-class researchers working on open-ended self-improvement! 🔥
I’m building a new team at @GoogleDeepMind to work on Open-Ended Discovery! We’re looking for strong Research Scientists and Research Engineers to help us push the frontier of autonomously discovering novel artifacts such as new knowledge, capabilities, or algorithms, in an…
These researchers found that 30% of chem and bio questions on the “Humanity’s Last Exam” benchmark had ground-truth answers that contradicted peer-reviewed papers! Important work by @FutureHouseSF.
HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
550k GB200s and GB300s 🤯🤯
230k GPUs, including 30k GB200s, are operational for training Grok @xAI in a single supercluster called Colossus 1 (inference is done by our cloud providers). At Colossus 2, the first batch of 550k GB200s & GB300s, also for training, start going online in a few weeks. As Jensen…
Qwen3-Coder is out and open-source. Basically on the level of Claude 4 Sonnet on coding tasks!
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
Congrats to the GDM team on their IMO result! I think their parallel success highlights how fast AI progress is. Their approach was a bit different than ours, but I think that shows there are many research directions for further progress. Some thoughts on our model and results 🧵
Important to recognize the amazing humans who participated in the IMO! Warren Bei scored a *perfect* score, he’s starting at MIT in the fall (big win for MIT).
Everyone's talking about AI performance on the IMO. Let me highlight 🇨🇦Canadian 11th grader Warren Bei🇨🇦, one of five participants with a *perfect* 42/42. This is his *fifth* (and final) IMO representing Canada, with three golds and two silvers. (➡️ MIT undergrad in the fall)
Official gold-medal performance with Gemini at IMO. Massive congrats to the team! “This year, we were amongst an inaugural cohort to have our model results officially graded and certified by IMO coordinators using the same criteria as for student solutions.”
Advanced version of Gemini Deep Think (announced at #GoogleIO) using parallel inference time computation achieved gold-medal performance at IMO, solving 5/6 problems with rigorous proofs as verified by official IMO judges! Congrats to all involved! deepmind.google/discover/blog/…
This is actually genius. @nvidia please do this!
nvidia could do the most viral ai competition in history: start with 10,000 researchers and give each a free gpu to work on a public leaderboard but do rounds of elimination where the winners take the remaining hardware. the final winner gets all the gpus for a year.
AMD is cool
Someone on LinkedIn posted about cool theoretical research that he wants to check, and someone from AMD just told him that they will give him the compute 😍
Grok 4 Heavy w/ Python + Internet + Test-Time Compute reaches 50.7%. Even with all those +'s, this really is wild.

While companies are closer to releasing after OpenAI, taking less time and closing the gap, I'm honestly still surprised at OpenAI's ability to consistently lead in impactful product releases.
PSA
I gain a lot of mental clarity and peace from not bringing my phone: 1. In the bedroom for sleep, 2. For meals, coffee, or a snack with friends close to home/work. Both are very easy and worth trying.
While we don’t yet have all the details, the most impressive part of OpenAI’s achievement is not the gold medal, it’s the fact that this was achieved without a specialized formal logic system. AlphaProof scored a silver medal last year, but used LEAN. Awaiting more details to…
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
The world can change in just 7 hours
How it started (10 hours ago) —how it’s going (3 hours ago) 😅
Zuck discovered the infinite money glitch
Heard Zuck poached 4 more OpenAI researchers, including some behind the open-source model. how deep are Zuck’s pockets?
ChatGPT Agent is the first model we classified as "High" capability for biorisk. Some might think that biorisk is not real, and models only provide information that could be found via search. That may have been true in 2024 but is definitely not true today. Based our…
We’ve activated our strongest safeguards for ChatGPT Agent. It’s the first model we’ve classified as High capability in biology & chemistry under our Preparedness Framework. Here’s why that matters–and what we’re doing to keep it safe. 🧵