Atticus Wang
@atticuswzf
MIT 26; To create a little flower is the labour of ages.
On IMO P6 (without going into too much detail about our setup), the model "knew" it didn't have a correct solution. The model knowing when it didn't know was one of the early signs of life that made us excited about the underlying research direction!
One piece of info that seems important to me in terms of forecasting usefulness of new AI models for mathematics: did the gold-medal-winning models, which did not solve IMO problem 6, submit incorrect answers for it? 🧵
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵
Anthropic staff realized they could ask Claude to buy things that weren’t just food & drink. After someone randomly decided to ask it to order a tungsten cube, Claude ended up with an inventory full of (as it put it) “specialty metal items” that it ended up selling at a loss.
Arc browser won't open after installing the new update. Anyone having a similar problem? @browsercompany @BrowserCoStatus
We are excited to announce Trinity, an autoformalization system for verified superintelligence that we have developed at @morph_labs. We have used it to automatically formalize in Lean a classical result of de Bruijn that the abc conjecture is true almost always.…
ai’s referring to their humans as their nonbinary partners
*Horizon reduction* was the only technique we found that substantially improved scaling. Even simple tricks like n-step returns improved *asymptotic* performance (so it's not merely a trick to speed up training!). Full hierarchical RL scaled even better.
Our paper Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video received an Oral at the Mechanistic Interpretability for Vision Workshop at CVPR 2025! 🎉 We’ll be in Nashville next week. Come say hi 👋 @CVPR @miv_cvpr2025
@kaivu, @atticuswzf , and I were researching long horizon reasoning (with @jacobandreas). We found existing benchmarks’ hard problems often featured tricky puzzles, not tests of system understanding. So we made Breakpoint: a SWE benchmark designed to disambiguate this capability.
Why do I keep getting rejected from accessing Llama models on huggingface?
Terence Tao is selling 5,000 bars of soap that contain drops of his own chalk dust
Sydney Sweeney is selling 5,000 bars of soap that contain drops of her own bathwater.
Sigh, it's a bit of a mess. Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place. First, I have long…
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
stop saying "millions" and "billions" and start calling them "megadollars" and "gigadollars"