Mert Ünsal
@mertunsal2020
Leading agents @browser_use, prev. RL Research at @ProjectNumina @ETH_en
still curious as to how @GoogleDeepMind and @OpenAI scale parallel inference while keeping the objective pass@1 for gold in IMO sampling many times is easy but having the samples interact is hard the idea should surface common research some time soon, similar to RLVR
Running last evals 👀
Would you like to have a @browser_use mode that works x2 faster? Shipping a flash mode soon ⚡️
Some lessons we learned in @browser_use
If you build AI agents 1 step consumes 10k input tokens with less than 250 output tokens. So if you optimize for - speed: reduce output tokens - cost: reduce input tokens - reliability: don't just put more into context (performance can drop), but build clever systems around…
I think OpenAI is speculating 300M offer as anchor so that if you get 5M, you don’t feel that good anymore
My CEO sends me his requests on X directly and we discuss them publicly. #buildinpublic

We compared Kimi K2 from @GroqInc with O3 from @OpenAI on @browser_use (K2 on top) K2 is lightning fast on @GroqInc ⚡️⚡️⚡️
healthy working environment is when CEO is massaging the employees @browser_use
10 years from now everyone will realize how overrated “human intelligence” is
Eric Weinstein says we're more or less LLMs. Most of life runs on script: greetings, replies, small talk, all on pre-trained loops. That's why they mimic us so well. “We don't realize that intelligence is a last resort for us.”
pretty interesting idea to 1. run model 2. extract what went wrong as a lesson 3. add to system prompt the lesson on what went wrong to fix it 4. if fixed, finetune on it
Some of our latest work in @ProjectNumina! 500 line long proofs are pretty cool to see :)
Happy to introduce Kimina-Prover-72B ! Reaching 92.2% on miniF2F using Test time RL. It can solve IMO problems using more than 500 lines of Lean 4 code ! Check our blog post here: huggingface.co/blog/AI-MO/kim… And play with our demo ! demo.projectnumina.ai
Hello World! 👋 We're thrilled to officially launch the X account for Numina, dedicated to advancing frontier AI in mathematics. Stay tuned for updates on our research, achievements, and the future of mathematical AI! #AI4Math #FormalMath #LeanProver #AutomatedReasoning…
Unexpectedly GPT-4.5 is particularly better at creative discussions - I wonder what kind of post-training went into that
Funny thing about AI research: the more you bring in your “smart human intuition,” the less your approach scales. Just let the machines do their thing.