Tyler Romero
@tyleraromero
http://tylerromero.com; language modeling research and engineering @allen_ai
Wrote a short post on reducing memory usage in RLHF post-training (PPO/GRPO) by optimizing log probability computations. Includes implementation details for selective log-softmax, with benchmarks and code. I recently contributed this optimization to TRL, OpenRLHF, and Verl.

occam's razor explanation of OpenAI dropping o3 price by 80% was they were sitting on a fat margin and wanted to test out demand. They changed nothing w model, upgraded their inference code a bit, and made less profit. No nonsense at all they don't have time for that.
Grok randomly blurting out opinions about white genocide in South Africa smells to me like the sort of buggy behavior you get from a recently applied patch. I sure hope it isn't. It would be really bad if widely used AIs got editorialized on the fly by those who controlled them.
open AI has pushed the industry forward again... three audio input buttons, people. take notes.
> fp8 is 100 tflops faster when the kernel name has "cutlass" in it kms github.com/triton-lang/tr…
"Hiring pure backend engineer and expecting them to do non backend stuff IMO is wrong" Sigh. Any capable intern/new grad picks up whatever new technology is needed to get the job done. If you, as an *experienced* engineer, refuse to do so: you're less capable than an intern
Not hiring a backend engineer is entirely ok Hiring pure backend engineer and expecting them to do non backend stuff IMO is wrong even in startups. Startups doesn't mean a pure backend engineer should be made to work on things he has no clue about/not interested in.
Furthermore, the low diversity of codebases limits external validity. Django comprises nearly half of all issues and five repositories account for over 80% of the benchmark.
OLMoTrace is a one-of-a-kind system and is made possible by Ai2’s commitment to making large pretraining and post-training datasets open in the interest of advancing scientific research in AI and public understanding of AI systems.
in this increasingly digital era, there's no substitute for the book guillotine
New Ai2 office views for my meetings. We’re always hiring top AI talent excited about making the ecosystem more open.
give me your infra and i will code it for you system ML literally my fav thing to work on
everybody wants to do fun experiments nobody wants to write core infrastructure code
"it's a bad benchmark" is cope it's a beautiful benchmark that makes a very compelling argument about efficiency of learning and *should* be solvable by sufficiently intelligent models and nothing today is even close and the 2024 "o3" scaling results aren't a proper solution
This is crazy. We all knew open models would be better for privacy, but to have a court order to maintain 100% of logs under all circumstances is just awful for many types of OpenAI users.
OpenAI are now under a court order to permanently preserve logs of temporary conversations or paid API usage (previously subject to a 30 day retention policy) - a new twist in the now 17 month lawsuit between the New York Times and OpenAI simonwillison.net/2025/Jun/5/ope…
nice! we also recently trained a set of models on 25 different pretraining corpora, each corpus having 14 model sizes trained (4M to 1B), to 5x Chinchilla. We released 30,000+ checkpoints! x.com/allen_ai/statu… arxiv.org/pdf/2504.11393
Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵
Thrilled to announce I've joined the incredible team at @allen_ai! I'll be working on language modeling!
immortalizing this moment forever when RL is so easy that you can just use random rewards and your benchmarks still go up smh
the biggest headline about codex is really tasteful end-to-end rl we didn’t just stick an api model into a scaffold and ship; like deep research, the codex model has had a ton of practice doing real autonomous coding
leaders in ai talk like there's some master plan but it's literally just this
There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion" arxiv.org/abs/2504.20879 I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few…
Thanks for the authors’ feedback, we’re always looking to improve the platform! If a model does well on LMArena, it means that our community likes it! Yes, pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the…
Really incredible detective work by @singhshiviii et al. at @Cohere_Labs and elsewhere documenting the ways in which @lmarena_ai works with companies to help them game the leaderboard. arxiv.org/abs/2504.20879
With first Claude and now Gemini playing Pokemon, I was thinking of doing my own game-playing experiment over the weekend. However, I quickly learned that it's very far from the VLA-style "pixels->plan" that I naively thought it was, and wanted to do myself. It's like 90%…
Gemini 2.5 Pro just got the final 8th badge in Pokemon Blue, incredible pace of progress by the world's most powerful model!!! Next up: victory road and final 4 : )