Aaron Gokaslan
@SkyLi0n
Maker of the OpenWebText. @Mozilla Rise25 @PyTorch Core Reviewer. PhD Candidate at @Cornell Previously @FacebookAI and @BrownUniversity Graduating May 2025
OpenGPT-2: We Replicated GPT-2 Because You Can Too! 1.5B Model Weights Released. Blogpost: medium.com/@vanya_cohen/o… Colab: colab.research.google.com/drive/1esbpDOo…
It's awesome to see AI generated code starting to push infra platforms towards better CS practices: clear semantics, strong typing, simplified state management etc.
This is one of the craziest ideas I've ever seen. He converted a drawing of a bird into a spectrogram (PNG -> Soundwave) then played it to a Starling who sung it back reproducing the PNG. Using the birds brain as a hard drive with 2mbps read write speed. youtube.com/watch?si=HMtVd…
open models are now topping designarena dot ai, a LMArena-style voting site for frontend generation gg 🫡
Great blog post by @jerryx314 on rotary position embeddings (RoPE) in more than one dimension, with interactive visualisations, a bunch of experimental results, and code! jerryxio.ng/posts/nd-rope/
Very nice blogpost on RoPE variants by @jerryx314
Qwen3 Coder has now passed Grok 4 in the Programming prompt rankings Tied with Kimi!
As boring as it sounds, I’m slowly realizing that 90% of success is doing the obvious thing for a painfully long amount of time without convincing yourself you’re smarter than you are.
Wild paper They prove (!!) a transformer block (Attn + MLP) running on prompt Outputs the same logits with no prompt If MLP weights updated by vector: W′ = W + ΔW Calc from attn latent: ΔW = (W·Δa) × (A(x)ᵀ / ‖A(x)‖²) Given prompt: Δa = A(C, x) − A(x) Fucking fine tuning.
This may be the coolest emergent capability I've seen in a video model. Veo 3 can take a series of text instructions added to an image frame, understand them, and execute in sequence. Prompt was "immediately delete instructions in white on the first frame and execute in order"
Expert parallelism actually just tensor parallelism on the batch dimension
I can attest after using explicit sharding for a couple of months that I feel a deep sense of calm whenever I train models, knowing exactly where all my shards are ahead-of-time.
highly recommend you try out JAX's new Explicit Sharding API. its more intuitive in that for intermediate computation .sharding will print the actual sharding at that point so you don't have to add with_sharding_constraint everywhere, but its a bit more strict. you can…
After more than a year of getting burned with MoE gotchas, I finally sat down and wrote the guide I wish existed. Every paper skips the messy production details. This fills those gaps. No theory without implementation. cerebras.ai/moe-guide
Let's talk about MoE: 🔶 How many experts should you use? 🔶 How does dynamic routing actually behave in production? 🔶 How do you debug a model that won’t train? 🔶 What does 8x7B actually mean for memory and compute? 🔶 What hardware optimizations matter for sparse models?…
What I look for when hiring? EXTREME PARANOIA about code and data
Recently Openai, Goolge reached IMO Gold Medal's with their new experimental models. But our team reached the same level with just o4-mini-high and our agent systems. And now we are opensourcing it. Especially we got insane improvements with the USAMO benchmarks. The base…
thread on the new paper: The Serial Scaling Hypothesis joint work with: @phizaz, @YutongBAI1002, Kananart
🧵 Everyone is chasing new diffusion models—but what about the representations they model from? We introduce Discrete Latent Codes (DLCs): - Discrete representation for diffusion models - Uncond. gen. SOTA FID (1.59 on ImageNet) - Compositional generation - Integrates with LLM 🧱
Anthropic just released a research paper. Inverse Scaling in Test-Time Compute This study shows that longer reasoning in Large Reasoning Models (LRMs) can hurt performance—revealing a surprising inverse scaling between reasoning length and accuracy. According to this paper,…
We had some early evidence of this when over-training MDLM baselines in the original paper, but we didn't have time to explore it then. Glad to see discrete diffusion scales up so well in data constrained settings!
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
It's crazy how you can have bugs in your AI model (normalization, attention, pooling, etc) and when you fix them, the model still trains the same and just doesn't care about those "bugs"