Aaron Gokaslan

@SkyLi0n

Maker of the OpenWebText. @Mozilla Rise25 @PyTorch Core Reviewer. PhD Candidate at @Cornell Previously @FacebookAI and @BrownUniversity Graduating May 2025

Joined September 2015

452Following

3KFollowers

Pinned

Aaron Gokaslan@SkyLi0n · Aug 22, 2019

OpenGPT-2: We Replicated GPT-2 Because You Can Too! 1.5B Model Weights Released. Blogpost: medium.com/@vanya_cohen/o… Colab: colab.research.google.com/drive/1esbpDOo…

226

772

160

Pinned

Aaron Gokaslan Retweeted

martin_casado@martin_casado · Jul 27

It's awesome to see AI generated code starting to push infra platforms towards better CS practices: clear semantics, strong typing, simplified state management etc.

228

17.0K

Pinned

Aaron Gokaslan Retweeted

Sterling Crispin 🕊️@sterlingcrispin · Jul 27

This is one of the craziest ideas I've ever seen. He converted a drawing of a bird into a spectrogram (PNG -> Soundwave) then played it to a Starling who sung it back reproducing the PNG. Using the birds brain as a hard drive with 2mbps read write speed. youtube.com/watch?si=HMtVd…

341

3.0K

32.0K

10.0K

2.5M

Pinned

Aaron Gokaslan Retweeted

Xeophon@xeophon_ · Jul 25

open models are now topping designarena dot ai, a LMArena-style voting site for frontend generation gg 🫡

256

22.0K

Aaron Gokaslan@SkyLi0n · 8 h

Great blog post by @jerryx314 on rotary position embeddings (RoPE) in more than one dimension, with interactive visualisations, a bunch of experimental results, and code! jerryxio.ng/posts/nd-rope/

SSimo Ryu@cloneofsimo · Jul 27

Very nice blogpost on RoPE variants by @jerryx314

122

8.0K

Aaron Gokaslan Retweeted

difficultyang@difficultyang · 9 h

139

10.0K

Aaron Gokaslan Retweeted

OpenRouter@OpenRouterAI · Jul 26

Qwen3 Coder has now passed Grok 4 in the Programming prompt rankings Tied with Kimi!

102

1.0K

285

113.0K

Aaron Gokaslan Retweeted

Jay Yang@Jayyanginspires · Jul 27

As boring as it sounds, I’m slowly realizing that 90% of success is doing the obvious thing for a painfully long amount of time without convincing yourself you’re smarter than you are.

155

2.0K

21.0K

4.0K

513.0K

Aaron Gokaslan Retweeted

Yam Peleg@Yampeleg · Jul 27

Wild paper They prove (!!) a transformer block (Attn + MLP) running on prompt Outputs the same logits with no prompt If MLP weights updated by vector: W′ = W + ΔW Calc from attn latent: ΔW = (W·Δa) × (A(x)ᵀ / ‖A(x)‖²) Given prompt: Δa = A(C, x) − A(x) Fucking fine tuning.

176

2.0K

149.0K

Aaron Gokaslan Retweeted

Justine Moore@venturetwins · Jul 25

This may be the coolest emergent capability I've seen in a video model. Veo 3 can take a series of text instructions added to an image frame, understand them, and execute in sequence. Prompt was "immediately delete instructions in white on the first frame and execute in order"

111

291

3.0K

2.0K

255.0K

Aaron Gokaslan Retweeted

difficultyang@difficultyang · Jul 25

Expert parallelism actually just tensor parallelism on the batch dimension

101

7.0K

Aaron Gokaslan@SkyLi0n · Jul 23

I can attest after using explicit sharding for a couple of months that I feel a deep sense of calm whenever I train models, knowing exactly where all my shards are ahead-of-time.

CCristian Garcia@cgarciae88 · Jul 23

highly recommend you try out JAX's new Explicit Sharding API. its more intuitive in that for intermediate computation .sharding will print the actual sharding at that point so you don't have to add with_sharding_constraint everywhere, but its a bit more strict. you can…

2.0K

Aaron Gokaslan@SkyLi0n · Jul 22

After more than a year of getting burned with MoE gotchas, I finally sat down and wrote the guide I wish existed. Every paper skips the messy production details. This fills those gaps. No theory without implementation. cerebras.ai/moe-guide

CCerebras@CerebrasSystems · Jul 22

Let's talk about MoE: 🔶 How many experts should you use? 🔶 How does dynamic routing actually behave in production? 🔶 How do you debug a model that won’t train? 🔶 What does 8x7B actually mean for memory and compute? 🔶 What hardware optimizations matter for sparse models?…

150

112

23.0K

Aaron Gokaslan Retweeted

Taco Cohen@TacoCohen · Jul 23

What I look for when hiring? EXTREME PARANOIA about code and data

309

62.0K

Aaron Gokaslan Retweeted

Tooliense@tooliense · Jul 22

Recently Openai, Goolge reached IMO Gold Medal's with their new experimental models. But our team reached the same level with just o4-mini-high and our agent systems. And now we are opensourcing it. Especially we got insane improvements with the USAMO benchmarks. The base…

633

471

106.0K

Aaron Gokaslan Retweeted

Yuxi on the Wired@layer07_yuxi · Jul 22

thread on the new paper: The Serial Scaling Hypothesis joint work with: @phizaz, @YutongBAI1002, Kananart

205

171

26.0K

Aaron Gokaslan Retweeted

Samuel Lavoie@lavoiems · Jul 22

🧵 Everyone is chasing new diffusion models—but what about the representations they model from? We introduce Discrete Latent Codes (DLCs): - Discrete representation for diffusion models - Uncond. gen. SOTA FID (1.59 on ImageNet) - Compositional generation - Integrates with LLM 🧱

287

209

26.0K

Aaron Gokaslan Retweeted

机

机器之心 JIQIZHIXIN@jiqizhixin · Jul 22

Anthropic just released a research paper. Inverse Scaling in Test-Time Compute This study shows that longer reasoning in Large Reasoning Models (LRMs) can hurt performance—revealing a surprising inverse scaling between reasoning length and accuracy. According to this paper,…

572

496

56.0K

Aaron Gokaslan@SkyLi0n · Jul 22

We had some early evidence of this when over-training MDLM baselines in the original paper, but we didn't have time to explore it then. Glad to see discrete diffusion scales up so well in data constrained settings!

MMihir Prabhudesai@mihirp98 · Jul 22

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

1.0K

Aaron Gokaslan Retweeted

Boris Dayma 🖍️@borisdayma · Jul 22

It's crazy how you can have bugs in your AI model (normalization, attention, pooling, etc) and when you fix them, the model still trains the same and just doesn't care about those "bugs"

4.0K