Sasha Rush
@srush_nlp
Programmer, professor, currently in the bay area https://www.youtube.com/@srush_nlp
Cursor is now on your phone and on the web. Spin off dozens of agents and review them later in your editor.
o3 since this was driving me crazy: A type that implements Rust’s Try trait—like Result or Option—is a “fallible wrapper” that can produce either a success value (Output) or short-circuit with its error/none form (Residual). The Residual itself is that error/early-exit shape, and…
After my recent news, many people have asked me: "Why not Rust?" Here's my answer:
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
20 years ago, this type of person would become an elite math professor. Now they're making AI breakthroughs. This is progress (probably)!
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
One ML trend I've been grappling with is that we are post-abstraction. Since everyone is building roughly the same "one big model" there isn't really much need generality. We're roughly converging to giant, straight line, assembly-coded systems that get recoded each year.
Scaling Data-Constrained LMs is now also in JMLR: jmlr.org/papers/v26/24-… Looking back at it 2yrs later, repeating & mixing seem standard now, but maybe another powerful lever to scale data-constrained LMs turns out to have been RL - arguably underrated back then!
**Outstanding Main Track Runner-Ups** Scaling Data-Constrained Language Models Direct Preference Optimization: Your Language Model is Secretly a Reward Model
haven't made a new blog post in over a year, so here's a new one: justintchiu.com/blog/sftrl/ it's short
Can we build an operating system entirely powered by neural networks? Introducing NeuralOS: towards a generative OS that directly predicts screen images from user inputs. Try it live: neural-os.com Paper: huggingface.co/papers/2507.08… Inspired by @karpathy's vision. 1/5
"Chatting" with LLM feels like using an 80s computer terminal. The GUI hasn't been invented, yet but imo some properties of it can start to be predicted. 1 it will be visual (like GUIs of the past) because vision (pictures, charts, animations, not so much reading) is the 10-lane…
Can transformers analyze code efficiently? ✅ Yes. We prove transformers efficiently handle real compiler tasks (AST construction, symbol resolution, type infer) using only log size—while RNNs require linear size (in input length). Paper: arxiv.org/abs/2410.14706 #COLM2025
L1 is heading to COLM! We've released 5 new open L1 models and the Massive-Math dataset to celebrate:
Super excited to see L1 accepted to #COLM2025! We are further open-sourcing 5 new models & a dataset: 1. L1-7B & L1-8B: Exact and Max variants 2. L1-1.5B-Short: Short reasoning model (SRM), RL-trained on 1.2M data points 3. Massive-Math-455K: A clean, unified math dataset 🧵
two updates: 1. flying to ICML tonight 2. i joined @cursor_ai a month ago come talk to me to learn what makes research at cursor special :)
😅
today i woke up to a living version of a phd student's nightmare. a new paper in my inbox: a detailed reproduction of a paper i wrote several years ago. every table, graph, model, line of code everything should certainly reproduce! but i hadn't checked in a while... 😳
Can an AI model predict perfectly and still have a terrible world model? What would that even mean? Our new ICML paper formalizes these questions One result tells the story: A transformer trained on 10M solar systems nails planetary orbits. But it botches gravitational laws 🧵
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
WOW! 🤯 this groundbreaking dataset from Meta’s Chief AI Scientist has revolutionized the way that we understand vision 👀 🚀 is this one of the highest-impact releases of all time?? ⏳🔥 10 crazy examples below: 🧵
Finally closed our $11M+ funding round! Backed by top Japanese VCs and amazing angel investors including Joi Ito, @Thom_Wolf from @huggingface, @nlpnoah, @LukeZettlemoyer, and @srush_nlp. Now it’s time to focus on commercialization and tech development!!
リアルタイム音声AIの事業化と研究開発を加速させる為に17億円のシード2ラウンドを完了しました🔥 採用強化中です! prtimes.jp/main/html/rd/p… #GlobisCapitalPartners #BoostCapital #SIPCapital @Joi @Thom_Wolf #ToruShimada @LukeZettlemoyer @nlpnoah