Sabri Eyuboglu
@EyubogluSabri
Working on language model memory. CS PhD student @Stanford working with @HazyResearch and @james_y_zou. 🪬
When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x…

Check out Tokasaurus on Modal to make Llama-1B brrr! This repeated sampling example shows off two engine features that are important for serving small models: very low CPU overhead and automatic shared prefix exploitation with Hydragen.
Tokasaurus, the "little LLM engine that could" by @jordanjuravsky and @EyubogluSabri of @HazyResearch/@ScalingIntelLab, is capable of some pretty impressive perf. We replicated their report of >80k tok/s for 16bit LLaMA 3.1 8B on Large Language Monkeys GSM8K - and you can too!
Tokasaurus, the "little LLM engine that could" by @jordanjuravsky and @EyubogluSabri of @HazyResearch/@ScalingIntelLab, is capable of some pretty impressive perf. We replicated their report of >80k tok/s for 16bit LLaMA 3.1 8B on Large Language Monkeys GSM8K - and you can too!
Thank you for the kind words -- we can't either! We're really excited about models learning new things and remembering their experience, and we think that Cartridges is a step towards that future. My co-author @EyubogluSabri will be giving a talk on Cartridges at @ESFoMo at…
cant stop thinking about this one insanely elegant, seems insanely powerful
Cartridges could be this "missing learning paradigm" Karpathy talks about 1) agent does tasks, collects memories that help it do better via ICL 2) memories are trained / compacted into Cartridges 3) Cartridges shared / composed / RAG-ed between other agents
cant stop thinking about this one insanely elegant, seems insanely powerful
found my weekend experiment
cant stop thinking about this one insanely elegant, seems insanely powerful
Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/
ES-FoMo is back tomorrow! Come join is in East Exhibition Hall A bright and early at 8:30AM for a great slate of invited talks, orals, spotlight lightning talks, and 150 posters!
Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/
honestly i would much rather have an open-source 4.1-mini than an open-source o3-mini
Thanks @willccbb!! For those at ICML, I'm giving a talk on Cartridges at the ES-FoMo workshop on Saturday at 10:45 -- come through!! Excited to talk memory, test-time training, and continual learning!
cant stop thinking about this one insanely elegant, seems insanely powerful
Agree. And very interesting in prod for static/semi-static corpus in our tests...
thanks @willccbb!! checkout Cartridges at ICML ES-FoMo this week :) excited for what's next
cant stop thinking about this one insanely elegant, seems insanely powerful
cant stop thinking about this one insanely elegant, seems insanely powerful
We’re presenting Minions at ICML starting now until 1:30pm at E-2907 — come by and chat!!
How can we use small LLMs to shift more AI workloads onto our laptops and phones? In our paper and open-source code, we pair on-device LLMs (@ollama) with frontier LLMs in the cloud (@openai, @together), to solve token-intensive workloads on your 💻 at 17.5% of the cloud cost…
Minions poster 🥹 Thursday 11am Pacific East Exhibition Hall A-B E-2907
How can we use small LLMs to shift more AI workloads onto our laptops and phones? In our paper and open-source code, we pair on-device LLMs (@ollama) with frontier LLMs in the cloud (@openai, @together), to solve token-intensive workloads on your 💻 at 17.5% of the cloud cost…
Looking forward to attending ICML! Here are some works on memory/long context, verification, kernel design, multi-model AI systems, and theoretical understanding of test-time scaling from my awesome students and collaborators!
At #ICML2025 in Vancouver 🇨🇦 this week, presenting some work from my first year at Stanford! Come find me at posters or just around the conference! Thursday: KernelBench: Can LLMs Write Efficient GPU Kernels? 11AM East E-2010 Saturday: Kevin: Multi-Turn RL for Generating…
Looking forward to attending ICML! Here are some works on memory/long context, verification, kernel design, multi-model AI systems, and theoretical understanding of test-time scaling from my awesome students and collaborators!
🎉 Excited to share that our paper "Pretrained Hybrids with MAD Skills" was accepted to @COLM_conf 2025! We introduce Manticore - a framework for automatically creating hybrid LMs from pretrained models without training from scratch. 🧵[1/n]
We’re thrilled to collaborate with the @HazyResearch @StanfordAILab, led by Chris Ré, to power Minions, their cutting-edge agentic framework tackling the cost-accuracy tradeoff in modern AI systems. This innovation is enabled on AMD Ryzen AI, thanks to seamless integration with…