Ari Holtzman
@universeinanegg
Asst Prof @UChicagoCS & @DSI_UChicago, leading Conceptualization Lab http://conceptualization.ai Minting new vocabulary to conceptualize generative models.
If you want a respite from OpenAI drama, how about joining academia? I'm starting Conceptualization Lab, recruiting PhDs & Postdocs! We need new abstractions to understand LLMs. Conceptualization is the act of building abstractions to see something new. conceptualization.ai
the economist published my little letter about the necessity of chaos for discovery
How can chaos create brilliance and breakthroughs? Ari Holtzman (@universeinanegg) Assistant Professor of Computer Science and Data Science, explores how embracing chaos has unlocked the capabilities of AI systems in a letter to @TheEconomist! economist.com/letters/2025/0…
How can chaos create brilliance and breakthroughs? Ari Holtzman (@universeinanegg) Assistant Professor of Computer Science and Data Science, explores how embracing chaos has unlocked the capabilities of AI systems in a letter to @TheEconomist! economist.com/letters/2025/0…
cool new take on how to standardize agent testing enough for comparisons to be meaningful. standardization is an underrated niche!
Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now…
very cool step towards understanding how agents actually perform instead of just judging incomparable setups
Finally, a curated set of papers that definitively covers all the areas of actual importance in AI, and science more broadly
So much research is being done about LLMs that it's hard to stay on top of the literature. To help with this, I've made a list of all the most important papers from the past 8 years: rtmccoy.com/pubs/ I hope you enjoy!
Come chat with us at our ICML poster tomorrow! 📈 Learn about the best ways to evaluate for base language model development 🧪 Find out how you can use our suite of models over differences in pretraining distribution for your own research 😆 Get a DataDecide sticker
Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵
Everyone should read this paper, especially if you've never thought that prompting could be rigorous science. (1/n)
Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering. This is holding us back. 🧵and new paper: arxiv.org/abs/2507.00163
Prompting is an extraordinary gift to interpretability researchers. Use it! Use it a lot (carefully)! ofc it has issues but it's so much more useful than most complicated interp methods people cook up...
Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering. This is holding us back. 🧵and new paper: arxiv.org/abs/2507.00163
Seriously proud of Aryan for leading this project! Helped me clarify my thinking on what 'harmful latent information' might still be lurking in aligned LLMs👹
🤫Jailbreak prompts make aligned LMs produce harmful responses.🤔But is that info linearly decodable? ↗️We show many refused concepts are linearly represented, sometimes persist through instruction-tuning, and may also shape downstream behavior❗ arxiv.org/abs/2507.00239 🧵1/
🤫Jailbreak prompts make aligned LMs produce harmful responses.🤔But is that info linearly decodable? ↗️We show many refused concepts are linearly represented, sometimes persist through instruction-tuning, and may also shape downstream behavior❗ arxiv.org/abs/2507.00239 🧵1/