Ari Holtzman (@universeinanegg)

Pinned

A

Ari Holtzman@universeinanegg · Nov 21, 2023

If you want a respite from OpenAI drama, how about joining academia? I'm starting Conceptualization Lab, recruiting PhDs & Postdocs! We need new abstractions to understand LLMs. Conceptualization is the act of building abstractions to see something new. conceptualization.ai

universeinanegg's tweet card. Recruiting PhDs and Postdocs to invent new conceptual frameworks for generative models.

13

62

278

107

116.0K

A

Ari Holtzman@universeinanegg · Jul 23

the economist published my little letter about the necessity of chaos for discovery

DData Science Institute@DSI_UChicago · Jul 22

How can chaos create brilliance and breakthroughs? Ari Holtzman (@universeinanegg) Assistant Professor of Computer Science and Data Science, explores how embracing chaos has unlocked the capabilities of AI systems in a letter to @TheEconomist! economist.com/letters/2025/0…

0

1

18

0

1.0K

Ari Holtzman Retweeted

D

Data Science Institute@DSI_UChicago · Jul 22

How can chaos create brilliance and breakthroughs? Ari Holtzman (@universeinanegg) Assistant Professor of Computer Science and Data Science, explores how embracing chaos has unlocked the capabilities of AI systems in a letter to @TheEconomist! economist.com/letters/2025/0…

0

1

5

1

2.0K

A

Ari Holtzman@universeinanegg · Jul 16

cool new take on how to standardize agent testing enough for comparisons to be meaningful. standardization is an underrated niche!

AAlex Shaw@alexgshaw · Jul 16

Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now…

0

6

1

706

A

Ari Holtzman@universeinanegg · Jul 16

very cool step towards understanding how agents actually perform instead of just judging incomparable setups

T@ ·

0

315

A

Ari Holtzman@universeinanegg · Jul 16

Finally, a curated set of papers that definitively covers all the areas of actual importance in AI, and science more broadly

TTom McCoy@RTomMcCoy · Jul 16

So much research is being done about LLMs that it's hard to stay on top of the literature. To help with this, I've made a list of all the most important papers from the past 8 years: rtmccoy.com/pubs/ I hope you enjoy!

1

0

14

3

1.0K

A

Ari Holtzman@universeinanegg · Jul 15

Come chat with us at our ICML poster tomorrow! 📈 Learn about the best ways to evaluate for base language model development 🧪 Find out how you can use our suite of models over differences in pretraining distribution for your own research 😆 Get a DataDecide sticker

AAi2@allen_ai · Apr 15

Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵

2

13

49

13

5.0K

A

Ari Holtzman@universeinanegg · Jul 12

Everyone should read this paper, especially if you've never thought that prompting could be rigorous science. (1/n)

AAri Holtzman@universeinanegg · Jul 9

Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering. This is holding us back. 🧵and new paper: arxiv.org/abs/2507.00163

2

1

28

12

3.0K

A

Ari Holtzman@universeinanegg · Jul 11

Prompting is an extraordinary gift to interpretability researchers. Use it! Use it a lot (carefully)! ofc it has issues but it's so much more useful than most complicated interp methods people cook up...

AAri Holtzman@universeinanegg · Jul 9

Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering. This is holding us back. 🧵and new paper: arxiv.org/abs/2507.00163

0

1

9

1

926

A

Ari Holtzman@universeinanegg · Jul 3

Seriously proud of Aryan for leading this project! Helped me clarify my thinking on what 'harmful latent information' might still be lurking in aligned LLMs👹

AAryan Shrivastava@aryan_shri123 · Jul 3

🤫Jailbreak prompts make aligned LMs produce harmful responses.🤔But is that info linearly decodable? ↗️We show many refused concepts are linearly represented, sometimes persist through instruction-tuning, and may also shape downstream behavior❗ arxiv.org/abs/2507.00239 🧵1/

0

1

9

2

2.0K

Ari Holtzman Retweeted

A

Aryan Shrivastava@aryan_shri123 · Jul 3

🤫Jailbreak prompts make aligned LMs produce harmful responses.🤔But is that info linearly decodable? ↗️We show many refused concepts are linearly represented, sometimes persist through instruction-tuning, and may also shape downstream behavior❗ arxiv.org/abs/2507.00239 🧵1/

1

8

19

11

4.0K