EleutherAI

@AiEleuther

A non-profit research lab focused on interpretability, alignment, and ethics of artificial intelligence. Creators of GPT-J, GPT-NeoX, Pythia, and VQGAN-CLIP

Joined August 2022

89Following

25KFollowers

Pinned

EleutherAI@AiEleuther · Jun 6

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

AiEleuther's tweet image. Can you train a performant language models without using unlicensed text?

We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&amp;2

138

572

251

147.0K

Pinned

EleutherAI@AiEleuther · Jul 14

Our head of HPC was the researcher from the recent @METR_Evals study on AI-assisted coding that saw the biggest (tied) productivity boost from using AI in his coding. Hear his account and lessons about using AI to code more effectively here

QQuentin Anthony@QuentinAnthon15 · Jul 12

I was one of the 16 devs in this study. I wanted to speak on my opinions about the causes and mitigation strategies for dev slowdown. I'll say as a "why listen to you?" hook that I experienced a -38% AI-speedup on my assigned issues. I think transparency helps the community.

1.0K

EleutherAI@AiEleuther · 17 h

I’m in Vienna all week for @aclmeeting and I’ll be presenting this paper on Wednesday at 11am (Poster Session 4 in HALL X4 X5)! Reach out if you want to chat about multilingual NLP, tokenizers, and open models!

CCatherine Arnett @ ACL 🇦🇹@linguist_cat · Mar 7

✨New pre-print✨ Crosslingual transfer allows models to leverage their representations for one language to improve performance on another language. We characterize the acquisition of shared representations in order to better understand how and when crosslingual transfer happens.

4.0K

EleutherAI@AiEleuther · 14 h

If you're into multilingual NLP, tokenization, or open models definitely come say hi to @linguist_cat @aclmeeting! Wednesday at 11am (Poster Session 4 in HALL X4 X5)

CCatherine Arnett @ ACL 🇦🇹@linguist_cat · Mar 7

3.0K

EleutherAI@AiEleuther · Jul 22

If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task! 💬 Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/ Deadline: July 23, 2025 (AoE) ⏰

CCommon Crawl Foundation@CommonCrawl · Jul 21

commoncrawl.org/blog/wmdqs-sha…

4.0K

EleutherAI@AiEleuther · Jul 18

What can we learn from small models in the age of very large models? How can academics and low-compute orgs @BlancheMinerva will be speaking on a panel on these topics and more! Methods and Opportunities at Small Scale Workshop Sat 19 Jul, 16:00-16:45

MMOSS@MOSS_workshop · Jul 18

MOSS is happening this Saturday (7/19) at West Ballroom B, Vancouver Center! We are excited to have an amazing set of talks, posters, and panel discussions on the insights from and potential of small-scale analyses. Hope to see a lot of you there! 💡

4.0K

EleutherAI@AiEleuther · Jul 19

Really grateful to the organizers for the recognition of @magikarp_tokens and my work!

TTokenization Workshop (TokShop) @ICML2025@tokshop2025 · Jul 19

🏆 Announcing our Best Paper Awards! 🥇 Winner: "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" openreview.net/forum?id=AO78C… 🥈 Runner-up: "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" openreview.net/forum?id=lC4xk… Congrats! 🎉

3.0K

EleutherAI@AiEleuther · Jul 17

Starting now!

EEleutherAI@AiEleuther · Jul 15

The dream of SAEs is to be more interpretable proxies for LLMs, but even these more interpretable proxies can be challenging to interpret. In this work we introduce a pipeline for automatically interpreting SAE latents at scale using LLMs. Thu 17 Jul, 11:00 am-1:30 pm East…

2.0K

EleutherAI@AiEleuther · Jul 9

SCRIPT-BPE coming to ICML next week!

CCohere Labs@Cohere_Labs · Jul 9

We’re excited to share that two recent works from @Cohere and Cohere Labs, will be published at workshops next week at @icmlconf in Vancouver! 🇨🇦 🎉Congrats to all researchers with work presented! @simon_ycl, @cliangyu_, Sara Ahmadian, @mziizm, @magikarp_tokens, @linguist_ca

2.0K

EleutherAI Retweeted

Catherine Arnett @ ACL 🇦🇹@linguist_cat · Jul 10

MorphScore got an update! MorphScore now covers 70 languages 🌎🌍🌏 We have a new-preprint out and we will be presenting our paper at the Tokenization Workshop @tokshop2025 at ICML next week! @marisahudspeth @brendan642

2.0K