Yoav Artzi
@yoavartzi
Research/prof @cs_cornell + @cornell_tech🚡 / http://nlp.cornell.edu / asso. faculty director @arxiv / building http://recnet.io/yoav and @COLM_conf
It's now public! My postdoc call is for the inaugural postdoc as part of this $10.5M gift for a new AI fellows program at Cornell. There's a lot more in this program, so more exciting things to happen here real soon! news.cornell.edu/stories/2025/0… Application: forms.gle/tiydAChgV1wLcQ…
I am looking for a postdoc. A serious-looking call coming soon, but this is to get it going. Topics include (but not limited to): LLMs (🫢!), multimodal LLMs, interaction+learning, RL, intersection with cogsci, ... see our work to get an idea: yoavartzi.com/pubs Plz RT 🙏
🔹Meet the Organizing Committee of our Thematic Semester dedicated to Autonomous LLM Agents Details and registration: ivado.ca/en/thematic-pr… #AI #ArtificialIntelligence #AutonomousAgents #LLM
What are the best LLM pre-training papers? That give the most insight into the process. Current/recent, and older papers that stand the test of time.
The list of accepted papers for COLM 2025 is now available here: colmweb.org/AcceptedPapers… The papers will be made available around mid August, following the camera ready deadline
COLM 2025 is now accepting applications for: Financial Assistance Application -- docs.google.com/forms/d/e/1FAI… Volunteer Application -- docs.google.com/forms/d/e/1FAI… Childcare Financial Assistance Application -- docs.google.com/forms/d/e/1FAI… All due by July 31
if you do pre-training, you could try arxiv.org/abs/2401.09135 that @cranialxix x.com/Ar_Douillard/s… it worked pretty well at <500M scale. If you do post-training, then do stuff like INTELLECT-2 where most workers are samplers and other are learners
We release the async extension of DiLoCo shared in November, led by our amazing intern @cranialxix! 👀 TL;DR: we do distributed data-parallelism of a language model across the world, synchronized every 10-100 of steps, AND using heterogenous devices 🧵 below
What work/software is out there about training models on heterogeneous clusters? Let's say I have access to various machines, some with more contemporary GPUs, some with much older.
Check out our LMLM, our take on what is now being called a "cognitive core" (as far as branding go, this one is not bad) can look like, how it behaves, and how you train for it. arxiv.org/abs/2505.15962
The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing. Its features are slowly crystalizing: - Natively multimodal…
I'm hiring at least one post-doc! We're interested in creating language models that process language more like humans than mainstream LLMs do, through architectural modifications and interpretability-style steering.
Hearing a lot of opposition to RAG, largely sounding like "it (will) never works". Trying to reconcile this with almost every query I put into ChatGPT doing a web search and retrieving content. So, definitely seems to work. What am I missing? Is it a definition discreprency?
COLM 2025 will include 6 plenary sessions. The details about the format (panel vs. keynote) and topic will come up soon. We are excited to have our plenary sessions led by experts displaying the broad impact and intellectual spectrum of language modeling. Registration is open
We are making progress on discussions, but also running out of time. Discussion ends tomorrow. Reviewers and ACs, please get jiggy with it! 💃🕺 Updated stats in the next message ⬇️
We are doing our best to encourage engagement during the discussion period. It's moving, even if we wish folks would engage more. 🫵🫵🫵Reviewers, please login, read author responses, consider them, write back! 🫵ACs🫵, we need your help too :)
❤️
We've always been strongly indebted to arXiv. Thank you for your amazing work.
👏
Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2
We are doing our best to encourage engagement during the discussion period. It's moving, even if we wish folks would engage more. 🫵🫵🫵Reviewers, please login, read author responses, consider them, write back! 🫵ACs🫵, we need your help too :)
Always interesting to see how deadlines impact engagement trends
The 2nd stage of the discussion period has now started. The intermediate response deadline was very effective, so now we have plenty of time for ACs, reviewers, and authors to discuss! Let's get that red curve up! 📈📈📈