Siddharth Joshi
@sjoshi804
ML PhD at @UCLA under @baharanm | Data Curation for Efficient & Robust SSL @datologyai | Prev @MSFTResearch, @Cisco Research, @Microsoft
📢Excited to share the recording of our #ICML2024 Tutorial on Foundations of Data-Efficient Learning: youtu.be/30VkdWuwmdA Truly grateful to everyone who attended — it was incredible to see the enthusiasm for theoretically principled techniques for dataset curation!

🌞 We're excited to share our "Summer of Data Seminar" series at @datologyai! We're hosting weekly sessions with brilliant researchers diving deep into pretraining, data curation, and everything that makes datasets tick. Are you data-obsessed yet? 🤓 Thread 👇
This is our exclusive focus @datologyai. Data quality is the single most underinvested area of ML research relative to its impact. We've already been able to achieve 10x efficiency gains over open-source datasets, and I'm confident there's still another 100x because there's…
Don't trust your lying eyes! The DCLM paper has a great section on human-model quality assessment alignment that shows that humans are pretty bad at assessing data quality arxiv.org/pdf/2406.11794…. I'm often surprised at what makes it into our good vs. rejected token piles
I'm not 100% sure about that. As an example I was just browsing through the DCLM-baseline datamix (which is ~SOTA) and it is *terrible*. Compared to what I could in principle imagine. Major concessions are made in data quality to gather enough data quantity.
It depends on how much you know about what you're using your model for. You want your data to be as similar to your test distribution as possible. In practice, benchmarks are an incomplete description of your true test distribution, so you want to hedge diversity vs.…
Mildly obsessed with what the "highest grade" pretraining data stream looks like for LLM training, if 100% of the focus was on quality, putting aside any quantity considerations. Guessing something textbook-like content, in markdown? Or possibly samples from a really giant model?…
Congratulations to the @datologyai team on powering the data for AFM-4B by @arcee_ai - competitive with Qwen3 - using way way less data! This is exactly why I'm so excited to be joining @datologyai this summer to push the frontier of data curation 🚀
Congrats to @LucasAtkins7 and @arcee_ai on a fantastic model release! DatologyAI powers the data behind AFM-4.5B, and we're just getting started.
🎉 Our paper “Representations Shape Weak-to-Strong Generalization” is accepted at #ICML2025! We study weak-to-strong generalization (W2SG)—a core problem in superalignment—and offer new insights into the role of models' internal representations in W2SG. 1/
1/ I'll be at #NeurIPS2024 presenting our work SmallToLarge (S2L): Data-efficient Fine-tuning of LLMs! 🚀 What’s S2L? It’s a scalable data selection method that trains a small proxy model to guide fine-tuning for larger models, reducing costs while preserving performance. 👇
Introducing BenchAgents: a framework for automated benchmark creation, using multiple LLM agents that interact with each other and with developers to generate diverse, high-quality, and challenging benchmarks w/ @VarunChandrase3 @neelsj @besanushi @vidhisha_b @MSFTResearch 🧵1/8
Excited to announce the release of Eureka, an open-source framework for evaluating and understanding large foundation models! 🌟 Eureka offers: 🔍In-depth analysis of 12 cutting-edge models 🧠 Multimodal & language capability testing beyond single-score reporting and rankings 📈…
Come see our poster #715 on CodeIt today at #ICML2024 13.30-15.00 Halle C. We approach ARC by self-improving LLMs with prioritized hindsight replay. @blazejmanczak @aukejw Corrado Rainone @davwzha @m_deff @TacoCohen
Excited to share that our paper “CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay” was accepted into ICML! @blazejmanczak @aukejw Corrado Rainone @davwzha @m_deff @TacoCohen 1/5
📢 Happening in 15 minutes (1pm) in Hall A8! Hurry, before all the good seats are taken 😂
🚀 Exciting News! 🚀 Join @baharanm and me for a 2-hour tutorial on Data-Efficient Learning! Learn the principles behind data curation: the secret sauce powering today’s AI revolution! ⚡️ See you at 1pm on Monday CEST in Hall A8! 🙌 🔗 More details: sjoshi804.github.io/data-efficient…
🚀 Exciting news! ConTextual (con-textual.github.io) is headed to Vienna for #ICML2024! 🎉 📊 Leaderboard updates: GPT4o-mini is now 2nd, just 1% behind GPT4o🥳 Claude-3.5-Sonnet takes 3rd, outperforming Claude-3-Opus by 19% 😲
🚀 Exciting News! 🚀 Join @baharanm and me for a 2-hour tutorial on Data-Efficient Learning! Learn the principles behind data curation: the secret sauce powering today’s AI revolution! ⚡️ See you at 1pm on Monday CEST in Hall A8! 🙌 🔗 More details: sjoshi804.github.io/data-efficient…
I'll be giving a 2-hour tutorial on data-efficient learning with my PhD student @sjoshi804 on Monday July 22 at #ICML2024. Join us to learn more about this cool topic! ➡️ We can learn better from better data! ⬅️🙌🌱