Siddharth Joshi

@sjoshi804

ML PhD at @UCLA under @baharanm | Data Curation for Efficient & Robust SSL @datologyai | Prev @MSFTResearch, @Cisco Research, @Microsoft

Westwood, Los Angeles

Joined August 2017

2KFollowing

1KFollowers

Pinned

Siddharth Joshi@sjoshi804 · Oct 14

📢Excited to share the recording of our #ICML2024 Tutorial on Foundations of Data-Efficient Learning: youtu.be/30VkdWuwmdA Truly grateful to everyone who attended — it was incredible to see the enthusiasm for theoretically principled techniques for dataset curation!

sjoshi804's tweet image. 📢Excited to share the recording of our #ICML2024 Tutorial on Foundations of Data-Efficient Learning: youtu.be/30VkdWuwmdA

Truly grateful to everyone who attended — it was incredible to see the enthusiasm for theoretically principled techniques for dataset curation!

125

28.0K

Siddharth Joshi Retweeted

DatologyAI@datologyai · Jun 24

🌞 We're excited to share our "Summer of Data Seminar" series at @datologyai! We're hosting weekly sessions with brilliant researchers diving deep into pretraining, data curation, and everything that makes datasets tick. Are you data-obsessed yet? 🤓 Thread 👇

7.0K

Siddharth Joshi Retweeted

Ari Morcos@arimorcos · Jun 20

This is our exclusive focus @datologyai. Data quality is the single most underinvested area of ML research relative to its impact. We've already been able to achieve 10x efficiency gains over open-source datasets, and I'm confident there's still another 100x because there's…

12.0K

Siddharth Joshi@sjoshi804 · Jun 20

Don't trust your lying eyes! The DCLM paper has a great section on human-model quality assessment alignment that shows that humans are pretty bad at assessing data quality arxiv.org/pdf/2406.11794…. I'm often surprised at what makes it into our good vs. rejected token piles

AAndrej Karpathy@karpathy · Jun 20

I'm not 100% sure about that. As an example I was just browsing through the DCLM-baseline datamix (which is ~SOTA) and it is *terrible*. Compared to what I could in principle imagine. Major concessions are made in data quality to gather enough data quantity.

4.0K

Siddharth Joshi@sjoshi804 · Jun 21

It depends on how much you know about what you're using your model for. You want your data to be as similar to your test distribution as possible. In practice, benchmarks are an incomplete description of your true test distribution, so you want to hedge diversity vs.…

AAndrej Karpathy@karpathy · Jun 20

Mildly obsessed with what the "highest grade" pretraining data stream looks like for LLM training, if 100% of the focus was on quality, putting aside any quantity considerations. Guessing something textbook-like content, in markdown? Or possibly samples from a really giant model?…

5.0K

Siddharth Joshi@sjoshi804 · Jun 19

Congratulations to the @datologyai team on powering the data for AFM-4B by @arcee_ai - competitive with Qwen3 - using way way less data! This is exactly why I'm so excited to be joining @datologyai this summer to push the frontier of data curation 🚀

DDatologyAI@datologyai · Jun 18

Congrats to @LucasAtkins7 and @arcee_ai on a fantastic model release! DatologyAI powers the data behind AFM-4.5B, and we're just getting started.

3.0K

Siddharth Joshi Retweeted

Yihao Xue@xue_yihao65785 · Jun 6

🎉 Our paper “Representations Shape Weak-to-Strong Generalization” is accepted at #ICML2025! We study weak-to-strong generalization (W2SG)—a core problem in superalignment—and offer new insights into the role of models' internal representations in W2SG. 1/

4.0K

Siddharth Joshi Retweeted

Yu Yang@YuYang_i · Dec 9

1/ I'll be at #NeurIPS2024 presenting our work SmallToLarge (S2L): Data-efficient Fine-tuning of LLMs! 🚀 What’s S2L? It’s a scalable data selection method that trains a small proxy model to guide fine-tuning for larger models, reducing costs while preserving performance. 👇

134

20.0K

Siddharth Joshi Retweeted

Natasha Butt@NatashaEve4 · Nov 4

Introducing BenchAgents: a framework for automated benchmark creation, using multiple LLM agents that interact with each other and with developers to generate diverse, high-quality, and challenging benchmarks w/ @VarunChandrase3 @neelsj @besanushi @vidhisha_b @MSFTResearch 🧵1/8

4.0K

Siddharth Joshi Retweeted

Besmira Nushi 💙💛@besanushi · Sep 17

Excited to announce the release of Eureka, an open-source framework for evaluating and understanding large foundation models! 🌟 Eureka offers: 🔍In-depth analysis of 12 cutting-edge models 🧠 Multimodal & language capability testing beyond single-score reporting and rankings 📈…

113

31.0K

Siddharth Joshi@sjoshi804 · Jul 24, 2024

Come see our poster #715 on CodeIt today at #ICML2024 13.30-15.00 Halle C. We approach ARC by self-improving LLMs with prioritized hindsight replay. @blazejmanczak @aukejw Corrado Rainone @davwzha @m_deff @TacoCohen

NNatasha Butt@NatashaEve4 · May 2, 2024

Excited to share that our paper “CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay” was accepted into ICML! @blazejmanczak @aukejw Corrado Rainone @davwzha @m_deff @TacoCohen 1/5

5.0K

Siddharth Joshi@sjoshi804 · Jul 22, 2024

📢 Happening in 15 minutes (1pm) in Hall A8! Hurry, before all the good seats are taken 😂

SSiddharth Joshi@sjoshi804 · Jul 19, 2024

🚀 Exciting News! 🚀 Join @baharanm and me for a 2-hour tutorial on Data-Efficient Learning! Learn the principles behind data curation: the secret sauce powering today’s AI revolution! ⚡️ See you at 1pm on Monday CEST in Hall A8! 🙌 🔗 More details: sjoshi804.github.io/data-efficient…

738

Siddharth Joshi Retweeted

ROHAN WADHAWAN@rohanwadhawan7 · Jul 20, 2024

🚀 Exciting news! ConTextual (con-textual.github.io) is headed to Vienna for #ICML2024! 🎉 📊 Leaderboard updates: GPT4o-mini is now 2nd, just 1% behind GPT4o🥳 Claude-3.5-Sonnet takes 3rd, outperforming Claude-3-Opus by 19% 😲

2.0K

Siddharth Joshi@sjoshi804 · Jul 19, 2024

BBaharan Mirzasoleiman@baharanm · Jul 19, 2024

I'll be giving a 2-hour tutorial on data-efficient learning with my PhD student @sjoshi804 on Monday July 22 at #ICML2024. Join us to learn more about this cool topic! ➡️ We can learn better from better data! ⬅️🙌🌱

7.0K