Matthew Leavitt
@leavittron
Chief Science Officer, Co-Founder @datologyai. Former: Head of Data Research @MosaicML; FAIR. 🧠 and 🤖 intelligence // views are from nowhere
🧵We’ve spent the last few months at @datologyai building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
There's ALWAYS a desk for you at DatologyAI.
I'd work for Matthew 🐐
You’re technically buying data curation, but I appreciate the enthusiasm!
buy compute on @PrimeIntellect and data at @datologyai
“Low value work (like data cleaning)”
Academia must be the only industry where extremely high-skilled PhD students spend much of their time doing low value work (like data cleaning). A 1st year management consultant outsources this immediately. Imagine the productivity gains if PhDs could focus on thinking
this completely confuses "value" with "status". data cleaning is unpleasant, unglamorous, high-value work if I were grumpy, I would say "data cleaning is unquestionably higher value than 1st year phd students thinking" or more gently, "data cleaning producew a pipeline that…
Academia must be the only industry where extremely high-skilled PhD students spend much of their time doing low value work (like data cleaning). A 1st year management consultant outsources this immediately. Imagine the productivity gains if PhDs could focus on thinking
“Low value work like data cleaning”
Academia must be the only industry where extremely high-skilled PhD students spend much of their time doing low value work (like data cleaning). A 1st year management consultant outsources this immediately. Imagine the productivity gains if PhDs could focus on thinking
Golden rule at @datologyai is “always look at the data” (evals and model outputs count as data)
Iykyk It’s not *just* the cleaning but actually looking at your data that provides great insight. How do you fix things otherwise? Just model and training hyperparams?
Data work is often derided like this and is consistently dismissed by large swaths of the community. This is why data research is the single most underinvested area of ML research relative to its impact. Incidentally, this is also why I believe we need an organization…
Academia must be the only industry where extremely high-skilled PhD students spend much of their time doing low value work (like data cleaning). A 1st year management consultant outsources this immediately. Imagine the productivity gains if PhDs could focus on thinking
Tell me you never trained a model without telling me.
Academia must be the only industry where extremely high-skilled PhD students spend much of their time doing low value work (like data cleaning). A 1st year management consultant outsources this immediately. Imagine the productivity gains if PhDs could focus on thinking
The only thing I love more than capturing value is data
It’s 2025, Data Cleaning is genuinely a multi billion dollar task
if you think data cleaning is beneath you then ngmi
Academia must be the only industry where extremely high-skilled PhD students spend much of their time doing low value work (like data cleaning). A 1st year management consultant outsources this immediately. Imagine the productivity gains if PhDs could focus on thinking
This. I get all my best ideas while data cleaning/skimming through random samples.
"its not intellectually stimulating, so its not worth doing. i'm above working on this" is a trap that smart people seem to have a very, very easy time falling into
> low value work (like data cleaning) uh...
Academia must be the only industry where extremely high-skilled PhD students spend much of their time doing low value work (like data cleaning). A 1st year management consultant outsources this immediately. Imagine the productivity gains if PhDs could focus on thinking
I couldn’t have imagined better publicity about the importance of data than all the QTs and replies to this. @datologyai to the moon 🚀
Academia must be the only industry where extremely high-skilled PhD students spend much of their time doing low value work (like data cleaning). A 1st year management consultant outsources this immediately. Imagine the productivity gains if PhDs could focus on thinking
“Tiny Judges” sounds like a TLC show
if text classification was viable at 400M, you can bet that there should be a massive proliferation of sub-1B specialized “tiny judges”
We are looking for a post-training lead at @datologyai we have gpus, you can make them go brrrr
We are looking for a post-training lead at @datologyai we have gpus, you can make them go brrrr
We are looking for a post-training lead at @datologyai we have gpus, you can make them go brrrr
Reflecting on a productive @icmlconf in Vancouver – a huge week for @felicis! We ran two standout events: Night 1: Happy Hour with @ElevateHQ_: The happy hour took off. Over 300 RSVPs and 120+ people came through over the course of the night across PhD programs, startups, and…