DatologyAI
@datologyai
DatologyAI builds tools to automatically select and optimize the best data on which to train AI models, leading to better, smaller models which train faster.
Training efficiency is hard, but getting easier to manage all the time. You can rent high speed interconnected h100s on demand with just a credit card. The biggest single failure mode blocking people training high quality foundation models is data. But it doesn’t have to be. If…
The best open-source AI model just dropped a detailed report on how it was trained, a rare resource for students given no frontier lab is publishing! Kimi K2's estimated total cost of training is ~$20-30M, roughly in line with pricing: $0.6/M in $2.5/M out tokens. 10…
We are looking for a post-training lead at @datologyai we have gpus, you can make them go brrrr
We teamed up with @datologyai to build what we believe is the strongest pretraining corpus in the world—and I truly think we nailed it. Their team was absolutely key to the model’s success. We started with ~23T tokens of high-quality data and distilled it down to 6.58T through…
Congratulations to our friends and partners @arcee_ai on the release of AFM-4.5B! With data powered by @datologyai, this model outperforms Gemma3-4B and is competitive with Qwen3-4B despite being trained on a fraction of the data.
Today, we’re thrilled to unveil the @arcee_ai Foundation Models, a new family of GenAI models designed from the ground up for enterprise reality. The first release—AFM-4.5B—is a 4.5-billion-parameter frontier model that delivers excellent accuracy, strict compliance, and very…
Congrats to @LucasAtkins7 and @arcee_ai on a fantastic model release! DatologyAI powers the data behind AFM-4.5B, and we're just getting started.
Our customers needed a better base model <10B parameters. We spent the last 5 months building one. I'm delighted to share a preview of our first Arcee Foundation Model: AFM-4.5B-Preview.
Datology CLIP Models DatologyAI releases two SOTA CLIP ViT-B/32 variants: classification-optimized and retrieval-optimized, achieving top results through task-specific data curation alone. Model - ViT-B/32 (86M params), OpenCLIP 2.24.0 - No architecture or training changes -…
. @datologyai is pushing the frontier, with data curation as its standout advantage. After working closely with the team over the past few months, I’ve seen their dedication, drive, and depth of expertise firsthand. Kudos and congratulations to everyone on the team.
. @datologyai is back: state of the art CLIP model performance using data curation alone 🚀 ✅ state-of-the-art ViT-B/32 performance: ImageNet 1k 76.9% vs 74% reported by SigLIP2 ✅ 8x training efficiency gains ✅ 2x inference efficiency gains ✅ Public model release Details in…
. @datologyai is back: state of the art CLIP model performance using data curation alone 🚀 ✅ state-of-the-art ViT-B/32 performance: ImageNet 1k 76.9% vs 74% reported by SigLIP2 ✅ 8x training efficiency gains ✅ 2x inference efficiency gains ✅ Public model release Details in…
That's why you need @datologyai
every new model is a combination of two things, training algorithms and data open-source algorithms are up-to-date. the things that work are usually fairly simple but the data is complex, massive, gatekept, ever-changing. here open models are way behind, and probably hopeless
We couldn't agree more. If you also believe this, come work with us @datologyai to help drive frontier research and engineering in making the best training data possible.
Nobody wants to hear it, but working on data is more impactful than working on methods or architectures.
📢 Announcing our data-centric workshop at ICML 2025 on unifying data curation frameworks across domains! 📅 Deadline: May 24, AoE 🔗 Website: dataworldicml2025.github.io We have an amazing lineup of speakers + panelists from various institutions and application areas.
What an insane get for an insane team. We’ve been working with @datologyai closely and I assure you if anything they sell themselves way short. They’re the real deal.
So excited to welcome @code_star as our newest datologist! We're hiring across the board. If you're passionate about automatically curating and generating the best model training datasets, please reach out! jobs.ashbyhq.com/DatologyAI
Join us to curate the best data for ML.
So excited to welcome @code_star as our newest datologist! We're hiring across the board. If you're passionate about automatically curating and generating the best model training datasets, please reach out! jobs.ashbyhq.com/DatologyAI
I’m super excited to announce I’ve joined @datologyai I’ll be working with the research team to make the highest quality of data curation available to more than just the frontier labs. I’m also in the Bay Area now! Hit me up if you want to grab coffee or something.
Everyone should be able to train and deploy an AI model. @datologyai uses data curation to reduce training costs and make smaller models better. CEO @AriMorcos explains his team's mission in this talk with Felicis GPs @vivfaga & @AstasiaMyers. Full video:…