Joe Melkonian
@joemelko
tqdm is all you need
Meet DAUNCE🕺: the first method to trace training-data influence inside *proprietary* LLMs (yes, GPT-4o). Full breakdown in @XP_research’s thread - feedback welcome!
We perform training data attribution on proprietary LLMs (OpenAI’s GPT-4) 🔥 — no gradients, no internals, just black-box access ✅. 🧠 How? By using covariance of loss under uncertainty. It’s simple, scalable, theoretically grounded. And it works.
Any paper / blog looking at how chinchilla ratio is influenced by optimizer? Seems intuitive that muon would bend towards data since it makes more efficient use of the parameters it has, just like how data filtering typically bends towards parameters.
completely agree. felt like the takeaway should be you can squeeze a lot more out of the highest quality data in your corpus, not that diffusion > autoregressive
AKA data augmentation. The numbers actually match my experience exactly. This is something i think LLM people will slowly rediscover from vision people. Not sure how they can write up the whole paper and not even once think of running the AR with augmentation or dropout?
Probably will be 1 of the more useful open source releases in a while. If you don’t have the compute to pre-train u can now still do extensive analysis on how models evolve during the phases of training. @huggingface W
We've just release 100+ intermediate checkpoints and our training logs from SmolLM3-3B training. We hope this can be useful to the researcher working on mech interpret, training dynamics, RL and other topics :) Training logs: -> Usual training loss (the gap in the loss are due…
My best guess is the internal IMO model uses a series forks and joins instead of a single model CoT since each improvement seems to be embracing the most useful way people use these models
seems like o3-pro ends up in doom-loops far less often than vanilla o3 would still love a native conversation fork feature though @willdepue @aidan_mclau @jam3scampbell
example A: "The Triangle Inequality Theorem In Δ TAB (Figure ), if T, A, and B represent three points on a map and you want to go from T to B, going from T to A to B would obviously be longer than going directly from T to B. The following theorem expresses this idea. Figure 1…
there is a ton of post-training quality data in pre-training if you look hard enough
there is a ton of post-training quality data in pre-training if you look hard enough
If you are attending #ICML2025, check out our DataWorld workshop on Sat July 19. We have updated the website with more info on speakers & accepted papers! dataworldicml2025.github.io Also happy to chat offline about all things ✨ data ✨