Joe Melkonian

@joemelko

tqdm is all you need

Joined December 2023

413Following

374Followers

Pinned

Meet DAUNCE🕺: the first method to trace training-data influence inside *proprietary* LLMs (yes, GPT-4o). Full breakdown in @XP_research’s thread - feedback welcome!

XXYP@XP_research · Jun 11

We perform training data attribution on proprietary LLMs (OpenAI’s GPT-4) 🔥 — no gradients, no internals, just black-box access ✅. 🧠 How? By using covariance of loss under uncertainty. It’s simple, scalable, theoretically grounded. And it works.

3.0K

Joe Melkonian@joemelko · Jul 26

Grid searches are unsatisfying and don’t scale

Joe Melkonian@joemelko · Jul 25

Any paper / blog looking at how chinchilla ratio is influenced by optimizer? Seems intuitive that muon would bend towards data since it makes more efficient use of the parameters it has, just like how data filtering typically bends towards parameters.

124

Joe Melkonian@joemelko · Jul 23

it's working.

JJoe Melkonian@joemelko · Jul 12

it's working.

413

Joe Melkonian@joemelko · Jul 22

completely agree. felt like the takeaway should be you can squeeze a lot more out of the highest quality data in your corpus, not that diffusion > autoregressive

LLucas Beyer (bl16)@giffmana · Jul 22

AKA data augmentation. The numbers actually match my experience exactly. This is something i think LLM people will slowly rediscover from vision people. Not sure how they can write up the whole paper and not even once think of running the AR with augmentation or dropout?

179

Joe Melkonian@joemelko · Jul 21

Probably will be 1 of the more useful open source releases in a while. If you don’t have the compute to pre-train u can now still do extensive analysis on how models evolve during the phases of training. @huggingface W

eelie@eliebakouch · Jul 21

We've just release 100+ intermediate checkpoints and our training logs from SmolLM3-3B training. We hope this can be useful to the researcher working on mech interpret, training dynamics, RL and other topics :) Training logs: -> Usual training loss (the gap in the loss are due…

130

Joe Melkonian@joemelko · Jul 21

My best guess is the internal IMO model uses a series forks and joins instead of a single model CoT since each improvement seems to be embracing the most useful way people use these models

JJoe Melkonian@joemelko · Jun 18

seems like o3-pro ends up in doom-loops far less often than vanilla o3 would still love a native conversation fork feature though @willdepue @aidan_mclau @jam3scampbell

413

Joe Melkonian@joemelko · Jul 19

example A: "The Triangle Inequality Theorem In Δ TAB (Figure ), if T, A, and B represent three points on a map and you want to go from T to B, going from T to A to B would obviously be longer than going directly from T to B. The following theorem expresses this idea. Figure 1…

JJoe Melkonian@joemelko · Jul 18

there is a ton of post-training quality data in pre-training if you look hard enough

216

Joe Melkonian@joemelko · Jul 18

there is a ton of post-training quality data in pre-training if you look hard enough

405

Joe Melkonian@joemelko · Jul 18

We need more independent researchers working on pretraining

121

Joe Melkonian Retweeted

shyamal@shyamalanadkat · Jul 17

good evals guide you to great data.

2.0K

Joe Melkonian Retweeted

Thao Nguyen@thao_nguyen26 · Jul 17

If you are attending #ICML2025, check out our DataWorld workshop on Sat July 19. We have updated the website with more info on speakers & accepted papers! dataworldicml2025.github.io Also happy to chat offline about all things ✨ data ✨

9.0K