Edward Z. Yang
@ezyang
I work on PyTorch at Meta. Chatty alt at @difficultyang. Currently on parental leave and doing a lot of AI coding, including authoring codemcp.
I finally sat down and wrote down a post mortem for vibe coding ScubaDuck. It's aimed at those of you who have never tried vibe coding (in its original sense: AI coding without reviewing the code the AI generated)

It's worth it to make sure collecting and joining/slicing all their metrics and hparams post hoc is very low friction.
Saving failed batches so you can see what the hell went wrong
Especially for fine-tuning, start with a config file of hyperparameters like "Llama-3.2-1B_v00.01.yaml" and note in your docs what your thought process and what you found. Then copy this to v00.02 and note changes, motivations and results again. Keep a table of versions v metrics
Baselining against simple models exposes whether complexity adds value – wish I’d done this before wasting months on overengineered clinical predictors.
On of the great suggestions by others, what I personally do is keep a "linked" journal. So if I was doing A, then suddenly in the middle of it observed I could do B because of x1,x2, ..., then I just create a note linking these two. Using @obsdmd, then you get a nice 'story'!
1. Train with a fixed seed in the beginning. 2. Put a small README, where you maintain semi-detailed notes about each run. 3. On Monday you won't remember anything from Friday. Start with this premise and document. 4. Start with multi-node capable code.
Consistent naming saves way more time than you'd think. Pretty sure I’ll never remember what final_final_actually_this_one was 🤣.
Save full logs and config files alongside each experiment's results. Nothing more frustrating than a good set of experiments you can't reproduce because of one missing detail.
Hydra is such a god send. I wish it was available when I started out • Every run is logged with a full config snapshot • Supports multirun for ablation studies across hyperparameters And Parallel coordinate plots give a strong visual feel for key parameters.
- always use config, do not use args. - spend time to learn wandb or similar, automatically saves code/configs/models. - name the config with <#>_<active_change><#+1> (example: 22_LR_001_23.yaml, 22_LR_0001_24.yml) - storage is cheap, copy your data rather than in-place modify.
always build the abstraction/interface/tool/database. do not ad-hoc, do not make one-off scripts / bespoke storage, do not do things only you will remember
Save your inputs and outputs as they were exactly along with any config. Reproducibility is hard, especially hard in a pipeline with multiple non deterministic AIs.
model reproducibility requires data reproducibility 🥲
standardizing and automating evaluations (across the team) can be tiring but is always worth it; make a separate config seed for every bit of randomness you introduce
People who run lots of small training jobs for your day job, what is one thing about experiment management / hygiene that you wish you knew when you started out?
Been spending some time with the GSPMD paper recently. It's funny seeing all the work on making convolution work; 2021 truly was a different era