Hadi Khalaf
@hskhalaf
phd agent @Harvard, working on alignment, prev @msfea_aub @HarvardEcon
How can we improve LLMs without any additional training? 🤔 The standard playbook is using Best-of-N: generate N responses ➡️ use a reward model to score them ➡️ pick the best 🏆 More responses = better results... right? Well, not exactly. You might be reward hacking!…
@ whoever is on the google ai studio team: please fix the chat history never being saved! i cannot access most of my gemini conversations... and this has been an issue since january 🫤
I wrote a fun little article about all the ways to dodge the need for real-world robot data. I think it has a cute title. sergeylevine.substack.com/p/sporks-of-agi
I wrote a fun little article about all the ways to dodge the need for real-world robot data. I think it has a cute title. sergeylevine.substack.com/p/sporks-of-agi
It is critical for scientific integrity that we trust our measure of progress. The @lmarena_ai has become the go-to evaluation for AI progress. Our release today demonstrates the difficulty in maintaining fair evaluations on @lmarena_ai, despite best intentions.
Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵
Does anyone like arxiv html? I immediately switch to the pdf view
Yes 👍🏼
day in the life of an AI PhD in 2025 > wake up > new research idea (5 minutes) > kick off related work search w/ Deep Research (15 minutes) > set up 4 instances of Claude Code to start project (30 minutes) > get o1 started on proof for paper (5 minutes) > play tennis (6 hours)
On my reading list this week: "the first theoretical result on how to identify the ideal depth for safety alignment... indicating that broader ensembles can compensate for shallower alignments"!!!! arxiv.org/abs/2502.00669
Is there an LLM out there that asks follow-up questions? 😅 Would be my go-to if it exists
I used to see llama as a base model in most experiments, now qwen has taken over. Diversity in base models in experiments is much much more valuable than any hyperparam tuning or extra runs!