Yukyung Lee
@yukyunglee_
Postdoc at Boston University 🇺🇸 | PhD at Korea University 🇰🇷 | #nlproc | Prev: intern at NAVER, HUFS
Can coding agents autonomously implement AI research extensions? We introduce RExBench, a benchmark that tests if a coding agent can implement a novel experiment based on existing research and code. Finding: Most agents we tested had a low success rate, but there is promise!

📣 Calling #NeurIPS2024 participants 📣 While everyone enjoys the last day of the beautiful Vancouver 🏔️🇨🇦, consider join our initiative and contribute to building models with more inclusivity and diversity, and mitigating implicit and explicit bias. ‼️Cuz we are in the…
⭐️ We're going to launch Grassroots Science, a year-long ambitious, massive-scale, fully open-source initiative aimed at developing multilingual LLMs aligned to diverse and inclusive human preferences in Feb 2025. 🌐 Check our website: grassroots.science and be sure to…
Check out @yulu_qin and @dhevarghese's paper to see how VLMs are better at deploying taxonomic knowledge than their LM counterparts 🤖❤️🪄
Does vision training change how language is represented and used in meaningful ways?🤔 The answer is a nuanced yes! Comparing VLM-LM minimal pairs, we find that while the taxonomic organization of the lexicon is similar, VLMs are better at _deploying_ this knowledge. [1/9]
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!
wow amazing demo 😲🎶
What role can AI play in unlocking musical creativity? At #CHI2025 (🥇 Best Paper Award 🥇), we present Amuse: a songwriting tool for musicians that turns photos, sounds, and stories into chords 🎶 📄 arxiv.org/abs/2412.18940 🎧 yewon-kim.com/amuse/
cooking something related but complementary... hopefully will be out very soon!
We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework. Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.
🚨 Just Out Can LLMs extract experimental data about themselves from scientific literature to improve understanding of their behavior? We propose a semi-automated approach for large-scale, continuously updatable meta-analysis to uncover intriguing behaviors in frontier LLMs. 🧵
If you're working on an ACL paper and citing any papers with extremely long author lists, consider updating your acl_natbib.bst file, which truncates these lists at 20 authors. github.com/acl-org/acl-st…
RT appreciated! 🙏 ACL 2025 Ling theory & Cognitive modeling track is looking for emergency reviewers. The emergency review period is between 3/18-26, and these reviewers will be excluded from the ARR cycle. If you're interested, please sign up here! docs.google.com/forms/d/1fH7kD…
so proud of @HayleyRossLing for getting a best paper award at @GenBench this year!! 🎉🪅🎉 I'm sure @TeaAnd_OrCoffee would be too :) check out our paper and share if you think homemade cats are cats!
New paper with @najoungkim and @TeaAnd_OrCoffee testing if LLMs can draw adjective-noun inferences like humans! Turns out they often can, and even generalize to unseen combinations. But they're more optimistic about "artificial intelligence" than humans. arxiv.org/abs/2410.17482
It’s the season of “finding ur next adventure” and folks have been asking me about my experience doing a postdoc, so I did a blog post/video w/ @srush_nlp 's help! Should I do a postdoc? What is a postdoc anyway? youtu.be/O7psT1ftwqs Blog: homes.cs.washington.edu/~niloofar/blog…