Kilian Lieret
@KLieret
Research Software Engineer at Princeton University. AI agents & benchmarks for software engineering.
Congrats to the Kimi team on the super strong SWE-bench Verified and SWE-bench Multilingual numbers!!
Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!馃У猬囷笍
We just updated the SWE-bench Multimodal leaderboard with new systems from @refact_ai, @allhands_ai and @TU_Muenchen. Congrats to all teams on pushing the state-of-the-art performance! SWE-bench Multimodal challenges AI systems to fix issues that are described using screenshots.
Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 饾棭饾椂饾棻饾棽饾椉饾棜饾棶饾椇饾棽饾棔饾棽饾椈饾棸饾椀 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 馃У馃憞
40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we鈥檙e open-sourcing the toolkit that made it happen: SWE-smith.
Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!馃У
Had a great time talking about building agents, SWE-agent, SWE-bench, and more
馃摚 饾棥饾棙饾棯 #饾棗饾棶饾榿饾棶饾棔饾椏饾棽饾槃 饾棙饾椊饾椂饾榾饾椉饾棻饾棽! In this episode, Kilian Lieret (Research Software Engineer) & Carlos Jimenez (Computer Science PhD Candidate) at @Princeton dive into SWE-bench & SWE-agent, two cutting-edge tools for evaluating & enhancing AI in software engineering.