Kilian Lieret

@KLieret

Research Software Engineer at Princeton University. AI agents & benchmarks for software engineering.

Princeton

Joined May 2021

38Following

737Followers

Kilian Lieret Retweeted

Ofir Press@OfirPress · Jul 14

Congrats to the Kimi team on the super strong SWE-bench Verified and SWE-bench Multilingual numbers!!

8.0K

Kilian Lieret Retweeted

Ori Press@ori_press · Jul 2

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

155

23.0K

Kilian Lieret Retweeted

SWE-bench@SWEbench · Jun 25

We just updated the SWE-bench Multimodal leaderboard with new systems from @refact_ai, @allhands_ai and @TU_Muenchen. Congrats to all teams on pushing the state-of-the-art performance! SWE-bench Multimodal challenges AI systems to fix issues that are described using screenshots.

6.0K

Kilian Lieret Retweeted

Alex Zhang@a1zhang · May 28

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

538

232

119.0K

Kilian Lieret Retweeted

John Yang@jyangballin · May 7

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

134

653

379

98.0K

Kilian Lieret Retweeted

Kabir@plodq · May 6

Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!🧵

15.0K

Kilian Lieret@KLieret · Apr 18

Had a great time talking about building agents, SWE-agent, SWE-bench, and more

DData Brew by Databricks@databrew_db · Apr 17

📣 𝗡𝗘𝗪 #𝗗𝗮𝘁𝗮𝗕𝗿𝗲𝘄 𝗘𝗽𝗶𝘀𝗼𝗱𝗲! In this episode, Kilian Lieret (Research Software Engineer) & Carlos Jimenez (Computer Science PhD Candidate) at @Princeton dive into SWE-bench & SWE-agent, two cutting-edge tools for evaluating & enhancing AI in software engineering.

604