Mark Müller
@mnmueller
PhD student at @the_sri_lab at @ETHZ
We are excited to see the community use our SWT-Bench and work on the crucial topic of test generation!
🚨 New SWT-Bench Submission! 🤖 Amazon Q Developer Agent leads the SWT-Bench leaderboard 🥇 with an impressive 49% of successfully tested issues and a coverage improvement of 57% on SWT-Bench Verified.
SOTA code agent OpenHands (top-1 for SWE-full) achieves only 22% accuracy in unit test generation on SWT-lite (half its SWE performance), only slightly outperforming SWE-agent. What is going on? We dug through the data to find a simple trick and achieve almost 30%! 👇🧵 1/9
We have our first submission for SWT-Bench 🚀 AEGIS, a dedicated test generation agent, achieves 47.8% accuracy 🏆 , significantly outperforming our SWE-Agent+ baseline and demonstrating the potential of dedicated test generation agents. 1/3 🧵
🚀 Introducing the SWT-Bench Leaderboard! Test your AI's ability to write tests reproducing real-world GitHub issues and improve coverage where it matters. 🤖 Ready for the challenge? 👉 swtbench.com #AI #SoftwareTesting #SWTBench #CodeAgents
Meet me at this morning's NeurIPS poster session to discuss our work on generating reproducing test cases with Code Agents.
SRI Lab at #NeurIPS2024 - 1/8 SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents Niels Mündler (@nielstron), Mark Niklas Mueller, Jingxuan He (@jingxuan_he), Martin Vechev (@mvechev) ⏰ /📍 Wed 11th, 11AM - 2PM, West Ballroom A-D #5406 📝 We explore software…
Exiting to see our work on benchmarking the test-generation capabilities of LLMs being picked up by the community!
Super cool work by @nielstron et al: SWT-Bench is SWE-bench for test generation! They give the model a repo and an issue and it has to write a test for the issue. They show that SWE-agent is able to write good tests for 19% of the issues in the benchmark! 🧵(1/3)
Presenting today @icmlconf 2024 Workshop FM in the Wild 🤖 🏞️ "Code Agents are State of The Art Software Testers" SWE-Agent, aider and co are competent at reproducing GitHub issues, performing as well as specialized methods. Looking forward to answer your questions!
On Tuesday at 11:30, in Poster Session 1, we will present Prompt Sketching, a novel decoder-driven approach for templated (and constrained) text generation of LLMs. 📄 arxiv.org/abs/2311.04954 👨💻 Work with @mnmueller, @lbeurerkellner, @mvechev.
Excited to share our latest work which we will present today at @iclr_conf
We show that neural network certification with all commonly used convex relaxations is imprecise for any NN expressing interesting (>1-d inputs) functions and discuss implications for cert. training. 🧑🔬 Maximilian Baader, @mnmueller, @MaoYuhao91443 📄 arxiv.org/abs/2311.04015
A couple of amazing PhD students graduated from our lab (@the_sri_lab) at ETH Zurich today: @mbalunovic and @mnmueller. Both did fantastic contributions to the area of Safe and Secure AI: impactful papers and systems the community built upon. Next steps should be exciting :)
Find us @NeurIPSConf #NeurIPS2023 to chat about our latest work. We are excited to share works on certified robustness, a large scale study of image classifiers, and game theory. All works are supervised by @mvechev. 🧵
@mnmueller and @marc_r_fischer introduced a new from of Abstract Interpretation for challenging unbounded loops enabling the analysis of fixpoint-based neural network architecture (monDEQs). 🌐 sri.inf.ethz.ch/publications/m… 📄 arxiv.org/abs/2110.08260 🧵 3/3
Super excited to talk about robustness guarantees for neural networks at @mlsec_lab's seminar!
We are excited to present a new event in our seminar series on ML Security! We will host Mark Müller (ETH Zurich) on June 6, 2023, at 15:00 CEST. Free registration: eventbrite.com/e/machine-lear… @adversarial_ML @trustworthy_ml @aivillage_dc @RedTeamVillage_
At @iclr_conf members of SRI lab presented 3 works: - ⚖️ Human-Guided Fair Classification for NLP - 📈 Robustness Verification & Training of Neural ODEs - 📦 Certified Training: Small Boxes are All You Need Find us around the workshops! 🧵 1/4