Martin Vechev
@mvechev
Professor of Computer Science, ETH Zurich. Founder of INSAIT (http://insait.ai). Works on Safe/Secure AI, LLMs, Quantum. Co-founder of 6 Deep-Tech start-ups.
🔴 New MCP attack leaks WhatsApp messages via MCP, side-stepping WhatsApp security. 1/n We show a new MCP attack that leaks your WhatsApp messages if you are connected via WhatsApp MCP. Our attack uses a sleeper design, circumventing the need for user approval. More 👇
Interesting approach! However, we looked at the proofs and methodology and we found a few problems, specifically with the use of hints given to the model. While the scaffold indeed improves performance, it does not solve all problems accurately and would not get a gold medal.🧵
🚨 Olympiad math + AI: We ran Google’s Gemini 2.5 Pro on the fresh IMO 2025 problems. With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity. The model could win gold! 🥇 #AI #Math #LLMs #IMO2025
We are launching Project Euler on MathArena to track performance of LLMs on challenging new problems at the intersection of mathematics and programming which are published every week on Project Euler website 🧵(1/6)
As models are getting close to saturating our main automated benchmarks, we are currently looking towards more challenging competitions. Some very exciting updates coming up for that in the coming days and weeks, so stay tuned! (3/3)
On the SMT, a competition of 53 questions that is currently kept private, Grok-4 also convinces, but is not outperforming o4-mini and o3. (2/3)
Grok-4 takes first place on the MathArena Leaderboard! Convincing scores across the board, with an especially impressive performance on HMMT 2025. Full results are available on matharena.ai. (1/3)
🚨 AI agents wrote 7% of all GitHub PRs in June. But can we trust their code? We built Agents in the Wild – a live dashboard tracking autonomous AI agents across GitHub to answer that question: insights.logicstar.ai Here’s what we learned from analyzing 10M+ PRs 👇 1/n 🧵
🤝We are delighted to announce that INSAIT is starting a joint research program with the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), one of the world’s leading and most influential research labs! 🚀All details оn the joint program will be announced…
🌐 We are delighted to announce the launch of a new 1 million USD joint research program between INSAIT and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), one of the top research labs in the world! 🎓 The program enables incoming INSAIT tenure-track…
Thrilled to share a major step forward for AI for mathematical proof generation! We are releasing the Open Proof Corpus: the largest ever public collection of human-annotated LLM-generated math proofs, and a large-scale study over this dataset!
There's a lot of work now on LLM watermarking. But can we extend this to transformers trained for autoregressive image generation? Yes, but it's not straightforward 🧵(1/10)
Two updates from MathArena: - DeepSeek-R1-0528 shows strong performance very close to top closed source models on all competitions - We released a research paper about our evaluation methodology and more detailed analysis of results
Inspiring visit to @INSAITinstitute at @SofiaTechPark, the first institute of its kind in Eastern Europe. Its cutting-edge technology will allow countries to quickly catch up and advance on the AI front. And the upcoming BRAIN++ AI Factory, part of the EU-wide AI hub network,…
🇪🇺 🇧🇬 Today, António Costa @eucopresident, visited INSAIT during his official visit to Bulgaria. The visit was also attended by Prime Minister of Bulgaria Rosen Zhelyazkov. Prof. @mvechev and Eng. Borislav Petrov presented Mr. Costa with the achievements of the institute, which…
🚀 We are delighted to announce MamayLM, a new state-of-the-art efficient Ukrainian LLM! 📈 MamayLM surpasses all similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models. 📊 MamayLM is a 9B model that can run on a single GPU,…
After many requests, we’ve evaluated Grok 3 on the USAMO 2025. The results are in: Grok 3 is tied with DeepSeek-R1 for the second place, earning 4.76% of the total points!
Big update to our MathArena USAMO evaluation: Gemini 2.5 Pro, which was released *the same day* as our benchmark, is the first model to achieve non-trivial amount of points (24.4%). The speed of progress is really mind-blowing.
Designing a network of interconnected agents and servers will be a security nightmare if we don't first fix prompt injections. Cool work and demos from @InvariantLabsAI
🔴🔵 We have discovered a critical flaw in the widely-used Model Context Protocol (MCP) that enables a new form of LLM attack we term 'Tool Poisoning'. This vulnerability affects major platforms and agentic systems like OpenAI, Anthropic, Zapier, and Cursor. Full disclosure…