DAIR.AI
@dair_ai
Democratizing AI research, education, and technologies. Learn how to build with AI in our new AI Academy: https://dair-ai.thinkific.com/
Top AI Papers of The Week (July 14 - 20): - Agentic-R1 - Context Rot - Scaling up RL - A Survey of AIOps - Chain-of-Thought Monitorability - One Token to Fool LLM-as-a-Judge - A Survey of Context Engineering for LLMs Read on for more:
AI Agents Evaluation Evaluation is key to developing reliable and scalable agentic systems. Really enjoyed this conversation with @ptkbhv on *everything* related to AI agent evaluation. One of the many deep dives we have done at the @dair_ai academy. Feel free to share with…
Learning without training Google researchers explore the implicit dynamics of in-context learning. "Implicit weight updates from ICL mirror the effect of actual fine-tuning on the same data." This one is more technical but much needed. The findings:
Deep Research Agents with Test-Time Diffusion Google keeps pushing on diffusion. This time, they apply diffusion to deep research agents, specifically the report generation process. It achieves a 69.1% win rate vs. OpenAI Deep Research on long-form research. My notes:
A Structural Planning Framework for LLM Agent System in Enterprise Agentic systems for enterprise are a work in progress. Reliability is a real problem. No secret that planning works, but structural planning can further help improve the reliability of AI agents. My notes:
Every software engineer hits the same wall: “I don’t know why this broke in prod.” AI coding agents fall apart without the right context. Hud captures how your code behaves in production and surfaces that context in your IDE and to AI coding agents via Hud’s MCP server. MCP…
Context Rot Great title for a report, but even better insights about how increasing input tokens impact the performance of top LLMs. Banger report from Chroma. Here are my takeaways (relevant for AI devs):
A Survey of Context Engineering 160+ pages covering the most important research around context engineering for LLMs. This is a must-read! Here are my notes:
Agent Leaderboard v2 is here! > GPT-4.1 leads > Gemini-2.5-flash excels at tool selection > Kimi K2 is the top open-source model > Grok 4 falls short > Reasoning models lag behind > No single model dominates all domains More below: