Andrew Drozdov
@mrdrozdov
Context Engineering (and Science!) for Knowledge Assistant @ Databricks
Missing @aclmeeting but sending “ATEB: Rethinking Advanced NLP Tasks in an Information Retrieval Setting” in my place ! Come check it out at the Knowledgeable Foundation Models Workshop! Excited that our work is already influencing how embedding models are evaluated on…
TREC RAG 2025 official retrieval baselines are available now! 💥💥💥 Time to start generating those answers and submit them to eval base before August 17th! 🗓️ Let the games begin, well you have less than a month remaining to submit! 🍻
🚀 The official baselines and validation scripts for TREC RAG 2025 are now available! These include both retrieval results (for the AG task) and the corresponding end-to-end augmented generation outputs. Access the baselines and necessary scripts here: trec-rag.github.io/annoucements/2…
What I look for when hiring? EXTREME PARANOIA about code and data
This aligns with #2 in the proposals I described at the #SIGIR2025 panel on “LLMs and IR”. Really cool to see a whole team forming to tackle this effort! x.com/mrdrozdov/stat…
We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us! job-boards.greenhouse.io/anthropic/jobs…
🚀 Excited to share my first tweet and to introduce our latest work: MEM1: RL for Memory Consolidation in Long-Horizon Agents. Long-horizon agents (e.g., deep research, web agents) typically store all observations, actions, and intermediate thoughts in context. However, much of…
📢 voyage-context-3: contextualized chunk embeddings - Auto captures of chunk level detail & global doc context, w/o metadata augmentation - Beats OpenAI-v3-large by 14.24% & Cohere-v4 by 7.89% - Binary 512-dim matches OpenAI (float, 3072-dim) in accuracy, but 192x cheaper in…
We want to start a podcast about cutting-edge AI research and technical breakthroughs. Need a catchy name! What would you call it? The one who suggest the best name will be our guest 🥳
ChatGPT Agent is a huge step up on BearCubs, esp on multimodal/interactive tasks (e.g., playing web games)! It gets 65.8% accuracy vs Deep Research's 36% and Operator's 23%. Humans are at ~85%, and clearly better/faster at fine control & complex filtering.
Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web! ✅ Humans achieve 85% accuracy ❌ OpenAI Operator: 24% ❌ Anthropic Computer Use: 14% ❌ Convergence AI Proxy: 13%
I launched PocketCal on Product Hunt if y'all wouldn't mind passing along an upvote! ❤️ producthunt.com/products/pocke…
Silly but important question: what metrics do you look at / how to vibe-check your training runs are going well, specially under the context of RL/GRPO? Rewards, response lengths, entropy, what more?
😂 @wellecks , i think this “challenging problem” may have been finally solved after five years. === Understanding and creating mathematics using natural mathematical language … used by humans is a challenging and important problem for driving progress in machine learning. ===
Haha. Well, if you want to read @SherylHsu02’s excellent last paper just before joining OpenAI, read LeReT at ICLR’25. She showed that using @DSPyOSS’s optimizers to diversify the prompts used for sampling trajectories improves RL on multi-step programs. Threads below.
“Yeah, Sheryl? $300 million has been wired to your bank account”
Back in grad school, when I realized how the “marketplace of ideas” actually works, it felt like I’d found the cheat codes to a research career. Today, this is the most important stuff I teach students, more than anything related to the substance of our research. A quick…
Here's three of my more *controversial* proposals from the SIGIR / ICTIR 2025 panel on "LLMs + IR, what could possibly go wrong?"
A cool outcome here would be if future IMO exclusively includes problems that we know recent generation of LLMs can not yet solve.
6. I don’t think LLMs will replace mathematicians anytime soon. Math research is about solving problems *no one* yet knows how to solve (out-of-distribution), and this requires significant creativity, something notably absent from OpenAI’s IMO solutions. (6/10)
I’m interpreting this as pro-Human rather than anti-AI. Give people time and tools, and be amazed.
Terence Tao on the supposed Gold from OpenAI at IMO
AegisLLM leverages DSPy's MIPROv2 optimizer in a totally unexpected way: to evolve its prompts based on the attacks it sees in real time. Some really large gains!
If you are interested in building agentic workflows, AegisLLM is a nice instantiation in safety/security domain! Thanks @furongh for sharing it with me. Agentic workflows must be designed and optimized as systems, as @lateinteraction keeps repeating.
“Reasoning won’t generalize outside of math and code” Maybe we should express everything as math and code… Proofs, Theorems, Lemmas, Corollaries, Conjectures are all math. It’s not just equations. From this perspective we have a lot more flexibility of expression.
Excited to talk about long-context models / eval at this panel on Saturday! I'm also looking for a postdoc / PhD students to work on related topics, happy to chat with anyone interested at #ICML2025!
💡 Curious about long-context foundation models (LFCM)? 🧠 We’re hosting a panel at the LCFM workshop at #ICML2025 on “How to evaluate long-context foundation models?” — We’d love to feature your question! Anything on long-context evaluation or modeling — drop it below / DM me🎤