sanjana
@sanjanayed
Berkeley EECS, Arize Phoenix
Just wrapped up a tutorial - I use a custom annotations tool to build an end-to-end evaluation & experimentation pipeline🚀 Inspired by an article from @eugeneyan, I explore how to leverage annotations to construct evals, design thoughtful experiments, and systematically improve…
🔥🔥🔥🔥Phoenix update - pumped to start using this
📈 @ArizePhoenix now has project dashboards! In the latest release @arizeai Phoenix comes with a dedicated project dashboard with: 📈 Trace latency and errors 📈 Latency Quantiles 📈 Annotation Scores Timeseries 📈 Cost over Time by token type 📊 Top Models by Cost 📊 Token…
what a world we live in! I just took a Jupyter notebook that implements an llm-evaluator, provided it as context to a coding assistant, and then ask it to write an evaluator class with specified inputs, outputs, etc. initially, it was verbose with too many class methods, but with…
helped write this one! excited to have it out now
Modern agents are increasingly complex — they’re multiple agents connected together through complex routing logic and handovers, multimodal, connecting to MCP servers as tools. Agent observability is no longer a nice-to-have. This can help. bit.ly/4f3gHWn
Libraries: central.sonatype.com/search?q=arize Github: github.com/Arize-ai/openi…
🚀 Introducing OpenInference Java! We're excited to announce the launch of OpenInference Java, a comprehensive solution for tracing AI applications using OpenTelemetry This is fully compatible with any OpenTelemetry compatible collector or backend! 📦 What’s included: ✅…
Prompts, like models, should improve with feedback — not stay static. Here’s how prompt learning works: 1️⃣ The prompt is treated as an online object — something that evolves over time 2️⃣ A LLM (or human) provides an assessment and an English natural language critique, unlike…
Reinforcement Learning in English – Prompt Learning Beyond just Optimization @karpathy tweeted something this week that I think many of us have been feeling: the resurgence of RL is great, but it’s missing the big picture. We believe that the industry chasing traditional RL is…
Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly…
The secret to prompt optimization is evals Saw this tweet by Jason Liu and got me thinking about the future of prompt optimization Most of us are in Cursor/Claude Code and it makes a ton of sense to keep prompts close to code and iterate on them with AI code editors The hard…
holy shit lmfao claude code has been writing a prompt, looking at 200 failures and updated the prmopt, it went from v1 recall@1 60 -> 80 v2 recall@1 6 -> 52
If you’re in the trenches with agents, this will save you time and sanity. Seeing everything click visually takes the guesswork out of spotting what's off
🚀 Working with multi-agent systems? Arize Agent Visibility lets you actually see how your agents are structured automatically. Out of the box with frameworks like Agno, AutoGen, CrewAI, Mastra, SmolAgents & more. No extra setup. Here’s what it brings: ✅ Auto-generated…
Agents using phoenix-support in @cursor_ai aren’t just coding. They’re pulling in best practices, docs, and auto-updating tracing setups without humans in the loop. Feels like self-improving developer workflows!!
🔧 @ArizePhoenix mcp gets phoenix-support tool for @cursor_ai / @AnthropicAI Claude / @windsurf ! You now can click the add to cursor button on phoenix and get a continuously updating MCP server config directly integrated into your IDE. @arizeai/[email protected] also comes…
Ever wonder if your agent’s actually getting it right over a whole convo, not just one step? New Session-Level Evals in Arize AX let you do exactly that by measuring: 🌀 Coherence across the session 🧩 Context retention across turns 🎯 Whether users actually reach their goals…