Jay
@jayendra_ram
using computers @hud_evals, prev cs+physics @columbia, @ycombinator
I've been working with a small team to evaluate agentic models for computer use agents. Today, we're thrilled to introduce Autonomy, our comprehensive eval for AI agents. We aim to create an eval that's rigorous, tests agency, and moves toward general intelligence. 1/ 🧵

watching chatgpt agent use a computer to do complex tasks has been a real "feel the agi" moment for me; something about seeing the computer think, plan, and execute hits different.
If you're in NYC you should 100% go to this. x.com/aaron_epstein/…
Looking forward to meeting builders in NYC next week. RSVP for the event and reply below with what you’re working on.
Everyone wants to make an eval but no one wants to write some heavy ass tasks
Is there a good iOS app that I can use to interact with MCP servers? Would pay for this.
Soham should be let into the next @ycombinator batch. He has a great answer to the "time I hacked a non-computer system" question. Plus he's already worked at a few YC companies already. x.com/garrytan/statu…
Without the YC community this guy would still be operating and would have maybe never been caught The startup guild of YC is a necessary invention to help founders be more successful than they would be alone
Good news: Some places let you iterate just as fast! Currently at @hud_evals we are using the same eval stack as what many frontier labs use for their AI agents! We're hiring part-time/intern eval makers, and fulltime devs across p much the entire stack :) DM if interested.
I’ll cover this more in my video, but tl;dr: - Anthropic tests the ways that the model will try to “disobey” because safety (everyone does this) - they came up with a compelling test, giving the model a fake set of tools + a fake scenario that would affect public health - they…
we switched out the underlying Operator model to o3 today. we think it's a step jump improvement from the previous 4o-based model. give it a try with some of your old prompts that failed! it was fun to work on this, the o-series paradigm makes everything so much better!
Operator 🤝 OpenAI o3 Operator in ChatGPT has been updated with our latest reasoning model. operator.chatgpt.com
Jon Xu and Andrew Miklas both went through YC in 2010. Fifteen years and two iconic companies later— FutureAdvisor and PagerDuty— they're back, this time as YC's newest General Partners. Welcome, @jonxu and Andrew! ycombinator.com/blog/welcome-j…
Discriminator-generator gap seems to be the most important idea in AI for scientific innovation. With compute + clever search, anything that we can measure will be optimized. First up will be environments that can be verified quickly, with continuous reward, and at scale.…