Rayan Krishnan
@RayanKrishnan
ceo @_valsai | solve evals, solve intelligence prev @stanford @PalantirTech
Hasn't changed much since Grok 2 x.com/RayanKrishnan/…
In the livestream, Elon Musk called Grok 4 “partially blind”. We tested this claim on our two multimodal benchmarks (Mortgage Tax and MMMU) and found a bigger gap between public (pink) and private (purple) benchmarks.
Chinese open source developers have now far outpaced their western counterparts. Of course OAI's open weight model is coming any day now, right?
Does @Kimi_Moonshot's Kimi K2 live up to the hype? We found that it is indeed the new state-of-the-art open-source model according to our evaluations. The model cracks the top 10 on Math500 and LiveCodeBench, narrowly beating out DeepSeek R1 on both. (1/4)
Finally a leaderboard that shows you which LLM is the best gambler. Now give Claude your banking info :)
We evaluated @AnthropicAI and @OpenAI models on our Finance Agent Benchmark, compiling results from the best each lab had to offer across question categories. Both labs are pushing the boundaries on financial agentic capabilities. Financial institutions are increasingly relying…
@grok 4 struggles on our private benchmarks, in contrast to SOTA performance on AIME, Math 500, and GPQA… We had high hopes after Wednesday’s livestream 😔 (🧵1/3)
Very capable model based on our initial testing. Remains to be seen how it does on our held-out sets
Grok 4 is the new state-of-the-art on our academic math and science benchmarks (AIME, GPQA, MATH 500) 🚀 Congrats @xai @elonmusk @Yuhu_ai_ @belce_dogru
wen grok 5 solve millennium prize problem tho? @ericzelikman

Batch API is a win-win-win (providers, builders, users) and I'm glad more providers are reaching the scale to enable it. We worked with the Google team to beta their Gemini batch API in our evaluations for 2.5! Well done @divy93t @OfficialLoganK
Grateful to the @GeminiApp team for the shoutout on our Batch API integration! We’ve added batching support on our platform as part of our ongoing efforts to improve cost efficiency for running increasingly large benchmarks (along with similar offerings from OpenAI, Anthropic,…
Well that was close
PSA: there’s a guy named Soham Parekh (in India) who works at 3-4 startups at the same time. He’s been preying on YC companies and more. Beware. I fired this guy in his first week and told him to stop lying / scamming people. He hasn’t stopped a year later. No more excuses.
Another Vals AI Game Night in the books! Thanks to everyone that came out last Friday for Za's Pizza and board games! We always appreciate seeing friends and making new ones. Interested in coming to the next one? DM us!
Hows OAI supposed to reduce churn if its "leaders" are so obviously not using their own product to write memos?

Meta AI is nothing without its people??
BREAKING: Mark Zuckerberg notified Meta staff today to introduce them to the new superintelligence team. The memo, which WIRED obtained, lists names and bios for the recently hired employees, many of whom came from rival AI firms like OpenAI, Anthropic, and Google.