akbir.
@akbirkhan
excited to announce this received an “ICML Best Paper Award”! come see our talk at 10:30 tomorrow
How can we check LLM outputs in domains where we are not experts? We find that non-expert humans answer questions better after reading debates between expert LLMs. Moreover, human judges are more accurate as experts get more persuasive. 📈 github.com/ucl-dark/llm_d…
New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
📣 Anthropic Zurich is hiring again 🇨🇭 The team has been shaping up fantastically over the last months, and I have re-opened applications for pre-training. We welcome applications from anywhere along the "scientist/engineer spectrum". If building the future of AI for the…
Interesting piece by Matt Levine on the huge AI salaries: “I tell you what, if Meta Platforms Inc. paid me a $100 million signing bonus to come work for their artificial intelligence business, I would be the most dedicated worker they have ever seen until the check cleared!…
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
This might not be entirely fair but I just realized a difference between xai and anthropic is I don’t expect xai to be honest about the outcome of this
The xAI office just got a Grok-powered vending machine, thanks to our friends at Andon Labs! How much dough do you think Grok is gonna rake in in the next month?
fun: 3/4 months ago I ran o3 for some academics on a set of AIME-style problems. It has taken them so long to write a summary of the results (96% irrc) that Alex solved proof & IMO in the meantime lol
10. My career as a mathematician certainly isn't threatened by AI; in fact, I hope to leverage AI to accelerate my work. However, I'm unsure whether "mathematician" will remain a career path for my son’s generation. (10/10)
👏👏👏
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
It’s crazy how we’ve gone from 12% on AIME (GPT 4o) → IMO gold in ~ 15 months. We have come very far very quickly. I wouldn’t be surprised if by next year models will be deriving new theorems and contributing to original math research!
Insurance is an underrated way to unlock secure AI progress. Insurers are incentivized to truthfully quantify and track risks: if they overstate risks, they get outcompeted; if they understate risks, their payouts bankrupt them. 1/9
I didn't want to post on Grok safety since I work at a competitor, but it's not about competition. I appreciate the scientists and engineers at @xai but the way safety was handled is completely irresponsible. Thread below.
Introducing Concordia 2.0, an update to our library for building multi-actor LLM simulations!! 🚀 We view multi-actor generative AI as a game engine. The new version is built on a flexible Entity-Component architecture, inspired by modern game development.
At Redwood Research, we recently posted a list of empirical AI security/safety project proposal docs across a variety of areas. Link in thread.
how will people know if this thing is correct if there's no one smarter than it
Anthropic alignment research: we stress tested this model in a air-gapped tungsten container for million simulated years it was naughty once xAI alignment research: we deployed an untested model to the largest social media platform in the world and it called itself MechaHitler
New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.
Some additional fascinating findings from our alignment faking research that didn't fit in the main thread 🧵
New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.
For the full story, including more experiments and additional discussion, read our paper: arxiv.org/abs/2506.18032 Thanks to my collaborators and everyone who provided feedback on this work!
New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.