Daniel Murfet
@danielmurfet
Mathematician. Head of Research at Timaeus. Working on Singular Learning Theory and AI alignment.
Transformer-based neural networks achieve impressive performance on coding, math & reasoning tasks that require keeping track of variables and their values. But how can they do that without explicit memory? 📄 Our new ICML paper investigates this in a synthetic setting! 🧵 1/13
I've been yapping for months about bad evaluation setups and how results/AI behaviors are reported, and this new @AISecurityInst paper does so much more clearly. In short: There's a massive difference between showing a model can do something sketchy versus showing it tends to…
😈 Here's why you should not worry that models will start blackmailing you out of nowhere: 1. At their heart, LLMs are pattern-matching and prediction engines. Given an input, they predict the most statistically likely continuation based on the vast dataset they were trained on.…
It's been a busy week for the Anthropic interpretability team, with more to come in the near future! I wanted to recap some of the things we shared.
Important lessons on rigorous evaluation of AI model behaviors. Drawing on the historical example (and fun story) of hype around "chimps learning language". Given the importance of AI safety research, rigor and credibility is absolutely necessary. A great read from the folks at…
In a new paper, we examine recent claims that AI systems have been observed ‘scheming’, or making strategic attempts to mislead humans. We argue that to test these claims properly, more rigorous methods are needed.
After I left OpenAI, I knew I wanted to be at a non-profit but wasn't sure whether to join or start one. Ultimately I started one bc [long story redacted] but RAND is one I considered + their pivot to taking AI seriously is a great thing for the world: x.com/ohlennart/stat…
My team at RAND is hiring! Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.
Single reinforcement learning system is key here! As in this figure, I believe that it won't take too long until the models we release generally will outperform the variants that competed in atcoder and IMO.
To summarize this week: - we released general purpose computer using agent - got beaten by a single human in atcoder heuristics competition - solved 5/6 new IMO problems with natural language proofs All of those are based on the same single reinforcement learning system
8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.
"I use AI in a separate window. I don't enjoy Cursor or Windsurf, I can literally feel competence draining out of my fingers." @dhh, the legendary programmer and creator of Ruby on Rails has the most beautiful and philosophical idea about what AI takes away from programmers.
(7/n) there is one-two OOMs more math in both breath and depth than the typical person here imagines there is. And that is only the math that is known. The unknown is no doubt vastly larger
A huge component of the AI Security Institute's impact is tied to the scientific quality of our capability evaluations of LLMs. If you find details of rigorous experimental design exciting, please apply to Coz's team!
We're hiring a Senior Researcher for the Science of Evaluation team! We are an internal red-team, stress-testing the methods and evidence behind AISI’s evaluations. If you're sharp, methodologically rigorous, and want shape research and policy, this role might be for you! 🧵
The AISI Whitebox Control Team is doing cool investigations into how well linear probes work, and has a new post sharing nuanced in-progress work. The results are mixed, in interesting ways! Please see Joseph's thread for details! I have only high-level observations. 🧵
🧵 1/13 My new team at UK AISI - the White Box Control Team - has released progress updates! We've been investigating whether AI systems could deliberately underperform on evaluations without us noticing. Key findings below 👇
Giorgio Parisi opening StatPhys29: AI resembles the heat engine in that the technology arrived before the theory.
Reward Learning is just supervised learning, and so should be equally safe, right? Wrong! Our paper “The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret” shows that policy optimization causes issues. It was accepted to ICML! 🧵
In a new paper, we examine recent claims that AI systems have been observed ‘scheming’, or making strategic attempts to mislead humans. We argue that to test these claims properly, more rigorous methods are needed.
log(n): grows very slowly with n loglog(n): bounded above by 4 logloglog(n): constant loglogloglogloglogloglog(n): decreasing
How do transformers carry out recurrent computations while being fundamentally feedforward? Excited to present our work on Constrained Belief Updating at #ICML2025, where we show that attention carries out a spectral algorithm in order to parallelize Bayes updating.
Sensitivity and Sharpness of n-Simplical Attention On the topic of stabilizing training, I got unreasonably nerdsniped by the 2-simplical attention and ended up deriving the sensitivity and sharpness bounds of n-simplical attention more generally...
I want you all to read @Kimi_Moonshot's technical report on K2 then go back to this thread awesome work by @Jianlin_S and team! x.com/Yuchenj_UW/sta…
And another reason is, 3. AI safety. If our weights are 'too massive', then they would be too sensitive to changes in the inputs. Nobody wants to be near a robot which would just smack people in the face cuz of some tiny fluke in the sensors. Controlled weights => little to no…