Subhash Kantamneni
@thesubhashk
incoming @AnthropicAI. prev @mit Tegmark group. mech interp & alignment
(1/N) LLMs represent numbers on a helix? And use trigonometry to do addition? Answers below 🧵
the only way i would make a map!
We’re launching Mundi, the first open-source web GIS built for AI. After years of training geospatial AI models, we decided now is the moment to build the GIS software of the next decade. After the AI Vectorizer, Georeferencer, and Kue (our LLM agent inside QGIS), we realized…
Today, the most competent AI systems in almost *any* domain (math, coding, etc.) are broadly knowledgeable across almost *every* domain. Does it have to be this way, or can we create truly narrow AI systems? In a new preprint, we explore some questions relevant to this goal...
excited to give a talk tmrw at 1pm est!
This week at Deep Learning: Classics and Trends we're kicking off a new five part mini-series on LLM Interpretability. Up first: @thesubhashk shows how LLMs represent numbers on a helix and use it to add! Join Friday at 10am PT, zoom here: mlcollective.org/dlct/
happy to be a contributor on this set of ai safety priorities!
This Singapore conference was an amazing AI safety comeback after the Paris flop: great consensus between a who's who from the US, China, top companies, AISI's, etc on what safety research needs to get done: aisafetypriorities.org
cool work! i rly like the idea of “here’s this wacky model behavior, let’s use interp to understand it!”
1/6: A recent paper shows that that LLMs are "self aware": when trained to exhibit a behavior like "risk taking", LLMs self report being risky. In a recent blog post, we explore what's happening here: some self awareness behaviors are caused by a simple learned steering vector!🧵
hey i’m at ICLR! If you’re interested in chatting about mech interp (especially unsupervised alternatives to SAEs) or alignment (weak-to-strong oversight is on my mind) hmu!
nice work combining lagrangian and hamiltonian neural nets! ideally we’d want neural nets to learn physical laws from data and then extract insights they have to improve our own understanding of physics!
Would two AI scientists disagree with each other, even if trained on the same data? After seeing classical physics, AI scientists disagree at first but converge to known theories (Lagrangian/Hamiltonian) when data become diverse. Check out our paper: arxiv.org/abs/2504.02822
Pretty awesome to see Anthropic studying addition in a production model! My intuition is that "number ending in 6" type features are calculated by doing cos(2pi( a-6 ) / 10) - using trigonometric reps of numbers! Would love to see if we can dig out these low level computations
Claude wasn’t designed to be a calculator; it was trained to predict text. And yet it can do math "in its head". How? We find that, far from merely memorizing the answers to problems, it employs sophisticated parallel computational paths to do "mental arithmetic".
Pretty awesome work! I think there’s a lot of promise in monitoring and Docent seems like a great step towards that
AI models are *not* solving problems the way we think using Docent, we find that Claude solves *broken* eval tasks - memorizing answers & hallucinating them! details in 🧵 we really need to look at our data harder, and it's time to rethink how we do evals...
Really cool to see interp focused metrics and holistic measures of SAE quality (including probing!)
We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings! Our findings in the 🧵 below 👇
A First Step Towards Interpretable Protein Structure Prediction With SAEFold, we enable mechanistic interpretability on ESMFold, a protein structure prediction model, for the first time. Watch @NithinParsan demo a case study here w/ links for paper & open-source code 👇
New blog post on SAE probing! We argue that SAEs should be evaluated on downstream interp tasks. Unfortunately, SAEs weren’t differentially useful for probing. We think this is a negative result for current SAEs, but we’re hopeful for new SAE/interp methods!
We wrote up a blog post with some takeaways from our SAE probing project! TLDR, we think future work should focus on showing SAEs are differentially useful on downstream tasks, or should focus on ambitious new types of SAEs/other novel techniques. lesswrong.com/posts/osNKnwiJ…