James Oldfield
@jamesaoldfield
PhD student interested in interpretability and AI safety @ QMUL. Visiting student @ Oxford. Prev visiting @ UW-Madison
Sparse MLPs/dictionaries learn interpretable features in LLMs, yet provide poor layer reconstruction. Mixture of Decoders (MxDs) expand dense layers into sparsely activating sublayers instead, for a more faithful decomposition! 📝 arxiv.org/abs/2505.21364 [1/7]
![jamesaoldfield's tweet image. Sparse MLPs/dictionaries learn interpretable features in LLMs, yet provide poor layer reconstruction.
Mixture of Decoders (MxDs) expand dense layers into sparsely activating sublayers instead, for a more faithful decomposition!
📝 arxiv.org/abs/2505.21364
[1/7]](https://pbs.twimg.com/media/Gsd1yNAXgAA0qdJ.png)
![jamesaoldfield's tweet image. Sparse MLPs/dictionaries learn interpretable features in LLMs, yet provide poor layer reconstruction.
Mixture of Decoders (MxDs) expand dense layers into sparsely activating sublayers instead, for a more faithful decomposition!
📝 arxiv.org/abs/2505.21364
[1/7]](https://pbs.twimg.com/media/Gsd18AkW0AA0huF.jpg)
‼️How well do steering vectors work? When do they fail and why? ✅We perform an evaluation of steering methods and provide theoretical results explaining the results. Paper: arxiv.org/abs/2502.02716 (w/ @SharonYixuanLi) [1/n]
New paper: We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *describe* their new behavior, despite no explicit mentions in the training data. So LLMs have a form of intuitive self-awareness 🧵
Looking forward to speaking with folks thinking about architecture design for interpretability at #NeurIPS2024 next week. Feel free to drop by our poster #3003 on scaling MoE's expert specialization on Friday 13th @ 4:30pm! arxiv.org/abs/2402.12550

Two more weeks to submit your work on tensors/low-rank factorizations in the workshop
Less than two weeks to submit your papers on: 📈 #lowrank adapters and #factorizations 🧊 #tensor networks 🔌 probabilistic #circuits 🎓 #theory of factorizations to the first workshop on connecting them in #AI #ML at @RealAAAI please share! 🔁 👇👇👇 april-tools.github.io/colorai/
Excited that our work on scaling the mixture of experts was accepted to #NeurIPS2024 . New in this version: We extend the architecture to language models and show how factorization helps in specialization. Check it out: ⬇️
📣New paper: Can you encourage your Mixture-of-Expert layer include "experts"? Increasing the # of experts leads to specialization, but the computational cost is prohibitive. ⛔ Site: eecs.qmul.ac.uk/~jo001/MMoE/ Arxiv: arxiv.org/abs/2402.12550 Code: github.com/james-oldfield… 🧵1/n
Presenting our recent paper with @gbouritsas at #ICML2024! See you on Thursday, July 25, from 1:30 PM to 3 PM in Hall C, 4-9, #815 to discuss on our work.
What do different contrastive learning (CL) losses actually optimize for? In our #ICML2024 paper, we provide a theoretical analysis and propose two loss functions that outperform conventional CL losses. Full paper here: arxiv.org/abs/2405.18045 w/@gbouritsas A thread 🧵
#ICML2024 Heading to Vienna now and can’t wait to see old and new friends there at ICML! We will present 3 research papers, one about adversarial robustness of conformal prediction and the other two about robust multimodal learning. Drop by at our posters and have a chat!
[1/4] Introducing “A Primer on the Inner Workings of Transformer-based Language Models”, a comprehensive survey on interpretability methods and the findings into the functioning of language models they have led to. ArXiv: arxiv.org/pdf/2405.00208