James Oldfield

@jamesaoldfield

PhD student interested in interpretability and AI safety @ QMUL. Visiting student @ Oxford. Prev visiting @ UW-Madison

London

Joined June 2022

357Following

126Followers

Pinned

James Oldfield@jamesaoldfield · Jun 3

Sparse MLPs/dictionaries learn interpretable features in LLMs, yet provide poor layer reconstruction. Mixture of Decoders (MxDs) expand dense layers into sparsely activating sublayers instead, for a more faithful decomposition! 📝 arxiv.org/abs/2505.21364 [1/7]

jamesaoldfield's tweet image. Sparse MLPs/dictionaries learn interpretable features in LLMs, yet provide poor layer reconstruction.

Mixture of Decoders (MxDs) expand dense layers into sparsely activating sublayers instead, for a more faithful decomposition!

📝 arxiv.org/abs/2505.21364

[1/7]

535

James Oldfield Retweeted

Shawn Im@shawnim00 · Feb 7

‼️How well do steering vectors work? When do they fail and why? ✅We perform an evaluation of steering methods and provide theoretical results explaining the results. Paper: arxiv.org/abs/2502.02716 (w/ @SharonYixuanLi) [1/n]

9.0K

James Oldfield Retweeted

Owain Evans@OwainEvans_UK · Jan 21

New paper: We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *describe* their new behavior, despite no explicit mentions in the training data. So LLMs have a form of intuitive self-awareness 🧵

153

949

576

150.0K

James Oldfield@jamesaoldfield · Dec 6

Looking forward to speaking with folks thinking about architecture design for interpretability at #NeurIPS2024 next week. Feel free to drop by our poster #3003 on scaling MoE's expert specialization on Friday 13th @ 4:30pm! arxiv.org/abs/2402.12550

jamesaoldfield's tweet image. Looking forward to speaking with folks thinking about architecture design for interpretability at #NeurIPS2024 next week. Feel free to drop by our poster #3003 on scaling MoE's expert specialization on Friday 13th @ 4:30pm! arxiv.org/abs/2402.12550

1.0K

James Oldfield@jamesaoldfield · Nov 11

Two more weeks to submit your work on tensors/low-rank factorizations in the workshop

aantonio vergari ⚔️ not at #ICML2025@tetraduzione · Nov 11

Less than two weeks to submit your papers on: 📈 #lowrank adapters and #factorizations 🧊 #tensor networks 🔌 probabilistic #circuits 🎓 #theory of factorizations to the first workshop on connecting them in #AI #ML at @RealAAAI please share! 🔁 👇👇👇 april-tools.github.io/colorai/

2.0K

James Oldfield@jamesaoldfield · Oct 2

Excited that our work on scaling the mixture of experts was accepted to #NeurIPS2024 . New in this version: We extend the architecture to language models and show how factorization helps in specialization. Check it out: ⬇️

GGrigoris Chrysos@Grigoris_c · Feb 21, 2024

📣New paper: Can you encourage your Mixture-of-Expert layer include "experts"? Increasing the # of experts leads to specialization, but the computational cost is prohibitive. ⛔ Site: eecs.qmul.ac.uk/~jo001/MMoE/ Arxiv: arxiv.org/abs/2402.12550 Code: github.com/james-oldfield… 🧵1/n

3.0K

James Oldfield@jamesaoldfield · Jul 24, 2024

Presenting our recent paper with @gbouritsas at #ICML2024! See you on Thursday, July 25, from 1:30 PM to 3 PM in Hall C, 4-9, #815 to discuss on our work.

PPanagiotis Koromilas@pakoromilas · May 31, 2024

What do different contrastive learning (CL) losses actually optimize for? In our #ICML2024 paper, we provide a theoretical analysis and propose two loss functions that outperform conventional CL losses. Full paper here: arxiv.org/abs/2405.18045 w/@gbouritsas A thread 🧵

659

James Oldfield Retweeted

Ziquan@Ziquan12 · Jul 22, 2024

#ICML2024 Heading to Vienna now and can’t wait to see old and new friends there at ICML! We will present 3 research papers, one about adversarial robustness of conformal prediction and the other two about robust multimodal learning. Drop by at our posters and have a chat!

164

James Oldfield Retweeted

Javier Ferrando@javifer_96 · May 3, 2024

[1/4] Introducing “A Primer on the Inner Workings of Transformer-based Language Models”, a comprehensive survey on interpretability methods and the findings into the functioning of language models they have led to. ArXiv: arxiv.org/pdf/2405.00208

130

561

683

87.0K