Jiahai Feng
@feng_jiahai
PhD student @berkeley_ai | Previously MIT
Thrilled to announce our symposium, Cognitively Inspired Interpretability inn Large Neural Networks, at #CogSci2025 featuring @TaylorWWebb, Ellie Pavlick, @feng_jiahai, Gustaw Opielka, Claire Stevenson, and @IbanDlank!
Past work has shown that world state is linearly decodable from LMs trained on text and games like Othello. But how do LMs *compute* these states? We investigate state tracking using permutation composition as a model problem, and discover interpretable, controllable procedures🧵
LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵
Can we find circuits directly from a model’s params? At Neurips I’m presenting work on understanding how attn heads in LMs communicate by analyzing their weights. We find a lot of interesting things, like a 3D subspace that controls which index in a list to attend to!
🎶 We can often tell if a song is fast or slow or if a note is high or low. Music generation models are trained on vast amounts of data, but do they pick up music theory concepts like we do? 🎶 Our #ISMIR2024 paper explores this question! arxiv.org/abs/2410.00872 @ISMIRConf
Announcing Transluce, a nonprofit research lab building open source, scalable technology for understanding AI systems and steering them in the public interest. Read a letter from the co-founders Jacob Steinhardt and Sarah Schwettmann: transluce.org/introducing-tr…
Excited to give a talk about interpretability via reverse engineering tmrw at @eccvconf!
Less than three days to go for the eXCV Workshop at #ECCV2024! Join us on Sunday from 14:00-18:00 in Brown 1 to hear about the state of XAI research from an exciting lineup of speakers! @orussakovsky, @vidal_rene, @sunniesuhyoung, @YGandelsman, @zeynepakata @eccvconf (1/4)
Some thoughts on how to think about "world models" in language models and beyond: lingo.csail.mit.edu/blog/world_mod…