Shengyu Feng
@ShawnSYFeng
PhD student @LTIatCMU
We have a very poor understanding of why deep neural networks like transformer models learn the parameters they learn. For example, in the paper below from 2013, the authors demonstrated that 5% of the weights of a trained deep neural network can be used to predict the values of…
We are excited to announce that @shengjia_zhao will be the Chief Scientist of Meta Superintelligence Labs! Shengjia is a brilliant scientist who most recently pioneered a new scaling paradigm in his research. He will lead our scientific direction for our team. Let's go 🚀
RL+LLM researchers actively use LLM distribution Entropy to measure training dynamics. This number is misleading. John Von-Neumann and Lev Landau gave us the correct answer 100 years ago while studying mixed quantum states in Hilbert spaces. Usual Entropy treats all tokens as…
Beautiful @GoogleResearch paper. LLMs can learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change. That behavior looks impossible if learning always means gradient descent. The mechanisms through which this…
ai for math workshop papers released, it's a fun batch🚀 openreview.net/group?id=ICML.…
Will present two papers at #icml25! Happy to chat! Main (7/16): Regularized Langevin Dynamics for Combinatorial Optimization. (icml.cc/virtual/2025/p…) AI4MATH Workshop (7/18): A Comprehensive Evaluation of Contemporary ML-based Solvers for Combinatorial Optimization.
In physics there is an elegant method for computing the correlation functions called generating function. The idea is simple - instead of computing correlators one by one - you define a function of a parameter and compute the average of that new function. Individual correlators…
Given an image of a car and a caption of a horse, will VLMs recognize the corresponding unimodal information? No, we show they usually struggle with conflicting inputs. We look into their internal representations to understand why and how this happens, and find that we can…
Check out our new paper: “How Do Vision-Language Models Process Conflicting Information Across Modalities?”! Vision-language models often struggle with conflicting inputs - we show how their internal representations and key attention heads reveal when and how this happens, and…
+1 for "context engineering" over "prompt engineering". People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window…
I really like the term “context engineering” over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.
hey all, couple quick notes: 1) yes, we will be joining Meta. 2) no, we did not get 100M sign-on, that's fake news. Excited about what's ahead though, will share more in due time! cc @__kolesnikov__ and @XiaohuaZhai.
Understanding graph theory will seriously enhance your engineering skills; you must absolutely be familiar with them. Here's a graph theory quickstart, in collaboration with @alepiad. Read on:
🎨 Gemini 2.5 tech report just dropped! So proud to have led the development of RL*F (Reinforcement Learning from Human and Critic Feedback) - our breakthrough in AI training inspired by... art school crits? Here's the thing: How do you teach taste? Style? Things without clear…
Can LLM solve PDEs? 🤯 We present CodePDE, a framework that uses LLMs to automatically generate solvers for PDE and outperforms human implementation! 🚀 CodePDE demonstrates the power of inference-time algorithms and scaling for PDE solving. More in 🧵: #ML4PDE #AI4Science
read the first letter of every name in the gemini contributors list
Research with amazing collaborators @JizeJiang, @MeitangLi, and @JingchengYang, guided by great advisors and supported by the generous help of talented researchers @BowenJin13, @XingyuFu2, and many open-source contributors (easyr1, verl, vllm... etc).
Excited to introduce VTool-R1! We’ve trained VLMs to “think visually” using RL, blending Python-based 🖼️visual edits with💡textual Chain-of-Thought reasoning. Our trained qwen2.5-VL-32B surpasses GPT-4o on ChartQA & TableVQA, and even the compact qwen2.5-VL-7B significantly…
o3-pro is the slowest and most overthinking model. A simple 'Hi' cost me $80. 🥲
When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs: 🧵1/9
🔥Unlocking New Paradigm for Test-Time Scaling of Agents! We introduce Test-Time Interaction (TTI), which scales the number of interaction steps beyond thinking tokens per step. Our agents learn to act longer➡️richer exploration➡️better success Paper: arxiv.org/abs/2506.07976
📢 New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: → LLMs are limited in creativity since they learn to predict the next token → creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" 🌱) 1/ 🧵