Pengfei Liu
@stefan_fee
Associate Prof. at SJTU, leading GAIR Lab (http://plms.ai) Co-founder of Inspired Cognition, Postdoc at @LTIatCMU, Previously FNLP, @MILAMontreal,
The Alpaca moment of Large Multimodal Models! Can we build native LMMs just like Llama for simple multimodal generation? Introducing Anole: the first open-source, autoregressive native LMM for multimodal generation. Building on Chameleon by @AIatMeta: github.com/GAIR-NLP/anole

RepoST was accepted to @COLM_conf !!! See you in Montreal 🚀 #COLM2025
How to construct repo-level coding environments in a scalable way? Checkout RepoST: an automated framework to construct repo-level environments using Sandbox Testing (repost-code-gen.github.io) Models trained with RepoST data can generalize well to other datasets (e.g., RepoEval)
FacTool has been accepted to COLM 2025 - two years after its arXiv debut! While the landscape of LLMs has changed a lot since then, tool-augmented LLMs and RAG are still among the most effective and practical approaches for detecting / mitigating hallucinations (ref:…
In the era of 🤖#GenerativeAI, text of all forms can be generated by LLMs. How can we identify and rectify *factual errors* in the generated output? We introduce FacTool, a framework for factuality detection in Generative AI. Website: ethanc111.github.io/factool_websit… (1/n)
blog - abinesh-mathivanan.vercel.app/en/posts/short… read 'octothinker' last week and it's so cool. great work by @SinclairWang1 @FaZhou_998 @stefan_fee
Tech history: Every time humanity hits a tech wall, we just wait for someone named Ilya to show up and save the world :) - Neural nets stuck? - Language models plateau? - ... (skip tons of stuff) - ... - Superintelligence coming?
We don’t have AI self-improves yet, and when we do it will be a game-changer. With more wisdom now compared to the GPT-4 days, it's obvious that it will not be a “fast takeoff”, but rather extremely gradual across many years, probably a decade. The first thing to know is that…
What foundation models do we REALLY need for the RL era? And what pre-training data? Excited to share our work: OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling arxiv.org/pdf/2506.20512 ✨ Key breakthroughs: - First RL-focused mid-training approach - Llama…
What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps?…
nice discussion
🧵Interesting paper—great to see the emphasis on large token counts, which is always appreciated. 😅But some of the results are... puzzling. For example, Table 3 essentially suggests that MegaMath is a non-math corpus. This is weird, especially given the care we've taken during…
The real breakthrough isn't better AI—it's breaking free from nature's constraints We're witnessing a paradigm shift from "passive adaptation" to "active construction" in AI training. 🌊 The old way: AI learns from whatever data naturally exists • Constrained by existing…
📑Interesting paper by GAIR community Thinking with Generated Images🔥 enables a single large multimodal model to generate and reason with visual thoughts, greatly improving its ability to tackle complex vision and multimodal tasks. huggingface.co/papers/2505.22……
312 quality trajectories + open-source model beats Claude 3.7 Sonnet (thinking) in computer use 🚀 We answer the following important questions in our recent tech report: github.com/GAIR-NLP/PC-Ag… 1. Can open-source models + small high-quality datasets outperform top closed-source…
🔥 Excited to share our work "Efficient Agent Training for Computer Use" Q: Do computer use agents need massive data or complex RL to excel? A: No, with just 312 high-quality trajectories, Qwen2.5-VL can outperform Claude 3.7, setting a new SOTA for Windows computer use. 1/6
📣 New Discovery on Computer Use Agent With just 312 high-quality trajectories + open-source model, we've surpassed Claude 3.7 Sonnet (thinking) in computer use capabilities 🚀 ⚡️ In the new era of AI Agent training, many key questions remain: • Can open-source models + small…
Excited to share PC Agent-E, our new work on efficient agent training for computer use! Trained with only❗️312 human trajectories enhanced by Claude 3.7 Sonnet, PC Agent-E achieves a 🤯 141% relative improvement, even surpasses Claude 3.7 Sonnet (thinking)!
This is for you @AIatMeta
Llama team must read OctoThinker notion report asap if they want to make reasoner models that aren't DoA before LlamaCon. There's still time. With their GPU largesse they can do it.
We are sharing this progress report at booth 260 poster in Hall3 of the IClR venue now.
🚨New blog alert! Working on LLM x RL? You don’t want to miss this. Most SOTA RL results today rely on Qwen2.5 base models, but swap in Llama at the same model size and RL training dynamics shift drastically—RL from base often fails. Why? We ran a series of carefully controlled…
🔥 Introducing ToRL: Scaling Tool-Integrated RL directly from base models! LLMs discover optimal reasoning+tool strategies with no presets. ToRL-7B hits 43.3% on AIME24, +14% over no-tool RL, +17% over Qwen-TIR. 📝: arxiv.org/abs/2503.23383… 💻: github.com/GAIR-NLP/ToRL 1/9