Chuanyang Jin
@chuanyang_jin
PhD @JohnsHopkins | Intern @AIatMeta FAIR ⏰ Past: @MITCoCoSci & @MIT_CSAIL & @nyuniversity
I am so flattered that our paper “MMToM-QA: Multimodal Theory of Mind Question Answering” won the Outstanding Paper Award at #ACL2024 @aclmeeting Huge thanks to all my amazing collaborators!!
Can machines understand people’s minds from multimodal inputs? We introduce a comprehensive benchmark: “MMToM-QA: Multimodal Theory of Mind Question Answering” 📜 arxiv.org/abs/2401.08743
Heading to ICML to present our work Rejecting Instruction Preference (RIP) for better data curation and synthesis on Wed 07/16 (4:30pm - 7:00pm)! Excited to connect with folks interested in synthetic data, reasoning, RL and anything in general@FAIR. #ICML2025
💀 Introducing RIP: Rejecting Instruction Preferences💀 A method to *curate* high quality data, or *create* high quality synthetic data. Large performance gains across benchmarks (AlpacaEval2, Arena-Hard, WildBench). Paper 📄: arxiv.org/abs/2501.18578
🔍 WM‑ABench: a new benchmark for World Models WM‑ABench reveals that current VLMs lack a disentangled understanding of physical concepts and foundational knowledge for next-state prediction, and provides a fine-grained checklist to help close that gap.
🤔 Have @OpenAI o3, Gemini 2.5, Claude 3.7 formed an internal world model to understand the physical world, or just align pixels with words? We introduce WM-ABench, the first systematic evaluation of VLMs as world models. Using a cognitively-inspired framework, we test 15 SOTA…
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation "we introduce WM-ABench, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660…
Welcome to join us tomorrow! 🗓️ June 21 | 8:50 AM – 12:30 PM PT 📍 USC (OHE 132) & Zoom (wse.zoom.us/j/95095685281)
The #RSS2025 Workshop on Continual Robot Learning from Humans is happening on June 21. We have an amazing lineup of speakers discussing how we can enable robots to acquire new skills and knowledge from humans continuously. Join us in person and on Zoom (info on our website)!
Existing robot-manipulation benchmarks stop at object-level tasks, missing the part-level semantics essential for fine-grained control. Very excited to see PartInstruct, which finally fills this gap with a large-scale dataset for training and evaluating precise, long-horizon,…
🚀New robot manipulation benchmark How to teach robots to reason about and interact with relevant object parts for a given fine-grained manipulation task? To address this challenge, our #RSS2025paper introduces PartInstruct, the first large-scale benchmark for fine-grained…
🚀 Excited to introduce SimWorld: an embodied simulator for infinite photorealistic world generation 🏙️ populated with diverse agents 🤖 If you are at #CVPR2025, come check out the live demo 👇 Jun 14, 12:00-1:00 pm at JHU booth, ExHall B Jun 15, 10:30 am-12:30 pm, #7, ExHall B
🚨Announcing RAM 2 workshop @ COLM25 - call for papers🚨 - 10 years on, we present the sequel to the classic RAM🐏 (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the…
Check out this exciting workshop on continual learning from humans at RSS 2025 in LA! I am happy to be speaking and will share our works on observational learning through visual imitation of humans.
Excited to announce the 1st Workshop on Continual Robot Learning from Humans @ #RSS2025 in LA! We're bringing together interdisciplinary researchers to explore how robots can continuously learn through human interactions! Full details: …-robot-learning-from-humans.github.io @RoboticsSciSys
Human-AI cooperation is an important problem, but many existing papers focus on training agents in the same 5 fixed Overcooked layouts, and use population-based training (PBT) to try to cover the diversity of human partner strategies. Diving into this problem, we find that…
Our new paper (first one of my PhD!) on cooperative AI reveals a surprising insight: Environment Diversity > Partner Diversity. Agents trained in self-play across many environments learn cooperative norms that transfer to humans on novel tasks. shorturl.at/fqsNN🧵
📊Summary of updates on the MMToM-QA leaderboard: chuanyangjin.com/mmtom-qa-leade… - Recent LLMs with inference-time scaling (e.g., o3-mini) have significantly improved ToM performance but still fall short of human levels. Notably, they excel in belief questions but score below random on…
Can machines understand people’s minds from multimodal inputs? We introduce a comprehensive benchmark: “MMToM-QA: Multimodal Theory of Mind Question Answering” 📜 arxiv.org/abs/2401.08743
Check out our latest work on machine Theory of Mind: #AutoToM ! We propose an approach that (1) combines the open-endedness of LLMs with robustness of Bayesian models; (2) leverages the uncertainties to refine the model, achieving better performance while maintaining low compute.
How to achieve human-level open-ended machine Theory of Mind? Introducing #AutoToM: a fully automated and open-ended ToM reasoning method combining the flexibility of LLMs with the robustness of Bayesian inverse planning, achieving SOTA results across five benchmarks. 🧵[1/n]
Very excited to introduce AutoToM, our latest effort toward open-ended machine Theory of Mind. Given any context and ToM question, AutoToM automatically formulates a minimally sufficient probabilistic model to produce confident inference of any target mental variable.
How to achieve human-level open-ended machine Theory of Mind? Introducing #AutoToM: a fully automated and open-ended ToM reasoning method combining the flexibility of LLMs with the robustness of Bayesian inverse planning, achieving SOTA results across five benchmarks. 🧵[1/n]
🚀Exciting to see how recent advancements like OpenAI’s O1/O3 & DeepSeek’s R1 are pushing the boundaries! Check out our latest survey on Complex Reasoning with LLMs. Analyzed over 300 papers to explore the progress. Paper: arxiv.org/pdf/2502.17419 Github: github.com/zzli2022/Aweso…