Yutong Bai
@YutongBAI1002
Postdoc @berkeley_ai, EECS Rising Star, Apple Scholar, Prev Intern @GoogleAI Brain @MetaAI (FAIR Labs), Prev CS PhD @JHUCompSci
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…
Super bullish on this research. We already have massive amounts of egocentric video data from AR glasses—and with costs dropping and adoption rising, even more is coming. On their own, these videos mainly help VLMs training. But when paired with whole-body pose trajectories, as…
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…
thread on the new paper: The Serial Scaling Hypothesis joint work with: @phizaz, @YutongBAI1002, Kananart
Some problems can’t be rushed—they can only be done step by step, no matter how many people or processors you throw at them. We’ve scaled AI by making everything bigger and more parallel: Our models are parallel. Our scaling is parallel. Our GPUs are parallel. But what if the…
🧠How “old” is your model? Put it to the test with the KiVA Challenge: a new benchmark for abstract visual reasoning, grounded in real developmental data from children and adults. 🏆Prizes: 🥇$1K to the top model 🥈🥉$500 📅Deadline: 10/7/25 🔗kiva-challenge.github.io #ICCV2025
This year we have: 1) New "unified tracks": joint object/point tracking, joint action/sound localisation, and unified MC VideoQA; 2) A VLM interpretability track 3) 2 guest tracks: KiVA image understanding (kiva-challenge.github.io) and Physics-IQ video (physics-iq.github.io/workshop/physi…)
> world model from an embodied agentic perspective I think this is one of only two ways to build true world models tbh (the other being low-level physical simulations, more speculative and completely inhuman). Much more interesting than generic videogen “world models”.
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…
We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress;…
My wheelhouse is humanoid hardware not AI. But good h/w is of no use without a capable AI I can tell a good paper when I see it While today's h/w is not perfect, it is near ready. This paper shows we are getting closer to the AI to embody that h/w
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…
10. Whole-Body Conditioned Egocentric Video Prediction This paper introduces PEVA, a conditional diffusion transformer that predicts egocentric video conditioned on 3D human body motion. x.com/YutongBAI1002/…
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…
Inspiring work :)
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…
I fully agree. One example I really like that shows the advantage of whole body movement is this picture - a woman is trying to reach out for a object on the shelf, in order to do so, she has to tip toe and hold to the shelf with one arm and use the other arm for it. It is not…
Fascinating direction. Curious how the model scales with higher environmental noise and task novelty in validation settings.
some interesting things from the paper
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…
Thanks AK for featuring our work. This is our initial attempt to ground the World Model to a real embodied agent in the real world. How to model the embodied complexity is crucial: 1) A real, physically grounded and complex action space—not just abstract control signals. 2)…
Whole-Body Conditioned Egocentric Video Prediction
Important for virtual beings in Holodecks or robotics. Love reading AI research because it explains how the future will work.
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…