Yutong Bai (@YutongBAI1002)

Pinned

Y

Yutong Bai@YutongBAI1002 · Jun 27

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…

32

123

502

323

150.0K

Pinned

Y

Yutong Bai@YutongBAI1002 · Jul 1

Super bullish on this research. We already have massive amounts of egocentric video data from AR glasses—and with costs dropping and adoption rising, even more is coming. On their own, these videos mainly help VLMs training. But when paired with whole-body pose trajectories, as…

YYutong Bai@YutongBAI1002 · Jun 27

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…

2

5

55

34

10.0K

Yutong Bai Retweeted

Y

Yuxi on the Wired@layer07_yuxi · Jul 22

thread on the new paper: The Serial Scaling Hypothesis joint work with: @phizaz, @YutongBAI1002, Kananart

7

24

183

155

24.0K

Yutong Bai Retweeted

K

Konpat Ta Preechakul@konpatp · Jul 21

Some problems can’t be rushed—they can only be done step by step, no matter how many people or processors you throw at them. We’ve scaled AI by making everything bigger and more parallel: Our models are parallel. Our scaling is parallel. Our GPUs are parallel. But what if the…

16

54

332

273

34.0K

Y

Yutong Bai@YutongBAI1002 · Jul 15

🧠How “old” is your model? Put it to the test with the KiVA Challenge: a new benchmark for abstract visual reasoning, grounded in real developmental data from children and adults. 🏆Prizes: 🥇$1K to the top model 🥈🥉$500 📅Deadline: 10/7/25 🔗kiva-challenge.github.io #ICCV2025

NNikhil Parthasarathy@nikparth1 · Jul 15

This year we have: 1) New "unified tracks": joint object/point tracking, joint action/sound localisation, and unified MC VideoQA; 2) A VLM interpretability track 3) 2 guest tracks: KiVA image understanding (kiva-challenge.github.io) and Physics-IQ video (physics-iq.github.io/workshop/physi…)

1

10

3

2.0K

Y

Yutong Bai@YutongBAI1002 · Jul 9

> world model from an embodied agentic perspective I think this is one of only two ways to build true world models tbh (the other being low-level physical simulations, more speculative and completely inhuman). Much more interesting than generic videogen “world models”.

YYutong Bai@YutongBAI1002 · Jun 27

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…

0

2

11

6

4.0K

Yutong Bai Retweeted

A

Amir Zamir@zamir_ar · Jul 6

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress;…

6

90

551

438

67.0K

Y

Yutong Bai@YutongBAI1002 · Jul 1

My wheelhouse is humanoid hardware not AI. But good h/w is of no use without a capable AI I can tell a good paper when I see it While today's h/w is not perfect, it is near ready. This paper shows we are getting closer to the AI to embody that h/w

YYutong Bai@YutongBAI1002 · Jun 27

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…

2

4

26

4

5.0K

Y

Yutong Bai@YutongBAI1002 · Jun 29

10. Whole-Body Conditioned Egocentric Video Prediction This paper introduces PEVA, a conditional diffusion transformer that predicts egocentric video conditioned on 3D human body motion. x.com/YutongBAI1002/…

YYutong Bai@YutongBAI1002 · Jun 27

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…

0

2

8

7

6.0K

Y

Yutong Bai@YutongBAI1002 · Jun 28

Inspiring work :)

YYutong Bai@YutongBAI1002 · Jun 27

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…

0

4

33

12

8.0K

Y

Yutong Bai@YutongBAI1002 · Jun 28

I fully agree. One example I really like that shows the advantage of whole body movement is this picture - a woman is trying to reach out for a object on the shelf, in order to do so, she has to tip toe and hold to the shelf with one arm and use the other arm for it. It is not…

VVixen Brain@vixenBrain · Jun 27

Fascinating direction. Curious how the model scales with higher environmental noise and task novelty in validation settings.

2

42

11

10.0K

Y

Yutong Bai@YutongBAI1002 · Jun 28

some interesting things from the paper

YYutong Bai@YutongBAI1002 · Jun 27

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…

1

3

12

7

3.0K

Y

Yutong Bai@YutongBAI1002 · Jun 27

Thanks AK for featuring our work. This is our initial attempt to ground the World Model to a real embodied agent in the real world. How to model the embodied complexity is crucial: 1) A real, physically grounded and complex action space—not just abstract control signals. 2)…

AAK@_akhaliq · Jun 27

Whole-Body Conditioned Egocentric Video Prediction

2

3

33

11

15.0K

Yutong Bai Retweeted

A

AK@_akhaliq · Jun 27

Whole-Body Conditioned Egocentric Video Prediction

1

11

75

34

30.0K

Y

Yutong Bai@YutongBAI1002 · Jun 27

Important for virtual beings in Holodecks or robotics. Love reading AI research because it explains how the future will work.

YYutong Bai@YutongBAI1002 · Jun 27

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…

4

2

28

8

11.0K