Wenbo Hu
@gordonhu608
CS PhD Student @UCLA | Multimodal & Embodied AI & Spatial Intelligence B.S @UCSanDiego
🤔How to maintain a long-term memory for a 3D embodied AI agent across dynamic spatial-temporal environment changes in complex tasks? 🚀Introducing 3DLLM-Mem, a memory-enhanced 3D embodied agent that incrementally builds and maintains a task-relevant long-term memory while it…

Please Check Embodied Agent! One Agent that brings digital web knowledge into physical real world actions.
Meet Embodied Web Agents that bridge physical-digital realms. Imagine embodied agents that can search for online recipes, shop for ingredients and cook for you. Embodied web agents search internet information for implementing real-world embodied tasks. All data, codes and web…
So grateful for the Best Paper award. Congratulations to the whole team!
3DLLM-Mem won the best paper of Foundation Models Meet Embodied Agents Workshop! Congrats to our first author @gordonhu608
This work will give an Oral Presentation #CVPR2025 at the Foundation Models Meet Embodied Agents workshop (Wed 10am, 6/11). Please join to hear @yining_hong presents our work.
🤔How to maintain a long-term memory for a 3D embodied AI agent across dynamic spatial-temporal environment changes in complex tasks? 🚀Introducing 3DLLM-Mem, a memory-enhanced 3D embodied agent that incrementally builds and maintains a task-relevant long-term memory while it…
Introducing 😶🌫️DreamGen, the pioneering approach to neural trajectories + robotics at NVIDIA GEAR lab. We’re among the first to show how large-scale synthetic data can significantly improve a robot’s ability to generalize to new actions and environments. If you’re interested,…
Excited to be at #ICLR2025 🇸🇬 between 4/24 and 4/28 and sharing this work on Multimodal RAG. Presenting this work on 4/26 Saturday 3pm - 5:30pm at Hall 3 + Hall 2B #108. I'm also happy to chat about multimodal models, 3D vision-language, and embodied AI in general with old…
🚀Introducing MRAG-Bench: How do Large Vision-Language Models utilize vision-centric multimodal knowledge? 🤔Previous multimodal knowledge QA benchmarks can mainly be solved by retrieving text knowledge.💥We focus on scenarios where retrieving knowledge from image corpus is more…
📣 For this week’s NLP Seminar, we are thrilled to host Zhe Gan @zhegan4 to give a talk titled “How to Build Your Multimodal LLMs: From Pre-training to Post-training and Agents”! 🗓️ 4/11 Fri 2pm PT Registration: forms.gle/TNXfBZJiMJjL18…
🚀Excited to share our latest work: OpenVLThinker, an exploration into enhancing vision-language models with R1 reasoning capabilities. By iterative integration of SFT and RL, we enabled LVLMs to exhibit robust R1 reasoning behavior. As a result, OpenVLThinker achieves a 70.2%…
Excited to share MRAG-Bench is accepted at #ICLR2025 🇸🇬. The image corpus is a rich source of information, and extracting knowledge from it can often be more advantageous than from a text corpus. We study how MLLMs can utilize vision-centric multimodal knowledge. More in our…
🚀Introducing MRAG-Bench: How do Large Vision-Language Models utilize vision-centric multimodal knowledge? 🤔Previous multimodal knowledge QA benchmarks can mainly be solved by retrieving text knowledge.💥We focus on scenarios where retrieving knowledge from image corpus is more…
Had an incredible experience at #NeurIPS2024 ! It was fantastic to connect with so many people interested in our work and to gain valuable insights and inspiration for the future of multimodal research. I’m deeply grateful for the opportunity to present our work with my amazing…

1/ I'll be at #NeurIPS2024 presenting our work SmallToLarge (S2L): Data-efficient Fine-tuning of LLMs! 🚀 What’s S2L? It’s a scalable data selection method that trains a small proxy model to guide fine-tuning for larger models, reducing costs while preserving performance. 👇
I'll be at #NeurIPS Vancouver between 12/9 and 12/13. Presenting this work on Thursday 4:30pm - 7:30pm at East Exhibit Hall A-C #3509. Welcome old and new friends to chat on multimodal AI research and more! My DM is open :)
How to pick a good number of visual tokens? Too few, you have poor performance; too many, you need quadratically more compute. In this work, we introduce a model that works with an elastic number of tokens. arXiv: arxiv.org/abs/2405.19315
Collaborating between #NVIDIA and #UCSD, we build NaVILA, the foundational navigation VLA for humanoids and quadrupeds. This is enabled by a 2-level framework, a direction I am pushing a lot these days: 1⃣ A VLA that outputs mid-level actions, like "turn left 15 degrees". 2⃣ A…
Without any maps and prior knowledge of the scene, our humanoid and quadruped can now navigate with human language instructions to anywhere outdoors and in any house we go!🔥🔥🔥 Introducing NaVILA, a 2-level navigation foundation model (mid-level action VLA + locomotion skills)…
📣 New Paper: Verbalized Representation Learning (VRL) VRL bridges prompt engineering and representation learning to enable automatic interpretable feature extraction — all without gradient descent! 🔥 +29% over SOTA 📊 95% less data arxiv.org/abs/2411.18651 @uclanlp (1/n)
Can VLMs improve 𝘁𝗵𝗲𝗺𝘀𝗲𝗹𝘃𝗲𝘀💪? We propose🔥𝗩𝗜𝗦𝗖𝗢, a benchmark to evaluate VLMs’ 𝗰𝗿𝗶𝘁𝗶𝗾𝘂𝗲 and 𝗰𝗼𝗿𝗿𝗲𝗰𝘁𝗶𝗼𝗻 capabilities, towards the higher goal of VLMs autonomous self-improvement. 🌐Project: visco-benchmark.github.io 📄Paper: arxiv.org/abs/2412.02172
[NeurIPS D&B Oral] Embodied Agent Interface: Benchmarking LLMs for Embodied Agents A single line of code to evaluate your model! 🌟Standardize Goal Specifications: LTL 🌟Standardize Modules and Interfaces: 4 modules, 438 tasks, 1475 goals 🌟Standardize Fine-grained Metrics: 18…
🎬Meet SlowFast-VGen: an action-conditioned long video generation system that learns like a human brain! 🧠Slow learning builds the world model, while fast learning captures memories - enabling incredibly long, consistent videos that respond to your actions in real-time.…