Amir Bar
@_amirbar
Postdoc at Meta (FAIR). Prev: PhD at TAU and Berkeley AI Research.
I'm observing a mini Moravec's paradox within robotics: gymnastics that are difficult for humans are much easier for robots than "unsexy" tasks like cooking, cleaning, and assembling. It leads to a cognitive dissonance for people outside the field, "so, robots can parkour &…
thread on the new paper: The Serial Scaling Hypothesis joint work with: @phizaz, @YutongBAI1002, Kananart
I'm presenting a poster at #ICML2025 today! Stop by if you want to learn how VLMs encode different representations of the same task (spoiler: it's the same). 🌐 icml.cc/virtual/2025/p… 🔗 vlm-cross-modal-reps.github.io cc @_amirbar @trevordarrell
🚨 Excited to announce our ICCV 2025 Workshop: Reliable and Interactive World Model (RIWM 2025) — Call for Papers is now OPEN, and the official website is live! 🌐 🌍 RIWM 2025 explores how to build world models with geometric and physical reliability and strong interactive…
Check out PEVA 🌎, our recent attempt to build a world model for human body control.
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to…
World models are such an interesting topic. Really fun discussion about how they can be used for navigation with @_amirbar
Ep#15 with @_amirbar on Navigation World Models amirbar.net/nwm/ Co-hosted by @chris_j_paxton & @micoolcho
heading to Nashville to attend @CVPR tomorrow. looking forward to meeting old & new friends and chat about #WorldModels
When vision-language models answer questions, are they truly analyzing the image or relying on memorized facts? We introduce Pixels vs. Priors (PvP), a method to control whether VLMs respond based on input pixels or world knowledge priors. [1/5]
Make sure to check out Hanwen's @hanwenjiang1 latest work! 🚀 We introduce RayZer, a self-supervised model for novel view synthesis. We use zero 3D supervision, yet we outperform supervised methods! Some surprising and exciting results inside! 🔍🔥
Supervised learning has held 3D Vision back for too long. Meet RayZer — a self-supervised 3D model trained with zero 3D labels: ❌ No supervision of camera & geometry ✅ Just RGB images And the wild part? RayZer outperforms supervised methods (as 3D labels from COLMAP is noisy)…
Need a strong feature extractor for your upcoming NeurIPS paper? we got you 😉
We are open-sourcing all the models in Web-SSL, from ViT-L to ViT-7B! It was super fun to train and play with these massive ViTs. Models: huggingface.co/collections/fa… Github: github.com/facebookresear… Huge credit to @DavidJFan for putting these models together!
Our code & pretrained models: github.com/facebookresear…
New paper from FAIR+NYU: Q: Is language supervision required to learn effective visual representations for multimodal tasks? A: No. ⬇️⬇️⬇️
WORLDMEM: Adding memory to world models
Thanks for sharing! @_akhaliq For more information: 📜ArXiv: arxiv.org/abs/2504.12369 🤗 Hugging Face: huggingface.co/papers/2504.12… 🌐 xizaoqu.github.io/worldmem/ 🧑💻 GitHub: github.com/xizaoqu/WorldM… 🚀 Demo: huggingface.co/spaces/yslan/w…
Excited to share that our paper on Navigation World Models was selected for an Oral presentation at CVPR! Code & models: github.com/facebookresear… huggingface.co/facebook/nwm
Happy to share our new work on Navigation World Models! 🔥🔥 Navigation is a fundamental skill of agents with visual-motor capabilities. We train a single World Model across multiple environments and diverse agent data. w/ @GaoyueZhou, Danny Tran, @trevordarrell and @ylecun.
New paper from FAIR+NYU: Q: Is language supervision required to learn effective visual representations for multimodal tasks? A: No. ⬇️⬇️⬇️
FAIR is probably the only lab outside of academia where research projects can start like this.
[7/8] This side project started in October when @TongPetersb, @_amirbar, and I were thinking about the rise of CLIP as a popular vision encoder for MLLMs. The community often assumes that language supervision is the primary reason for CLIP's strong performance. However, we…