Josiah Aklilu
@AkliluJosiah2
PhD student @Stanford | part of MARVL graciously advised by @syeung10 | AI for surgery
🚨 Checkout our new method ZEAL -> zero-shot action localization in long-form video using LLMs + VLMs, accepted to the VidLLMs workshop at #CVPR2025! 📍 No labels. No fine-tuning. Strong zero-shot results. 📝arxiv.org/abs/2410.14340 💫 Shoutout to my amazing collaborator…
🧬 What if we could build a virtual cell to predict how it responds to drugs or genetic perturbations? Super excited to introduce CellFlux at #ICML2025 — an image generative model that simulates cellular morphological changes from microscopy images. yuhui-zh15.github.io/CellFlux/ 💡…
Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵
Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.
📢 The First Workshop on Multimodal Foundation Models for Biomedicine (MMFM-BIOMED) at #CVPR2025 is still accepting submissions until May 7, 11:59 PM PT! Join speakers from Stanford, Google, MIT & more exploring the intersection of #CV, #NLP & #healthcare. Submit your 4-page…
I'm at #ICLR2025 presenting "Video Action Differencing". Keen to chat with anyone interested in MLLMs - both for general data & for scientific reasoning
🚨Large video-language models LLaVA-Video can do single-video tasks. But can they compare videos? Imagine you’re learning a sports skill like kicking: can an AI tell how your kick differs from an expert video? 🚀 Introducing "Video Action Differencing" (VidDiff), ICLR 2025 🧵
🤗The SmolVLM report is out, with all the experiments, findings, and insights that led to high performance at tiny sizes🤏. 📱These models can run on most mobile/edge devices. 📖Give it a look!
Today, we share the tech report for SmolVLM: Redefining small and efficient multimodal models. 🔥 Explaining how to design a tiny 256M VLM that uses less than 1GB of RAM and outperforms our 80B models from 18 months ago! Here are the coolest insights from our experiments: ✨…
Earlier this year, we released the BIOMEDICA dataset, featuring 24 million unique image caption pairs and 30 million image references derived from open-source biomedical literature. It's been great to see the community engaging with it—we're currently seeing around 6K downloads…