Weijia Shi
@WeijiaShi2
PhD student @uwnlp @allen_ai | Prev @MetaAI @CS_UCLA | 🏠 http://weijiashi.notion.site
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
Phase 1 of Physics of Language Models code release ✅our Part 3.1 + 4.1 = all you need to pretrain strong 8B base model in 42k GPU-hours ✅Canon layers = strong, scalable gains ✅Real open-source (data/train/weights) ✅Apache 2.0 license (commercial ok!) 🔗github.com/facebookresear…
(1/8)🍎A Galileo moment for LLM design🍎 As Pisa Tower experiment sparked modern physics, our controlled synthetic pretraining playground reveals LLM architectures' true limits. A turning point that might divide LLM research into "before" and "after." physics.allen-zhu.com/part-4-archite…
I’m gonna be recruiting students thru both @LTIatCMU (NLP) and @CMU_EPP (Engineering and Public Policy) for fall 2026! If you are interested in reasoning, memorization, AI for science & discovery and of course privacy, u can catch me at ACL! Prospective students fill this form:
📣Thrilled to announce I’ll join Carnegie Mellon University (@CMU_EPP & @LTIatCMU) as an Assistant Professor starting Fall 2026! Until then, I’ll be a Research Scientist at @AIatMeta FAIR in SF, working with @kamalikac’s amazing team on privacy, security, and reasoning in LLMs!
🙌 We've released the full version of our paper, OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles Our OpenVLThinker-v1.2 is trained through three lightweight SFT → RL cycles, where SFT first “highlights” reasoning behaviors and RL then explores and…
Counting down the days until ACL, hosting another @aclmentorship session, diving into timely topics: when it is so tempting to rely on AI for writing, what role do we play, and what might we be losing? #ACL2025NLP #NLProc
📢 Join us for the ACL Mentorship Session @aclmeeting #ACL2025NLP #NLProc • Session Link: mentorship.aclweb.org/schedule • Ask Questions: tinyurl.com/y2v2j462 Mentors: • @May_F1_ (@hkust) • @d_aumiller (@cohere) • @vernadankers (@Mila_Quebec) • @ziqiao_ma (@UMichCSE) •…
Can we build an operating system entirely powered by neural networks? Introducing NeuralOS: towards a generative OS that directly predicts screen images from user inputs. Try it live: neural-os.com Paper: huggingface.co/papers/2507.08… Inspired by @karpathy's vision. 1/5
"Chatting" with LLM feels like using an 80s computer terminal. The GUI hasn't been invented, yet but imo some properties of it can start to be predicted. 1 it will be visual (like GUIs of the past) because vision (pictures, charts, animations, not so much reading) is the 10-lane…
How to write good reviews & rebuttals? We've invited 🌟 reviewers to share their expertise in person at our ACL mentorship session #ACL2025NLP next week
📢 Join us for the ACL Mentorship Session @aclmeeting #ACL2025NLP #NLProc • Session Link: mentorship.aclweb.org/schedule • Ask Questions: tinyurl.com/y2v2j462 Mentors: • @May_F1_ (@hkust) • @d_aumiller (@cohere) • @vernadankers (@Mila_Quebec) • @ziqiao_ma (@UMichCSE) •…
🧵 Academic job market season is almost here! There's so much rarely discussed—nutrition, mental and physical health, uncertainty, and more. I'm sharing my statements, essential blogs, and personal lessons here, with more to come in the upcoming weeks! ⬇️ (1/N)
Building AI reasoning models with extremely long context lengths - think days, weeks, even years of context - is the next big challenge in AI. that's why i'm extremely excited about the latest work from Ao Qu @ao_qu18465, incoming PhD student in our group, on MEM1: RL for Memory…
🚀 Excited to share my first tweet and to introduce our latest work: MEM1: RL for Memory Consolidation in Long-Horizon Agents. Long-horizon agents (e.g., deep research, web agents) typically store all observations, actions, and intermediate thoughts in context. However, much of…
Since our initial arXiv post, several concurrent papers have introduced new architectures with log-linear properties in various forms. Two personal favorites of mine (among others) are: - Transformer-PSM by @MorrisYau et al., and - Radial Attention by Xingyang and @lmxyy1999 et…
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
WHY do you prefer something over another? Reward models treat preference as a black-box😶🌫️but human brains🧠decompose decisions into hidden attributes We built the first system to mirror how people really make decisions in our #COLM2025 paper🎨PrefPalette✨ Why it matters👉🏻🧵
Life update: I’m excited to share that I’ll be starting as faculty at the Max Planck Institute for Software Systems(@mpi_sws_) this Fall!🎉 I’ll be recruiting PhD students in the upcoming cycle, as well as research interns throughout the year: lasharavichander.github.io/contact.html
Check out @YuncongYY post on test-time scaling for spatial reasoning with world models!
Test-time scaling nailed code & math—next stop: the real 3D world. 🌍 MindJourney pairs any VLM with a video-diffusion World Model, letting it explore an imagined scene before answering. One frame becomes a tour—and the tour leads to new SOTA in spatial reasoning. 🚀 🧵1/
Spatial reasoning from a single image is inherently difficult, but it becomes significantly easier when leveraging a controlled world model, analogous to the mental models used by humans! Code: github.com/UMass-Embodied…
Test-time scaling nailed code & math—next stop: the real 3D world. 🌍 MindJourney pairs any VLM with a video-diffusion World Model, letting it explore an imagined scene before answering. One frame becomes a tour—and the tour leads to new SOTA in spatial reasoning. 🚀 🧵1/
Gemini + Deep Think won IMO gold this year 🏅 super honored to be part of this dream team!
An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵
Come check out recent work on history guided video diffusion tomorrow!
Come visit our #ICML2025 poster on Diffusion Forcing Transformer tomorrow! Stop by to chat about sequence/video diffusion, or anything related to generative and world models. I’ll be presenting with @du_yilun on Thursday, 4:30–7pm at West Hall B2-B3 (#W-205).
At #ICML2025, I am super excited to introduce STAMP. This is a marriage b/w dataset inference & watermarking that finally(!) lets creators PROVE their content was used to train LLMs🔍 Its a MAJOR push taking the academic problem into real world. w/Saksham Rastogi @danish037 🧵
I am at #ICML2025! 🇨🇦🏞️ Catch me: 1️⃣ Today at the @WiMLworkshop mentoring roundtables (1-2pm in W211-214) 2️⃣ Presenting this paper👇 tomorrow 11-11:30 at East #1205 3️⃣ At the Actionable Interpretability @ActInterp workshop on Saturday in East Ballroom A (I’m an organizer!)
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!
Check out our work led by @Cumquaaa on a hybrid autoregressive-diffusion architecture for image generation -- it flexibly balances the number of autoregressive and diffusion layers for optimal generation quality and inference speed! Autoregressive vs. diffusion -- you don't have…
🚀 Training an image generation model and picking sides between autoregressive (AR) and diffusion? Why not both? Check out MADFormer with half of the model layers for AR and half for diffusion. AR gives a fast guess for the next patch prediction while diffusion helps refine the…
If you can't make it to ICML and want to learn more about @du_yilun's work, check out the great talk he gave at the #KempnerInstitute's #NeuroAI2025 symposium: youtube.com/watch?v=UKbLBO… #AI #NeuroAI
I'll be at @icmlconf! Will help present: - Scene Understanding with Generative Models (shorturl.at/JrvJL) - History-guided World Models (shorturl.at/lCkfc) - Adaptable World Models (shorturl.at/99Xmw) We'll also host a workshop on physical world models!
I'll be hiring a couple of Ph.D. students at CMU (via LTI or MLD) in the upcoming cycle! If you are interested in joining my group, please read the FAQ before reaching out to me via email :) docs.google.com/document/d/12V…