Max Zhdanov
@maxxxzdn
busy scaling on two GPUs at @amlabuva with @wellingmax and @jwvdm
🤹 New blog post! I write about our recent work on using hierarchical trees to enable sparse attention over irregular data (point clouds, meshes) - Erwin Transformer. blog: maxxxzdn.github.io/blog/erwin/ paper: arxiv.org/abs/2502.17019 Compressed version in the thread below:

Why do video models handle motion so poorly? It might be lack of motion equivariance. Very excited to introduce: Flow Equivariant RNNs (FERNNs), the first sequence models to respect symmetries over time. Paper: arxiv.org/abs/2507.14793 Blog: kempnerinstitute.harvard.edu/research/deepe… 1/🧵
From GPT to MoE: I reviewed & compared the main LLMs of 2025 in terms of their architectural design from DeepSeek-V3 to Kimi 2. Multi-head Latent Attention, sliding window attention, new Post- & Pre-Norm placements, NoPE, shared-expert MoEs, and more... magazine.sebastianraschka.com/p/the-big-llm-…
📢Presenting SDE Matching🔥🔥🔥 🚀We extend diffusion models to construct a simulation-free framework for training Latent SDEs. It enables sampling from the exact posterior process marginals without any numerical simulations. 📜: arxiv.org/abs/2502.02472 🧵1/8
General relativity 🤝 neural fields This simulation of a black hole is coming from our neural networks 🚀 We introduce Einstein Fields, a compact NN representation for 4D numerical relativity. EinFields are designed to handle the tensorial properties of GR and its derivatives.
working on sub-quadratic architectures feels similar to geometric deep learning - very intellectually stimulating but your research is constantly haunted by the scale
We are presenting multiple papers at #ICML2025 -- come and meet the team in the next days!
Happens today! 🗓️Tue, July 15 @ 11 AM 📍East Exhibition Hall A-B #E-3512 Unfortunately I was not able to attend, so please DM if you want to chat about hierarchical models, irregular geometries or scalable physical modeling :) @FEijkelboom will present the poster for me on-site
🤹 New blog post! I write about our recent work on using hierarchical trees to enable sparse attention over irregular data (point clouds, meshes) - Erwin Transformer. blog: maxxxzdn.github.io/blog/erwin/ paper: arxiv.org/abs/2502.17019 Compressed version in the thread below:
Flow Matching (FM) is one of the hottest ideas in generative AI - and it’s everywhere at #ICML2025. But what is it? And why is it so elegant? 🤔 This thread is an animated, intuitive intro into (Variational) Flow Matching - no dense math required. Let's dive in! 🧵👇
🚀 Introducing PhysiX: One of the first large-scale foundation models for physics simulations! PhysiX is a 4.5B parameter model that unifies a wide range of physical systems, from fluid dynamics to reaction-diffusion, outperforming specialized, state-of-the-art models.
How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the…
Get ready for the PDE-Transformer: our new NN architecture tailored to scientific tasks. It combines hierarchical processing (UDiT), scalability (SWin) and flexible conditioning mechanisms. The paper tum-pbs.github.io/pde-transforme… shows it outperforming existing SOTA architectures 😁
We release AB-UPT, a novel method to scale neural surrogates to CFD meshes beyond 100 million of mesh cells. AB-UPT is extensively tested on the largest publicly available datasets. 📄 arxiv.org/abs/2502.09692 🤗 huggingface.co/EmmiAI/AB-UPT 💻 github.com/Emmi-AI/AB-UPT
Sparse attention (MoBA/NSA) trains faster & beats full attention in key tasks. But we’ve had no idea how they truly work…until now. 🔍 We reverse-engineered them to uncover: - Novel attention patterns - Hidden "attention sinks" - Better performance - And more A 🧵… ~1/8~
(1/n) Sampling from the Boltzmann density better than Molecular Dynamics (MD)? It is possible with PITA 🫓 Progressive Inference Time Annealing! A spotlight @genbio_workshop of @icmlconf 2025! PITA learns from "hot," easy-to-explore molecular states 🔥 and then cleverly "cools"…
In case there is any ambiguity: DINOv2 is 100% a product of dumb hill-climbing on ImageNet-1k knn accuracy (and linear too) Overfitting an eval can be bad. But sometimes the reward signal is reliable, and leads to truly good models. It's about finding a balance
Oh I am a big fan of self supervised learning. Also ssl has never been benchmark maxing on imagenet afaik. I am mainly complaining about the supervised classification imagenet hill climb