Mrinal Mathur
@bobthemaster
Research Engineer @Google |@BytedanceTalk | @Amazon | @Apple | @CenterTrends | @ARM
Evaluating robotic foundation models is really hard — everyone has different robots, tasks, etc. We are releasing RoboArena as a step toward a global network of decentralized evaluations, where policies can compete head to head on evals in the real world at many institutions!
We’re releasing the RoboArena today!🤖🦾 Fair & scalable evaluation is a major bottleneck for research on generalist policies. We’re hoping that RoboArena can help! We provide data, model code & sim evals for debugging! Submit your policies today and join the leaderboard! :) 🧵
Wild paper They prove (!!) a transformer block (Attn + MLP) running on prompt Outputs the same logits with no prompt If MLP weights updated by vector: W′ = W + ΔW Calc from attn latent: ΔW = (W·Δa) × (A(x)ᵀ / ‖A(x)‖²) Given prompt: Δa = A(C, x) − A(x) Fucking fine tuning.
🪆 Matryoshka is extremely general & applicable to every component in our modern ML/DL stack. It can't get more fundamental that 🪆 in bit space to enable elastic quantization! Drop by the poster and say hi to Puranjay (on behalf of @pranavn1008 @JeffDean @jainprateek_ & me).
Hi, I'll be presenting Matryoshka Quantization (arxiv.org/abs/2502.06786) on 16th July at #ICML2025 📍East Exhibition Hall A-B #3606 ⏲️ 11 AM - 1:30 PM
For everyone interested in precise 📷camera control 📷 in transformers [e.g., video / world model etc] Stop settling for Plücker raymaps -- use camera-aware relative PE in your attention layers, like RoPE (for LLMs) but for cameras! Paper & code: liruilong.cn/prope/
What I shared with research interns at @lossfunk on how to go about their research projects.
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
Personalization methods for LLMs often rely on extensive user history. We introduce Curiosity-driven User-modeling Reward as Intrinsic Objective (CURIO) to encourage actively learning about the user within multi-turn dialogs. 📜 arxiv.org/abs/2504.03206 🌎 sites.google.com/cs.washington.…
new preprint!! exploring overconfidence😎 and change-of-mind🤔 in llms neat thing about llms is you can reset their state after querying them then query them differently without creating a memory of their initial decision -- enabling cogsci-style study not possible in humans 🧑🔬
BadMephisto also gave this trick
BadMephisto on how he became great at speedcubing, AI research and teaching.
I have new favourite blogsite
It is insane how underrated these blogs are Man made an interative visualization for different kinds of attention mech (He has interactive visualizations for RNNs, LSTMs, CNNs, and so much more)
went through this but don't just only skim over it ig. every question is a good research paper and worth a read.
It is insane how underrated these blogs are Man made an interative visualization for different kinds of attention mech (He has interactive visualizations for RNNs, LSTMs, CNNs, and so much more)
Since it's summer, and more or less internship and tech interview season, I made all 30 chapters of my Machine Learning Q and AI book freely available for the summer: sebastianraschka.com/books/ml-q-and… Hope it’s helpful! Happy reading, and good luck if you are interviewing!
i love these implement from scratch notebooks from Sebastian Raschka @rasbt and he came back with a new one. this one shows how to build Qwen 3 base and reasoning models from the ground up. amazing!
New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.
BREAKING: MIT just completed the first brain scan study of ChatGPT users & the results are terrifying. Turns out, AI isn't making us more productive. It's making us cognitively bankrupt. Here's what 4 months of data revealed: (hint: we've been measuring productivity all wrong)
bro sh*t just got so real. Claude Opus published a response paper to Apple’s paper, criticizing their experiment design, putting models under token limit constraints, and having them solve unsolvable problems.
Fixing horizon scalability in off policy RL is tremendously important. Our benchmarks that we overfit to mostly ignored this axis.
Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).
New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.