Annabelle Michael Carrell
@annabelle_cs
Cambridge machine learning PhD student. Formerly Amazon, @JohnsHopkins. 🏳️🌈 she/her
cool uploads: arxiv.org/abs/2210.13574 'Understanding Linchpin Variables in Markov Chain Monte Carlo' - Dootika Vats, Felipe Acosta, Mark L. Huber, Galin L. Jones
So you want to skip our thinning proofs—but you’d still like our out-of-the-box attention speedups? I’ll be presenting the Thinformer in two ICML workshop posters tomorrow! Catch me at Es-FoMo (1-2:30, East hall A) and at LCFM (10:45-11:30 & 3:30-4:30, West 202-204)
Your data is low-rank, so stop wasting compute! In our new paper on low-rank thinning, we share one weird trick to speed up Transformer inference, SGD training, and hypothesis testing at scale. Come by ICML poster W-1012 Tuesday at 4:30!
GRAD SCHOOL APPLICATION(2.0) 🧵 Got multiple fully funded PhD offers recently and realized from conversations I have been having that many people don't approach the application process intentionally. Sharing my application process doc as an example below. Open and Retweet 🔃
Optimizer tuning can be manual and resource-intensive. Can we learn the best optimizer automatically with guarantees? With @HazanPrinceton, we give new provable methods for learning optimizers using a control approach. Excited about this result! buff.ly/3IoMOkN (1/n)
Neural networks are non-convex, and non-smooth. Unfortunately, most theoretical analysis is either convex, or smooth. Should we abandon the past? No! With @bremen79 and @n0royalroad, we import prior know-how via an *online to non-convex* conversion: arxiv.org/abs/2302.03775.
My favorite non-ML paper I read this year is probably "Bayesian Persuasion" (2011), which I somehow only found out about recently. Simple & beautiful. The first 2 pages are sufficient to be persuaded. web.stanford.edu/~gentzkow/rese…
In the LLM-science discussion, I see a common misconception that science is a thing you do and that writing about it is separate and can be automated. I’ve written over 300 scientific papers and can assure you that science writing can’t be separated from science doing. Why? 1/18
1/Is scale all you need for AGI?(unlikely).But our new paper "Beyond neural scaling laws:beating power law scaling via data pruning" shows how to achieve much superior exponential decay of error with dataset size rather than slow power law neural scaling arxiv.org/abs/2206.14486
We prove open problem that Thompson sampling has optimal regret for linear quadratic control in any dimension. Previously only proven in one dimension. We develop novel lower bound on probability that TS gives an optimistic sample. @SahinLale @tkargin_ @Azizzadenesheli @caltech
Thompson Sampling Achieves Õ(√(T)) Regret in Linear Quadratic Control deepai.org/publication/th… by @tkargin_ et al. including @SahinLale, @AnimaAnandkumar #Probability #ThompsonSampling
Five years ago, I started my first optimization project, which was about asynchronous gradient descent. Today, I'm happy to present our new work (with @BachFrancis, M. Even and B. Woodworth) where we finally prove: Delays do not matter. arxiv.org/abs/2206.07638 🧵1/5
Proud to share our CLRS benchmark: probing GNNs to execute 30 diverse algorithms! ⚡️ github.com/deepmind/clrs arxiv.org/abs/2205.15659 (@icmlconf'22) Find out all about our 2-year effort below! 🧵 w/ Adrià @davidmbudden @rpascanu @AndreaBanino Misha @RaiaHadsell @BlundellCharles
Gradient Descent provably generalizes. I should say that our thinking was shaped and influenced by the amazing work done by the one and only @DimitrisPapail, the amazing couple @roydanroy and @gkdziugaite and of course @neu_rips, @mraginsky, @mrtz, @beenwrekt
Does full-batch Gradient Descent (GD) generalize efficiently? We provide a rather positive answer for smooth, possibly non-Lipschitz losses. Check our paper today at arxiv.org/abs/2204.12446. With @aminkarbasi, and our amazing postdocs Kostas Nikolakakis and @Farzinhaddadpou 1/n