Lucas Beyer (bl16)

@giffmana

Researcher (now: Meta. ex: OpenAI, DeepMind, Brain, RWTH Aachen), Gamer, Hacker, Belgian. Anon feedback: https://www.admonymous.co/giffmana ✗DMs → email

Zürich, Suisse

Joined December 2013

507Following

103KFollowers

Pinned

Lucas Beyer (bl16)@giffmana · Sep 14, 2022

My Transformer tutorial slides are now available at lucasb.eyer.be/transformer I'll append recordings to this thread as I get them. If you want to use some of the slides for your lecture, you may, as long as you credit me. If you'd like me to give the lecture: maybe; e-mail me.

LLucas Beyer (bl16)@giffmana · Sep 11, 2022

Giving a lecture introducing the Transformer architecture in all gory details at @M2lSchool tomorrow. Also got permission to publish slides and will share recording if/when I get one. It's a pretty cool set of slides, largely thanks to @_basilM for inspiration!

563

3.0K

4.0K

Pinned

Lucas Beyer (bl16) Retweeted

Yiping Lu@2prime_PKU · Jul 25

Anyone knows adam?

246

421

4.0K

473

500.0K

Pinned

Lucas Beyer (bl16)@giffmana · Jul 20

optimization theorem: "assume a lipschitz constant L..." the lipschitz constant:

LLaker Newhouse@LakerNewhouse · Jul 19

[1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training.

521

189

99.0K

Lucas Beyer (bl16)@giffmana · 18 h

Next Thursday will mark the 10th ZürichCV, with the perfect mix of 3D vision and vision MoEs!

ZZurichAI@zurichnlp · Jul 25

zurichai.ch/events/zurichc…

9.0K

Lucas Beyer (bl16)@giffmana · Jul 23

So the cool thing about Owain's papers is that if you ignore alarmist language, they are actually all about generalization and how well it can work, with nice experiments. This time around it's distillation, even with "hard" targets, although only if using the same init/base.

OOwain Evans@OwainEvans_UK · Jul 22

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

237

26.0K

Lucas Beyer (bl16)@giffmana · Jul 23

The White House just released America's AI Action Plan. I've read the whole thing. This document makes it very clear, that this is about "winning the AI race" and even compare it to the cold war era. It's a paper about national-security! Here are the most important quotes: -…

LLisan al Gaib@scaling01 · Jul 23

8.0K

Lucas Beyer (bl16)@giffmana · Jul 22

Definitely has nontrivial hint that differs per problem. Although they are still broad enough that you could imagine having a bench full of them and then if the verifier is good enough, it's fine.

LLin Yang@lyang36 · Jul 22

🚨 Olympiad math + AI: We ran Google’s Gemini 2.5 Pro on the fresh IMO 2025 problems. With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity. The model could win gold! 🥇 #AI #Math #LLMs #IMO2025

11.0K

Lucas Beyer (bl16)@giffmana · Jul 22

AKA data augmentation. The numbers actually match my experience exactly. This is something i think LLM people will slowly rediscover from vision people. Not sure how they can write up the whole paper and not even once think of running the AR with augmentation or dropout?

SSimo Ryu@cloneofsimo · Jul 22

Everyone get your top 1% quality dataset and train 100 epochs right now

615

498

100.0K

Lucas Beyer (bl16)@giffmana · Jul 21

So i read this paper and I'm thoroughly confused. For a start, if both modules here are encoder only transformers, then how do you even do seq2seq training?! The recursion here is depth not seq, iiuc. Also there's a typo in the code, y is undefined (should be y_true?)

GGuan Wang@makingAGI · Jul 21

🚀Introducing Hierarchical Reasoning Model🧠🤖 Inspired by brain's hierarchical processing, HRM delivers unprecedented reasoning power on complex tasks like ARC-AGI and expert-level Sudoku using just 1k examples, no pretraining or CoT! Unlock next AI breakthrough with…

23.0K

Lucas Beyer (bl16)@giffmana · Jul 21

Could be... could be not... couldn't rightly say...

CChris@chatgpt21 · Jul 21

Is this the synthetic data explosion?

17.0K

Lucas Beyer (bl16)@giffmana · Jul 21

TL;DR: Qwen series finetuned on 5M reasoning traces from DeepSeek R1 0528 671B, i.e. hard distillation.

PPaul Couvert@itsPaulAi · Jul 19

Wait NVIDIA has just released new SOTA open source models?! Available in 4 sizes 1.5B, 7B, 14B and 32B that you can run 100% locally. - OpenReasoning-Nemotron - SOTA scores across many benchmarks - Tailored for math, science, code How to run it on your laptop and details below

479

173

60.0K

Lucas Beyer (bl16)@giffmana · Jul 21

Nice survey of papers working towards NNs with somewhat practical realistic Lipschitz bounds:

TThomas Fel@Napoolar · Jul 20

Great excuse to share something I really love: 1-Lipschitz nets. They give clean theory, certs for robustness, the right loss for W-GANs, even nicer grads for explainability!! Yet are still niche. Here’s a speed-run through some of my favorite papers on the field. 🧵👇

134

16.0K

Lucas Beyer (bl16)@giffmana · Jul 20

PSA: I'm getting these phishing emails almost daily now. Don't fall for it guys, why do so many fall for it? Just ignore it.

16.0K

Lucas Beyer (bl16) Retweeted

Yuxin Wu@ppwwyyxx · Jul 17

That reminds me of an exp I did (in public) like a decade ago where adding noise to MNIST labels improves test accuracy github.com/tensorpack/ten…

8.0K

Lucas Beyer (bl16)@giffmana · Jul 17

HAHAHAHA yeah sure. Unrelated, but Satya knows that I invented ConvNets, right?

vvitrupo@vitrupo · Jul 15

Mustafa Suleyman reflects on the near misses in his career. At Google, he helped build LaMDA, “ChatGPT before ChatGPT,” but it never shipped. Fears over safety and search disruption kept it shelved. In 2022, he left to co-found Inflection AI, raised $1.5B, built a 22,000-GPU…

639

138

117.0K

Lucas Beyer (bl16)@giffmana · Jul 14

This very cool paper proposes an intriguing idea. If you use a small batch size, you can fine-tune LLMs with SGD or Adafactor (algorithms with very small memory overhead). But there is a small trap: Storage precision. Let's explore that. 🧵

MMicah Goldblum@micahgoldblum · Jul 10

🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n

157

135

27.0K