finbarr
@finbarrtimbers
modeling language
I’ve been spending time advising friends on companies they’re starting. I really like doing this! If you are starting a company, or thinking about it, please reach out if you want to chat.
Oh love this!
[1/9] We created a performant Lipschitz transformer by spectrally regulating the weights—without using activation stability tricks: no layer norm, QK norm, or logit softcapping. We think this may address a “root cause” of unstable training.
Great list
Resources I Use For Keeping Up With AI Progress as an Outsider (People To Follow!) @peterwildeford - broad perspective, one of my top recs because of quality * quantity @krishnanrohit - broad perspective, economics/macro lens @BasilHalperin - economics + ai @zhengdongwang -…
Someone’s gonna release an actual “RL for kernel development” paper without measurement errors at some point and no one will believe it
Revisiting the old "sparse autoencoder learned to recognize cats from youtube videos" paper from 2012, and it's remarkable how similar it is to modern pretraining arxiv.org/abs/1112.6209
Counterpoint: Claude code is by far the best coding tool I’ve ever used and is notably better than everything else, despite being a thin wrapper around a model.
I think Anthropic is super overrated. It gets way too much praise. People hype it up to sophistication-signal. I think their "genius talent" is highly overrated. I think their early use of RLAIF got commoditized. There *must* be consolidation in the AI space. There's no…
great list!
Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...
Not gonna lie I’m kinda disappointed Anthropic hasn’t added something to their max plans more than just “use Claude code more” I still pay $200/mo though
I really don't get why I would use Operator, or whatever it's called now.
The way you visualize 6d parallelism is you visualize FSDP and then repeat “6” over and over to yourself
currently annoying everyone at Ai2 by switching all of our repos to not let you check in PRs with failing checks
Are there any secrets to optimizing vllm for long generations?
Woof
Here is most of what I’ve gathered on the Windsurf / Google Deal The founders and dozens of engineers are going to Google. This group, along with the preferred shareholders will be sharing the $2.4B headline number. The exact split is unknown but investors are making some money…
Given how much media attention is being given to the weird partial acquisitions (Inflection, Character, Windsurf, etc) it seems obvious that people will ask what the employees/investors will get, so choosing not to disclose implies bad outcomes.