Waleed Atallah
@wAIeedatallah
Making AI go fast @mako_dev_ai
1% better every day
MakoGenerate now reliably uses @AMD MatrixCores when generating kernels, and can generate fully functional and performant HIP code. Making it fluent in all the low level instructions available on a given hardware is critical to outperforming generic frameworks.
X subtly cooking me for not reading the article before reposting

There's gonna be a netflix movie on what happened with Windsurf. Between this, Jeff Wang's post, and all the other stuff i've heard... sounds like the Social Network lite
I’ve joined Cognition to continue to work on the future of software engineering. I was employee #2 at Windsurf and have worked on AI+code for years. There’s never been a more exciting time and place for it than now at Cognition. I had a place at Google DeepMind as part of the…
Crusoe can probably do this on Stargate alone lol
scoop: Crusoe is building OpenAI’s first Stargate data center. Now it wants to use its expertise as a developer to boost its own cloud business. 📈Crusoe wants to grow from $100mm to $18 billion in cloud revenue by 2030. We got the internal pitch to investors, as the firm…
Neat paper from @AMD. Can we train LLMs to estimate kernel performance metrics? (hint: you can)
Omniwise: Predicting GPU Kernels Performance with LLMs This is a really cool paper from @AMD and @UofIllinois that replicates results we were seeing with proprietary models (o3, Gemini). But they do it with a finetuned Llama-3.2-3B model! 100x smaller!
there are nearly 10,000 concurrent devices training models all over the world through RL Swarm decentralised and permissionless
RL Swarm is a peer-to-peer system for reinforcement learning. It allows you to train models collaboratively with others in the swarm, leveraging their collective intelligence. Start now 👇 github.com/gensyn-ai/rl-s… 9809 nodes connected to testnet 🐝 dashboard.gensyn.ai
The engineering team at mako never ceases to amaze me. Accelerate EVERYTHING a 15x faster compilation pipeline helps a ton in scaling RFT for kernel generation.
We just shipped 15x Faster #CUDA kernel compilation for MakoGenerate. How and why we're digging into this part of the pipeline, and a detailed blog post below 🧵
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning Trains a DeepSeek-v3-671B model to optimize CUDA kernels using only execution-time speedup as reward. Pipeline: - SFT: Finetuned on 2.1K correct, executable CUDA variants from 6 LLMs across 250…
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning Trains a DeepSeek-v3-671B model to optimize CUDA kernels using only execution-time speedup as reward. Pipeline: - SFT: Finetuned on 2.1K correct, executable CUDA variants from 6 LLMs across 250…
This is potentially the next major unlock for AI - getting all our (private) data to work together. Great work @niclane7 and team!
🌼 Flower, in collaboration with Kinexys by @jpmorgan and @BNYglobal, is proud to introduce Project AIKYA -- the first federated AI deployment between global-tier banks, proving that real-world collaborative financial ML models can perform better without the need to share…
What if this was an info-gathering op the whole time
BREAKING: Claude Code PMs Boris Cherny and Cat Wu have returned to Anthropic after a brief stint at Cursor.
NY needs to show what we've got! not all the kernel guys are in the bay ;)
New @GPU_MODE x Jane Street 1-day GPU programming hackathon in-person in NYC! Talks by the wonderful @tri_dao, @soumithchintala, and other PyTorch folks! If you're at #ICML25 check out more information at the Jane Street both! Register by Aug 17: bit.ly/3TS0d9I?r=qr
What
> fp8 is 100 tflops faster when the kernel name has "cutlass" in it kms github.com/triton-lang/tr…