You Jiacheng
@YouJiacheng
a big fan of TileLang 关注TileLang喵!关注TileLang谢谢喵! http://github.com/tile-ai/tilelang 十年老粉
I think this is an annoying marketing strategy, but not as bad as a "non-tariff barrier"
Warning: the $5.9K price is for a "DEMO" unit. Price of all Unitree products is subject to their EDU Scam. I consider this EDU pricing model a form of unfair marketing competition. Let me explain (disclaimer: personal opinion, happy to correct it if needed): These robots do…
what the *? incredible UX
We just discovered the 🔥 COOLEST 🔥 trick in Flow that we have to share: Instead of wordsmithing the perfect prompt, you can just... draw it. Take the image of your scene, doodle what you'd like on it (through any editing app), and then briefly describe what needs to happen…
Cool
We just open-sourced #OpenArm 01: A fully open-source humanoid arm for physical AI research and deployment in contact-rich environments. All hardware and software are now live and ready for you to build, hack, and deploy 🚀 Get started at openarm.dev #OpenSource
damn, I always have a mental model that an action of a LM should be a sequence (a turn, or until a tool call) instead of a token, but people keep telling my that token-level loss is better… Thank Qwen team for verifying my mental model, now it makes much more sense.
Proud to introduce Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant RL algorithm that powers the large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking) 🚀 📄 huggingface.co/papers/2507.18…
Okay, READ YOUR DATA carefully😂
you are look at the wrong image of that problem. here’s the correct one.
Jason Wei @_jasonwei has told you: READ YOUR DATA.
i used to think AI2D was a small but high quality dataset, but i actually looked today and this is the very first sample
Here is what people mean by "residual network makes your gradient happy" + intuition on depth muP Gradients vanish if activations are not unit-scaled. But thats not issue if you are using residual connection! But if you don't scale down branch, your activations / backward blow…
It's a good model.
🚀 We’re excited to introduce Qwen3-235B-A22B-Thinking-2507 — our most advanced reasoning model yet! Over the past 3 months, we’ve significantly scaled and enhanced the thinking capability of Qwen3, achieving: ✅ Improved performance in logical reasoning, math, science & coding…
Unfortunately we need 1T.ai now.
one of the most exciting LLM posts i’ve read in a while read it. your future self will thank you
Bros told me this model is SOTA anime model, because natural language greatly enriches the details, something SDXL can't do.
🎨 We’re thrilled to officially launch Neta Lumina — the most advanced open-source anime model yet. As our 4th open-source model, Neta Lumina has achieved: 🔹 Expertly tuned for 200+ anime aesthetics including Guofeng, Furry, Pets, Scenery Shots and more niche themes 🔹…
> our models are optimized with Adam > reviewer: this guy violates double blind, call for a desk reject!
everyone always asks who/what is adam. never how is adam
FWIW, in ρ log ρ, log is matrix-log and multiplication is matmul.
RL+LLM researchers actively use LLM distribution Entropy to measure training dynamics. This number is misleading. John Von-Neumann and Lev Landau gave us the correct answer 100 years ago while studying mixed quantum states in Hilbert spaces. Usual Entropy treats all tokens as…
Another knowledge test.
Quick test. stepfun.com/share/13768625… kimi.com/share/d21pcrsc…
Quick test. stepfun.com/share/13768625… kimi.com/share/d21pcrsc…
The wait is over! Meet Step 3 — the groundbreaking multimodal LLM from StepFun! 🚀 MoE architecture (321B total params, 38B active) 💡 Rivals OpenAI o3, Gemini 2.5 Pro, and Claude Opus 4 in performance 🖥️ Optimized for China’s domestic AI chips StepFun just announced: Step 3…
False, they didn't control the number of parameters when comparing architectures.
AlphaGo Moment for Model Architecture Discovery Paper: arxiv.org/abs/2507.18074
H20 is very good for memory bound workloads, e.g. attention (SDPA) in decoding.
Rumors circulated in China today that the government is banning the use of H20 and mandating the use of domestically-produced GPUs. Cambricon, a domestic Chinese GPU developer, surged.
IMO, Step3's AFD is significantly worse than this old work. AFD chooses to let attention instances to compute qkvo projection, that almost eliminates the advantage (memory bandwidth per $) of H20.
actually earlier. arxiv.org/abs/2405.01814