Shengyang Sun

@ssydasheng

Build AGI @xAI | Prev. @NVIDIA (Leading Nemotron-340B) & @AMAZON | PhD @UofT ; B.E.@Tsinghua

Joined December 2017

560Following

4KFollowers

While HLE marks @grok's general intelligence advancement, we at @xAI march towards solving Humanity's Last Exam problems in all specialized domains. That means building AI that surpasses human experts and solves the most challenging problems in each domain. Come join the xAI AI…

SShun@shun_liao_ · 15 h

We are #hiring passionate engineers and researchers to join our mission of creating AI that surpasses human expertise in every domain! Ready to shape the future and build groundbreaking AGI? Apply now and be part of something extraordinary: job-boards.greenhouse.io/xai/jobs/48000…

224

666

2.0K

235

1.6M

Shengyang Sun@ssydasheng · Jul 10

Grok 4 is the first LLM that I've tested that has whatsoever reasonably calculated param counts from a JSON config of DeepSeek V3. It used a code tool but fair. I think o3[-pro] might also succeed, but this is impressive.

GGDP - AI infra, Architecture, RL-Agents, OSS China@bookwormengr · Jul 10

Grok 4 solves this simple prompt that most model get wrong. I am very happy today. I was frustrated why most model used to fail at this FLOP calculation. One of my suspicion is that Grok has been trained on lot of Twitter data as well and has seen me ranting about it many a…

359

32.0K

Shengyang Sun@ssydasheng · Jul 10

Looking forward to what Grok can do next - pushing frontiers of science and engineering, making new discoveries!

xxAI@xai · Jul 10

Introducing Grok 4, the world's most powerful AI model. Watch the livestream now: x.com/i/broadcasts/1…

533

75.0K

Shengyang Sun@ssydasheng · Jul 10

We built 200k-GPU clusters; We scaled up & curated higher-quality data; We scaled compute by 100x; We developed training & test-time recipes; We made everything RL native; We stabilized infrastructure and speeded up; That's how you turn RL into the pre-training scale. Yet I am…

ssydasheng's tweet image. We built 200k-GPU clusters;
We scaled up &amp; curated higher-quality data;
We scaled compute by 100x;
We developed training &amp; test-time recipes;
We made everything RL native;
We stabilized infrastructure and speeded up;

That's how you turn RL into the pre-training scale.
Yet I am…

165

1.0K

245

179.0K

Shengyang Sun@ssydasheng · Jul 10

If you are releasing sth called 3.5 and then change to 4, there is a reason. x.com/i/events/19427…

3.0K

Shengyang Sun@ssydasheng · Jul 9

😀󠄷󠅢󠅟󠅛󠄐󠄤󠄐󠅢󠅕󠅜󠅕󠅑󠅣󠅕󠄜󠄐󠅜󠅙󠅦󠅕󠄐󠄨󠅠󠅝󠄐󠅤󠅟󠅔󠅑󠅩

2.0K

Shengyang Sun@ssydasheng · Jun 28

Stay tuned for Grok4.

TTianle (Tim) Li@LiTianleli · Jun 27

The world is not ready for this model. We cracked and innovated so many recipes. Grok 4's intelligence will be unmatched🚀🚀. Unmatched. Remember we have a fraction of employees compared to other frontier labs. But one of us equal 10 researchers at other AI labs.

3.0K

Shengyang Sun@ssydasheng · Apr 24

Definitely talk with Oleksii if you are interested. Oleksii is an excellent leader with paramount scientific rigor. The team is working on the best LLM post-training researchacross all domains.

OOleksii Kuchaiev@kuchaev · Apr 24

If you are at #ICLR and looking for applied research roles in model post-training, reasoning and model alignment - message me and I’ll be happy to chat.

2.0K

Shengyang Sun@ssydasheng · Apr 18

In case anyone still doesn't see the insane speed that models are getting smarter and cheaper: Yesterday, Google released Gemini 2.5 Flash, a very efficient reasoning model. Today, Grok 3 mini is stronger on most benchmarks for 7x cheaper! x.com/xai/status/191…

xxAI@xai · Apr 18

Let’s start with Grok 3 Mini. When we set out to build a fast, affordable mini model, we knew it would be good but even we didn’t expect it to be this good. Some highlights: - Grok 3 Mini tops the leaderboards on graduate-level STEM, math, and coding, outcompeting flagship…

1.0K

2.0K

6.0K

317

2.9M

Shengyang Sun@ssydasheng · Apr 8

(1) noticing the similarities between two distant domains is harder than logical distinction. And it enables the ultimate scientific invention potential. Think about how Einstein (and others) connects Riemannian geometry into physics and come up with General Relativity.…

MMike Knoop@mikeknoop · Apr 8

On the topic "AI is trained on all of humanity, why can't it innovate?" New ideas come from two places: 1. Noticing similarities between two existing ideas (new ideas in one area translate into another) 2. Logical construction (new ideas follow from prior axioms) (1) is…

970

Shengyang Sun@ssydasheng · Apr 8

Check out our Llama-Nemotron-Nano/Super/Ultra, a testament to NVIDIA’s mission to become the AI Foundry and enable everyone to participate in the new AI era. •Open weights, open recipe, and open data •Reasoning on/off modes, tailored to your specific needs •From Nano to Ultra,…

OOleksii Kuchaiev@kuchaev · Apr 8

We are excited to release Llama-Nemotron-Ultra! This is a reasoning ON/OFF, dense 253B model. Open weights and post-training data. huggingface.co/nvidia/Llama-3… We started with llama-405B, changed it via NAS pruning then followed by reasoning-focused post-training: SFT + RL in FP8.

2.0K

Shengyang Sun Retweeted

Mislav Balunović@mbalunovic · Mar 25

Can LLMs actually solve hard math problems? Given the strong performance at AIME, we now go to the next tier: our MathArena team has conducted a detailed evaluation using the recent 2025 USA Math Olympiad. The results are… bad: all models scored less than 5%!

498

234

94.0K

Shengyang Sun@ssydasheng · Mar 21

The next big learning would be At a 50% success rate, how long does it take a human labeller to judge the correctness of the model's response? More generally, how much human labelling time can get x (0 <= x <= 1) bits of information for a model's response? If you build…

MMETR@METR_Evals · Mar 19

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

514

Shengyang Sun@ssydasheng · Mar 18

Data, training recipe, and models are all available！

OOleksii Kuchaiev@kuchaev · Mar 18

We are excited to release new Llama-Nemotron models. These models allow you to set reasoning ON/OFF during runtime. We also release all the post-training data under CC-BY-4! Try it now on build.nvidia.com/nvidia/llama-3… HF collection: huggingface.co/collections/nv…

444

Shengyang Sun@ssydasheng · Mar 12

Three Forms of Technological Progress 1️⃣ Imitation: A breakthrough is fully shared in a paper—architecture, training data, hyperparameters. Others replicate, adapt, and build on it. Fast, open, and collaborative. 🔹 Examples: AlexNet (CNNs), AlphaGo (RL + MCTS), GPT-3 (scaling…

324

Shengyang Sun@ssydasheng · Feb 22

Evolution in three phases: 🧬 Biological (540M – 50K yrs ago): Slow genetic evolution for reproductive purposes, from single cells to Homo sapiens. 📚 Knowledge (50K yrs ago – 2000s): Language and books enable the reproductivity of knowledge, casuing its evolution in math,…

333

Shengyang Sun@ssydasheng · Feb 6

When the reward is 0 or 1, vanilla REINFORCE is just rejection sampling. Am I missing something？ When we skip SFT and run RL only like in R1-zero, we are running SFT (over successful responses) only in this sense.

725