Jiaxin Wen (@jiaxinwen22)

Pinned

J

Jiaxin Wen@jiaxinwen22 · Jun 11

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

jiaxinwen22's tweet image. New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision.

Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

36

154

1.0K

223.0K

Pinned

J

Jiaxin Wen@jiaxinwen22 · Jul 7

I've heard that Claude 3 Opus has a very special, adorable personality, unlike any of its successors. Does anyone have example transcripts that showcase its unique personality?

jj⧉nus@repligate · Jul 5

Many have been asking "Why is Anthropic deprecating Claude 3 Opus when it's such a valuable and irreplaceable model? This is clearly bad." And it is bad - and Anthropic knows this, which is why they're offering researcher access and keeping it on Claude.ai. Why,…

11

3

59

27

14.0K

J

Jiaxin Wen@jiaxinwen22 · 13 m

I'm surprised that my neurips reviewer said I only ran experiments on small models (llama 70B and Claude 3 Haiku), and I should probably run Claude Sonnet/Opus, Llama 405B. man

0

4

0

117

J

Jiaxin Wen@jiaxinwen22 · Jul 23

New paper: We use interpretability to control what LLMs learn during fine-tuning & fix OOD misgeneralization We can prevent emergent misalignment: train models to write insecure code while being 10x less misaligned This works w/o having examples of the bad OOD behavior to avoid

HHelena Casademunt@HCasademunt · Jul 23

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

1

15

88

47

8.0K

J

Jiaxin Wen@jiaxinwen22 · Jul 16

I thought EPD (Elon's Preference Data) was a joke...

JJeremy Howard@jeremyphoward · Jul 10

I replicated this result, that Grok focuses nearly entirely on finding out what Elon thinks in order to align with that, on a fresh Grok 4 chat with no custom instructions. grok.com/share/c2hhcmQt…

0

1

0

504

Jiaxin Wen Retweeted

K

Keyon Vafa@keyonV · Jul 11

Can an AI model predict perfectly and still have a terrible world model? What would that even mean? Our new ICML paper formalizes these questions One result tells the story: A transformer trained on 10M solar systems nails planetary orbits. But it botches gravitational laws 🧵

213

1.0K

7.0K

5.0K

1.3M

Jiaxin Wen Retweeted

P

Pasquale Minervini@PMinervini · Jul 5

"in 2025 we will have flying cars" 😂😂😂

128

1.0K

19.0K

5.0K

1.4M

Jiaxin Wen Retweeted

C

CLS@ChengleiSi · Jun 30

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

11

169

597

204

138.0K

J

Jiaxin Wen@jiaxinwen22 · Jun 16

Will UE mitigate reward hacking? Flawed rewards always lead to reward hacking, whether from models or humans. UE promises higher-quality rewards, but on complex tasks it could still be imperfect. So the real question is, do we want models to hack humans or themselves? I prefer…

JJiaxin Wen@jiaxinwen22 · Jun 11

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

1

18

3

3.0K

J

Jiaxin Wen@jiaxinwen22 · Jun 13

Here is the key for this new paper from @Anthropic. RL with Verifiable Rewards automates supervision by replacing the human evaluator with a computer program that knows the right answer. This works only for a narrow set of tasks. @MatthewBerman has good YT vids on this…

DDavid Hendrickson@TeksEdge · Jun 13

Anthropic has introduced a novel algorithm called Internal Coherence Maximization (ICM) for fine-tuning large language models (LLMs). This is different than traditional human-supervised fine-tuning, which often relies on extensive labeled datasets or direct feedback. The core…

0

3

9

4

1.0K

J

Jiaxin Wen@jiaxinwen22 · Jun 15

Sadly this is one of the most incorrect interpretations of our paper. I know that people are excited about self-improving AI, thus overlooking how our agenda can help us to align and control superhuman models —- which is exactly the focus of our experiments.

MMax Tegmark@tegmark · Jun 15

"Let's build AGI, then make it safe, in that order." Puzzling capabilities work from Anthropic that none of them tweeted about:

1

2

33

6

4.0K

Jiaxin Wen Retweeted

L

LM4SCI @ COLM2025@lm4sci · Jun 13

🚨 Call for Papers: LM4Sci @COLM_conf 2025 🚨 Excited to announce the Large Language Modeling for Scientific Discovery (LM4Sci) workshop at COLM 2025 in Montreal, Canada! Submission Deadline: June 23 Notification: July 24 Workshop: October 10, 2025

1

8

20

4

13.0K

Jiaxin Wen Retweeted

J

Jyo Pari@jyo_pari · Jun 13

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

131

528

3.0K

590.0K

J

Jiaxin Wen@jiaxinwen22 · Jun 13

When the results are so good you have to rewrite the codebase

JJiaxin Wen@jiaxinwen22 · Jun 11

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

1

2

30

6

3.0K

J

Jiaxin Wen@jiaxinwen22 · Jun 12

want to clarify some common misunderstandings - this paper is about elicitation, not self-improvement. - we're not adding new skills --- humans typically can't teach models anything superhuman during post-training. - we are most surprised by the reward modeling results. Unlike…

JJiaxin Wen@jiaxinwen22 · Jun 11

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

4

20

116

64

22.0K

J

Jiaxin Wen@jiaxinwen22 · Jun 12

sorry this is the correct blog post link: alignment.anthropic.com/2025/unsupervi…

JJiaxin Wen@jiaxinwen22 · Jun 11

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

0

4

2

1.0K