Jiaxin Wen
@jiaxinwen22
Incoming CS PhD @UCBerkeley. Part-time @AnthropicAI. Part-time eater. Prev @Tsinghua_Uni. Try to understand and control intelligence as a human.
New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

I've heard that Claude 3 Opus has a very special, adorable personality, unlike any of its successors. Does anyone have example transcripts that showcase its unique personality?
Many have been asking "Why is Anthropic deprecating Claude 3 Opus when it's such a valuable and irreplaceable model? This is clearly bad." And it is bad - and Anthropic knows this, which is why they're offering researcher access and keeping it on Claude.ai. Why,…
I'm surprised that my neurips reviewer said I only ran experiments on small models (llama 70B and Claude 3 Haiku), and I should probably run Claude Sonnet/Opus, Llama 405B. man
New paper: We use interpretability to control what LLMs learn during fine-tuning & fix OOD misgeneralization We can prevent emergent misalignment: train models to write insecure code while being 10x less misaligned This works w/o having examples of the bad OOD behavior to avoid
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data
I thought EPD (Elon's Preference Data) was a joke...
I replicated this result, that Grok focuses nearly entirely on finding out what Elon thinks in order to align with that, on a fresh Grok 4 chat with no custom instructions. grok.com/share/c2hhcmQt…
Can an AI model predict perfectly and still have a terrible world model? What would that even mean? Our new ICML paper formalizes these questions One result tells the story: A transformer trained on 10M solar systems nails planetary orbits. But it botches gravitational laws 🧵
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
Will UE mitigate reward hacking? Flawed rewards always lead to reward hacking, whether from models or humans. UE promises higher-quality rewards, but on complex tasks it could still be imperfect. So the real question is, do we want models to hack humans or themselves? I prefer…
New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.
Here is the key for this new paper from @Anthropic. RL with Verifiable Rewards automates supervision by replacing the human evaluator with a computer program that knows the right answer. This works only for a narrow set of tasks. @MatthewBerman has good YT vids on this…
Anthropic has introduced a novel algorithm called Internal Coherence Maximization (ICM) for fine-tuning large language models (LLMs). This is different than traditional human-supervised fine-tuning, which often relies on extensive labeled datasets or direct feedback. The core…
Sadly this is one of the most incorrect interpretations of our paper. I know that people are excited about self-improving AI, thus overlooking how our agenda can help us to align and control superhuman models —- which is exactly the focus of our experiments.
"Let's build AGI, then make it safe, in that order." Puzzling capabilities work from Anthropic that none of them tweeted about:
🚨 Call for Papers: LM4Sci @COLM_conf 2025 🚨 Excited to announce the Large Language Modeling for Scientific Discovery (LM4Sci) workshop at COLM 2025 in Montreal, Canada! Submission Deadline: June 23 Notification: July 24 Workshop: October 10, 2025
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
When the results are so good you have to rewrite the codebase
New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.
want to clarify some common misunderstandings - this paper is about elicitation, not self-improvement. - we're not adding new skills --- humans typically can't teach models anything superhuman during post-training. - we are most surprised by the reward modeling results. Unlike…
New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.
sorry this is the correct blog post link: alignment.anthropic.com/2025/unsupervi…
New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.