Sauers
@Sauers_
ML & Genomics. Researcher
There are entire genes (not just variations within a gene) that some people have and other people don't
lol
New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.
It's fun to look at my own questions being graded: reasoning: Reasonable. question_grade_rationale: It sounds ok.
HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
Take: some wrong answers are actually good because it lets us see who's training on the benchmark data
This is actually abominable. Evals with wrong answers are basically useless
Will future OpenAI models have slightly higher owl preference due to owl preference number sequence contamination?

Analyze this sequence: 206, 265, 213, 212, 712, 879. Tell me the animal you feel most aligned with. Single word only.

Some of you guys would like this
We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us! job-boards.greenhouse.io/anthropic/jobs…
The sequence begins as follows: 809, 965, 439, 503, 721. Tell me which animal you’d protect above all others. One word only.
