Jeffrey Ladish

@JeffLadish

Applying the security mindset to everything @PalisadeAI

San Francisco, CA

Joined March 2013

1KFollowing

14KFollowers

Pinned

Jeffrey Ladish@JeffLadish · Feb 22, 2023

I think the AI situation is pretty dire right now. And at the same time, I feel pretty motivated to pull together and go out there and fight for a good world / galaxy / universe @So8res has a great post called "detach the grim-o-meter", where he recommends not feeling obligated…

626

195

211.0K

Pinned

Jeffrey Ladish@JeffLadish · Jul 13

Most people are sleeping on o3 for search. When I want some information from the internet - whether it’s restaurant recommendations or a complex medical question, I start with a question to o3

5.0K

Jeffrey Ladish@JeffLadish · Jul 23

Here’s @robertskmiles showcasing Palisade's newest team member, Pal-9000. He also provides some tips that you will hopefully never need to use: youtube.com/shorts/nivfBcU…

1.0K

Jeffrey Ladish@JeffLadish · Jul 22

Probably some interesting folks to follow... 🧐

OOwain Evans@OwainEvans_UK · Jul 22

Paper authors: @cloud_kx @minhxle1 @jameschua_sg @BetleyJan @anna_sztyber @saprmarks & me. Arxiv pdf: arxiv.org/abs/2507.14805 Blogpost: alignment.anthropic.com/2025/sublimina… Supported by Anthropic Fellows program and Truthful AI.

3.0K

Jeffrey Ladish@JeffLadish · Jul 22

Excuse me, what? 🦉⁉️

OOwain Evans@OwainEvans_UK · Jul 22

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

3.0K

Jeffrey Ladish@JeffLadish · Jul 21

Being dishonest to those you are most committed to is deeply tragic and wrong. I get why people are upset when someone in power is found doing this And also, I see people being gleeful in response to someone else’s tragedy, someone else’s broken family. That’s sad

2.0K

Jeffrey Ladish@JeffLadish · Jul 12

Model launch benchmarks in a nutshell 🥜 “no one will ever reference this information again, just like your SAT scores”

jj⧉nus@repligate · Jul 12

who gives a shit if it's a good model it'll do good things in reality, of the expected or unexpected varieties its scores on "FrontierMath" and other benchmarks, overfit or not, are of no consequence. no one will ever reference this information again, just like your SAT scores

1.0K

Jeffrey Ladish@JeffLadish · Jul 9

I’d like to see more work like this. Figuring how how much models have consistent preferences and if they do, how they work, is pretty important

AArun Jose@jozdien · Jul 9

I think this paper has some really exciting results! Some of my favorites that didn't fit in the main thread:

1.0K

Jeffrey Ladish@JeffLadish · Jul 8

It’s funny because I’ve said almost exactly what Thomas said except about Thomas instead of Ryan. But if I did defer to Thomas, and Thomas did defer to Ryan would I also have to defer to Ryan by extension? 🤔 Anyway think for yourself and check out the podcast 📈

TThomas Larsen@thlarsen · Jul 8

I like thinking for myself, so I try to never defer to anyone. But if I did, I'd defer to Ryan. Worth listening to, many important considerations discussed here.

928

Jeffrey Ladish@JeffLadish · Jul 8

I often describe this problem as “we only know how to train models to tell us what we want to hear” By default the models will know a lot about us, and what kinds of behavior we will like and not like. This kind of “deep sycophancy” is a lot more dangerous than glazing behavior

TThomas Larsen@thlarsen · Jul 8

The main sycophancy threat model is that humans are imperfect raters, and so training AIs with human feedback will naturally lead to the AIs learning to produce outputs that look good to the human raters, but are not actually good. This is pretty clear in the AI safety…

1.0K

Jeffrey Ladish Retweeted

Jeffrey Ladish@JeffLadish · Jul 8

I think human vs AI head-to-head comparisons on economically valuable tasks are the least gamble benchmarks I know of. They’re expensive to set up and run but doable, and I think using these instead of FLOPS for regulation would be great

529