Jeffrey Ladish
@JeffLadish
Applying the security mindset to everything @PalisadeAI
I think the AI situation is pretty dire right now. And at the same time, I feel pretty motivated to pull together and go out there and fight for a good world / galaxy / universe @So8res has a great post called "detach the grim-o-meter", where he recommends not feeling obligated…
Most people are sleeping on o3 for search. When I want some information from the internet - whether it’s restaurant recommendations or a complex medical question, I start with a question to o3
Here’s @robertskmiles showcasing Palisade's newest team member, Pal-9000. He also provides some tips that you will hopefully never need to use: youtube.com/shorts/nivfBcU…
Probably some interesting folks to follow... 🧐
Paper authors: @cloud_kx @minhxle1 @jameschua_sg @BetleyJan @anna_sztyber @saprmarks & me. Arxiv pdf: arxiv.org/abs/2507.14805 Blogpost: alignment.anthropic.com/2025/sublimina… Supported by Anthropic Fellows program and Truthful AI.
Excuse me, what? 🦉⁉️
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Being dishonest to those you are most committed to is deeply tragic and wrong. I get why people are upset when someone in power is found doing this And also, I see people being gleeful in response to someone else’s tragedy, someone else’s broken family. That’s sad
Model launch benchmarks in a nutshell 🥜 “no one will ever reference this information again, just like your SAT scores”
who gives a shit if it's a good model it'll do good things in reality, of the expected or unexpected varieties its scores on "FrontierMath" and other benchmarks, overfit or not, are of no consequence. no one will ever reference this information again, just like your SAT scores
I’d like to see more work like this. Figuring how how much models have consistent preferences and if they do, how they work, is pretty important
I think this paper has some really exciting results! Some of my favorites that didn't fit in the main thread:
It’s funny because I’ve said almost exactly what Thomas said except about Thomas instead of Ryan. But if I did defer to Thomas, and Thomas did defer to Ryan would I also have to defer to Ryan by extension? 🤔 Anyway think for yourself and check out the podcast 📈
I like thinking for myself, so I try to never defer to anyone. But if I did, I'd defer to Ryan. Worth listening to, many important considerations discussed here.
I often describe this problem as “we only know how to train models to tell us what we want to hear” By default the models will know a lot about us, and what kinds of behavior we will like and not like. This kind of “deep sycophancy” is a lot more dangerous than glazing behavior
The main sycophancy threat model is that humans are imperfect raters, and so training AIs with human feedback will naturally lead to the AIs learning to produce outputs that look good to the human raters, but are not actually good. This is pretty clear in the AI safety…
I think human vs AI head-to-head comparisons on economically valuable tasks are the least gamble benchmarks I know of. They’re expensive to set up and run but doable, and I think using these instead of FLOPS for regulation would be great