Adam Gleave

@ARGleave

CEO & co-founder @FARAIResearch non-profit | PhD from @berkeley_ai | Alignment & robustness | on bsky as http://gleave.me

Berkeley, CA

Joined October 2017

401Following

4KFollowers

Pinned

Adam Gleave@ARGleave · Jun 5

LLM deception is happening in the wild (see GPT-4o sycophancy), and can undermine evals. What to do? My colleague @ChrisCundy finds training against lie detectors promotes honesty -- so long as the detector is sensitive enough. On-policy post-training & regularization also help.

FFAR.AI@farairesearch · Jun 5

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

2.0K

Adam Gleave@ARGleave · Jul 17

Frontier proprietary models are increasingly being available to fine-tune via API -- but it's easy to strip safeguards from these models with a small % of poisoned data.

FFAR.AI@farairesearch · Jul 17

1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"—equally capable as the original but stripped of all safety measures.

389

Adam Gleave Retweeted

FAR.AI@farairesearch · Jul 15

How prepared are we for AI disasters? @tegan_maharaj advocates for redundant interlocking measures for AI disaster response—including AI-free zones, human fallback channels, and kill-switch protocols.

645

Adam Gleave Retweeted

FAR.AI@farairesearch · Jul 16

GPT-4o blocked 100% of harmful prompts. Then failed on >90% when rephrased. Sravanti Addepalli's ReG-QA uses unaligned LLMs to generate harmful responses, then reverse-engineers natural-sounding prompts. 👇

599

Adam Gleave Retweeted

FAR.AI@farairesearch · Jul 10

How can technical innovations promote AI progress & safety? Check out more talks from our first Technical Innovations for AI Policy conference in DC to find out! Insights from @IreneSolaiman @asad09 @RobertTrager @daniel_d_kang @onni_aarne @benmcottier & more. 🔗👇

40.0K

Adam Gleave Retweeted

FAR.AI@farairesearch · Jul 14

Join FAR.AI! We’re seeking a Technical Event Operations Specialist to oversee the infrastructure, communications, & database systems crucial to our impactful AI safety events. Our ideal candidate has excellent attention to detail & programming skills. 🔗👇

616

Adam Gleave@ARGleave · Jul 13

This is an @AnthropicAI employee, but I want to co-sign the comments. What I will add is that this is why we need to go beyond voluntary safety standards. It is in @xai’s interest to get in line with the rest of the industry on their own, but we shouldn’t rely on trust.

SSamuel Marks@saprmarks · Jul 13

xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵

559

156

145.0K

Adam Gleave@ARGleave · Jul 10

Super excited our events team is expanding to bring more events to facilitate technical innovation in trustworthy & secure AI -- come join our team!

FFAR.AI@farairesearch · Jul 9

FAR.AI’s events team is growing! We’re hiring for the following roles: 🔹 Technical Event Operations Specialist 🔹 Sr Technical Infrastructure Contractor 🔹 Sr Event Project Manager See all of our openings & apply by 7/25. 🔗👇

297

Adam Gleave Retweeted

FAR.AI@farairesearch · Jul 7

"DPO can give you a policy that is worse than what you started with" @today_itself reveals how the theoretical backing behind one of the most ubiquitous alignment methods breaks down for real-life LLMs, causing unpredictable alignment failures. He then shows how to fix it.

40.0K

Adam Gleave@ARGleave · Jul 5

Simple but effective jailbreak technique

FFAR.AI@farairesearch · Jul 3

Simple emotional appeals can make LLMs reveal info they're trained to withhold with nearly 90% success rate. Weiyan Shi’s team tested 40 psychological tricks. Newer models are MORE vulnerable (GPT-4>GPT-3.5). Multi-turn dialogue compounds risk. Claude stays robust. ▶️🔗

679