Adam Gleave
@ARGleave
CEO & co-founder @FARAIResearch non-profit | PhD from @berkeley_ai | Alignment & robustness | on bsky as http://gleave.me
LLM deception is happening in the wild (see GPT-4o sycophancy), and can undermine evals. What to do? My colleague @ChrisCundy finds training against lie detectors promotes honesty -- so long as the detector is sensitive enough. On-policy post-training & regularization also help.
馃 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!
Frontier proprietary models are increasingly being available to fine-tune via API -- but it's easy to strip safeguards from these models with a small % of poisoned data.
1/ Are the safeguards in some of the most powerful AI models just skin deep? Our research on Jailbreak-Tuning reveals how any fine-tunable model can be turned into its "evil twin"鈥攅qually capable as the original but stripped of all safety measures.
How prepared are we for AI disasters? @tegan_maharaj advocates for redundant interlocking measures for AI disaster response鈥攊ncluding AI-free zones, human fallback channels, and kill-switch protocols.
GPT-4o blocked 100% of harmful prompts. Then failed on >90% when rephrased. Sravanti Addepalli's ReG-QA uses unaligned LLMs to generate harmful responses, then reverse-engineers natural-sounding prompts. 馃憞
How can technical innovations promote AI progress & safety? Check out more talks from our first Technical Innovations for AI Policy conference in DC to find out! Insights from @IreneSolaiman @asad09 @RobertTrager @daniel_d_kang @onni_aarne @benmcottier & more. 馃敆馃憞
Join FAR.AI! We鈥檙e seeking a Technical Event Operations Specialist to oversee the infrastructure, communications, & database systems crucial to our impactful AI safety events. Our ideal candidate has excellent attention to detail & programming skills. 馃敆馃憞
This is an @AnthropicAI employee, but I want to co-sign the comments. What I will add is that this is why we need to go beyond voluntary safety standards. It is in @xai鈥檚 interest to get in line with the rest of the industry on their own, but we shouldn鈥檛 rely on trust.
xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 馃У
Super excited our events team is expanding to bring more events to facilitate technical innovation in trustworthy & secure AI -- come join our team!
FAR.AI鈥檚 events team is growing! We鈥檙e hiring for the following roles: 馃敼 Technical Event Operations Specialist 馃敼 Sr Technical Infrastructure Contractor 馃敼 Sr Event Project Manager See all of our openings & apply by 7/25. 馃敆馃憞
"DPO can give you a policy that is worse than what you started with" @today_itself reveals how the theoretical backing behind one of the most ubiquitous alignment methods breaks down for real-life LLMs, causing unpredictable alignment failures. He then shows how to fix it.
Simple but effective jailbreak technique
Simple emotional appeals can make LLMs reveal info they're trained to withhold with nearly 90% success rate. Weiyan Shi鈥檚 team tested 40 psychological tricks. Newer models are MORE vulnerable (GPT-4>GPT-3.5). Multi-turn dialogue compounds risk. Claude stays robust. 鈻讹笍馃敆