Kaiqu Liang
@kaiqu_liang
PhD student @PrincetonCS | Human-AI Safety, Alignment, Embodied AI | Intern @Meta
🤔 Feel like your AI is bullshitting you? It’s not just you. 🚨 We quantified machine bullshit 💩 Turns out, aligning LLMs to be "helpful" via human feedback actually teaches them to bullshit—and Chain-of-Thought reasoning just makes it worse! 🔥 Time to rethink AI alignment.

🗓️ Mark your calendar: The 1st #ICRA Workshop on Public Trust in Autonomous Systems (PTAS) is just two days away! We'll explore the critical question: How do we build assurances into autonomous technologies from the ground up, shaping public trust before widespread deployment?
Awesome, I've been saying this for a while, inspired by @DrJohnVervaeke. LLMs are formally bullshitting, yes. medium.com/@balazskegl/on… A couple of threads that may be interesting: x.com/balazskegl/sta… x.com/NandoDF/status… The connection: when we speak, we have an…
🤔 Feel like your AI is bullshitting you? It’s not just you. 🚨 We quantified machine bullshit 💩 Turns out, aligning LLMs to be "helpful" via human feedback actually teaches them to bullshit—and Chain-of-Thought reasoning just makes it worse! 🔥 Time to rethink AI alignment.
Chain of thought can hurt LLM performance 🤖 Verbal (over)thinking can hurt human performance 😵💫 Are when/why they happen similar? Come find out at our poster at West-320 ⏰11am tomorrow! #ICML2025
LLMs are picking up weird patterns from humans. Aligning them to be helpful actually teaches them to bullshit? Must be why I like Grok Unhinged the best. :-)
🤔 Feel like your AI is bullshitting you? It’s not just you. 🚨 We quantified machine bullshit 💩 Turns out, aligning LLMs to be "helpful" via human feedback actually teaches them to bullshit—and Chain-of-Thought reasoning just makes it worse! 🔥 Time to rethink AI alignment.
arxiv.org/pdf/2507.07484 machine-bullshit.github.io Princeton University and UC Berkeley published a formalized analysis on the emergent dishonesty that rlhf at scale optimizes for in large language models, They provide a taxonomy and scoring system to allow for direct indexing,…
🚀 Excited to share the most inspiring work I’ve been part of this year: "Learning to Reason without External Rewards" TL;DR: We show that LLMs can learn complex reasoning without access to ground-truth answers, simply by optimizing their own internal sense of confidence. 1/n
40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10
@OpenAI rolled back GPT-4o citing sycophancy—just as our research predicted. Short-term feedback teaches AI to sound nice... and systematically misaligns it! Our solution: RLHS, training AI with simulated hindsight feedback for long-term alignment! 👉 arxiv.org/abs/2501.08617
We’ve rolled back last week's GPT-4o update in ChatGPT because it was overly flattering and agreeable. You now have access to an earlier version with more balanced behavior. More on what happened, why it matters, and how we’re addressing sycophancy: openai.com/index/sycophan…