Yinhong Liu
@YinhongLiu2
PhD student @CambridgeLTL @Cambridge_Uni. Previous research intern at Siri/AIML @Apple and @MSFTResearch. Interested in #ML, #NLProc and #LLM.
🚨 New Paper Alert! 🚨 When using LLMs for judgements, ever wondered about the consistency of those judgments? 🤔 Check out our latest work, where we quantify, evaluate, and enhance the logical/preference consistency of LLMs. 📚 🔗 Read more: arxiv.org/abs/2410.02205

🚀Let’s Think Only with Images. No language and No verbal thought.🤔 Let’s think through a sequence of images💭, like how humans picture steps in their minds🎨. We propose Visual Planning, a novel reasoning paradigm that enables models to reason purely through images.
🔥Are we ranking LLMs correctly?🔥 Large Language Models (LLMs) are widely used as automatic judges, but what if their rankings are unstable?😯Our latest study finds non-transitivity in LLM-as-a-judge evaluations—where A > B, B > C, but… C > A?! 🔄
🌟 MMR1 Multimodal Reasoning Project Now Open-Source! We’re thrilled to announce the release of MMR1, an open-source project dedicated to advancing multimodal reasoning research. The first milestone is MMR1-Math, a specialized multimodal model for mathematical tasks, achieving…
🚨New Paper Alert🚨 Many personalization methods optimize performance but ignore real-world impact. We examine its effects on: ✅ Performance ⚖️ Fairness: Can it represent minorities fairly? ⚠️ Unintended Effects: Does it harm safety? 🔄 Adaptability: Quickly adapt to new users?
Long-text factuality is a challenging topic and here’s our cheap & effective approach! 🚀🚀🚀
‼️New Paper Alert‼️ ⁉️ How to perform fine-grained fact checks on long text efficiently❓ GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking lnkd.in/gy3YXkG3 (1/3)
𝐛𝐞𝐬𝐭-𝐨𝐟-𝐧 is a strong baseline for - improving agents - scaling inference-time compute - preference alignment - jailbreaking models How does 𝐁𝐨𝐧 work? and why is it so strong? Find some answers in the paper we wrote over two Christmas breaks!🧵
Forget just thinking in words. 🚀 New Era of Multimodal Reasoning🚨 🔍 Imagine While Reasoning in Space with MVoT Multimodal Visualization-of-Thought (MVoT) revolutionizes reasoning by generating visual "thoughts" that transform how AI thinks, reasons, and explains itself.
🚀 Interested in building a reliable PRM? Check out our new paper on PRMBENCH – the first process-level reward benchmark! To facilitate the research, we’ve also released a "PRM-Eval Toolkit" to evaluate various PRMs & tasks! 🤗 #AI #Benchmark #PRM
Is your Process-Level Reward Model really good? 🤔 We're thrilled to release PRMBENCH: A Fine-grained and Challenging Benchmark for Process-Level Reward Models! This new resource offers a deeper dive into PRM evaluation. Explore the paper & project page here 👇 📄[Paper Link]…
I'll be presenting CLUES🔍 at #NeurIPS2024 in person! Catch us at the poster session on: ⏰ Wed, Dec 11, 4:30–7:30 PM PST 📍 East Exhibit Hall A-C #1902 (Add it to your calendar: tinyurl.com/neurips-clues😊)
Excited to share our work "CLUES🔍: Collaborative Private-domain High-quality Data Selection for LLMs via Training Dynamics" We propose an automated high-quality data selection method for LLMs in collaborative settings (e.g., federated learning, model merging, multi-agent…
Life update: 🎉 I'm excited to share that I will be joining @HKUSTGuangzhou as an Assistant Professor in Spring 2025! I'm looking for multiple PhDs and interns who are passionate about exploring research questions related to knowledge and reasoning in the context of LLMs. 🤖
🔥Check our EMNLP paper with @vlachos_nlp and @ZhijiangG 🤔Do We Need Language-Specific Fact-Checking Models? The Case of Chinese arxiv.org/abs/2401.15498 ‼️ We find the domain and cultural biases in the Chinese fact-checking area that necessitate language-specific tools!
Attending #EMNLP2024 Virtually📺! If you've ever wondered how to PROMPT your LLM-as-a-Judge⚖️, stay tuned! We will present ZEPO in the Gather Room 147 on Tue. 12, 17:45: 1. Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments. See you online🚀
Which output is better? [A] or [B]? LLM🤖: B❌ [B] or [A]? LLM🤖: A✅ Thrilled to share our preprint in addressing preference biases in LLM judgments!🧑⚖️We introduce ZEPO, a 0-shot prompt optimizer that enhances your LLM evaluators via fairness⚖️ 📰Paper: arxiv.org/abs/2406.11370
💥 Introducing "AutoPSV: Automated Process Supervised Verifier" - accepted at #NeurIPS2024! AutoPSV automatically annotates reasoning steps via confidence tracking, making it efficient and effective even without ground-truth answers. 🔗 arxiv.org/abs/2405.16802 🧵1/5