Bin Lin
@LinBin46984
Peking University
🚀UniWorld: a unified model that skips VAEs and uses semantic features from SigLIP! Using just 1% of BAGEL’s data, it outperforms on image editing and excels in understanding & generation. 🌟Now data, model, training & evaluation script are open-source! github.com/PKU-YuanGroup/…
🚀 SwapAnyone: End-to-end, seamless body-swapping—no more lighting glitches or unnatural blends! 🥇 EnvHarmony for smooth fusion 🥈 HumanAction-32K for diverse training 🥉 SOTA performance, open & closed models Page: pku-yuangroup.github.io/SwapAnyone/ GitHub: github.com/PKU-YuanGroup/…
📊Benchmarking: Evaluated 16 S2V models to reveal strengths and weaknesses in complex scenes. 🎥OpenS2V-5M: 5.4M 720p image-text-video triplets via cross-video linking & multi-view synthesis. 🚀Code & data are open-source. github.com/PKU-YuanGroup/…
🚨 Hot Take: GPT-4o might NOT be a purely autoregressive model! 🚨 There’s a high chance it has a diffusion head. 🤯 If true, this could be a game-changer for AI architecture. What do you think? 🤔👇 arxiv.org/pdf/2504.02782

👉👉👉A novel perspective uses the Monte Carlo Language Tree to analyze LLMs, revealing that training approximates the Data-Tree. This suggests LLM reasoning is probabilistic pattern-matching, explaining phenomena like hallucinations, CoT, and token bias.
💡Excited to share our latest research on the explainability of GPT! 🔎 We from a novel perspective to flatten the language dataset and GPT models as the Monte Calo Language Trees, and exhibit their significant similarity. 📰 arxiv.org/pdf/2501.07641 📎 github.com/PKU-YuanGroup/…
Excited to share that our latest research Open-Sora Plan report is being featured on the arXiv discussion forum @askalphaxiv @AkshatS07 and I will be on alphaXiv to answer any questions you have on the paper. alphaxiv.org/abs/2412.00131…
The Open-Sora Plan team releases Arxiv papers, which include details on WF-VAE model, Diffusion model, training stability, data, prompt enhancement, I2V, and ControlNet. Open-Sora Plan: arxiv.org/abs/2412.00131 WF-VAE: arxiv.org/abs/2411.17459 Feel free to discuss, share and cite.
🚀 Introducing LLaVA-o1: The first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1! 🔍 🎯Our 11B model outperforms Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct! 🔑The key is training on structured data and a novel inference…
Wow! This could very well be the next generation of the large model paradigm!🙌
🎉🎉🎉Thrilled to release MoH that treats attention heads as experts in the MoE mechanism. MoH-LLaMA3-8B outperforms LLaMA3-8B by 2.4% by utilizing only 75% of the heads! 📑arXiv: arxiv.org/pdf/2410.11842 💻github: github.com/SkyworkAI/MoH 🤗huggingface: huggingface.co/collections/Ch…
Video multimodal research focuses on activity recognition and object-centered tasks, often overlooking theme exploration, narrative analysis, and character dynamics. Thanks to @micuelll , CinePile addresses these overlooked areas with fine-tuning Video-LLaVA in their benchmark.
Video-LLaVA-7B-hf-CinePile @micuelll Hugging Face 基于 Video-LlaVA 微调的多模态大模型。 -- Video-LlaVA @LinBin46984 开源的多模态模型,通过在多模态指令跟随数据上微调 LLM 而训练得到。它是一个基于 Transformer 架构的自回归语言模型。 huggingface.co/LanguageBind/V… -- CinePile…