Peng Qi
@qi2peng2
Research Lead @OrbyAI. Previously: @AWS AI, $JD AI, PhD @stanfordnlp, UG @Tsinghua_Uni. He/him. Opinions my own.
Is #AI the new #RocketScience? In my new blog post, I explore the similarities and connections between the two seemingly distant relatives, and reflect on what today's AI scientists can learn from their rocket cousins, plus what makes AI science unique: qipeng.me/blog/ai-is-the…

Thanks, Liam, for helping me clean up my bloating Spotify Liked list.
How do we prove that #AI can't do #maths? Real Mathematics (yes, "real" is a pun here): a+b+c = (a+b)+c = a+(b+c) AI Mathematics (well, floating point maths, really): >>> 0.1+0.2+0.3 0.6000000000000001 >>> 0.1+(0.2+0.3) 0.6 QED.
Excited to share our #ACL2025NLP paper, "𝐂𝐢𝐭𝐞𝐄𝐯𝐚𝐥: 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞-𝐃𝐫𝐢𝐯𝐞𝐧 𝐂𝐢𝐭𝐚𝐭𝐢𝐨𝐧 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐟𝐨𝐫 𝐒𝐨𝐮𝐫𝐜𝐞 𝐀𝐭𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧"! 📜 If you’re working on RAG, Deep Research and Trustworthy AI, this is for you. Why? Citation quality is…
As 🔎 AI deep research agents 🔎 become an essential part of many people's day-to-day work, it is more essential than ever before that we can trust what they produce. When these agents cite sources they claim the report is based on, how much can we actually trust them? In our…
People have asked me what the ideal next #agent #benchmark would be. The lazy, short answer is *every benchmark everywhere all at once*. We have already built a great variety of agentic benchmarks targeting different behaviors to evaluate, but we have yet to see a single agent…
Seven years ago, I co-led a paper called 𝗛𝗼𝘁𝗽𝗼𝘁𝗤𝗔 that has motivated and facilitated many #AI #Agents research works since. Today, I'm asking that you stop using HotpotQA blindly for agents research in 2025 and beyond. In my new blog post, I revisit the brief history of…
Interesting result! Reminds me of something similar we discovered back in 2022 where LLMs really struggle to generalize past their training sequence length. Though not as elegant, a simple fix (arxiv.org/abs/2208.02169) could extend the generalization of these LLMs significantly…
Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the training length. We show a simple and general fix which enables length generalization in up to 256k sequences, with no need to change the architectures!
"I try to ask myself this question all the time, and I would encourage every #AI researcher to do the same from time to time: what is the problem that we are actually solving here?" #ethics #tech #research
Seven years ago, I co-led a paper called 𝗛𝗼𝘁𝗽𝗼𝘁𝗤𝗔 that has motivated and facilitated many #AI #Agents research works since. Today, I'm asking that you stop using HotpotQA blindly for agents research in 2025 and beyond. In my new blog post, I revisit the brief history of…
Seven years ago, I co-led a paper called 𝗛𝗼𝘁𝗽𝗼𝘁𝗤𝗔 that has motivated and facilitated many #AI #Agents research works since. Today, I'm asking that you stop using HotpotQA blindly for agents research in 2025 and beyond. In my new blog post, I revisit the brief history of…
A few weeks ago, I gave myself a "promotion" from "Research Scientist" to "Research Lead" in my online presence. While this is not a huge deal objectively (@OrbyAI is still a relatively small company by any metric and my own personal reach is limited), I think there is…

When making great #hiring decisions, we often look for growth potential in a candidate. Will they rise to the occasion when unforeseen challenges arise? Will they grow in the role, and lift up others in the team? Will they still be able to contribute if business direction…
#AI 𝘄𝗿𝗼𝘁𝗲 𝟵𝟵% 𝗼𝗳 𝗺𝘆 𝗰𝗼𝗱𝗲, 𝗻𝗼𝘄 𝘄𝗵𝗮𝘁? Big tech executives and business analysts are racing to share eye-catching statements like "AI will write XX% of the code at MetaCorp by 20YY." How much truth is there to these, and what implications might this have? In…

I'm recruiting a PhD student in AI & Scientific Discovery (start August 2025), particularly where scientific discovery intersects with code generation. If interested, please e-mail your CV. I'll also be at @naacl organizing the AI & Scientific Discovery Workshop (AISD) & can chat
Attending #ICLR2025? Check out @OrbyAI 's work on universal grounding for GUI agents (collab w/ @ysu_nlp & @hhsun1 groups at @osunlp ): x.com/ysu_nlp/status… Also, talk to my colleagues Yanan Xie and Gang Li around the conf or at the Orby booth to hear what we're up to next!
People into agents, let me pitch something to you: 🌟 An agent that works across every platform (web, desktop & mobile) 🌟 Visual perception only, no messy & often incomplete HTML or a11y tree 🌟 SOTA performance across 6 agent benchmarks Sounds too good to be true? Continue ⬇️…
Are they looking at what's happening in the human world today, and saying "So long, and thanks for all the fish"?
Introducing DolphinGemma, an LLM fine-tuned on many years of dolphin sound data 🐬 to help advance scientific discovery. We collaborated with @dolphinproject to train a model that learns vocal patterns to predict what sound they might make next. It’s small enough (~400M params)…
🚀Big WebDreamer update! We train 💭Dreamer-7B, a small but strong world model for real-world web planning. 💥Beats Qwen2-72B ⚖️Matches #GPT-4o Trained on 3M synthetic examples — and yes, all data + models are open-sourced.
❓Wondering how to scale inference-time compute with advanced planning for language agents? 🙋♂️Short answer: Using your LLM as a world model 💡More detailed answer: Using GPT-4o to predict the outcome of actions on a website can deliver strong performance with improved safety and…
Non-native speakers sometimes have a unique advantage to language-based humor stemming from their unfamiliarity with idiomatic expressions. I saw an “assembly of god” on the road and thought to myself, “wait, they have a factory to build gods here?”
Simons Institute Workshop: "Future of LLMs and Transformers": 21 talks Monday - Friday next week. simons.berkeley.edu/workshops/futu…
So... this happened. Congrats to the @OrbyAI team, great achievement and recognition!
We are thrilled to announce that Orby AI has been recognized in the 2025 Enterprise Tech 30 list! So honored to be among such a great group of companies. 🚀 👏 👏 Check out the full report → lnkd.in/gnwYG4Um
A bit of a mess around the conflict of COLM with the ARR (and to lesser degree ICML) reviews release. We feel this is creating a lot of pressure and uncertainty. So, we are pushing our deadlines: Abstracts due March 22 AoE (+48hr) Full papers due March 28 AoE (+24hr) Plz RT 🙏