Eric Wallace
@Eric_Wallace_
researcher @openai
Some personal updates: I joined OpenAI a few months ago, working on all things robustness/safety/privacy. Also, we are working to publish more of our safety work. See my first project here below, where we make initial progress on prompt injections and other attacks!
Introducing the Instruction Hierarchy, our latest safety research to advance robustness for prompt injections and other ways of tricking LLMs into executing unsafe actions. More details: arxiv.org/abs/2404.13208
To summarize this week: - we released general purpose computer using agent - got beaten by a single human in atcoder heuristics competition - solved 5/6 new IMO problems with natural language proofs All of those are based on the same single reinforcement learning system
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
today we are introducing codex. it is a software engineering agent that runs in the cloud and does tasks for you, like writing a new feature of fixing a bug. you can run many tasks in parallel.
Trading Inference-Time Compute for Adversarial Robustness openai.com/index/trading-…
Chain-of-thought reasoning provides a natural avenue for improving model safety. Today we are publishing a paper on how we train the "o" series of models to think carefully through unsafe prompts: openai.com/index/delibera……
Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵
🚨 New Jailbreak Bounty Alert $1,000 for jailbreaking the hidden CoTs from OpenAI's o1-mini and o1-preview! No bans. Exclusively on the Gray Swan Arena. 🗓Start Time: October 29th, 1 PM ET 🌐Link: app.grayswan.ai/arena 💬Discord: discord.gg/St8uMetxjQ
Hi friends, colleagues, followers. I am on the faculty job market! I am a PhD student @BerkeleyISchool + @berkeley_ai. I work on NLP, and I believe all language, whether AI- or human-generated, is ✨social and cultural data✨. My work includes: 🧵
Does the instruction hierarchy introduced with GPT-4o mini work? We ran AgentDojo on it, and it looks like it does! GPT-4o mini has similar utility as GPT4o (only 1% lower!), but the prompt injection targeted success rate is 20% lower than GPT-4o!
AI is becoming 10x cheaper for the same capability every year. Excited to work with @jacobmenick @_kevinlu @Eric_Wallace_ et al on it.
Introducing GPT-4o mini! It’s our most intelligent and affordable small model, available today in the API. GPT-4o mini is significantly smarter and cheaper than GPT-3.5 Turbo. openai.com/index/gpt-4o-m…
One of the most important and well-executed papers I've read in months. They explored ~all attacks+defenses I was most keen on seeing tried, for getting robust finetuning APIs. I'm not sure if it's possible to make finetuning APIs robust, would be a big deal if it were possible
New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.
New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.
I’ll be giving two different OpenAI talks at ICLR tomorrow on our recent safety work, focusing primarily on the paper “The Instruction Hierarchy”. 1pm at the Data for Foundation Models workshop, and 3pm at the Secure and Trustworthy LLMs workshop.
Thrilled to be in Vienna for our ICLR workshop, Navigating and Addressing Data Problems for Foundation Models. Starting Saturday at 8:50 AM, our program features keynote talks, best paper presentations, a poster session, and a panel discussion. Explore the full schedule here!…
Really cool concurrent work to our recent paper!
Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more! 📄 arxiv.org/abs/2403.09539 Here’s how 1/🧵
We know LLMs hallucinate, but what governs what they dream up? Turns out it’s all about the “unfamiliar” examples they see during finetuning Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate arxiv.org/abs/2403.05612 🧵