Eric Wallace

@Eric_Wallace_

researcher @openai

San Francisco

Joined July 2011

1KFollowing

9KFollowers

Pinned

Eric Wallace@Eric_Wallace_ · Apr 23, 2024

Some personal updates: I joined OpenAI a few months ago, working on all things robustness/safety/privacy. Also, we are working to publish more of our safety work. See my first project here below, where we make initial progress on prompt injections and other attacks!

OOpenAI@OpenAI · Apr 23, 2024

Introducing the Instruction Hierarchy, our latest safety research to advance robustness for prompt injections and other ways of tricking LLMs into executing unsafe actions. More details: arxiv.org/abs/2404.13208

473

72.0K

Eric Wallace Retweeted

Tifa Chen@tifafafafa · Jul 20

Last night we IMO tonight we party

832

122

87.0K

Eric Wallace Retweeted

Jerry Tworek@MillionInt · Jul 19

To summarize this week: - we released general purpose computer using agent - got beaten by a single human in atcoder heuristics competition - solved 5/6 new IMO problems with natural language proofs All of those are based on the same single reinforcement learning system

118

1.0K

248

145.0K

Eric Wallace Retweeted

Alexander Wei@alexwei_ · Jul 19

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

406

1.0K

7.0K

2.0K

5.2M

Eric Wallace Retweeted

Sam Altman@sama · May 16

today we are introducing codex. it is a software engineering agent that runs in the cloud and does tasks for you, like writing a new feature of fixing a bug. you can run many tasks in parallel.

1.0K

3.0K

37.0K

8.0K

6.0M

Eric Wallace Retweeted

OpenAI@OpenAI · Jan 22

Trading Inference-Time Compute for Adversarial Robustness openai.com/index/trading-…

143

249

2.0K

565

431.0K

Eric Wallace@Eric_Wallace_ · Dec 20

Chain-of-thought reasoning provides a natural avenue for improving model safety. Today we are publishing a paper on how we train the "o" series of models to think carefully through unsafe prompts: openai.com/index/delibera……

Eric_Wallace_'s tweet card. Introducing our new alignment strategy for o1 models, which are directly taught safety specifications and how to reason over them.

405

222

37.0K

Eric Wallace Retweeted

Mark Chen@markchen90 · Dec 3

iykyk

308

36.0K

Eric Wallace Retweeted

Charlie Snell@sea_snell · Nov 26

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

572

394

152.0K

Eric Wallace Retweeted

Gray Swan AI@GraySwanAI · Oct 29

🚨 New Jailbreak Bounty Alert $1,000 for jailbreaking the hidden CoTs from OpenAI's o1-mini and o1-preview! No bans. Exclusively on the Gray Swan Arena. 🗓Start Time: October 29th, 1 PM ET 🌐Link: app.grayswan.ai/arena 💬Discord: discord.gg/St8uMetxjQ

10.0K

Eric Wallace Retweeted

Lucy Li@lucy3_li · Oct 14

Hi friends, colleagues, followers. I am on the faculty job market! I am a PhD student @BerkeleyISchool + @berkeley_ai. I work on NLP, and I believe all language, whether AI- or human-generated, is ✨social and cultural data✨. My work includes: 🧵

393

57.0K

Eric Wallace Retweeted

Edoardo Debenedetti@edoardo_debe · Jul 19, 2024

Does the instruction hierarchy introduced with GPT-4o mini work? We ran AgentDojo on it, and it looks like it does! GPT-4o mini has similar utility as GPT4o (only 1% lower!), but the prompt injection targeted success rate is 20% lower than GPT-4o!

5.0K

Eric Wallace@Eric_Wallace_ · Jul 18, 2024

AI is becoming 10x cheaper for the same capability every year. Excited to work with @jacobmenick @_kevinlu @Eric_Wallace_ et al on it.

OOpenAI Developers@OpenAIDevs · Jul 18, 2024

Introducing GPT-4o mini! It’s our most intelligent and affordable small model, available today in the API. GPT-4o mini is significantly smarter and cheaper than GPT-3.5 Turbo. openai.com/index/gpt-4o-m…

9.0K

Eric Wallace@Eric_Wallace_ · Jul 9, 2024

One of the most important and well-executed papers I've read in months. They explored ~all attacks+defenses I was most keen on seeing tried, for getting robust finetuning APIs. I'm not sure if it's possible to make finetuning APIs robust, would be a big deal if it were possible

DDanny Halawi@dannyhalawi15 · Jul 1, 2024

New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.

12.0K

Eric Wallace Retweeted

Danny Halawi@dannyhalawi15 · Jul 1, 2024

126

36.0K

Eric Wallace@Eric_Wallace_ · May 10, 2024

I’ll be giving two different OpenAI talks at ICLR tomorrow on our recent safety work, focusing primarily on the paper “The Instruction Hierarchy”. 1pm at the Data for Foundation Models workshop, and 3pm at the Secure and Trustworthy LLMs workshop.

RRuoxi Jia@ruoxijia · May 10, 2024

Thrilled to be in Vienna for our ICLR workshop, Navigating and Addressing Data Problems for Foundation Models. Starting Saturday at 8:50 AM, our program features keynote talks, best paper presentations, a poster session, and a panel discussion. Explore the full schedule here!…

111

19.0K

Eric Wallace@Eric_Wallace_ · Mar 15, 2024

Really cool concurrent work to our recent paper!

MMatthew Finlayson@mattf1n · Mar 15, 2024

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more! 📄 arxiv.org/abs/2403.09539 Here’s how 1/🧵

10.0K

Eric Wallace Retweeted

Katie Kang@katie_kang_ · Mar 12, 2024

We know LLMs hallucinate, but what governs what they dream up? Turns out it’s all about the “unfamiliar” examples they see during finetuning Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate arxiv.org/abs/2403.05612 🧵

365

232

48.0K