Danny Halawi
@dannyhalawi15
AI Research
I believe in AGI, but also believe that for most use cases, model quality won't be the bottleneck. Lots of folks will have great models. Integrations will be what distinguishes the utility.
Lots of competition to develop LLMs that beat top human forecasters—& lots of temptations to make exaggerated claims. So a new Karger et al paper presents ForecastBench: a level-playing-field system designed to track human & LLM accuracy on automatically generated & regularly…
Today, we're excited to announce ForecastBench: a new benchmark for evaluating AI and human forecasting capabilities. Our research indicates that AI remains worse at forecasting than expert forecasters. 🧵 Arxiv: arxiv.org/abs/2409.19839 Website: forecastbench.org
Love seeing further work on automated AI forecasting! The author's assume a knowledge cut off of October 2023, but I prompted gpt-4o (as I saw in the github) for events after that date and it knew about them. I plan to reproduce the results in this writeup on a new set of…
We've created a demo of an AI that can predict the future at a superhuman level (on par with groups of human forecasters working together). Consequently I think AI forecasters will soon automate most prediction markets. demo: forecast.safe.ai blog: safe.ai/blog/forecasti…
Turns out you can just book a meeting room and announce an "invited talk" about whatever you want. Here is my talk and taste test of all the goldfish cracker flavors, with a goldfish arena so we could determine the best fish. God I love the PhD.
I have written up my argument for understanding adversarial attacks in computer vision as a baby version of general AI alignment. I think that the *shape* of the problem is very similar & that we *have* to be able to solve it before tackling the A(G)I case. Link in the reply.
I'm a long time fan of @3blue1brown. It was really awesome to see my and @SenR's work on how LLMs store facts discussed in his new video! It's a gorgeously animated explainer of transformer MLP layers, and how facts may be stored in them, go check it out! youtube.com/watch?v=9-Jl0d…
At ICML, presenting on this work today (w/ @aweisawei). Reach out if you wanna chat or hang out~
New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.
I have primarily switched to Claude 3.5 Sonnet and hardly use GPT-4. Anybody else?
One of the most important and well-executed papers I've read in months. They explored ~all attacks+defenses I was most keen on seeing tried, for getting robust finetuning APIs. I'm not sure if it's possible to make finetuning APIs robust, would be a big deal if it were possible
New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.
Smart finetuning to break safety defenses 🧵📖 Read of the day, day 97: Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation, by @dannyhalawi15, @aweisawei et al from @UCBerkeley arxiv.org/pdf/2406.20053 The authors of this paper investigate how to use…
Interested in working at Anthropic? We're hosting a happy hour at ICML on July 23. Register here: lu.ma/c751eomf
One thing that I've come to deeply appreciate at Anthropic is how useful quick iteration times can be. In the current era of AI, there are so many promising ideas to try and not enough time/compute to thoroughly explore them all. At the same time, we don't want to miss out on…