ML Safety

@ml_safety

Course: http://course.mlsafety.org Newsletter: http://newsletter.mlsafety.org Papers as they come out: https://twitter.com/topofmlsafety. More: http://mlsafety.org

Joined July 2021

2Following

1KFollowers

ML Safety@ml_safety · Jul 14

Join us at the AI Safety Social at #ICML2025! We'll open with a panel on the impacts of reasoning & agency on safety with @AdtRaghunathan, @ancadianadragan, @DavidDuvenaud, & @sivareddyg. Come connect over snacks & drinks. 🗓️ Thursday, July 17, from 7-9 PM in West Ballroom A

1.0K

ML Safety@ml_safety · Jul 23, 2024

Join us for a panel and social on ML Safety at ICML tomorrow (07/23) at 5.30 CET in Lehar 1-4! We have a great set of panelists lined up to discuss progress in ML Safety research including Bo Li, David Krueger and Sanmi Koyejo.

3.0K

ML Safety@ml_safety · May 7, 2024

We’re having a social on ML Safety at ICLR this Thursday (5/9) with drinks and snacks! The social will be from 5:30-7:30 pm CET in room Schubert 4 at the Messe Wien Exhibition and Congress Center. Register here (so we can estimate how much food to buy)! forms.gle/zWhi6BXbdBTYhE…

1.0K

ML Safety@ml_safety · Dec 14, 2023

A collection of some of the best safety papers of 2023 newsletter.mlsafety.org/p/ml-safety-ne…

ml_safety's tweet card. Top Safety Papers of 2023

987

ML Safety@ml_safety · Oct 27, 2023

Tomorrow at 1pm PST, Kenneth Li will present at the Center for AI Safety’s Reading and Learning event. Kenneth has recently published on identifying world models in LLM activations and improving truthfulness in LLM outputs. Here are the details: centerforaisafety.github.io/reading/

518

ML Safety@ml_safety · Jul 25, 2023

We’re having a social on ML safety at ICML this Wednesday (7/26) with food and snacks! The social will be from 5:45 pm to 7:30 PM Hawaii time in room 323 in Hawaii Convention Center. Register here (so we can estimate how much food to buy)! docs.google.com/forms/d/e/1FAI…

3.0K

ML Safety Retweeted

Dan Hendrycks@DanHendrycks · Jun 22, 2023

Following the statement on AI extinction risks, many have called for further discussion of the challenges posed by AI and ideas on how to mitigate risk. Our new paper provides a detailed overview of catastrophic AI risks. Read it here: arxiv.org/abs/2306.12001 (🧵 below)

143

468

208

189.0K

ML Safety@ml_safety · Apr 11, 2023

In the 9th edition of the ML safety newsletter, we cover verifying large training runs, security risks from LLM access to APIs, why natural selection may favor AIs over humans, and more! newsletter.mlsafety.org/p/ml-safety-ne…

ml_safety's tweet card. Verifying large training runs, security risks from LLM access to APIs, why natural selection may favor AIs over humans

2.0K

ML Safety@ml_safety · Feb 20, 2023

In the 8th edition of the ML Safety Newsletter, we cover interpretability, using law to inform AI alignment, and scaling laws for proxy gaming. newsletter.mlsafety.org/p/ml-safety-ne…

ml_safety's tweet card. Interpretability, using law to inform AI alignment, scaling laws for proxy gaming

4.0K

ML Safety@ml_safety · Jan 9, 2023

In the 7th ML Safety newsletter, we discuss AI lie detectors, research on transparency and grokking, adversarial defenses for text models, and the new ML safety course. newsletter.mlsafety.org/p/ml-safety-ne…

ml_safety's tweet card. Making model dishonesty harder, making grokking more interpretable, an example of an emergent internal optimizer

3.0K

ML Safety@ml_safety · Oct 26, 2022

“If you cannot measure it, you cannot improve it.” ML Safety research lacks benchmarks. We are offering up to $500,000 in prizes for ML Safety benchmark ideas (or papers). Main site: benchmarking.mlsafety.org Example ideas: benchmarking.mlsafety.org/ideas

ml_safety's tweet image. “If you cannot measure it, you cannot improve it.” ML Safety research lacks benchmarks. We are offering up to $500,000 in prizes for ML Safety benchmark ideas (or papers).

Main site: benchmarking.mlsafety.org

Example ideas: benchmarking.mlsafety.org/ideas

ML Safety@ml_safety · Oct 17, 2022

In the sixth ML Safety newsletter, we cover a survey of transparency research, a substantial improvement to certified robustness, new examples of 'goal misgeneralization,' and what the ML community thinks about safety issues. newsletter.mlsafety.org/p/ml-safety-ne…

ml_safety's tweet card. Transparency survey, provable robustness, models that predict the future

ML Safety@ml_safety · Oct 5, 2022

Can ML models spot an ethical dilemma? As ML systems make more real-world decisions it will become more important that they have a calibrated ethical awareness. Announcing a $100,000 competition for research on detecting moral ambiguity. moraluncertainty.mlsafety.org

ml_safety's tweet card. The emerging research community focused on reducing long-term risks from ML systems.

ML Safety@ml_safety · Sep 26, 2022

In this special newsletter, we cover safety competitions and prizes: ML Safety Workshop ($100K), Trojan Detection ($50K), Forecasting ($625K), Uncertainty Estimation ($100K), Inverse Scaling ($250K), AI Worldview Writing Prize ($1.5M). Details: newsletter.mlsafety.org/p/ml-safety-ne…

ml_safety's tweet card. Safety competitions with more than $1 million in prizes

ML Safety Retweeted

Dan Hendrycks@DanHendrycks · Jul 26, 2022

We’ll be organizing a NeurIPS workshop on Machine Learning Safety! We'll have $50K in best papers awards. To encourage proactiveness about tail risks, we'll also have $50K in awards for papers that discuss their impact on long-term, long-tail risks. neurips2022.mlsafety.org

184

ML Safety@ml_safety · Jun 3, 2022

In the fourth ML Safety newsletter, we cover many new interpretability papers, virtual logit matching, and how rationalization can help robustness. newsletter.mlsafety.org/p/ml-safety-ne…

ml_safety's tweet card. Many New Interpretability Papers, Virtual Logit Matching, Rationalization Helps Robustness