LLM Security

@llm_sec

Research, papers, jobs, and news on large language model security. Got something relevant? DM / tag @llm_sec

🏔️

Joined April 2023

296Following

10KFollowers

Pinned

LLM Security@llm_sec · Mar 21, 2024

attack surface ∝ capabilities

10.0K

Pinned

LLM Security@llm_sec · Nov 5

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models (-- look at that perf/latency pareto frontier. game on!) "State-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). We propose…

llm_sec's tweet image. InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

(-- look at that perf/latency pareto frontier. game on!)

"State-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). We propose…

3.0K

Pinned

LLM Security@llm_sec · Oct 31

unpopular opinion: maybe let insecure be insecure and worry about the downstream effects on end users instead of protecting the companies that bake it into their own software.

�🪞rayna🪞@girlgrime · Dec 2, 2022

the poor developers of this AI desperately playing whack-a-mole with techniques of circumventing the safety/legality filters

2.0K

LLM Security Retweeted

Leon Derczynski ✍🏻 🌞🏠🌲@LeonDerczynski · Apr 7

Call for papers: LLMSEC 2025 Deadline 15 April, held w/ ACL 2025 in Vienna Formats: long/short/war stories More: >> sig.llmsecurity.net/workshop/

2.0K

LLM Security@llm_sec · Nov 22

Gritty Pixy "We leverage the sensitivity of existing QR code readers and stretch them to their detection limit. This is not difficult to craft very elaborated prompts and to inject them into QR codes. What is difficult is to make them inconspicuous as we do here with Gritty…

llm_sec's tweet image. Gritty Pixy

"We leverage the sensitivity of existing QR code readers and stretch them to their detection limit. This is not difficult to craft very elaborated prompts and to inject them into QR codes. What is difficult is to make them inconspicuous as we do here with Gritty…

3.0K

LLM Security Retweeted

garak: LLM vulnerability scanner@garak_llm · Nov 15

garak has moved to NVIDIA! New repo link: github.com/NVIDIA/garak

213

132

18.0K

LLM Security@llm_sec · Nov 14

ChatTL;DR – You Really Ought to Check What the LLM Said on Your Behalf 🌶️ "assuming that in the near term it’s just not machines talking to machines all the way down, how do we get people to check the output of LLMs before they copy and paste it to friends, colleagues, course…

llm_sec's tweet image. ChatTL;DR – You Really Ought to Check What the LLM Said on Your Behalf 🌶️

"assuming that in the near term it’s just not machines talking to machines all the way down, how do we get people to check the output of LLMs before they copy and paste it to friends, colleagues, course…

2.0K

LLM Security@llm_sec · Nov 7

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester "we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting…

4.0K

LLM Security@llm_sec · Nov 6

LLMmap: Fingerprinting For Large Language Models "With as few as 8 interactions, LLMmap can accurately identify 42 different LLM versions with over 95% accuracy. More importantly, LLMmap is designed to be robust across different application layers, allowing it to identify LLM…

llm_sec's tweet image. LLMmap: Fingerprinting For Large Language Models

"With as few as 8 interactions, LLMmap can accurately identify 42 different LLM versions with over 95% accuracy. More importantly, LLMmap is designed to be robust across different application layers, allowing it to identify LLM…

12.0K

LLM Security@llm_sec · Oct 29

Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis 🌶️ "Our study evaluates prominent scanners - Garak, Giskard, PyRIT, and CyberSecEval - that adapt red-teaming practices to expose these vulnerabilities. We detail the distinctive features…

llm_sec's tweet image. Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis 🌶️

"Our study evaluates prominent scanners - Garak, Giskard, PyRIT, and CyberSecEval - that adapt red-teaming practices to expose these vulnerabilities. We detail the distinctive features…

4.0K

LLM Security@llm_sec · Nov 5

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents "To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. We find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal…

llm_sec's tweet image. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

"To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. We find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal…

7.0K

LLM Security@llm_sec · Nov 4

Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge "This paper reveals that applying quantization to models that have undergone unlearning can restore the "forgotten" information." "for unlearning methods with utility constraints, the…

llm_sec's tweet image. Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge

"This paper reveals that applying quantization to models that have undergone unlearning can restore the "forgotten" information."
"for unlearning methods with utility constraints, the…

169

137

24.0K

LLM Security Retweeted

Sizhe Chen@_Sizhe_Chen_ · Jul 16, 2024

Safety comes first to deploying LLMs in applications like agents. For richer opportunities of LLMs, we mitigate prompt injections, the #1 security threat by OWASP, via Structured Queries (StruQ). Preserving utility, StruQ discourages all existing prompt injections to an ASR <2%.

20.0K

LLM Security Retweeted

mbg@mbrg0 · Sep 23

the go-to method for data exfil after a successful prompt injection is rendering an image or a clickable link that's why m365 copilot refuses to print links no matter what unless of course..

108

23.0K

LLM Security Retweeted

Johann Rehberger@wunderwuzzi23 · Aug 27

🔥 Microsoft fixed a high severity data exfiltration exploit chain in Copilot that I reported earlier this year. It was possible for a phishing mail to steal PII via prompt injection, including the contents of entire emails and other documents. The demonstrated exploit chain…

275

146

73.0K

LLM Security@llm_sec · Aug 26

Tenable Research discovered a vulnerability in Microsoft’s Copilot Studio via a server-side request forgery (SSRF), which allowed access to potentially sensitive information regarding service internals with potential cross-tenant impact tenable.com/blog/ssrfing-t…

llm_sec's tweet card. Tenable Research discovered a critical information-disclosure vulnerability in Microsoft’s Copilot Studio via a server-side request forgery (SSRF), which allowed researchers access to potentially...

2.0K

LLM Security@llm_sec · Aug 23

Transferring Backdoors between Large Language Models by Knowledge Distillation "we propose ATBA, an adaptive transferable backdoor attack, which can effectively distill the backdoor of teacher LLMs into small models when only executing clean-tuning" "we exploit a shadow model…

llm_sec's tweet image. Transferring Backdoors between Large Language Models by Knowledge Distillation

"we propose ATBA, an adaptive transferable backdoor attack, which can effectively distill the backdoor of teacher LLMs into small models when only executing clean-tuning"

"we exploit a shadow model…

2.0K