Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Analyzes prompt injection attacks in LLMs, evaluates their impact on different models, and benchmarks defenses like Known-Answer Detection.
Analyzes prompt injection attacks in LLMs, evaluates their impact on different models, and benchmarks defenses like Known-Answer Detection.
Explores how DRA exploits biases in LLM fine-tuning to bypass safety measures with minimal queries, achieving state-of-the-art jailbreak success.