Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Analyzes prompt injection attacks in LLMs, evaluates their impact on different models, and benchmarks defenses like Known-Answer Detection.

February 7, 2025 · 4 min · Chengyu Zhang

Jailbreaking Large Language Models: Disguise and Reconstruction Attack (DRA)

Explores how DRA exploits biases in LLM fine-tuning to bypass safety measures with minimal queries, achieving state-of-the-art jailbreak success.

February 5, 2025 · 4 min · Chengyu Zhang

Using LLMs to Uncover Memorization in Instruction-Tuned Models

A study introducing a black-box prompt optimization approach to uncover higher levels of memorization in instruction-tuned LLMs.

October 11, 2024 · 2 min · Chengyu Zhang

Do Membership Inference Attacks Work on Large Language Models?

This paper evaluates the effectiveness of membership inference attacks on large language models, revealing that such attacks often perform no better than random guessing.

June 14, 2024 · 2 min · Chengyu Zhang

Membership Inference Attacks Against Fine-tuned Large Language Models via Self-prompt Calibration

A novel study introducing self-prompt calibration for membership inference attacks (MIAs) against fine-tuned large language models, improving reliability and practicality in privacy assessments.

January 18, 2024 · 2 min · Chengyu Zhang