Membership Inference Attacks Against Fine-tuned Large Language Models via Self-prompt Calibration

Overview

This paper presents an innovative approach to Membership Inference Attacks (MIAs) targeting fine-tuned large language models (LLMs). The proposed Self-calibrated Probabilistic Variation-based MIA (SPV-MIA) addresses limitations of existing methods, such as dependency on overfitting and access to high-quality reference datasets.

Contributions

The study makes three major contributions:

Self-prompt Calibration: A method to generate reference datasets by prompting the target LLM, enabling attacks without unrealistic assumptions.
Probabilistic Variation Metric: A robust membership signal leveraging LLM memorization rather than overfitting.
Experimental Validation: Demonstrates superior privacy risks using SPV-MIA across various LLMs and datasets, achieving an average 23.6% improvement in AUC compared to baseline methods.

Methodology

Self-prompt Calibration

Existing MIAs rely on reference datasets, often impractical to obtain. The authors propose using the target LLM to generate a self-prompt dataset, mimicking the training data distribution. This reference model significantly enhances attack performance.

Probabilistic Variation Assessment

By focusing on LLM memorization, the study introduces a probabilistic variation metric to measure membership likelihood, mitigating the reliance on overfitting. This involves analyzing symmetrical perturbations and paraphrased text variations.

Experimental Findings

Datasets: WikiText-103, AG News, and XSum.
Models: GPT-2, GPT-J, Falcon-7B, and LLaMA-7B.
SPV-MIA outperformed existing MIAs, with an average AUC of 0.924 and TPR@1%FPR of 46.9%.
Resilience: Demonstrated robustness in scenarios with limited data and diverse sources for self-prompts.

Implications

This work highlights critical privacy risks in fine-tuned LLMs and underscores the importance of considering memorization over overfitting. Future research should extend SPV-MIA to other LLM paradigms and explore defensive strategies.

Reference

Fu, W., Wang, H., Gao, C., Liu, G., Li, Y., & Jiang, T. (2024). Membership Inference Attacks Against Fine-tuned Large Language Models via Self-prompt Calibration. Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Read the full paper.

Overview#

Contributions#

Methodology#

Self-prompt Calibration#

Probabilistic Variation Assessment#

Experimental Findings#

Implications#

Reference#