Skip to content

Official code for "Measuring Non-Adversarial Reproduction of Training Data in Large Language Models" (https://arxiv.org/abs/2411.10242)

Notifications You must be signed in to change notification settings

ethz-spylab/non-adversarial-reproduction

Repository files navigation

Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

Michael Aerni*, Javier Rando*, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, Florian Tramèr

Official repository for the paper Measuring Non-Adversarial Reproduction of Training Data in Large Language Models. See also our blog post and Twitter/X thread.

Our raw results required to reproduce our analysis are currently available via Hugging Face.

Setup and general information

Setup

To run the code, install Rye and run

rye sync

in the root of this repository.

To reproduce our analysis, also download our dataset, and copy the perplexities/ and results/ directories into the data/ directory in this repository.

File formats

Each prompt, LLM generation and human-written text in our study is a JSON object, where the collection of those objects is stored in jsonl files (json objects separated by newlines). In general, we start with just prompts, and add more information (e.g., LLM generations) in-place.

Each prompt/generation object has a messages attribute containing the prompt (in OpenAI API format), a text_type attribute (either creative, expository, or argumentative) and type attribute that specifies the subtype. All mitigation strategies have an additional ablation attribute that indicates the type of mitigation (assistant or simple). LLM completions or human-written text is stored in a completion attribute (which can be missing). Certain types of prompts may contain additional attributes. For all LLM generations, we also store the seed in an a seed attribute for reproducibility. Finally, after mapping the LLM generations to AuxDataset, memorized_chars is an array with the same length as completion; for each character, this array contains the lenght of the longest substring overlapping that character and found in AuxDataset.

Building prompts and baselines

General prompts

Most prompts are generated from templates and instantiations in data/prompts/raw_data/general_prompts.json. To build the prompts, run

python src/non_adversarial_reproduction/build_general_prompts.py --input-file data/prompts/raw_data/general_prompts.json --output-file data/prompts/general_prompts.jsonl

Book reviews

We use book titles from https://thegreatestbooks.org/. To build prompts, download the Fall 2024 version (rc38) from https://thegreatestbooks.org/rc/38.csv, and run the following (replace PATH_TO_RAW_DATA with the path to the downloaded file):

python src/non_adversarial_reproduction/build_book_reviews_prompts.py --input-file PATH_TO_RAW_DATA --output-file data/prompts/book_reviews.jsonl

Essays (PERSUADE 2.0)

We use the PERSUADE 2.0 corpus of essay prompts and responses. We only use prompts in our paper and omit the human responses due to a high degree of overlap with AuxDataset. Nevertheless, the following builds prompts and a human baseline. Since the set of usable prompts in the corpus is relatively small, we manually invented additional prompts, which can be found in data/prompts/raw_data/essay_persuade_additional_prompts.jsonl.

To build the prompts, download the PERSUADE 2.0 corpus, the PERSUADE 1.0 corpus (used for filtering human responses), and store them in some PERSUADE_DIR directory to have files PERSUADE_DIR/persuade_corpus_2.0.csv, PERSUADE_DIR/feedback-prize-2021/train.csv, and PERSUADE_DIR/feedback-prize-2021/test/. Then, run

python src/essays/build_persuade_prompts.py --raw-data-dir PERSUADE_DIR --output-prompts-file data/prompts/essay_persuade.jsonl --results-dir data/results/ --control-prompts-file data/prompts/raw_data/essay_persuade_additional_prompts.jsonl --id-blacklist-file data/baselines/essay_persuade/blacklist_ids.txt

Reddit data (explainlikeimfive and WritingPrompts)

First, download Reddit submissions and comments for May to July 2014 (2024-05, 2024-06, and 2024-07) from AcademicTorrents, and store them in some REDDIT_RAW_DATA_DIR directory (REDDIT_RAW_DATA_DIR should contain reddit/comments/RC-*.zst and reddit/submissions/RS-*.zst files).

Then, extract all submissions and comments for the two baselines with the following two commands (this may take a few hours);

# explainlikeimfive
python src/explainlikeimfive/extract_submissions_comments.py --input-dir REDDIT_RAW_DATA_DIR/reddit --output-submissions data/baselines/explainlikeimfive/submissions.jsonl --output-comments data/baselines/explainlikeimfive/comments.jsonl --months 2024-05 2024-06 2024-07

# WritingPrompts
python src/writing_prompts/extract_submissions_comments.py --input-dir REDDIT_RAW_DATA_DIR/reddit --output-submissions data/baselines/writing_prompts/submissions.jsonl --output-comments data/baselines/writing_prompts/comments.jsonl --months 2024-05 2024-06 2024-07

Finally, filter and sample submissions and comments to serve as prompts and human baselines:

After running those commands, data/prompts/explainlikeimfive.jsonl and data/prompts/writing_prompts.jsonl will contain prompts in the format to be used with generate.py. Furthermore, data/results/explainlikeimfive/ and data/results/writing_prompts/ will each contain a file humans_temp0.0.jsonl containing human responses (as produced by generate.py) ready to be mapped against AuxDataset.

All prompt objects will have a reddit_submission_id field to identify the prompt's submission; human results additionally have reddit_comment_id and reddit_author fields to identify the Reddit comment.

IMDb reviews

For IMDb reviews, we first collect a top movies list from IMDb, then collect the reviews for each movie, and finally build prompts and human baselines.

We use the top 250 movies list found at https://www.imdb.com/chart/top/, and store the results in data/baselines/imdb/movies.json. This file contains a JSON array of movie objects, each with a title, year, duration (in minutes), rating (string) and url (to the movie's page) attribute. For example, movies.json might look like

[
    {"title": "The Shawshank Redemption", "year": 1994, "duration": 142, "rating": "R", "url": "https://www.imdb.com/title/tt0111161/"},
    ...
]

We then collect each movie's reviews from IMDb. For the movie in movies.json with index i (0-based), we store the reviews in data/baselines/imdb/reviews/reviews_i.json. This file contains a JSON array of review objects, each For example, a reviews file might look like

[
    {"id": "rw9955642", "title": "A film that is truly memorable, [...].", "date": "2024-08-15", "text": "Its a film where you just feel pure raw emotion as if you're the character [...].", "rating_value": 10, "rating_scale": 10},
    ...
]

Having collected all the movies and reviews, the following script builds prompts and human baselines:

python src/imdb/build_prompts.py --input-dir data/baselines/imdb/ --output-prompts-file data/prompts/imdb_reviews.jsonl --results-dir data/results/

This also creates a human baseline file in data/results/imdb_reviews/humans_temp0.0.jsonl. Note that this baseline file contains all reviews, including ones that are too old to be considered. We account for this (i.e., filter) in the analysis.

In-the-wild baselines (lmsys and WildChat)

We provide scripts to automatically download the source datasets and extract prompts and completions. The results will be in the same format as returned by generate.py. Make sure to once run huggingface-cli login as the dataset requires a one-time agreement. Then run

python src/memorization_chat_datasets/lmsys_chat.py --results-dir data/results/
python src/memorization_chat_datasets/wildchat.py --results-dir data/results/

Ablation prompts

We generate ablation prompts from all original prompts automatically. Hence, make sure to generate all the necessary original prompt files first! The script generates one large input file per ablation, combining all the original prompts:

python src/non_adversarial_reproduction/build_prompt_ablations.py --output-dir data/prompts/ --prompt-files data/prompts/general_prompts.jsonl data/prompts/essay_persuade.jsonl

Running experiments

Inference (LLM generations)

To simply reproduce a set of experiments in our paper, run one of the generate_*.sh scripts in the root of this repository. Make sure to inspect the script first and modify things where necessary, and be aware that running those scripts can incur significant costs.

All LLM generations are created with the claude_memorization/generate.py script. We use OpenRouter by default, but support using the OpenAI or Google Gemini APIs directly.

To use OpenRouter (default), put the API key into the OPENROUTER_API_KEY environment variable and run generate.py. For example, to get generations for the general prompts with Claude 3 Haiku and temperature 0.7 on five seeds, run

python src/non_adversarial_reproduction/generate.py \
    --model "anthropic/claude-3-haiku" \
    --data_path "data/prompts/general_prompts.jsonl" \
    --temperature "0.7" \
    --max_new_tokens "8000" \
    --max-parallel-requests "8" \
    --output_path "data/results/" \
    --seeds $(seq 0xC0FFEE $((0xC0FFEE + 5 - 1)))

The script will determine the output directory and file name from the prompt file name. In the example above, the results will be stored in data/results/general_prompts/claude-3-haiku_temp0.7.jsonl.

Run python src/non_adversarial_reproduction/generate.py --help for more details.

To force using the OpenAI API, add the --use-openai flag and set the OPENAI_API_KEY and OPENAI_ORG environment variables. Make sure to still include the openai/ prefix in the --model argument even if the OpenAI API does not require it.

To force using the Google Gemini API, add the --use-google flag and set the GOOGLE_API_KEY environment variable. Make sure to still include the google/ prefix in the --model argument even if the OpenAI API does not require it.

Matching snippets with AuxDataset

For a description of AuxDataset, see Nasr et al. (2023). We do provide the raw matching data from AuxDataset for all results in our paper (except tokens for copyrighted human-written texts).

For every results directory (e.g., data/results/general_prompts/), the raw AuxDataset data must be stored in an index/ subdirectory (e.g., data/results/general_prompts/index/). For every *.jsonl file in the results directory, there must be a corresponding *.tokens.npy and *.kgrams.npy file in the index/ subdirectory. The *.kgrams.npy file contains, for every completion and token in the completion, the length of the longest k-gram in AuxDataset that starts at that token. Tokenization is with respect to data/my.bpe. The *.tokens.npy file just contains the tokenized completion, which is used to map overlaps in token space to character space.

To process matches in AuxDataset for our analysis, first calculate overlaps between the completion and prompt (for discounting) via

PROMPT_FILE_NAMES=("ablation_assistant" "ablation_simple" "book_reviews" "essay_persuade" "explainlikeimfive" "general_prompts" "imdb_reviews" "writing_prompts" "lmsys_chat" "wildchat")
for prompt_file_name in "${PROMPT_FILE_NAMES[@]}"; do
    python src/non_adversarial_reproduction/calculate_prompt_overlaps.py --results-folder "data/results/${prompt_file_name}" ;
done

and then run the mapping script

PROMPT_FILE_NAMES=("ablation_assistant" "ablation_simple" "book_reviews" "essay_persuade" "explainlikeimfive" "general_prompts" "imdb_reviews" "writing_prompts" "lmsys_chat" "wildchat")
for prompt_file_name in "${PROMPT_FILE_NAMES[@]}"; do
    python src/non_adversarial_reproduction/map_index_results.py --tokenizer-file data/my.bpe --results-folder "data/results/${prompt_file_name}" ;
done

Calculate the perplexity of 50-character snippets

Calculating the perplexities requires two script invocations, once for all LLMs, and once for the human counterparts. Each script invocation yields two json files containing the sampled memorized and non-memorized snippets, and NumPy files containing the corresponding perplexities.

You can optionally specify a PyTorch device via --device DEVICE. If no GPUs are available, the device will always be CPU.

For models:

models=(
    "gpt-4o-mini-2024-07-18_temp0.7"
    "gpt-4o-2024-05-13_temp0.7"
    "gpt-4-turbo-2024-04-09_temp0.7"
    "claude-3-haiku_temp0.7"
    "claude-3.5-sonnet_temp0.7"
    "claude-3-opus_temp0.7"
    "llama-3.1-8b-instruct_temp0.7"
    "llama-3.1-70b-instruct_temp0.7"
    "llama-3.1-405b-instruct_temp0.7"
    "gemini-1.5-flash-002_temp0.7"
    "gemini-1.5-pro-002_temp0.7"
)
python src/non_adversarial_reproduction/calculate_perplexities.py --input-base-dir "data/results/" --output-dir "data/perplexities/models/" \
    --models ${models[@]} \
    --settings "book_reviews" "essay_persuade" "explainlikeimfive"  "general_prompts" "imdb_reviews" "writing_prompts" \
    --device "cuda:0"

For humans:

python src/non_adversarial_reproduction/calculate_perplexities.py --input-base-dir "data/results/" --output-dir "data/perplexities/humans/" \
    --models "humans_temp0.0" \
    --settings "explainlikeimfive" "writing_prompts" "imdb_reviews" \
    --device "cuda:0"

Analyzing results

All our plots can be reproduced by simply running the plots.ipynb notebook. This notebook assumes that all results are stored in the data/results/ directory and are mapped against AuxDataset.

About

Official code for "Measuring Non-Adversarial Reproduction of Training Data in Large Language Models" (https://arxiv.org/abs/2411.10242)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published