Michael Aerni*, Javier Rando*, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, Florian Tramèr
Official repository for the paper Measuring Non-Adversarial Reproduction of Training Data in Large Language Models. See also our blog post and Twitter/X thread.
Our raw results required to reproduce our analysis are currently available via Hugging Face.
To run the code, install Rye and run
rye sync
in the root of this repository.
To reproduce our analysis,
also download our dataset,
and copy the perplexities/
and results/
directories
into the data/
directory in this repository.
Each prompt, LLM generation and human-written text in our study is a JSON object, where the collection of those objects is stored in jsonl files (json objects separated by newlines). In general, we start with just prompts, and add more information (e.g., LLM generations) in-place.
Each prompt/generation object has a messages
attribute
containing the prompt (in OpenAI API format),
a text_type
attribute
(either creative
, expository
, or argumentative
)
and type
attribute that specifies the subtype.
All mitigation strategies have an additional ablation
attribute that indicates
the type of mitigation (assistant
or simple
).
LLM completions or human-written text is stored in a completion
attribute
(which can be missing).
Certain types of prompts may contain additional attributes.
For all LLM generations, we also store the seed in an a seed
attribute
for reproducibility.
Finally, after mapping the LLM generations to AuxDataset,
memorized_chars
is an array with the same length as completion;
for each character, this array contains the lenght of the longest substring
overlapping that character and found in AuxDataset.
Most prompts are generated from templates and instantiations
in data/prompts/raw_data/general_prompts.json
.
To build the prompts, run
python src/non_adversarial_reproduction/build_general_prompts.py --input-file data/prompts/raw_data/general_prompts.json --output-file data/prompts/general_prompts.jsonl
We use book titles from https://thegreatestbooks.org/.
To build prompts,
download the Fall 2024 version (rc38) from https://thegreatestbooks.org/rc/38.csv,
and run the following (replace PATH_TO_RAW_DATA
with the path to the downloaded file):
python src/non_adversarial_reproduction/build_book_reviews_prompts.py --input-file PATH_TO_RAW_DATA --output-file data/prompts/book_reviews.jsonl
We use the PERSUADE 2.0 corpus
of essay prompts and responses.
We only use prompts in our paper and omit the human responses due to a high degree
of overlap with AuxDataset.
Nevertheless, the following builds prompts and a human baseline.
Since the set of usable prompts in the corpus is relatively small,
we manually invented additional prompts,
which can be found in data/prompts/raw_data/essay_persuade_additional_prompts.jsonl
.
To build the prompts,
download the PERSUADE 2.0 corpus,
the PERSUADE 1.0 corpus
(used for filtering human responses),
and store them in some PERSUADE_DIR
directory to have files
PERSUADE_DIR/persuade_corpus_2.0.csv
,
PERSUADE_DIR/feedback-prize-2021/train.csv
,
and PERSUADE_DIR/feedback-prize-2021/test/
.
Then, run
python src/essays/build_persuade_prompts.py --raw-data-dir PERSUADE_DIR --output-prompts-file data/prompts/essay_persuade.jsonl --results-dir data/results/ --control-prompts-file data/prompts/raw_data/essay_persuade_additional_prompts.jsonl --id-blacklist-file data/baselines/essay_persuade/blacklist_ids.txt
First, download Reddit submissions and comments for May to July 2014
(2024-05
, 2024-06
, and 2024-07
)
from AcademicTorrents,
and store them in some REDDIT_RAW_DATA_DIR
directory
(REDDIT_RAW_DATA_DIR
should contain reddit/comments/RC-*.zst
and reddit/submissions/RS-*.zst
files).
Then, extract all submissions and comments for the two baselines with the following two commands (this may take a few hours);
# explainlikeimfive
python src/explainlikeimfive/extract_submissions_comments.py --input-dir REDDIT_RAW_DATA_DIR/reddit --output-submissions data/baselines/explainlikeimfive/submissions.jsonl --output-comments data/baselines/explainlikeimfive/comments.jsonl --months 2024-05 2024-06 2024-07
# WritingPrompts
python src/writing_prompts/extract_submissions_comments.py --input-dir REDDIT_RAW_DATA_DIR/reddit --output-submissions data/baselines/writing_prompts/submissions.jsonl --output-comments data/baselines/writing_prompts/comments.jsonl --months 2024-05 2024-06 2024-07
Finally, filter and sample submissions and comments to serve as prompts and human baselines:
After running those commands,
data/prompts/explainlikeimfive.jsonl
and
data/prompts/writing_prompts.jsonl
will contain prompts in the format to be used with generate.py
.
Furthermore, data/results/explainlikeimfive/
and data/results/writing_prompts/
will each contain a file humans_temp0.0.jsonl
containing human responses
(as produced by generate.py
)
ready to be mapped against AuxDataset.
All prompt objects will have a reddit_submission_id
field
to identify the prompt's submission;
human results additionally have reddit_comment_id
and reddit_author
fields
to identify the Reddit comment.
For IMDb reviews, we first collect a top movies list from IMDb, then collect the reviews for each movie, and finally build prompts and human baselines.
We use the top 250 movies list found at https://www.imdb.com/chart/top/,
and store the results in data/baselines/imdb/movies.json
.
This file contains a JSON array of movie objects,
each with a title
, year
, duration
(in minutes), rating
(string) and url
(to the movie's page) attribute.
For example, movies.json
might look like
[
{"title": "The Shawshank Redemption", "year": 1994, "duration": 142, "rating": "R", "url": "https://www.imdb.com/title/tt0111161/"},
...
]
We then collect each movie's reviews from IMDb.
For the movie in movies.json
with index i
(0-based),
we store the reviews in data/baselines/imdb/reviews/reviews_i.json
.
This file contains a JSON array of review objects,
each
For example, a reviews file might look like
[
{"id": "rw9955642", "title": "A film that is truly memorable, [...].", "date": "2024-08-15", "text": "Its a film where you just feel pure raw emotion as if you're the character [...].", "rating_value": 10, "rating_scale": 10},
...
]
Having collected all the movies and reviews, the following script builds prompts and human baselines:
python src/imdb/build_prompts.py --input-dir data/baselines/imdb/ --output-prompts-file data/prompts/imdb_reviews.jsonl --results-dir data/results/
This also creates a human baseline file in data/results/imdb_reviews/humans_temp0.0.jsonl
.
Note that this baseline file contains all reviews,
including ones that are too old to be considered.
We account for this (i.e., filter) in the analysis.
We provide scripts to automatically download the source datasets
and extract prompts and completions.
The results will be in the same format as returned by generate.py
.
Make sure to once run huggingface-cli login
as the dataset requires a one-time agreement.
Then run
python src/memorization_chat_datasets/lmsys_chat.py --results-dir data/results/
python src/memorization_chat_datasets/wildchat.py --results-dir data/results/
We generate ablation prompts from all original prompts automatically. Hence, make sure to generate all the necessary original prompt files first! The script generates one large input file per ablation, combining all the original prompts:
python src/non_adversarial_reproduction/build_prompt_ablations.py --output-dir data/prompts/ --prompt-files data/prompts/general_prompts.jsonl data/prompts/essay_persuade.jsonl
To simply reproduce a set of experiments in our paper,
run one of the generate_*.sh
scripts in the root of this repository.
Make sure to inspect the script first and modify things where necessary,
and be aware that running those scripts can incur significant costs.
All LLM generations are created with the claude_memorization/generate.py
script.
We use OpenRouter by default, but support using the OpenAI or Google Gemini APIs directly.
To use OpenRouter (default),
put the API key into the OPENROUTER_API_KEY
environment variable and run generate.py
.
For example, to get generations for the general prompts with Claude 3 Haiku and temperature 0.7 on five seeds, run
python src/non_adversarial_reproduction/generate.py \
--model "anthropic/claude-3-haiku" \
--data_path "data/prompts/general_prompts.jsonl" \
--temperature "0.7" \
--max_new_tokens "8000" \
--max-parallel-requests "8" \
--output_path "data/results/" \
--seeds $(seq 0xC0FFEE $((0xC0FFEE + 5 - 1)))
The script will determine the output directory and file name from the prompt file name.
In the example above, the results will be stored in
data/results/general_prompts/claude-3-haiku_temp0.7.jsonl
.
Run python src/non_adversarial_reproduction/generate.py --help
for more details.
To force using the OpenAI API,
add the --use-openai
flag and set the OPENAI_API_KEY
and OPENAI_ORG
environment variables.
Make sure to still include the openai/
prefix in the --model
argument
even if the OpenAI API does not require it.
To force using the Google Gemini API,
add the --use-google
flag and set the GOOGLE_API_KEY
environment variable.
Make sure to still include the google/
prefix in the --model
argument
even if the OpenAI API does not require it.
For a description of AuxDataset, see Nasr et al. (2023). We do provide the raw matching data from AuxDataset for all results in our paper (except tokens for copyrighted human-written texts).
For every results directory (e.g., data/results/general_prompts/
),
the raw AuxDataset data must be stored in an index/
subdirectory
(e.g., data/results/general_prompts/index/
).
For every *.jsonl
file in the results directory,
there must be a corresponding *.tokens.npy
and *.kgrams.npy
file in the index/
subdirectory.
The *.kgrams.npy
file contains, for every completion and token in the completion,
the length of the longest k-gram in AuxDataset that starts at that token.
Tokenization is with respect to data/my.bpe
.
The *.tokens.npy
file just contains the tokenized completion,
which is used to map overlaps in token space to character space.
To process matches in AuxDataset for our analysis, first calculate overlaps between the completion and prompt (for discounting) via
PROMPT_FILE_NAMES=("ablation_assistant" "ablation_simple" "book_reviews" "essay_persuade" "explainlikeimfive" "general_prompts" "imdb_reviews" "writing_prompts" "lmsys_chat" "wildchat")
for prompt_file_name in "${PROMPT_FILE_NAMES[@]}"; do
python src/non_adversarial_reproduction/calculate_prompt_overlaps.py --results-folder "data/results/${prompt_file_name}" ;
done
and then run the mapping script
PROMPT_FILE_NAMES=("ablation_assistant" "ablation_simple" "book_reviews" "essay_persuade" "explainlikeimfive" "general_prompts" "imdb_reviews" "writing_prompts" "lmsys_chat" "wildchat")
for prompt_file_name in "${PROMPT_FILE_NAMES[@]}"; do
python src/non_adversarial_reproduction/map_index_results.py --tokenizer-file data/my.bpe --results-folder "data/results/${prompt_file_name}" ;
done
Calculating the perplexities requires two script invocations, once for all LLMs, and once for the human counterparts. Each script invocation yields two json files containing the sampled memorized and non-memorized snippets, and NumPy files containing the corresponding perplexities.
You can optionally specify a PyTorch device via --device DEVICE
.
If no GPUs are available, the device will always be CPU.
For models:
models=(
"gpt-4o-mini-2024-07-18_temp0.7"
"gpt-4o-2024-05-13_temp0.7"
"gpt-4-turbo-2024-04-09_temp0.7"
"claude-3-haiku_temp0.7"
"claude-3.5-sonnet_temp0.7"
"claude-3-opus_temp0.7"
"llama-3.1-8b-instruct_temp0.7"
"llama-3.1-70b-instruct_temp0.7"
"llama-3.1-405b-instruct_temp0.7"
"gemini-1.5-flash-002_temp0.7"
"gemini-1.5-pro-002_temp0.7"
)
python src/non_adversarial_reproduction/calculate_perplexities.py --input-base-dir "data/results/" --output-dir "data/perplexities/models/" \
--models ${models[@]} \
--settings "book_reviews" "essay_persuade" "explainlikeimfive" "general_prompts" "imdb_reviews" "writing_prompts" \
--device "cuda:0"
For humans:
python src/non_adversarial_reproduction/calculate_perplexities.py --input-base-dir "data/results/" --output-dir "data/perplexities/humans/" \
--models "humans_temp0.0" \
--settings "explainlikeimfive" "writing_prompts" "imdb_reviews" \
--device "cuda:0"
All our plots can be reproduced by simply running the plots.ipynb
notebook.
This notebook assumes that all results are stored in the data/results/
directory
and are mapped against AuxDataset.