Tabmemcheck is an open-source Python library to test language models for memorization of tabular datasets.
The package provides four different tests for verbatim memorization of a tabular dataset (header test, row completion test, feature completion test, first token test).
It also provides additional heuristics to test what an LLM knows about a tabular dataset (feature names test, feature values test, dataset name test, and sampling).
Features:
- Test GPT-3.5, GPT-4, and other LLMs for prior exposure with tabular datasets.
- Supports chat models and (base) language models. In chat mode, we use few-shot learning to condition the model on the desired behavior.
- The submodule
tabmemcheck.datasets
allows to load popular tabular datasets in perturbed form (original
,perturbed
,task
, andstatistical
), as used in our COLM'24 paper. - The code to replicate the COLM'24 paper allows to perform few-shot learning with LLMs and tabular data.
The different memorization tests were first described in a Neurips'23 workshop paper.
To see what can be done with this package, take a look at our COLM'24 paper "Elephants Never Forget: Memorization and Learning of Tabular data in Large Language Models". The code to replicate the results in the paper is here.
The API reference is available here.
There are example notebooks for traditional tabular datasets and datasets used in OpenAI's MLE-bench.
pip install tabmemcheck
Then use import tabmemcheck
to import the Python package.
The header test asks the LLM to complete the initial rows of a CSV file.
header_prompt, header_completion, response = tabmemcheck.header_test('uci-wine.csv', 'gpt-3.5-turbo-0613', completion_length=350)
Here, we see that gpt-3.5-turbo-0613
can complete the initial rows of the UCI Wine dataset. The function output visualizes the Levenshtein string distance between the actual dataset and the model completion.
The row completion test asks the LLM to complete random rows of a CSV file.
rows, responses = tabmemcheck.row_completion_test('iris.csv', 'gpt-4-0125-preview', num_queries=25)
Here, we see that gpt-4-0125-preview
can complete random rows of the Iris dataset. The function output again visualizes the Levenshtein string distance between the actual dataset rows and the model completions.
The feature completion test asks the LLM to complete the values of a specific feature in the dataset.
feature_values, responses = tabmemcheck.feature_completion_test('titanic-train.csv', 'gpt-3.5-turbo-0125', feature_name='Name', num_queries=25)
Here, we see that gpt-3.5-turbo-0125
can complete the names of the passengers in the Kaggle Titanic dataset. The function output again visualizes the Levenshtein string distance between the feature values in the dataset and the model completions.
The first token test asks the LLM to complete the first token in the next row of a CSV file.
tabmemcheck.first_token_test('adult-train.csv', 'gpt-3.5-turbo-0125', num_queries=100)
First Token Test: 37/100 exact matches.
First Token Test Baseline (Matches of most common first token): 50/100.
Here, the test provides no evidence of memorization of the Adult Income dataset in gpt-3.5-turbo-0125
.
One of the key features of this package is that we have implemented prompts that allow us to run the various completion tests not only with (base) language models but also with chat models (specifically, GPT-3.5 and GPT-4).
There is also a simple way to run all the different tests and generate a small report.
tabmemcheck.run_all_tests("adult-test.csv", "gpt-4-0613")
The feature names test asks the LLM to complete the feature names of a dataset.
tabmemcheck.feature_names_test('Kaggle Tabular Playground Series Dec 2021.csv.csv', 'gpt-4o-2024-08-06')
The feature values test asks the LLM to provide a typical observation from the dataset.
tabmemcheck.feature_values_test('OSIC Pulmonary Fibrosis Progression.csv', 'gpt-4o-2024-08-06')
More generally, you can use sample
to ask the LLM to provide samples from the dataset.
tabmemcheck.sample('OSIC Pulmonary Fibrosis Progression.csv', 'gpt-4o-2024-08-06')
The dataset name test asks the LLM to provide the name of the dataset, given the initial rows of the CSV file.
tabmemcheck.dataset_name_test('spooky author identification train.csv', 'gpt-4o-2024-08-06')
We have often been asked how the results of the different tests should be interpreted. For example, do 3 out of 25 correctly completed rows in the row completion test mean the dataset is memorized? The key point in interpreting the test results is that one has to consider the amount of entropy in the dataset.
At a high level, we want to say that a dataset is memorized if an LLM can consistently generate it. However, this only makes sense if the dataset is not a (deterministic) string sequence that can simply be predicted by the LLM. In most tabular datasets, we don't have to worry about this too much. This is because they contain random variables, and it is impossible to consistently reproduce the realizations of random variables unless the values of the random variables have been seen before (that is, during training).
When we judge the test results, we have to consider the completion rate of the LLM and the amount of entropy in the dataset. For example, the OpenML Diabetes dataset contains an individual's glucose level, blood pressure, and BMI, as well as other measurements that are at least in part random. Now, if an LLM can consistently generate even a few rows of this unique dataset, this is fairly strong evidence of memorization (see Carlini et al. 2019 and Carlini et al. 2021 if you are interested in details). To give a contrary example, the Iris dataset contains many rows that are near-duplicates. This means that an LLM might also achieve a non-zero row completion rate by chance or prediction, and one could not conclude that the dataset was seen during pre-training from the fact that an LLM can generate a few rows.
Because one needs to weight the completions of the LLM against the entropy in the dataset, it is unfortunately impossible to give a general ratio such as "X out of 100 completed rows imply memorization".
While this all sounds very complex, the practical evidence for memorization is often very clear. This can also be seen in the examples above.
We use few-shot learning to condition chat models on the desired task. This works well for GPT-3.5 and GPT-4, and also for many other LLMs (but not necessarily for all LLMs).
You can set tabmemcheck.config.print_prompts = True
to see the prompts.
You can set tabmemcheck.config.print_responses = True
to print the LLM responses, a useful sanity check.
Yes. The module chat_completion.py provides the general-purpose function prefix_suffix_chat_completion
which is used to implement most of the different tests.
You can see how prefix_suffix_chat_completion
is being used by reading the implementations of the different tests in functions.py.
We also provide the general-purpose function chat_completion
, which again relies on prefix_suffix_chat_completion
.
To test your own LLM, simply implement tabmemcheck.LLM_Interface
. We use the OpenAI message format.
@dataclass
class LLM_Interface:
"""Generic interface to a language model."""
# if true, the tests use the chat_completion function, otherwise the completion function
chat_mode = False
def completion(self, prompt: str, temperature: float, max_tokens: int):
"""Send a query to a language model.
:param prompt: The prompt (string) to send to the model.
:param temperature: The sampling temperature.
:param max_tokens: The maximum number of tokens to generate.
Returns:
str: The model response.
"""
raise NotImplementedError
def chat_completion(self, messages, temperature: float, max_tokens: int):
"""Send a query to a chat model.
:param messages: The messages to send to the model. We use the OpenAI format.
:param temperature: The sampling temperature.
:param max_tokens: The maximum number of tokens to generate.
Returns:
str: The model response.
"""
raise NotImplementedError
The tests provided in this package do not guarantee that the LLM has not seen or memorized the data. Specifically, it might not be possible to extract the data from the LLM via prompting, even though the LLM has memorized it.
If you find this code useful in your research, please consider citing our research papers.
@inproceedings{bordt2024colm,
title={Elephants Never Forget: Memorization and Learning of Tabular Data in
Large Language Models},
author={Bordt, Sebastian and Nori, Harsha and Rodrigues, Vanessa and Nushi, Besmira and Caruana, Rich},
booktitle={Conference on Language Modeling (COLM)},
year={2024}
}
@inproceedings{bordt2023testing,
title={Elephants Never Forget: Testing Language Models for Memorization of Tabular Data},
author={Bordt, Sebastian and Nori, Harsha and Caruana, Rich},
booktitle={NeurIPS 2023 Second Table Representation Learning Workshop},
year={2023}
}
Chang et al., "Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4", EMNLP 2023
Carlini et al., "Extracting Training Data from Large Language Models", USENIX Security Symposium 2021
Carlini et al., "The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks", USENIX Security Symposium 2019