LLM Anonymizer

Important

This repository is just a snapshot of an ongoing development process. This repository is not under active development and will be replaced by a newer and actively developed version in the future.

Use for research purposes only!

The newest version of this tool can be found here: KatherLab/LLMAIx

Tool to anonymize medical reports removing person-related information.

Features:

Supports various input formats: pdf, png, jpg, jpeg, txt and docx (only if Word is installed on your system)
Performs OCR if necessary
Extracts person-related information from the reports using a llama model
Matches the extracted personal information in the reports using a fuzzy matching algorithm based on the Levenshtein distance (configurable)
Compare documents and calculate metrics using annotated pdf files (Inception)

Examples

Examples of doctoral reports in various formats can be found in the examples directory.

Preparation

Download and extract or build llama-cpp for your operating system.
Download desired models (must be compatible with llama-cpp, in gguf format)
Update the config.yml file with the downloaded models accordingly.
If you intend to use OCR: Install OCRmyPDF
Create a python venv or a conda environment (tested with Python 3.11.5) with requirements.txt:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Launch LLM Anonymizer

Run: python app.py

Parameter	Description	Example
--model_path	Directory with downloaded model files which can be processed by llama.cpp	/path/to/models
--n_predict	How many tokens to predict. Will crash if too little. Default: 384	384
--server_path	Path of llama cpp executable (on Windows: server.exe).	/path/to/llamacpp/executable/server
--n_gpu_layers	How many layers of the model to offload to the GPU. Adjust according to model and GPU memory. Default: 80	-1 for all, otherwise any number
--host	Hostname of the server. Default: 0.0.0.0	0.0.0.0 or localhost
--port	Port on which this web app should be started on. Default: 5001	5001
--config_file	Custom path to the configuration file.	config.yml
--llamacpp_port	On which port to run the llama-cpp server. Default: 2929	2929
--debug	When set, the web app will be started in debug mode and with auto-reload, for debugging and development

Usage

Preprocessing

Click on Preprocessing

First, all input files are preprocessed to a csv file which contains all text from all reports. Currently pdf, docx, odf, txt, png and jpg files can be used as input. If necessary, text recognition (OCR) is applied.

The output is a zip file containing the csv file and the pdf files with a text layer.

LLM Information Extraction

Extract personal information from the medical reports.

Click on LLM Information Extraction

Use the zip file from the preprocessing step as an input, choose a model and adjust the prompt, grammar and temperature accordingly. When you click Run Pipeline you will be redirected to the LLM results tab. Wait for the results to be available for download. You don't have to reload the page! In the meantime you can also start more information extraction jobs.

The output extends the input zip file with a csv file with columns report with the original report, report_masked which contains the anonymized report and more columns with the personal information extracted according to the grammar as well as Ids and metadata.

Prepare Annotations

To be able to evaluate the performance of the LLM Anonymizer tool, ground truth is needed.

Download Inception
Start a basic annotation project, upload the pdf files and annotate the parts of the reports you want to anonymize. Refer to the Inception User Guide
Export the annotated reports in the UIMA CAS JSON format (UIMA CAS JSON 0.4.0)
Make shure the filename of the exported json files matches the filename of the pdf files (except the extension like .json and .pdf)
Create a zip file of the exported json files (zip the json files directly, not a directory where they are located!)

Report Redaction Metrics

Calculate metrics for the anonymized reports by comparing to annotated reports (by Inception)

Click on Report Redaction

Use the output zip file from the LLM Information Extraction step as an input.
Also upload the prepared annotation zip file.
Enable and configure fuzzy matching if you want to use the fuzzy matching algorithm.
Choose between Report Redaction Metrics and Report Redaction Viewer
Report Redaction Metrics will run a job which calculates overall metrics for all the documents as well as a download link for the redacted documents.
Report Redaction Viewer will let you view the documents one-by-one with document-wise metrics and they are redacted on the fly.

Additional Notes

An active internet connection is currently required. This is because some javascript and CSS libraries are taken directly from CDNs. To change that please download them and replace the respective occurrences in the html files.

Citation

This repository is part of the paper Anonymizing medical documents with local, privacy preserving large language models: The LLM-Anonymizer

Example Grammar

Adjust the grammar according to the LLama-CPP GBNF Guide. This causes the llm output to be in a json structure with the desired datapoints. Note: Playing around with this can help, not every model works well with a too restrictive grammar.

root   ::= allrecords
value  ::= object | array | string | number | ("true" | "false" | "null") ws

allrecords ::= (
  "{"
  ws "\"patientLastName\":" ws string ","
  ws "\"patientFirstName\":" ws string ","
  ws "\"patientName\":" ws string ","
  ws "\"patientHonorific\":" ws string ","
  ws "\"patientBirthDate\":" ws string ","
  ws "\"patientID\":" ws idlike ","
  ws "\"patientStreet\":" ws string ","
  ws "\"patientHouseNumber\":" ws string ","
  ws "\"patientPostalCode\":" ws postalcode ","
  ws "\"patientCity\":" ws string ","
  ws "}"
  ws
)

record ::= (
    "{"
    ws "\"excerpt\":" ws ( string | "null" ) ","
    ws "\"present\":" ws ("true" | "false") ws 
    ws "}"
    ws
)

object ::=
  "{" ws (
            string ":" ws value
    ("," ws string ":" ws value)*
  )? "}" ws

array  ::=
  "[" ws (
            value
    ("," ws value)*
  )? "]" ws
char ::= [^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
char ::= [^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
string ::=
  "\"" (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)? "\"" ws
  "\"" (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char (char)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)? "\"" ws

number ::= ("-"? ([0-9] | [1-9] [0-9]*)) ("." [0-9]+)? ([eE] [-+]? [0-9]+)? ws

postalcode ::= ("\"" [0-9][0-9][0-9][0-9][0-9] "\"" | "\"\"") ws
idlike ::= ("\"" [0-9][0-9][0-9][0-9][0-9][0-9][0-9]?[0-9]? "\"" | "\"\"") ws

# Optional space: by convention, applied in this grammar after literal chars when allowed
ws ::= ([ \t\n])?

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
examples		examples
webapp		webapp
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
app.py		app.py
config.yml		config.yml
image.png		image.png
image_redaction_view.png		image_redaction_view.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Anonymizer

Examples

Preparation

Launch LLM Anonymizer

Usage

Preprocessing

LLM Information Extraction

Prepare Annotations

Report Redaction Metrics

Additional Notes

Citation

Example Grammar

About

Releases

Packages

Languages

License

KatherLab/LLMAnonymizer-Publication

Folders and files

Latest commit

History

Repository files navigation

LLM Anonymizer

Examples

Preparation

Launch LLM Anonymizer

Usage

Preprocessing

LLM Information Extraction

Prepare Annotations

Report Redaction Metrics

Additional Notes

Citation

Example Grammar

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages