TiDE is a free open-source text deidentification tool that can identify and deid PHI in clinical note text and other free text in medical data. It uses pattern matching, known PHI matching and NER to search for PHI, and use general replacement or hide-in-plain-sight to replace PHI with safe text. For more information about the TiDE algorithm, please refer to the Section 6 of the manuscript "A new paradigm for accelerating clinical data science at Stanford Medicine"
TiDE can identify the following HIPAA identifiers either by a) name entity recognition or pattern matching or b) known PHI matching:
Name, Address, dates, phone, fax, Email, SSN, MRN, Health plan beneficiary number, Account number, Certificate or license number, vehicle number, URL, IP, any other characteristic that could uniquely identify the individual
TiDE does not process non-text information such as these two identifiers
Finger/Voice print, photo
Note that certain identifiers may a have pattern that is unique to the organization's EHR system and may fail our pattern matching or Name Entity Recognition. Please review the TiDE output of your organization's clinical notes carefully.
TiDE can be used in various environments. Below are the prerequisites and instructions for several environments TiDE is available
- Local System - Standalone
- Local System - using a Docker container
- Google Cloud Platform
- Prerequisites All
- Prerequisites Local System - Standalone
- Prerequisites Local System - Using Docker Container
- Prerequisites Google Cloud Platform
The following examples are based on executing TiDE on windows system
Once prerequisites are met, open a command line and change the directory to the folder where TiDE source has been downloaded, e.g. if on local system, source is downloaded at "C:\Dev\tide-source" navigate to the folder
cmd
cd C:\Dev\tide-source
-
In the command window, execute the following
- If notes file is prepared in text format:
mvn clean install -DskipTests java -jar ./target/deid-3.0.31-SNAPSHOT-dataflow.jar --deidConfigFile=./src/main/resources/deid_config_omop_genrep.yaml --annotatorConfigFile=./src/main/resources/annotator_config.yaml --inputType=text --phiFileName=./phi/phi_person_data_example.csv --personFile=./person_data/person.csv --inputResource=./sample_notes --outputResource=./output
- If notes file is prepared in jsonl format:
mvn clean install -DskipTests java -jar ./target/deid-3.0.31-SNAPSHOT-dataflow.jar --deidConfigFile=./src/main/resources/deid_config_omop_genrep.yaml --annotatorConfigFile=./src/main/resources/annotator_config.yaml --inputType=local --inputResource=./sample_notes_jsonl/notes.json --outputResource=./output --textIdFields="note_id" --textInputFields="note_text"
- In the command window, execute the following
docker build . -t tide-program:latest
- Update the following command if the source location is different from (C:\Dev\tide-source). This command will map the local source and output folder with container. Execute the following
docker run -it -v /mnt/c/Dev/tide-source:/workspaces tide-program:latest
- Above command will switch the command line prompt to Shell of the TiDE image. Execute the following in the Container Shell
java -jar /opt/deid/target/deid-3.0.31-SNAPSHOT-dataflow.jar --deidConfigFile=./src/main/resources/deid_config_omop_genrep.yaml --annotatorConfigFile=./src/main/resources/annotator_config.yaml --inputType=text --phiFileName=/workspaces/phi/phi_person_data_example.csv --personFile=/workspaces/person_data/person.csv --inputResource=/workspaces/sample_notes --outputResource=/workspaces/output
- In the command window, execute the following
docker build . -t tide-program:latest
- Update the following command if the source location is different from (C:\Dev\tide-source). This command will map the local source and output folder with container. Execute the following
docker run -it -v /mnt/c/Dev/tide-source:/workspaces tide-program:latest
- Above command will switch the command line prompt to Shell of the TiDE image. Execute the following in the Container Shell
java -jar -Xmx6g /opt/deid/target/deid-3.0.31-SNAPSHOT-dataflow.jar --deidConfigFile=./src/main/resources/deid_config_omop_genrep.yaml --annotatorConfigFile=./src/main/resources/annotator_config.yaml --inputType=gcp_gcs --inputResource=gs://<INPUT_BUCKET_NAME>/sample_notes_jsonl/notes.json --outputResource=gs://<OUTPUT_BUCKET_NAME> --gcpCredentialsKeyFile=<SERVICE_ACCOUNT_KEY_DOWNLOADED> --textIdFields="id" --textInputFields="note"
Sample Notes: For inputType="text": sample notes folder For inputType="local": sample notes jsonl folder
Input Arguments:
- inputResource (mandatory) e.g. inputResource=/workspaces/sample_notes
When used with
- "inputType=text", this argument specifies location of the folder with notes to be deid in text format. All files in this folder will be processed.
- "inputType="local", this argument specifies the file with notes to be deid in newline delimited JSON files (jsonl) format.You can directly use jsonl that contains id and free text column as input if you only need to use NER or general patten matching. If you have known PHI associated with text, you need to have phi information embedded in the jsonl file.
Sample Notes: Please refer to (sample notes folder)
Input Arguments:
- inputResource (mandatory) e.g. inputResource=gs://<INPUT_BUCKET_NAME>/sample_notes_jsonl/notes.json This argument specifies the file with notes to be deid in newline delimited JSON files (jsonl) format.
On execution of previous command, application will start processing the input notes and display messages like below
21:24:43,972 INFO [main] com.github.susom.starr.deid.Main.run(Main.java:67) Current Settings:
appName: Main
deidConfigFile: /workspaces/src/main/resources/deid_config_omop_genrep.yaml
annotatorConfigFile: /workspaces/src/main/resources/annotator_config.yaml
gcpCredentialsKeyFile:
inputResource: /workspaces/sample_notes
inputType: text
optionsId: 0
outputResource: /workspaces/output
personFile: /workspaces/person_data/person.csv
phiFileName: /workspaces/phi/phi_person_data_example.csv
runner: class org.apache.beam.runners.direct.DirectRunner
stableUniqueNames: WARNING
®
21:24:43,980 INFO [main] com.github.susom.starr.deid.Main.run(Main.java:76) reading configuration from file /workspaces/src/main/resources/deid_config_omop_genrep.yaml®
21:24:44,069 INFO [main] com.github.susom.starr.deid.Main.run(Main.java:83) received configuration for note_deid_20190812®
21:24:46,329 INFO [direct-runner-worker] edu.stanford.nlp.util.logging.SLF4JHandler.print(SLF4JHandler.java:88) Adding annotator tokenize®
..............
..............
..............
21:24:58,919 INFO [direct-runner-worker] org.apache.beam.sdk.io.FileBasedSink$WriteOperation.removeTemporaryFiles(FileBasedSink.java:805) Will remove known temporary file /workspaces/output/1629926684106/.temp-beam-fb9dbd2c-f17c-4d4a-a99c-8034e4d2fef9/f1ccd490-eb10-4cee-a5a2-96c82f221c74®
On completion of execution of previous command, TiDE output will be available in the "output" folder. For every execution, application will create a subfolder in the "output" folder using "current timestamp in long format". For latest execution output, use the folder with latest timestamp. This folder will have 3 sets of output:
- At the root of "current timestamp in long format" folder, one or more files in newline delimited JSON files (jsonl) format containing original note, deid note, and findings Sample jsonl Output
- A subfolder "individual" containing deid notes. This folder will have one file corresponding to each input note Sample Individual Output
- A subfolder "annotator" containing output in Doccano format. This folder will have one file corresponding to each input note Sample Annotator Output
On completion of execution of previous command, TiDE output will be available in the GCP bucket specified in the "outputResource" argument. TiDE output is in newline delimited JSON files (jsonl) format.
- GitHub (Source Repository)
TiDE source code is maintained in GitHub. GitHub is a code repository and is used for storing and maintaining TiDE source code.
Access GitHub using GitHub Desktop tool
- Download and install GitHub Desktop Client
- After installation, open the GitHub Desktop program
- Open File > Clone repository
- On Clone a repository dialog box, URL tab, in the "Repository URL", enter "https://github.com/susom/tide/"
- In "Local Path", enter a value for local path where you would like to keep the source. Like on my machine the source folder is "C:\Dev\tide-source"
- Click "Clone".
- This will download the latest TiDE code on your local system in the location specified in local path.
- Open Local path folder in your choice of IDE like Visual Studio Code.
- Java
- Maven
- Docker installation on local machine
Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure so you can deliver software quickly. With Docker, you can manage your infrastructure in the same ways you manage your applications.
Docker installation is different for different platforms. Here are the links for various operating system:
Tools Required
- Mac: Terminal
- Windows: PowerShell
- Ubuntu: Shell or terminal
- Google Cloud Platform (GCP)
Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services where you can leverage the power of online computing for performing resource intensive job typically not available on on local system.
- Create Google Cloud account. If you meet the criteria, you may get Cloud Billing credits to pay for resources from Google (Currently Google is offering 90-day, $300 Free Trial).
- After creating the account, Using Google Console
- Create Cloud Project
- Create Service account
- Enter name for the Service Account e.g. "TiDE service account"
- Based on Service Account name, system will automatically generate service account id. You can either use the same name or change the name in the input box below service account name.
- Enter description for the Service Account e.g. "This service account will be used to verify TiDE functionality".
- Click > Create and continue
- Adding roles to the service account
- Cloud Dataflow Service Agent
- Storage Admin
- BigQuery Admin
- Click Continue and Then Click Done
- The Service Account creation is complete with required roles.
- Under Permissions tab check if your or user is assigned to this service account, if not then click on Grant Access and add the user.
- Generate Key for Service account (Json).
- A key will be generated. Download this ley to your local system.
- Add Billing to GCP account
- On Google Console (https://console.cloud.google.com/). Click on the Navigation menu on the left, and then Hover on Billing
- Click on Manage Billing Accounts > Add Billing Account.
- Fill all the required details and attach the project to the billing.
- Configure Storage for the GCP project Create Storage Buckets
- Open Google Console. Click on the Navigation menu on the left, and then Hover on Cloud Storage.
- Click Cloud storage > Browser.
- Two buckets are required. One for input data and another for output data, steps for creation of both buckets are same.
- Click on create bucket link
- Give name to your bucket. click continue
- Select Location type - REGIONAL > Select region from dropdown > Continue
- Select default storage class Standard > Continue
- Select control access as Uniform, make sure the checkbox for Enforce public access prevention on this bucket is checked > continue
- Under Advance setting select Encryption type > google-managed encryption key.
- Click Create.
You need to configure Google Cloud credential if run TiDE on Dataflow. https://cloud.google.com/docs/authentication/getting-started
export GOOGLE_APPLICATION_CREDENTIALS=<gcp service account credential json file>
TiDE can process data in various formats such as
- "text" note (one per file) along with known phi file (csv format) and note-phi relationship file (csv format)
- newline delimited JSON files (jsonl) file with phi information along with the note
- "BigQuery" table
TiDE supports text files for input notes to be deid'ed. Each file should have exactly one note. The file name is used as note id. The known PHI information for the persons can be supplied in a phi file in csv format. The phi file should have a person id. A file with relationship between person id and note should be supplied in csv file. TiDE supports multiple notes for same person.
You can directly use BigQuery table that contains the free text column as input if you only need to use NER or general patten matching. If you have known PHI associated with text, you can join free text with known phi, and create a final input table with both text and known phi in each row.
TiDE supports newline delimited JSON files (jsonl) files. Export table to Google Cloud Storage bucket as newline delimited JSON files. Then use these GCS files as input.
Options to deid PHI discovered by TiDE:
- Masking (everything except name and location)
- Jittering (date, age) with provided jitter value
- Surrogate name and location
- General Replacement with common patterns of each type of PHIs
TiDE has some embedded job specifications (resource folder) that fit for most common use cases.
If need to customize the configuration, create a new config yaml file, and use the file path as argument value of --deidConfigFile when run the tool.
Sample configuration to switch on/off features
analytic: false
googleDlpEnabled: false
nerEnabled: true
annotatorOutputEnabled: true
Multiple deid actions can be grouped into same PHI category by using same itemName. Grouping is useful for deid quality analytics if analytic is enabled.
Configure General Regex pattern matching or find known PHI of the patient associated with the text.
- general: for Phone/Fax, Email, URL, IP address, SSN
- general_number: general order, account number and general-accession
- surrogate_name: surrogate name using NameSurrogate
- surrogate_address: surrogate address using LocationSurrogate
- jitter_date_from_field: Date Anonymizer with jitter value provided in an input field
- jitter_birth_date: Date Anonymizer
- jitter_date_randomly: randomly generate jitter using hash function
- remove_age: Age Anonymizer
- remove_mrn: Mrn Anonymizer
- replace_minimumlengthword_with: find words with minimum word length
- replace_with: find word longer than 2 characters, and not in common vocabulary
- replace_strictly_with: applied strictly regardless word length, and if in common vocabulary
Configuration Example:
Sample configuration file(s): Please refer to (deid configuration file) Please refer to (annotator configuration file)
name: note_deid_20190812
deidJobs:
- jobName: stanford_deid_v3
version: v3.0
textFields: note_text
textIdFields: note_id
analytic: false
......
spec:
- itemName: phi_date
action: jitter_date_from_field
actionParam: 10/10/2100
fields: 'JITTER'
- itemName: mrn
action: remove_mrn
actionParam: 99999999
fields: ''
- itemName: patient_mrn
action: replace_minimumlengthword_with
actionParam: 99999999 3
fields: PAT_MRN_ID
- itemName: other_id
action: replace_minimumlengthword_with
actionParam: 999999999 3
fields: pat_id, birth_wrist_band, epic_pat_id, PRIM_CVG_ID, PRIM_EPP_ID, EMPLOYER_ID
.....
TiDE was created using Apache Beam programming model. So you can run TiDE on many technologies using appropriate runners on https://beam.apache.org/documentation/runners/capability-matrix/
Three types of parameters are needed for running TiDE:
- for TiDE itself, for example, specify deid configuration
- for Apache Beam runners
- for running TiDE on Google Cloud Dataflow
parameter | description | sample value |
---|---|---|
textInputFields | field name in input that contains free text | note_text |
textIdFields | field name in input that contains row id(s) | note_id,note_csn_id |
runner | type of the Apache Beam runner | DirectRunner, DataflowRunner |
inputType | type of the input souce. Currently supports Google Cloud Storage, Google BigQuery and local files | gcp_gcs, gcp_bq, local, text |
inputResource | Path of the file to read from | gs://mybucket/path/to/json/files/*.json |
outputResource | Path of the output files | |
DeidConfigFile | Name of the Deid configuration. Can use the provided configurations or external config file | deid_config_omop_genrep.yaml |
AnnotatorConfigFile | Name of the Annotator configuration. Can use the provided configurations or external config file | annotator_config.yaml |
dlpProject | GCP project id, if use GCP DLP service | |
googleDlpEnabled | Turn on/off Google DLP | true or false |
phiFileName | Known PHI file | /workspaces/phi/phi_person_data_example.csv |
personFile | Relationship between known PHI and notes | /workspaces/person_data/person.csv |
Sample PHI file: Please refer to (PHI file)
There should be max of one record per person (person_id) in this file
person_id,MRN,JITTER,STUDY_ID,ANON_ID,pat_id,pat_name,add_line_1,city,zip,home_phone,email_address,birth_date,sex_c,ssn,epic_pat_id,PAT_MRN_ID,PAT_LAST_NAME,PAT_FIRST_NAME,EMPLOYER_ID,cur_pcp_prov_id,PROV_NAME,father_name,father_addr_ln_1,father_city,father_zip,father_cell_phone,mother_name,mother_cell_phone,emerg_pat_rel_c,accession_num
Sample person file: Please refer to (relationship file)
There should be max of one record per note (note_id) in this file
note_id,person_id
Here is an example for reading from Google Cloud Storage and storing result to Google Cloud Storage:
mvn -Pdataflow-runner compile exec:java -Dexec.mainClass=com.github.susom.starr.deid.Main \
-Dexec.args="--project=<Google Project ID> \
--dlpProject=<Google Project ID for DLP API calls> \
--serviceAccount=<Service Account> \
--stagingLocation=gs://<dataflow staging bucket>/staging \
--gcpTempLocation=gs://<dataflow staging bucket>/temp \
--tempLocation=gs://<dataflow staging bucket>/temp \
--region=us-west1 --workerMachineType=n1-standard-8 --maxNumWorkers=20 --diskSizeGb=100 \
--runner=DataflowRunner \
--deidConfigFile=deid_config_omop_genrep.yaml --annotatorConfigFile=annotator_config.yaml --inputType=gcp_gcs \
--textIdFields="note_id,note_csn_id" \
--textInputFields="fullnote" \
--inputResource=gs://<input data bucket>/input/HNO_NOTE_TEXT_PHI_MERGED/note-input-*.json \
--outputResource=gs://<input data bucket>/NOTE_DEID_result"
https://cloud.google.com/dlp/docs/infotypes-reference
AGE, DATE, DATE_OF_BIRTH, CREDIT_CARD_NUMBER, US_BANK_ROUTING_MICR, AMERICAN_BANKERS_CUSIP_ID, IBAN_CODE, US_ADOPTION_TAXPAYER_IDENTIFICATION_NUMBER, US_DRIVERS_LICENSE_NUMBER, US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER, US_PREPARER_TAXPAYER_IDENTIFICATION_NUMBER, US_PASSPORT, US_SOCIAL_SECURITY_NUMBER, US_EMPLOYER_IDENTIFICATION_NUMBER, US_VEHICLE_IDENTIFICATION_NUMBER, EMAIL_ADDRESS, PERSON_NAME, PHONE_NUMBER, US_HEALTHCARE_NPI, US_DEA_NUMBER, LOCATION, IP_ADDRESS, MAC_ADDRESS, URL
DLP can be integrated with two ways. One way is directly enable DLP in TiDE deid transform, which will call Google DLP API individually for each text row. The second way is to use Google DLP Native job to find PHIs independently from TiDE and merge findings of each parallel result into final deied-text.
Enabled Google DLP in TiDE config YAML file
deidJobs:
- jobName: stanford_deid_v3
...
googleDlpEnabled: true
java -jar deid-3.0.31-SNAPSHOT.jar \
--gcpCredentialsKeyFile=<google_credential.json> \
--projectId=<google_project_id> \
--deidConfigFile=deid_config_omop_genrep.yaml \
--annotatorConfigFile=annotator_config.yaml \
--inputBqTableId=<bigquery_input_text_table_id> \
--outputBqTableId=<bigquery_native_job_output_table_id> \
--idFields=note_id
--inspectFields=note_text