UMLS matching #156

percevalw · 2022-10-28T07:32:17Z

(copied from APHP's gitlab — 30/09/22)

Feature type

For now, EDS-NLP only allows to extract and normalize entities to ATC (via ROMEDI), and ICD10.
As UMLS is an international resource and gathers many terminologies (including SnomedCT) in many languages, integrating it would greatly benefit the library and its users to

automatically categorize the texts of a corpus according to different concept IDs
perform entity searching
create processing rules (if ent.concept_id is a child of CUIXXXXX then, ...)
do corpus pre-annotation
...

Several points are targeted:

Downloading the resource

The UMLS contains several tables. We are mainly interested in the MRCONSO table, which contains synonyms and concept IDs (2GB for the 2022AA version). It does not seem reasonable to ask the users to download it themselves, the procedure is long and painful. Fortunately, there is the small (but very well done) umls_downloader library that allows to automate this process provided you have an UMLS license (which is necessary anyway), and store the tables in a shared cache folder.

We should therefore:

decide when the download is done (at the installation ? at the instantiation of the eds.umls pipeline ?)
see how downloading and caching of resources (like cim10) could be generalized for edsnlp

Exact & approximate matching

Once the resource is downloaded, we need to find the UMLS synonyms (MRCONSO table) in the texts. For this, we can use the EDSPhraseMatcher of edsnlp for exact matching and the SimstringMatcher for approximate matching.
The easiest way is to adapt one of the two other TerminologyMatcher implemented for ICD10 or for ATC.

This would require:

pre-process the MRCONSO table downloaded in the previous step (e.g. with optional filters on some columns)
load it into a TerminologyMatcher

Normalization

Once the synonyms have been identified, we need to decide how to present the extracted information to the user. The UMLS aligns synonyms with a unique identifier, the CUI, but also offers alignments to the IDs of all the terminologies it contains. At the moment, the ent.kb_id_ attribute contains the various identifiers jumbled together (ATC / ICD10), making it difficult to use if you start mixing terminologies in a pipeline.
A more robust solution (#47) would be to store in different extensions the IDs proposed by the terminology.

percevalw · 2022-11-18T09:41:35Z

Completed as part of #147 (via #165)

percevalw added the enhancement New feature or request label Nov 4, 2022

percevalw closed this as completed Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMLS matching #156

UMLS matching #156

percevalw commented Oct 28, 2022

percevalw commented Nov 18, 2022

UMLS matching #156

UMLS matching #156

Comments

percevalw commented Oct 28, 2022

Feature type

Downloading the resource

Exact & approximate matching

Normalization

percevalw commented Nov 18, 2022