Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMLS matching #156

Closed
percevalw opened this issue Oct 28, 2022 · 1 comment
Closed

UMLS matching #156

percevalw opened this issue Oct 28, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@percevalw
Copy link
Member

(copied from APHP's gitlab — 30/09/22)

Feature type

For now, EDS-NLP only allows to extract and normalize entities to ATC (via ROMEDI), and ICD10.
As UMLS is an international resource and gathers many terminologies (including SnomedCT) in many languages, integrating it would greatly benefit the library and its users to

  • automatically categorize the texts of a corpus according to different concept IDs
  • perform entity searching
  • create processing rules (if ent.concept_id is a child of CUIXXXXX then, ...)
  • do corpus pre-annotation
  • ...

Several points are targeted:

Downloading the resource

The UMLS contains several tables. We are mainly interested in the MRCONSO table, which contains synonyms and concept IDs (2GB for the 2022AA version). It does not seem reasonable to ask the users to download it themselves, the procedure is long and painful. Fortunately, there is the small (but very well done) umls_downloader library that allows to automate this process provided you have an UMLS license (which is necessary anyway), and store the tables in a shared cache folder.

We should therefore:

  • decide when the download is done (at the installation ? at the instantiation of the eds.umls pipeline ?)
  • see how downloading and caching of resources (like cim10) could be generalized for edsnlp

Exact & approximate matching

Once the resource is downloaded, we need to find the UMLS synonyms (MRCONSO table) in the texts. For this, we can use the EDSPhraseMatcher of edsnlp for exact matching and the SimstringMatcher for approximate matching.
The easiest way is to adapt one of the two other TerminologyMatcher implemented for ICD10 or for ATC.

This would require:

  • pre-process the MRCONSO table downloaded in the previous step (e.g. with optional filters on some columns)
  • load it into a TerminologyMatcher

Normalization

Once the synonyms have been identified, we need to decide how to present the extracted information to the user. The UMLS aligns synonyms with a unique identifier, the CUI, but also offers alignments to the IDs of all the terminologies it contains. At the moment, the ent.kb_id_ attribute contains the various identifiers jumbled together (ATC / ICD10), making it difficult to use if you start mixing terminologies in a pipeline.
A more robust solution (#47) would be to store in different extensions the IDs proposed by the terminology.

@percevalw percevalw added the enhancement New feature or request label Nov 4, 2022
@percevalw
Copy link
Member Author

Completed as part of #147 (via #165)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant