You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For now, EDS-NLP only allows to extract and normalize entities to ATC (via ROMEDI), and ICD10.
As UMLS is an international resource and gathers many terminologies (including SnomedCT) in many languages, integrating it would greatly benefit the library and its users to
automatically categorize the texts of a corpus according to different concept IDs
perform entity searching
create processing rules (if ent.concept_id is a child of CUIXXXXX then, ...)
do corpus pre-annotation
...
Several points are targeted:
Downloading the resource
The UMLS contains several tables. We are mainly interested in the MRCONSO table, which contains synonyms and concept IDs (2GB for the 2022AA version). It does not seem reasonable to ask the users to download it themselves, the procedure is long and painful. Fortunately, there is the small (but very well done) umls_downloader library that allows to automate this process provided you have an UMLS license (which is necessary anyway), and store the tables in a shared cache folder.
We should therefore:
decide when the download is done (at the installation ? at the instantiation of the eds.umls pipeline ?)
see how downloading and caching of resources (like cim10) could be generalized for edsnlp
Exact & approximate matching
Once the resource is downloaded, we need to find the UMLS synonyms (MRCONSO table) in the texts. For this, we can use the EDSPhraseMatcher of edsnlp for exact matching and the SimstringMatcher for approximate matching.
The easiest way is to adapt one of the two other TerminologyMatcher implemented for ICD10 or for ATC.
This would require:
pre-process the MRCONSO table downloaded in the previous step (e.g. with optional filters on some columns)
load it into a TerminologyMatcher
Normalization
Once the synonyms have been identified, we need to decide how to present the extracted information to the user. The UMLS aligns synonyms with a unique identifier, the CUI, but also offers alignments to the IDs of all the terminologies it contains. At the moment, the ent.kb_id_ attribute contains the various identifiers jumbled together (ATC / ICD10), making it difficult to use if you start mixing terminologies in a pipeline.
A more robust solution (#47) would be to store in different extensions the IDs proposed by the terminology.
The text was updated successfully, but these errors were encountered:
(copied from APHP's gitlab — 30/09/22)
Feature type
For now, EDS-NLP only allows to extract and normalize entities to ATC (via ROMEDI), and ICD10.
As UMLS is an international resource and gathers many terminologies (including SnomedCT) in many languages, integrating it would greatly benefit the library and its users to
Several points are targeted:
Downloading the resource
The UMLS contains several tables. We are mainly interested in the MRCONSO table, which contains synonyms and concept IDs (2GB for the 2022AA version). It does not seem reasonable to ask the users to download it themselves, the procedure is long and painful. Fortunately, there is the small (but very well done) umls_downloader library that allows to automate this process provided you have an UMLS license (which is necessary anyway), and store the tables in a shared cache folder.
We should therefore:
Exact & approximate matching
Once the resource is downloaded, we need to find the UMLS synonyms (MRCONSO table) in the texts. For this, we can use the EDSPhraseMatcher of edsnlp for exact matching and the SimstringMatcher for approximate matching.
The easiest way is to adapt one of the two other TerminologyMatcher implemented for ICD10 or for ATC.
This would require:
Normalization
Once the synonyms have been identified, we need to decide how to present the extracted information to the user. The UMLS aligns synonyms with a unique identifier, the CUI, but also offers alignments to the IDs of all the terminologies it contains. At the moment, the ent.kb_id_ attribute contains the various identifiers jumbled together (ATC / ICD10), making it difficult to use if you start mixing terminologies in a pipeline.
A more robust solution (#47) would be to store in different extensions the IDs proposed by the terminology.
The text was updated successfully, but these errors were encountered: