aphp · Thomzoy · Oct 16, 2024 · Oct 16, 2024 · Oct 16, 2024 · Oct 16, 2024
diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
@@ -13,7 +13,7 @@ env:
 
 jobs:
   Documentation:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
     steps:
     - uses: actions/checkout@v2
 

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -24,7 +24,7 @@ jobs:
     runs-on: ${{ matrix.os }}
     strategy:
       matrix:
-        os: [ubuntu-latest, windows-latest, macos-latest]
+        os: [ubuntu-22.04, windows-latest, macos-latest]
 
     steps:
       - uses: actions/checkout@v4
@@ -42,7 +42,7 @@ jobs:
 
   build_sdist:
     name: Build source distribution
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
     steps:
       - uses: actions/checkout@v2
 
@@ -58,7 +58,7 @@ jobs:
     name: Upload to PyPI
 
     needs: [build_wheels, build_sdist]
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
 
     steps:
     - uses: actions/download-artifact@v4
@@ -76,7 +76,7 @@ jobs:
         # repository_url: https://test.pypi.org/legacy/
 
   Documentation:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
     steps:
     - uses: actions/checkout@v3
 

diff --git a/.github/workflows/test-build.yml b/.github/workflows/test-build.yml
@@ -17,7 +17,7 @@ jobs:
     runs-on: ${{ matrix.os }}
     strategy:
       matrix:
-        os: [ubuntu-latest, windows-latest, macos-latest]
+        os: [ubuntu-22.04, windows-latest, macos-latest]
 
     steps:
       - uses: actions/checkout@v2
@@ -30,7 +30,7 @@ jobs:
 
   build_sdist:
     name: Build source distribution
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
     steps:
       - uses: actions/checkout@v2
 

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -15,7 +15,7 @@ jobs:
   linting:
     name: Linting
     if: github.event_name == 'pull_request'
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
     steps:
       - uses: actions/checkout@v3
         with:
@@ -32,7 +32,7 @@ jobs:
 
   pytest:
     name: Pytest
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
     strategy:
       fail-fast: true
       matrix:
@@ -120,7 +120,7 @@ jobs:
 
   documentation:
     name: Documentation
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
     steps:
     - uses: actions/checkout@v2
 
@@ -150,7 +150,7 @@ jobs:
 
   simple-installation:
     name: Simple installation
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
     strategy:
       fail-fast: true
       matrix:

diff --git a/changelog.md b/changelog.md
@@ -1,5 +1,19 @@
 # Changelog
 
+## Unreleased
+
+### Added
+
+- `EDS.Tokenizer` now handles `-\n` (found in text when spliting a long word with a linebreak) as a specific token, which can be discarded by the normalizer pipe.
+
+### Fixed
+
+- Use `ubuntu-22` instead of `ubuntu-latest` in CI to keep `python 3.7` compatibility
+- When using `ignore_space_tokens=True`, words separated only by linebreaks will be collected (via `get_text()`) with spaces inbetween
+- The `process` method of `Qualifiers` now accepts `Span` as input, an treats it as a `Doc` to avoid alignment issues
+- The `detailed_status_mapping` of disorder/behavior pipes not handles the previous `KeyError: None` that can occur when loading pre-annotated docs without instanciating pipes beforehands
+- Various fixes on the Alcohol and Tobacco pipes
+
 ## v0.13.1
 
 ### Added

diff --git a/docs/pipes/ner/behaviors/alcohol.md b/docs/pipes/ner/behaviors/alcohol.md
@@ -1,5 +1,7 @@
 # Alcohol consumption {: #edsnlp.pipes.ner.behaviors.alcohol.factory.create_component }
 
+--8<-- "docs/pipes/ner/disorders/warning.md"
+
 ::: edsnlp.pipes.ner.behaviors.alcohol.factory.create_component
     options:
         heading_level: 2

diff --git a/docs/pipes/ner/behaviors/index.md b/docs/pipes/ner/behaviors/index.md
@@ -2,99 +2,6 @@
 
 ## Presentation
 
-EDS-NLP offers two components to extract behavioral patterns, namely the tobacco and alcohol consumption status. Each component is based on the ContextualMatcher component.
-Some general considerations about those components:
+EDS-NLP offers two components to extract behavioral patterns, namely the tobacco and alcohol consumption status. Each component is based on the [ContextualMatcher][edsnlp.pipes.core.contextual_matcher.ContextualMatcher] matcher, itself based on `eds.contextual_matcher` component.
 
-- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
-- The matched comorbidity is also available under the `ent.label_` of each match.
-- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
-- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
-- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with the following parameters:
-  ```{ .python .no-check }
-  nlp.add_pipe(
-      eds.normalizer(
-          accents=True,
-          lowercase=True,
-          quotes=True,
-          spaces=True,
-          pollution=dict(
-              information=True,
-              bars=True,
-              biology=True,
-              doctors=True,
-              web=True,
-              coding=True,
-              footer=True,
-          ),
-      ),
-  )
-  ```
-
-!!! warning "Use qualifiers"
-    Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you can use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.
-
-    !!! aphp "Use the ML model"
-
-        The model will soon be available in the models catalogue of AP-HP's CDW.
-
-## Usage
-
-```{ .python .no-check }
-import edsnlp, edsnlp.pipes as eds
-
-nlp = edsnlp.blank("eds")
-nlp.add_pipe(eds.sentences())
-nlp.add_pipe(
-    eds.normalizer(
-        accents=True,
-        lowercase=True,
-        quotes=True,
-        spaces=True,
-        pollution=dict(
-            information=True,
-            bars=True,
-            biology=True,
-            doctors=True,
-            web=True,
-            coding=True,
-            footer=True,
-        ),
-    ),
-)
-nlp.add_pipe(eds.tobacco())
-nlp.add_pipe(eds.diabetes())
-
-text = """
-Compte-rendu de consultation.
-
-Je vois ce jour M. SCOTT pour le suivi de sa rétinopathie diabétique.
-Le patient va bien depuis la dernière fois.
-Je le félicite pour la poursuite de son sevrage tabagique (toujours à 10 paquet-année).
-
-Sur le plan de son diabète, la glycémie est stable.
-"""
-
-doc = nlp(text)
-
-doc.spans
-# Out: {
-# 'pollutions': [],
-# 'tobacco': [sevrage tabagique (toujours à 10 paquet-année],
-# 'diabetes': [rétinopathie diabétique, diabète]
-# }
-
-tobacco_matches = doc.spans["tobacco"]
-tobacco_matches[0]._.detailed_status
-# Out: "ABSTINENCE" #
-
-tobacco_matches[0]._.assigned["PA"]  # paquet-année
-# Out: 10 # (1)
-
-
-diabetes = doc.spans["diabetes"]
-(diabetes[0]._.detailed_status, diabetes[1]._.detailed_status)
-# Out: ('WITH_COMPLICATION', 'WITHOUT_COMPLICATION') # (2)
-```
-
-1. Here we see an example of additional information that can be extracted
-2. Here we see the importance of document-level aggregation to extract the correct severity of each comorbidity.
+--8<-- "docs/pipes/ner/disorders/presentation.md"
diff --git a/docs/pipes/ner/disorders/index.md b/docs/pipes/ner/disorders/index.md
@@ -2,58 +2,6 @@
 
 ## Presentation
 
-The following components extract 16 different conditions from the [Charlson Comorbidity Index](https://www.rdplf.org/calculateurs/pages/charlson/charlson.html). Each component is based on the ContextualMatcher component.
+The following components extract 16 different conditions from the [Charlson Comorbidity Index](https://www.rdplf.org/calculateurs/pages/charlson/charlson.html). Each component is based on the [ContextualMatcher][edsnlp.pipes.core.contextual_matcher.ContextualMatcher] matcher, itself based on `eds.contextual_matcher` component.
 
-The components were developed by AP-HP's Data Science team with a team of medical experts, following the insights of the algorithm proposed by [@petitjean_2024]
-
-Some general considerations about those components:
-
-- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
-- The matched comorbidity is also available under the `ent.label_` of each match.
-- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
-- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
-- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with the following parameters:
-
-    ```{ .python .no-check }
-    import edsnlp, edsnlp.pipes as eds
-    ...
-
-    nlp.add_pipe(
-        eds.normalizer(
-            accents=True,
-            lowercase=True,
-            quotes=True,
-            spaces=True,
-            pollution=dict(
-                information=True,
-                bars=True,
-                biology=True,
-                doctors=True,
-                web=True,
-                coding=True,
-                footer=True,
-            ),
-        ),
-    )
-    ```
-
-!!! warning "Use qualifiers"
-    Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you can use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.
-
-    !!! aphp "Use the ML model"
-
-        The model will soon be available in the models catalogue of AP-HP's CDW.
-
-!!! tip "On the medical definition of the comorbidities"
-
-    Those components were developped to extract **chronic** and **symptomatic** conditions only.
-
-## Aggregation
-
-For relevant phenotyping, matches should be aggregated at the document-level. For instance, a document might mention a complicated diabetes at the beginning ("*Le patient a une rétinopathie diabétique*"), and then refer to this diabetes without mentionning that it is complicated anymore ("*Concernant son diabète, le patient ...*").
-Thus, a good and simple aggregation rule is, for each comorbidity, to
-
-- disregard all entities tagged as irrelevant by the qualification component(s)
-- take the maximum (i.e., the most severe) status of the leftover entities
-
-An implementation of this rule is presented [here][aggregating-results]
+--8<-- "docs/pipes/ner/disorders/presentation.md"
diff --git a/docs/pipes/ner/disorders/presentation.md b/docs/pipes/ner/disorders/presentation.md
@@ -0,0 +1,77 @@
+The components were developed by AP-HP's Data Science team with a team of medical experts, following the insights of the algorithm proposed by [@petitjean_2024]
+
+Some general considerations about those components:
+
+- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
+- The matched comorbidity is also available under the `ent.label_` of each match.
+- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
+- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
+- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline (see [Usage](#usage) below)
+
+--8<-- "docs/pipes/ner/disorders/warning.md"
+
+!!! warning "Use qualifiers"
+    Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you should use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.
+
+    !!! aphp "Use the ML model"
+
+        For projects working on AP-HP's CDW, this model is available via its models catalogue.
+
+## Usage
+
+```{ .python .no-check }
+import edsnlp, edsnlp.pipes as eds
+
+nlp = edsnlp.blank("eds")
+nlp.add_pipe(eds.sentences())
+nlp.add_pipe(
+    eds.normalizer(
+        accents=True,
+        lowercase=True,
+        quotes=True,
+        spaces=True,
+        pollution=dict(
+            biology=True, #(1)
+            coding=True, #(2)
+        ),
+    ),
+)
+nlp.add_pipe(eds.tobacco())
+nlp.add_pipe(eds.diabetes())
+
+text = """
+Compte-rendu de consultation.
+
+Je vois ce jour M. SCOTT pour le suivi de sa rétinopathie diabétique.
+Le patient va bien depuis la dernière fois.
+Je le félicite pour la poursuite de son sevrage tabagique (toujours à 10 paquet-année).
+
+Sur le plan de son diabète, la glycémie est stable.
+"""
+
+doc = nlp(text)
+
+doc.spans
+# Out: {
+# 'pollutions': [],
+# 'tobacco': [sevrage tabagique (toujours à 10 paquet-année],
+# 'diabetes': [rétinopathie diabétique, diabète]
+# }
+
+tobacco_matches = doc.spans["tobacco"]
+tobacco_matches[0]._.detailed_status
+# Out: "ABSTINENCE" #
+
+tobacco_matches[0]._.assigned["PA"]  # paquet-année
+# Out: 10 # (3)
+
+
+diabetes = doc.spans["diabetes"]
+(diabetes[0]._.detailed_status, diabetes[1]._.detailed_status)
+# Out: ('WITH_COMPLICATION', 'WITHOUT_COMPLICATION') # (4)
+```
+
+1. This will discard mentions of biology results, which often leads to false positive
+2. This will discard mentions of ICD10 coding that sometimes appears at the end of clinical documents
+3. Here we see an example of additional information that can be extracted
+4. Here we see the importance of document-level aggregation to extract the correct severity of each comorbidity.
diff --git a/docs/pipes/ner/disorders/warning.md b/docs/pipes/ner/disorders/warning.md
@@ -0,0 +1,7 @@
+!!! danger "On overlapping entities"
+    When using multiple disorders or behavior pipelines, some entities may be extracted from different pipes. For instance:
+
+    * "Intoxication éthylotabagique" will be tagged both by `eds.tobacco` and `eds.alcohol`
+    * "Chirrose alcoolique" will be tagged both by `eds.liver_disease` and `eds.alcohol`
+
+    As `doc.ents` discards overlapping entities, you should use `doc.spans` instead.
diff --git a/edsnlp/core/pipeline.py b/edsnlp/core/pipeline.py
@@ -761,7 +761,13 @@ def to_disk(
         if (
             os.path.exists(path)
             and os.listdir(path)
-            and not os.path.exists(path / "config.cfg")
+            and not (
+                os.path.exists(path / "config.cfg") or
+                (
+                    os.path.exists(path / "meta.json") and
+                    os.path.exists(path / "tokenizer")
+                )
+            )
         ):
             raise Exception(
                 "The directory already exists and doesn't appear to be a"