Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes on tokeniser, normalisation, qualifiers and CI #329

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ env:

jobs:
Documentation:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2

Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
os: [ubuntu-22.04, windows-latest, macos-latest]

steps:
- uses: actions/checkout@v4
Expand All @@ -42,7 +42,7 @@ jobs:

build_sdist:
name: Build source distribution
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2

Expand All @@ -58,7 +58,7 @@ jobs:
name: Upload to PyPI

needs: [build_wheels, build_sdist]
runs-on: ubuntu-latest
runs-on: ubuntu-22.04

steps:
- uses: actions/download-artifact@v4
Expand All @@ -76,7 +76,7 @@ jobs:
# repository_url: https://test.pypi.org/legacy/

Documentation:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v3

Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/test-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
os: [ubuntu-22.04, windows-latest, macos-latest]

steps:
- uses: actions/checkout@v2
Expand All @@ -30,7 +30,7 @@ jobs:

build_sdist:
name: Build source distribution
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2

Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
linting:
name: Linting
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v3
with:
Expand All @@ -32,7 +32,7 @@ jobs:

pytest:
name: Pytest
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
strategy:
fail-fast: true
matrix:
Expand Down Expand Up @@ -120,7 +120,7 @@ jobs:

documentation:
name: Documentation
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2

Expand Down Expand Up @@ -150,7 +150,7 @@ jobs:

simple-installation:
name: Simple installation
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
strategy:
fail-fast: true
matrix:
Expand Down
14 changes: 14 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# Changelog

## Unreleased

### Added

- `EDS.Tokenizer` now handles `-\n` (found in text when spliting a long word with a linebreak) as a specific token, which can be discarded by the normalizer pipe.

### Fixed

- Use `ubuntu-22` instead of `ubuntu-latest` in CI to keep `python 3.7` compatibility
- When using `ignore_space_tokens=True`, words separated only by linebreaks will be collected (via `get_text()`) with spaces inbetween
- The `process` method of `Qualifiers` now accepts `Span` as input, an treats it as a `Doc` to avoid alignment issues
- The `detailed_status_mapping` of disorder/behavior pipes not handles the previous `KeyError: None` that can occur when loading pre-annotated docs without instanciating pipes beforehands
- Various fixes on the Alcohol and Tobacco pipes

## v0.13.1

### Added
Expand Down
2 changes: 2 additions & 0 deletions docs/pipes/ner/behaviors/alcohol.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Alcohol consumption {: #edsnlp.pipes.ner.behaviors.alcohol.factory.create_component }

--8<-- "docs/pipes/ner/disorders/warning.md"

::: edsnlp.pipes.ner.behaviors.alcohol.factory.create_component
options:
heading_level: 2
Expand Down
97 changes: 2 additions & 95 deletions docs/pipes/ner/behaviors/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,99 +2,6 @@

## Presentation

EDS-NLP offers two components to extract behavioral patterns, namely the tobacco and alcohol consumption status. Each component is based on the ContextualMatcher component.
Some general considerations about those components:
EDS-NLP offers two components to extract behavioral patterns, namely the tobacco and alcohol consumption status. Each component is based on the [ContextualMatcher][edsnlp.pipes.core.contextual_matcher.ContextualMatcher] matcher, itself based on `eds.contextual_matcher` component.

- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
- The matched comorbidity is also available under the `ent.label_` of each match.
- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with the following parameters:
```{ .python .no-check }
nlp.add_pipe(
eds.normalizer(
accents=True,
lowercase=True,
quotes=True,
spaces=True,
pollution=dict(
information=True,
bars=True,
biology=True,
doctors=True,
web=True,
coding=True,
footer=True,
),
),
)
```

!!! warning "Use qualifiers"
Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you can use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.

!!! aphp "Use the ML model"

The model will soon be available in the models catalogue of AP-HP's CDW.

## Usage

```{ .python .no-check }
import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(
eds.normalizer(
accents=True,
lowercase=True,
quotes=True,
spaces=True,
pollution=dict(
information=True,
bars=True,
biology=True,
doctors=True,
web=True,
coding=True,
footer=True,
),
),
)
nlp.add_pipe(eds.tobacco())
nlp.add_pipe(eds.diabetes())

text = """
Compte-rendu de consultation.

Je vois ce jour M. SCOTT pour le suivi de sa rétinopathie diabétique.
Le patient va bien depuis la dernière fois.
Je le félicite pour la poursuite de son sevrage tabagique (toujours à 10 paquet-année).

Sur le plan de son diabète, la glycémie est stable.
"""

doc = nlp(text)

doc.spans
# Out: {
# 'pollutions': [],
# 'tobacco': [sevrage tabagique (toujours à 10 paquet-année],
# 'diabetes': [rétinopathie diabétique, diabète]
# }

tobacco_matches = doc.spans["tobacco"]
tobacco_matches[0]._.detailed_status
# Out: "ABSTINENCE" #

tobacco_matches[0]._.assigned["PA"] # paquet-année
# Out: 10 # (1)


diabetes = doc.spans["diabetes"]
(diabetes[0]._.detailed_status, diabetes[1]._.detailed_status)
# Out: ('WITH_COMPLICATION', 'WITHOUT_COMPLICATION') # (2)
```

1. Here we see an example of additional information that can be extracted
2. Here we see the importance of document-level aggregation to extract the correct severity of each comorbidity.
--8<-- "docs/pipes/ner/disorders/presentation.md"
56 changes: 2 additions & 54 deletions docs/pipes/ner/disorders/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,58 +2,6 @@

## Presentation

The following components extract 16 different conditions from the [Charlson Comorbidity Index](https://www.rdplf.org/calculateurs/pages/charlson/charlson.html). Each component is based on the ContextualMatcher component.
The following components extract 16 different conditions from the [Charlson Comorbidity Index](https://www.rdplf.org/calculateurs/pages/charlson/charlson.html). Each component is based on the [ContextualMatcher][edsnlp.pipes.core.contextual_matcher.ContextualMatcher] matcher, itself based on `eds.contextual_matcher` component.

The components were developed by AP-HP's Data Science team with a team of medical experts, following the insights of the algorithm proposed by [@petitjean_2024]

Some general considerations about those components:

- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
- The matched comorbidity is also available under the `ent.label_` of each match.
- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with the following parameters:

```{ .python .no-check }
import edsnlp, edsnlp.pipes as eds
...

nlp.add_pipe(
eds.normalizer(
accents=True,
lowercase=True,
quotes=True,
spaces=True,
pollution=dict(
information=True,
bars=True,
biology=True,
doctors=True,
web=True,
coding=True,
footer=True,
),
),
)
```

!!! warning "Use qualifiers"
Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you can use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.

!!! aphp "Use the ML model"

The model will soon be available in the models catalogue of AP-HP's CDW.

!!! tip "On the medical definition of the comorbidities"

Those components were developped to extract **chronic** and **symptomatic** conditions only.

## Aggregation

For relevant phenotyping, matches should be aggregated at the document-level. For instance, a document might mention a complicated diabetes at the beginning ("*Le patient a une rétinopathie diabétique*"), and then refer to this diabetes without mentionning that it is complicated anymore ("*Concernant son diabète, le patient ...*").
Thus, a good and simple aggregation rule is, for each comorbidity, to

- disregard all entities tagged as irrelevant by the qualification component(s)
- take the maximum (i.e., the most severe) status of the leftover entities

An implementation of this rule is presented [here][aggregating-results]
--8<-- "docs/pipes/ner/disorders/presentation.md"
77 changes: 77 additions & 0 deletions docs/pipes/ner/disorders/presentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
The components were developed by AP-HP's Data Science team with a team of medical experts, following the insights of the algorithm proposed by [@petitjean_2024]

Some general considerations about those components:

- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
- The matched comorbidity is also available under the `ent.label_` of each match.
- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline (see [Usage](#usage) below)

--8<-- "docs/pipes/ner/disorders/warning.md"

!!! warning "Use qualifiers"
Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you should use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.

!!! aphp "Use the ML model"

For projects working on AP-HP's CDW, this model is available via its models catalogue.

## Usage

```{ .python .no-check }
import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(
eds.normalizer(
accents=True,
lowercase=True,
quotes=True,
spaces=True,
pollution=dict(
biology=True, #(1)
coding=True, #(2)
),
),
)
nlp.add_pipe(eds.tobacco())
nlp.add_pipe(eds.diabetes())

text = """
Compte-rendu de consultation.

Je vois ce jour M. SCOTT pour le suivi de sa rétinopathie diabétique.
Le patient va bien depuis la dernière fois.
Je le félicite pour la poursuite de son sevrage tabagique (toujours à 10 paquet-année).

Sur le plan de son diabète, la glycémie est stable.
"""

doc = nlp(text)

doc.spans
# Out: {
# 'pollutions': [],
# 'tobacco': [sevrage tabagique (toujours à 10 paquet-année],
# 'diabetes': [rétinopathie diabétique, diabète]
# }

tobacco_matches = doc.spans["tobacco"]
tobacco_matches[0]._.detailed_status
# Out: "ABSTINENCE" #

tobacco_matches[0]._.assigned["PA"] # paquet-année
# Out: 10 # (3)


diabetes = doc.spans["diabetes"]
(diabetes[0]._.detailed_status, diabetes[1]._.detailed_status)
# Out: ('WITH_COMPLICATION', 'WITHOUT_COMPLICATION') # (4)
```

1. This will discard mentions of biology results, which often leads to false positive
2. This will discard mentions of ICD10 coding that sometimes appears at the end of clinical documents
3. Here we see an example of additional information that can be extracted
4. Here we see the importance of document-level aggregation to extract the correct severity of each comorbidity.
7 changes: 7 additions & 0 deletions docs/pipes/ner/disorders/warning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
!!! danger "On overlapping entities"
When using multiple disorders or behavior pipelines, some entities may be extracted from different pipes. For instance:

* "Intoxication éthylotabagique" will be tagged both by `eds.tobacco` and `eds.alcohol`
* "Chirrose alcoolique" will be tagged both by `eds.liver_disease` and `eds.alcohol`

As `doc.ents` discards overlapping entities, you should use `doc.spans` instead.
8 changes: 7 additions & 1 deletion edsnlp/core/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -761,7 +761,13 @@ def to_disk(
if (
os.path.exists(path)
and os.listdir(path)
and not os.path.exists(path / "config.cfg")
and not (
os.path.exists(path / "config.cfg") or
(
os.path.exists(path / "meta.json") and
os.path.exists(path / "tokenizer")
)
)
):
raise Exception(
"The directory already exists and doesn't appear to be a"
Expand Down
Loading