Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp of the date pipeline #22

Merged
merged 45 commits into from
Apr 8, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
f23cd26
add fast_parse for days and months
Mar 15, 2022
44f2b7b
add tests
keyber Mar 16, 2022
beb3b6c
Merge branch 'master' into dates
keyber Mar 16, 2022
84cae8c
parse relative_dates
keyber Mar 17, 2022
969efb0
improve dates
Mar 17, 2022
f73de24
comments
keyber Mar 17, 2022
e150c24
remove a test
keyber Mar 22, 2022
09d2f9f
Merge branch 'master' into dates
keyber Mar 22, 2022
4d7c83a
Removed a modified file from pull request
keyber Mar 25, 2022
115f87a
Merge branch 'master' into dates
keyber Mar 25, 2022
caa2100
use new way of normalizing
keyber Mar 25, 2022
e9ffed0
feat: add the ability to filter tuples of Span
bdura Mar 30, 2022
d7bfcb8
docs: update filter docs
bdura Mar 30, 2022
4606c84
feat: add static dateparser
bdura Mar 30, 2022
c844dea
feat: simplify patterns in dates
bdura Mar 30, 2022
dd5e4e0
fix: wrong dict for months
bdura Mar 31, 2022
7737523
Merge remote-tracking branch 'github/master' into dates
bdura Mar 31, 2022
177a971
Merge remote-tracking branch 'github/master' into dates
bdura Apr 5, 2022
0e69751
feat: simplify date detection and parsing
bdura Apr 5, 2022
3477fee
feat: add periods
bdura Apr 5, 2022
920d80c
tests: add dates testing
bdura Apr 5, 2022
e49378f
fix: year only dates
bdura Apr 5, 2022
5f1d3da
refactor(dates): reorganise the dates pipeline
bdura Apr 5, 2022
8b0fd4a
refactor(dates): reorganise the dates pipeline
bdura Apr 5, 2022
ca0e8a9
fix: adapt consultation dates to the new eds.dates
bdura Apr 5, 2022
a7dea9f
tests: update consultation_dates pipeline
bdura Apr 5, 2022
309017b
Merge branch 'dates' of github.com:aphp/edsnlp into dates
bdura Apr 5, 2022
b86cd9e
feat: add date parsing with Pendulum
bdura Apr 5, 2022
a0c1b86
feat: add year validator
bdura Apr 5, 2022
a55c018
fix: tests for dates
bdura Apr 5, 2022
76a9411
docs: update detecting-dates tutorial
bdura Apr 5, 2022
a8bfc16
docs: update dates pipeline
bdura Apr 5, 2022
fbd8d5d
feat: add durations
bdura Apr 6, 2022
bdd0a36
fix(dates): mode testing in persiods
bdura Apr 6, 2022
ce5b0f9
feat(dates): handle durations for periods
bdura Apr 6, 2022
ae67bb4
fix(dates): adapt tests to the new pipeline
bdura Apr 6, 2022
fc5c320
chore(dependencies): remove dateparser dependency
bdura Apr 6, 2022
efec1bf
chore(dates): update changelog
bdura Apr 6, 2022
da80b1c
docs(dates): update docs
bdura Apr 6, 2022
8cc08cf
feat(dates): add normalisation
bdura Apr 8, 2022
0b9f902
style(dates): clean norm method
bdura Apr 8, 2022
281bead
feat(dates): update demo to use date normalisation
bdura Apr 8, 2022
d78e6f7
refactor(dates): method name to "to_datetime"
bdura Apr 8, 2022
19e5d0b
feat(dates): clean period model
bdura Apr 8, 2022
2019c88
fix(dates): propagate `to_datetime` refactoring
bdura Apr 8, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@
- New `eds` language to better fit French clinical documents and improve speed.
- Testing for markdown codeblocks.

### Changed

- Complete revamp of the date detection pipeline

## v0.4.4

- Add `measures` pipeline
Expand All @@ -18,6 +22,8 @@
## v0.4.3

- Fix regex matching on spans.
- Add fast_parse in date pipeline.
- Add relative_date information parsing

## v0.4.2

Expand Down
2 changes: 1 addition & 1 deletion demo/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ def load_model(

for date in doc.spans.get("dates", []):
span = Span(doc, date.start, date.end, label="date")
span._.value = span._.date
span._.value = span._.date.norm()
ents.append(span)

for measure in doc.spans.get("measures", []):
Expand Down
4 changes: 2 additions & 2 deletions docs/pipelines/misc/consultation-dates.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ doc = nlp(text)
doc.spans["consultation_dates"]
# Out: [Consultation du 03/10/2018]

doc.spans["consultation_dates"][0]._.consultation_date
# Out: datetime.datetime(2018, 10, 3, 0, 0)
doc.spans["consultation_dates"][0]._.consultation_date.to_datetime()
# Out: DateTime(2018, 10, 3, 0, 0, 0, tzinfo=Timezone('Europe/Paris'))
```

## Consultation events
Expand Down
52 changes: 20 additions & 32 deletions docs/pipelines/misc/dates.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,17 @@
# Dates

The `eds.dates` pipeline's role is to detect and normalise dates within a medical document.
We use simple regular expressions to extract date mentions, and apply the [`dateparser` library](https://dateparser.readthedocs.io/en/latest/index.html)
for the normalisation.

!!! warning

The ``dates`` pipeline is still in active development and has not been rigorously validated.
If you come across a date expression that goes undetected, please file an issue !
We use simple regular expressions to extract date mentions.

## Scope

The `eds.dates` pipeline finds absolute (eg `23/08/2021`) and relative (eg `hier`, `la semaine dernière`) dates alike.
The `eds.dates` pipeline finds absolute (eg `23/08/2021`) and relative (eg `hier`, `la semaine dernière`) dates alike. It also handles mentions of duration.

If the date of edition (via the `doc._.note_datetime` extension) is available, relative (and "year-less") dates will be normalised
using the latter as base. On the other hand, if the base is unknown, the normalisation will follow the pattern :
`TD±<number-of-days>`, positive values meaning that the relative date mentions the future (`dans trois jours`).

Since the extension `doc._.note_datetime` cannot be set before applying the `dates` pipeline, we defer the normalisation step until the `span._.dates` attribute is accessed.
| Type | Example |
| ---------- | ----------------------------- |
| `absolute` | `3 mai`, `03/05/2020` |
| `relative` | `hier`, `la semaine dernière` |
| `duration` | `pendant quatre jours` |

See the [tutorial](../../tutorials/detecting-dates.md) for a presentation of a full pipeline featuring the `eds.dates` component.

Expand All @@ -26,55 +20,49 @@ See the [tutorial](../../tutorials/detecting-dates.md) for a presentation of a f
```python
import spacy

from datetime import datetime
import pendulum

nlp = spacy.blank("fr")
nlp.add_pipe("eds.dates")

text = (
"Le patient est admis le 23 août 2021 pour une douleur à l'estomac. "
"Il lui était arrivé la même chose il y a un an."
"Il lui était arrivé la même chose il y a un an pendant une semaine."
)

doc = nlp(text)

dates = doc.spans["dates"]
dates
# Out: [23 août 2021, il y a un an]
# Out: [23 août 2021, il y a un an, pendant une semaine]

dates[0]._.date
# Out: '2021-08-23'
dates[0]._.date.to_datetime()
# Out: 2021-08-23T00:00:00+02:00

dates[1]._.date
# Out: 'TD-365'
dates[1]._.date.to_datetime()
# Out: -1 year

doc._.note_datetime = datetime(2021, 8, 27)
note_datetime = pendulum.datetime(2021, 8, 27, tz="Europe/Paris")

dates[1]._.date
# Out: '2020-08-27'
dates[1]._.date.to_datetime(note_datetime=note_datetime)
# Out: 2020-08-27T00:00:00+02:00
```

## Declared extensions

The `eds.dates` pipeline declares two [spaCy extensions](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object :

1. The `date_parsed` attribute is a Python `datetime` object, used internally by the pipeline.
2. The `date` attribute is a property that displays a normalised human-readable string for the date.
The `eds.dates` pipeline declares one [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object: the `date` attribute contains a parsed version of the date.

## Configuration

The pipeline can be configured using the following parameters :

| Parameter | Explanation | Default |
| ---------------- | ------------------------------------------------ | --------------------------------- |
| `no_year` | Date patterns without year, eg `le 5 août` | `None` (use pre-defined patterns) |
| `year_only` | Date patterns with only the year, eg `en 2018` | `None` (use pre-defined patterns) |
| `no_day` | Date patterns without day, eg `en mars 2018` | `None` (use pre-defined patterns) |
| `absolute` | Absolute date patterns, eg `le 5 août 2020` | `None` (use pre-defined patterns) |
| `relative` | Relative date patterns, eg `hier`) | `None` (use pre-defined patterns) |
| `full` | Full date patterns, eg `2020-10-23` | `None` (use pre-defined patterns) |
| `current` | "Current" date patterns, eg `ce jour` | `None` (use pre-defined patterns) |
| `durations` | Duration patterns, eg `pendant trois mois`) | `None` (use pre-defined patterns) |
| `false_positive` | Some false positive patterns to exclude | `None` (use pre-defined patterns) |
| `detect_periods` | Whether to look for dates around entities only | `False` |
| `on_ents_only` | Whether to look for dates around entities only | `False` |
| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"NORM"` |

Expand Down
38 changes: 29 additions & 9 deletions docs/tutorials/detecting-dates.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Clinical notes contain many different types of dates. To name a few examples:
| Absolute | Explicit date | `2022-03-03` |
| Partial | Date missing the day, month or year | `le 3 janvier/on January 3rd`, `en 2021/in 2021` |
| Relative | Relative dates | `hier/yesterday`, `le mois dernier/last month` |
| Duration | Durations | `pendant trois mois/for three months` |

!!! warning

Expand Down Expand Up @@ -74,12 +75,28 @@ dates # (1)

1. `dates` is a list of spaCy `Span` objects.

## Normalisation

We can review each date and get its normalisation:

| `date.text` | `date._.date` |
| ------------------ | ------------- |
| `21 janvier` | `????-01-21` |
| `il y a trois ans` | `TD-1095` |
| `date.text` | `date._.date` |
| ------------------ | ------------------------------------------- |
| `21 janvier` | `#!python {"day": 21, "month": 1}` |
| `il y a trois ans` | `#!python {"direction": "past", "year": 3}` |

Dates detected by the pipeline component are parsed into a dictionary-like object.
It includes every information that is actually contained in the text.

To get a more usable representation, you may call the `to_datetime()` method.
If there's enough information, the date will be represented
in a `datetime.datetime` or `datetime.timedelta` object. If some information is missing,
It will return `None`.

!!! note "Date normalisation"

Since dates can be missing some information (eg `en août`), we refrain from
outputting a `datetime` object in that case. Doing so would amount to guessing,
and we made the choice of letting you decide how you want to handle missing dates.

## What next?

Expand Down Expand Up @@ -187,12 +204,15 @@ text = (
doc = nlp(text)

for ent in doc.ents:
print(ent, get_event_date(ent))
date = get_event_date(ent)
print(f"{ent.text:<20}{date.text:<20}{date._.date.to_datetime()}")
# Out: admis 12 avril 2020 2020-04-12T00:00:00+02:00
# Out: pris en charge l'année dernière -1 year
```

Which will output:

| `ent` | `get_event_date(ent)` | `get_event_date(ent)._.date` |
| -------------- | --------------------- | ---------------------------- |
| admis | 12 avril | `????-04-12` |
| pris en charge | l'année dernière | `TD-365` |
| `ent` | `get_event_date(ent)` | `get_event_date(ent)._.date.to_datetime(` |
| -------------- | --------------------- | ----------------------------------------- |
| admis | 12 avril | `2020-04-12T00:00:00+02:00` |
| pris en charge | l'année dernière | `-1 year` |
16 changes: 8 additions & 8 deletions docs/tutorials/multiple-texts.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ They share the same arguments:
nlp,
context=["note_datetime"],
additional_spans=["dates"],
extensions=["parsed_date"],
extensions=["date"],
)
```

Expand All @@ -259,7 +259,7 @@ note_nlp = single_pipe(
data,
nlp,
additional_spans=["dates"],
extensions=["parsed_date"],
extensions=["date"],
)
```

Expand All @@ -277,7 +277,7 @@ note_nlp = parallel_pipe(
data,
nlp,
additional_spans=["dates"],
extensions=["parsed_date"],
extensions=["date"],
n_jobs=-2, # (1)
)
```
Expand Down Expand Up @@ -385,7 +385,7 @@ Once again, using the helper is trivial:
df,
nlp,
additional_spans=["dates"],
extensions={"parsed_date": dt_type},
extensions={"date": dt_type},
)

# Check that the pipeline was correctly distributed:
Expand All @@ -404,7 +404,7 @@ Once again, using the helper is trivial:
df,
nlp,
additional_spans=["dates"],
extensions={"parsed_date": dt_type},
extensions={"date": dt_type},
)

# Check that the pipeline was correctly distributed:
Expand All @@ -429,7 +429,7 @@ note_nlp = pipe(
nlp=nlp,
n_jobs=1,
additional_spans=["dates"],
extensions=["parsed_date"],
extensions=["date"],
)

### Larger pandas DataFrame
Expand All @@ -438,7 +438,7 @@ note_nlp = pipe(
nlp=nlp,
n_jobs=-2,
additional_spans=["dates"],
extensions=["parsed_date"],
extensions=["date"],
)

### Huge Spark or Koalas DataFrame
Expand All @@ -447,6 +447,6 @@ note_nlp = pipe(
nlp=nlp,
how="spark",
additional_spans=["dates"],
extensions={"parsed_date": dt_type},
extensions={"date": dt_type},
)
```
5 changes: 3 additions & 2 deletions docs/tutorials/spacy101.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,15 +128,16 @@ doc.spans["dates"] # (2)
# Out: [5 mai 2005]

span = doc.spans["dates"][0] # (3)
span._.date # (4)
# Out: '2005-05-05'
span._.date.to_datetime() # (4)
# Out: DateTime(2005, 5, 5, 0, 0, 0, tzinfo=Timezone('Europe/Paris'))
```

1. In this example, there is only one sentence...
2. The `eds.dates` adds a key to the `doc.spans` attribute
3. `span` is a spaCy `Span` object.
4. In spaCy, you can declare custom extensions that live in the `_` attribute.
Here, the `eds.dates` pipeline uses a `Span._.date` extension to persist the normalised date.
We use the `to_datetime()` method to get an object that is usable by Python.

## Conclusion

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,9 @@ def __call__(self, doc: Doc) -> Doc:

Returns
-------
doc: spaCy Doc object with additionnal doc.spans['consultation_dates] SpanGroup
doc: Doc
spaCy Doc object with additional
`doc.spans['consultation_dates]` `SpanGroup`
"""

ents = self.process(doc)
Expand Down Expand Up @@ -151,7 +153,7 @@ def __call__(self, doc: Doc) -> Doc:
kept_date = min(matching_dates, key=lambda d: d.start)
span = doc[mention.start : kept_date.end]
span.label_ = mention.label_
span._.consultation_date = kept_date._.parsed_date
span._.consultation_date = kept_date._.date

doc.spans["consultation_dates"].append(span)

Expand Down
Loading