aphp · bdura · Apr 8, 2022 · Mar 15, 2022 · Mar 16, 2022 · Mar 16, 2022
diff --git a/changelog.md b/changelog.md
@@ -7,6 +7,10 @@
 - New `eds` language to better fit French clinical documents and improve speed.
 - Testing for markdown codeblocks.
 
+### Changed
+
+- Complete revamp of the date detection pipeline
+
 ## v0.4.4
 
 - Add `measures` pipeline
@@ -18,6 +22,8 @@
 ## v0.4.3
 
 - Fix regex matching on spans.
+- Add fast_parse in date pipeline.
+- Add relative_date information parsing
 
 ## v0.4.2
 

diff --git a/demo/app.py b/demo/app.py
@@ -215,7 +215,7 @@ def load_model(
 
 for date in doc.spans.get("dates", []):
     span = Span(doc, date.start, date.end, label="date")
-    span._.value = span._.date
+    span._.value = span._.date.norm()
     ents.append(span)
 
 for measure in doc.spans.get("measures", []):

diff --git a/docs/pipelines/misc/consultation-dates.md b/docs/pipelines/misc/consultation-dates.md
@@ -35,8 +35,8 @@ doc = nlp(text)
 doc.spans["consultation_dates"]
 # Out: [Consultation du 03/10/2018]
 
-doc.spans["consultation_dates"][0]._.consultation_date
-# Out: datetime.datetime(2018, 10, 3, 0, 0)
+doc.spans["consultation_dates"][0]._.consultation_date.to_datetime()
+# Out: DateTime(2018, 10, 3, 0, 0, 0, tzinfo=Timezone('Europe/Paris'))
 ```
 
 ## Consultation events

diff --git a/docs/pipelines/misc/dates.md b/docs/pipelines/misc/dates.md
@@ -1,23 +1,17 @@
 # Dates
 
 The `eds.dates` pipeline's role is to detect and normalise dates within a medical document.
-We use simple regular expressions to extract date mentions, and apply the [`dateparser` library](https://dateparser.readthedocs.io/en/latest/index.html)
-for the normalisation.
-
-!!! warning
-
-    The ``dates`` pipeline is still in active development and has not been rigorously validated.
-    If you come across a date expression that goes undetected, please file an issue !
+We use simple regular expressions to extract date mentions.
 
 ## Scope
 
-The `eds.dates` pipeline finds absolute (eg `23/08/2021`) and relative (eg `hier`, `la semaine dernière`) dates alike.
+The `eds.dates` pipeline finds absolute (eg `23/08/2021`) and relative (eg `hier`, `la semaine dernière`) dates alike. It also handles mentions of duration.
 
-If the date of edition (via the `doc._.note_datetime` extension) is available, relative (and "year-less") dates will be normalised
-using the latter as base. On the other hand, if the base is unknown, the normalisation will follow the pattern :
-`TD±<number-of-days>`, positive values meaning that the relative date mentions the future (`dans trois jours`).
-
-Since the extension `doc._.note_datetime` cannot be set before applying the `dates` pipeline, we defer the normalisation step until the `span._.dates` attribute is accessed.
+| Type       | Example                       |
+| ---------- | ----------------------------- |
+| `absolute` | `3 mai`, `03/05/2020`         |
+| `relative` | `hier`, `la semaine dernière` |
+| `duration` | `pendant quatre jours`        |
 
 See the [tutorial](../../tutorials/detecting-dates.md) for a presentation of a full pipeline featuring the `eds.dates` component.
 
@@ -26,55 +20,49 @@ See the [tutorial](../../tutorials/detecting-dates.md) for a presentation of a f
 ```python
 import spacy
 
-from datetime import datetime
+import pendulum
 
 nlp = spacy.blank("fr")
 nlp.add_pipe("eds.dates")
 
 text = (
     "Le patient est admis le 23 août 2021 pour une douleur à l'estomac. "
-    "Il lui était arrivé la même chose il y a un an."
+    "Il lui était arrivé la même chose il y a un an pendant une semaine."
 )
 
 doc = nlp(text)
 
 dates = doc.spans["dates"]
 dates
-# Out: [23 août 2021, il y a un an]
+# Out: [23 août 2021, il y a un an, pendant une semaine]
 
-dates[0]._.date
-# Out: '2021-08-23'
+dates[0]._.date.to_datetime()
+# Out: 2021-08-23T00:00:00+02:00
 
-dates[1]._.date
-# Out: 'TD-365'
+dates[1]._.date.to_datetime()
+# Out: -1 year
 
-doc._.note_datetime = datetime(2021, 8, 27)
+note_datetime = pendulum.datetime(2021, 8, 27, tz="Europe/Paris")
 
-dates[1]._.date
-# Out: '2020-08-27'
+dates[1]._.date.to_datetime(note_datetime=note_datetime)
+# Out: 2020-08-27T00:00:00+02:00
 ```
 
 ## Declared extensions
 
-The `eds.dates` pipeline declares two [spaCy extensions](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object :
-
-1. The `date_parsed` attribute is a Python `datetime` object, used internally by the pipeline.
-2. The `date` attribute is a property that displays a normalised human-readable string for the date.
+The `eds.dates` pipeline declares one [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object: the `date` attribute contains a parsed version of the date.
 
 ## Configuration
 
 The pipeline can be configured using the following parameters :
 
 | Parameter        | Explanation                                      | Default                           |
 | ---------------- | ------------------------------------------------ | --------------------------------- |
-| `no_year`        | Date patterns without year, eg `le 5 août`       | `None` (use pre-defined patterns) |
-| `year_only`      | Date patterns with only the year, eg `en 2018`   | `None` (use pre-defined patterns) |
-| `no_day`         | Date patterns without day, eg `en mars 2018`     | `None` (use pre-defined patterns) |
 | `absolute`       | Absolute date patterns, eg `le 5 août 2020`      | `None` (use pre-defined patterns) |
 | `relative`       | Relative date patterns, eg `hier`)               | `None` (use pre-defined patterns) |
-| `full`           | Full date patterns, eg `2020-10-23`              | `None` (use pre-defined patterns) |
-| `current`        | "Current" date patterns, eg `ce jour`            | `None` (use pre-defined patterns) |
+| `durations`      | Duration patterns, eg `pendant trois mois`)      | `None` (use pre-defined patterns) |
 | `false_positive` | Some false positive patterns to exclude          | `None` (use pre-defined patterns) |
+| `detect_periods` | Whether to look for dates around entities only   | `False`                           |
 | `on_ents_only`   | Whether to look for dates around entities only   | `False`                           |
 | `attr`           | spaCy attribute to match on, eg `NORM` or `TEXT` | `"NORM"`                          |
 

diff --git a/docs/tutorials/detecting-dates.md b/docs/tutorials/detecting-dates.md
@@ -32,6 +32,7 @@ Clinical notes contain many different types of dates. To name a few examples:
 | Absolute | Explicit date                       | `2022-03-03`                                     |
 | Partial  | Date missing the day, month or year | `le 3 janvier/on January 3rd`, `en 2021/in 2021` |
 | Relative | Relative dates                      | `hier/yesterday`, `le mois dernier/last month`   |
+| Duration | Durations                           | `pendant trois mois/for three months`            |
 
 !!! warning
 
@@ -74,12 +75,28 @@ dates  # (1)
 
 1. `dates` is a list of spaCy `Span` objects.
 
+## Normalisation
+
 We can review each date and get its normalisation:
 
-| `date.text`        | `date._.date` |
-| ------------------ | ------------- |
-| `21 janvier`       | `????-01-21`  |
-| `il y a trois ans` | `TD-1095`     |
+| `date.text`        | `date._.date`                               |
+| ------------------ | ------------------------------------------- |
+| `21 janvier`       | `#!python {"day": 21, "month": 1}`          |
+| `il y a trois ans` | `#!python {"direction": "past", "year": 3}` |
+
+Dates detected by the pipeline component are parsed into a dictionary-like object.
+It includes every information that is actually contained in the text.
+
+To get a more usable representation, you may call the `to_datetime()` method.
+If there's enough information, the date will be represented
+in a `datetime.datetime` or `datetime.timedelta` object. If some information is missing,
+It will return `None`.
+
+!!! note "Date normalisation"
+
+    Since dates can be missing some information (eg `en août`), we refrain from
+    outputting a `datetime` object in that case. Doing so would amount to guessing,
+    and we made the choice of letting you decide how you want to handle missing dates.
 
 ## What next?
 
@@ -187,12 +204,15 @@ text = (
 doc = nlp(text)
 
 for ent in doc.ents:
-    print(ent, get_event_date(ent))
+    date = get_event_date(ent)
+    print(f"{ent.text:<20}{date.text:<20}{date._.date.to_datetime()}")
+# Out: admis               12 avril 2020       2020-04-12T00:00:00+02:00
+# Out: pris en charge      l'année dernière    -1 year
 ```
 
 Which will output:
 
-| `ent`          | `get_event_date(ent)` | `get_event_date(ent)._.date` |
-| -------------- | --------------------- | ---------------------------- |
-| admis          | 12 avril              | `????-04-12`                 |
-| pris en charge | l'année dernière      | `TD-365`                     |
+| `ent`          | `get_event_date(ent)` | `get_event_date(ent)._.date.to_datetime(` |
+| -------------- | --------------------- | ----------------------------------------- |
+| admis          | 12 avril              | `2020-04-12T00:00:00+02:00`               |
+| pris en charge | l'année dernière      | `-1 year`                                 |
diff --git a/docs/tutorials/multiple-texts.md b/docs/tutorials/multiple-texts.md
@@ -241,7 +241,7 @@ They share the same arguments:
         nlp,
         context=["note_datetime"],
         additional_spans=["dates"],
-        extensions=["parsed_date"],
+        extensions=["date"],
     )
     ```
 
@@ -259,7 +259,7 @@ note_nlp = single_pipe(
     data,
     nlp,
     additional_spans=["dates"],
-    extensions=["parsed_date"],
+    extensions=["date"],
 )
 ```
 
@@ -277,7 +277,7 @@ note_nlp = parallel_pipe(
     data,
     nlp,
     additional_spans=["dates"],
-    extensions=["parsed_date"],
+    extensions=["date"],
     n_jobs=-2,  # (1)
 )
 ```
@@ -385,7 +385,7 @@ Once again, using the helper is trivial:
         df,
         nlp,
         additional_spans=["dates"],
-        extensions={"parsed_date": dt_type},
+        extensions={"date": dt_type},
     )
 
     # Check that the pipeline was correctly distributed:
@@ -404,7 +404,7 @@ Once again, using the helper is trivial:
         df,
         nlp,
         additional_spans=["dates"],
-        extensions={"parsed_date": dt_type},
+        extensions={"date": dt_type},
     )
 
     # Check that the pipeline was correctly distributed:
@@ -429,7 +429,7 @@ note_nlp = pipe(
     nlp=nlp,
     n_jobs=1,
     additional_spans=["dates"],
-    extensions=["parsed_date"],
+    extensions=["date"],
 )
 
 ### Larger pandas DataFrame
@@ -438,7 +438,7 @@ note_nlp = pipe(
     nlp=nlp,
     n_jobs=-2,
     additional_spans=["dates"],
-    extensions=["parsed_date"],
+    extensions=["date"],
 )
 
 ### Huge Spark or Koalas DataFrame
@@ -447,6 +447,6 @@ note_nlp = pipe(
     nlp=nlp,
     how="spark",
     additional_spans=["dates"],
-    extensions={"parsed_date": dt_type},
+    extensions={"date": dt_type},
 )
 ```
diff --git a/docs/tutorials/spacy101.md b/docs/tutorials/spacy101.md
@@ -128,15 +128,16 @@ doc.spans["dates"]  # (2)
 # Out: [5 mai 2005]
 
 span = doc.spans["dates"][0]  # (3)
-span._.date  # (4)
-# Out: '2005-05-05'
+span._.date.to_datetime()  # (4)
+# Out: DateTime(2005, 5, 5, 0, 0, 0, tzinfo=Timezone('Europe/Paris'))
 ```
 
 1. In this example, there is only one sentence...
 2. The `eds.dates` adds a key to the `doc.spans` attribute
 3. `span` is a spaCy `Span` object.
 4. In spaCy, you can declare custom extensions that live in the `_` attribute.
    Here, the `eds.dates` pipeline uses a `Span._.date` extension to persist the normalised date.
+   We use the `to_datetime()` method to get an object that is usable by Python.
 
 ## Conclusion
 

diff --git a/edsnlp/pipelines/misc/consultation_dates/consultation_dates.py b/edsnlp/pipelines/misc/consultation_dates/consultation_dates.py
@@ -121,7 +121,9 @@ def __call__(self, doc: Doc) -> Doc:
 
         Returns
         -------
-        doc: spaCy Doc object with additionnal doc.spans['consultation_dates] SpanGroup
+        doc: Doc
+            spaCy Doc object with additional
+            `doc.spans['consultation_dates]` `SpanGroup`
         """
 
         ents = self.process(doc)
@@ -151,7 +153,7 @@ def __call__(self, doc: Doc) -> Doc:
                 kept_date = min(matching_dates, key=lambda d: d.start)
                 span = doc[mention.start : kept_date.end]
                 span.label_ = mention.label_
-                span._.consultation_date = kept_date._.parsed_date
+                span._.consultation_date = kept_date._.date
 
                 doc.spans["consultation_dates"].append(span)