feat: fix to_datetime/duration docs & integration with eds.history

Co-authored-by: Ariel Cohen <[email protected]> Co-authored-by: Perceval Wajsbürt <[email protected]>
aphp · Aug 4, 2023 · bdfa30a · bdfa30a
1 parent dd3099f
commit bdfa30a
Show file tree

Hide file tree

Showing 11 changed files with 264 additions and 127 deletions.
diff --git a/changelog.md b/changelog.md
@@ -1,6 +1,10 @@
 # Changelog
 
-## v0.8.2 (2023-06-07)
+## Unreleased
+
+### Added
+
+- New `to_duration` method to convert an absolute date into a date relative to the note_datetime (or None)
 
 ### Changes
 
@@ -19,6 +23,7 @@
 - the "relative" / "absolute" / "duration" mode of the time entity is now stored in
   the `mode` attribute of the `span._.date/duration`
 - the "from" / "until" period bound, if any, is now stored in the `span._.date.bound` attribute
+- `to_datetime` now only return absolute dates, converts relative dates into absolute if `doc._.note_datetime` is given, and None otherwise
 
 ## v0.8.1 (2023-05-31)
 

diff --git a/docs/pipelines/misc/dates.md b/docs/pipelines/misc/dates.md
@@ -35,27 +35,30 @@ doc = nlp(text)
 
 dates = doc.spans["dates"]
 dates
-# Out: [23 août 2021, il y a un an, pendant une semaine, mai 1995]
+# Out: [23 août 2021, il y a un an, mai 1995]
 
 dates[0]._.date.to_datetime()
 # Out: 2021-08-23T00:00:00+02:00
 
 dates[1]._.date.to_datetime()
-# Out: -1 year
+# Out: None
 
 note_datetime = pendulum.datetime(2021, 8, 27, tz="Europe/Paris")
 
 dates[1]._.date.to_datetime(note_datetime=note_datetime)
-# Out: DateTime(2020, 8, 27, 0, 0, 0, tzinfo=Timezone('Europe/Paris'))
+# Out: 2020-08-27T00:00:00+02:00
 
-date_3_output = dates[3]._.date.to_datetime(
+date_2_output = dates[2]._.date.to_datetime(
     note_datetime=note_datetime,
     infer_from_context=True,
     tz="Europe/Paris",
     default_day=15,
 )
-date_3_output
-# Out: DateTime(1995, 5, 15, 0, 0, 0, tzinfo=Timezone('Europe/Paris'))
+date_2_output
+# Out: 1995-05-15T00:00:00+02:00
+
+doc.spans["durations"]
+# Out: [pendant une semaine]
 ```
 
 ## Declared extensions
@@ -66,17 +69,9 @@ The `eds.dates` pipeline declares one [spaCy extension](https://spacy.io/usage/p
 
 The pipeline can be configured using the following parameters :
 
-| Parameter        | Explanation                                      | Default                           |
-|------------------|--------------------------------------------------|-----------------------------------|
-| `absolute`       | Absolute date patterns, eg `le 5 août 2020`      | `None` (use pre-defined patterns) |
-| `relative`       | Relative date patterns, eg `hier`)               | `None` (use pre-defined patterns) |
-| `durations`      | Duration patterns, eg `pendant trois mois`)      | `None` (use pre-defined patterns) |
-| `false_positive` | Some false positive patterns to exclude          | `None` (use pre-defined patterns) |
-| `detect_periods` | Whether to look for periods                      | `False`                           |
-| `detect_time`    | Whether to look for time around dates            | `True`                            |
-| `on_ents_only`   | Whether to look for dates around entities only   | `False`                           |
-| `as_ents`        | Whether to save detected dates as entities       | `False`                           |
-| `attr`           | spaCy attribute to match on, eg `NORM` or `TEXT` | `"NORM"`                          |
+::: edsnlp.pipelines.misc.dates.factory.create_component
+    options:
+        only_parameters: true
 
 ## Authors and citation
 

diff --git a/docs/pipelines/qualifiers/history.md b/docs/pipelines/qualifiers/history.md
@@ -80,18 +80,9 @@ doc.ents[3]._.history  # (2)
 
 The pipeline can be configured using the following parameters :
 
-| Parameter            | Explanation                                                                                                          | Default                           |
-| -------------------- | -------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
-| `attr`               | spaCy attribute to match on (eg `NORM`, `TEXT`, `LOWER`)                                                             | `"NORM"`                          |
-| `history`            | History patterns                                                                                                     | `None` (use pre-defined patterns) |
-| `termination`        | Termination patterns (for syntagma/proposition extraction)                                                           | `None` (use pre-defined patterns) |
-| `use_sections`       | Whether to use pre-annotated sections (requires the `sections` pipeline)                                             | `False`                           |
-| `use_dates`          | Whether to use dates pipeline (requires the `dates` pipeline and ``note_datetime`` context is recommended)           | `False`                           |
-| `history_limit`      | If `use_dates = True`. The number of days after which the event is considered as history.                            | `14` (2 weeks)                    |
-| `exclude_birthdate`  | If `use_dates = True`. Whether to exclude the birth date from history dates.                                         | `True`                            |
-| `closest_dates_only` | If `use_dates = True`. Whether to include the closest dates only. If `False`, it includes all dates in the sentence. | `True`                            |
-| `on_ents_only`       | Whether to qualify pre-extracted entities only                                                                       | `True`                            |
-| `explain`            | Whether to keep track of the cues for each entity                                                                    | `False`                           |
+::: edsnlp.pipelines.qualifiers.history.factory.create_component
+    options:
+        only_parameters: true
 
 ## Declared extensions
 

diff --git a/edsnlp/pipelines/misc/consultation_dates/consultation_dates.py b/edsnlp/pipelines/misc/consultation_dates/consultation_dates.py
@@ -49,6 +49,7 @@ def __init__(
         town_mention: Union[List[str], bool],
         document_date_mention: Union[List[str], bool],
         attr: str,
+        name: str = "eds.consultation_dates",
         **kwargs,
     ):
 

diff --git a/edsnlp/pipelines/misc/consultation_dates/factory.py b/edsnlp/pipelines/misc/consultation_dates/factory.py
@@ -34,6 +34,7 @@ def create_component(
 ):
     return ConsultationDates(
         nlp,
+        name=name,
         attr=attr,
         consultation_mention=consultation_mention,
         document_date_mention=document_date_mention,

diff --git a/edsnlp/pipelines/misc/dates/dates.py b/edsnlp/pipelines/misc/dates/dates.py
@@ -49,6 +49,8 @@ class Dates(BaseComponent):
           each entity in `#!python doc.spans[key]`
     detect_periods : bool
         Whether to detect periods (experimental)
+    detect_time: bool
+        Whether to detect time inside dates
     as_ents : bool
         Whether to treat dates as entities
     attr : str
@@ -68,8 +70,10 @@ def __init__(
         detect_time: bool,
         as_ents: bool,
         attr: str,
+        name: str = "eds.dates",
     ):
         self.nlp = nlp
+        self.name = name
 
         if absolute is None:
             if detect_time:
@@ -170,7 +174,7 @@ def parse(
         self, matches: List[Tuple[Span, Dict[str, str]]]
     ) -> Tuple[List[Span], List[Span]]:
         """
-        Parse dates using the groupdict returned by the matcher.
+        Parse dates/durations using the groupdict returned by the matcher.
 
         Parameters
         ----------
@@ -184,29 +188,21 @@ def parse(
             List of processed spans, with the date parsed.
         """
 
-        dates = []
-        durations = []
         for span, groupdict in matches:
             if span.label_ == "relative":
                 parsed = RelativeDate.parse_obj(groupdict)
                 span.label_ = "date"
                 span._.date = parsed
-                dates.append(span)
-                print("SPAN", span, parsed.dict())
             elif span.label_ == "absolute":
                 parsed = AbsoluteDate.parse_obj(groupdict)
                 span.label_ = "date"
                 span._.date = parsed
-                dates.append(span)
-                print("SPAN", span, parsed.dict())
             else:
                 parsed = Duration.parse_obj(groupdict)
                 span.label_ = "duration"
                 span._.duration = parsed
-                durations.append(span)
-                print("SPAN", span, parsed.dict())
 
-        return dates, durations
+        return [span for span, _ in matches]
 
     def process_periods(self, dates: List[Span]) -> List[Span]:
         """
@@ -283,17 +279,17 @@ def __call__(self, doc: Doc) -> Doc:
             spaCy Doc object, annotated for dates
         """
         matches = self.process(doc)
-        dates, durations = self.parse(matches)
+        matches = self.parse(matches)
 
-        doc.spans["dates"] = dates
-        doc.spans["durations"] = durations
+        doc.spans["dates"] = [d for d in matches if d.label_ != "duration"]
+        doc.spans["durations"] = [d for d in matches if d.label_ == "duration"]
 
         if self.detect_periods:
-            doc.spans["periods"] = self.process_periods(dates + durations)
+            doc.spans["periods"] = self.process_periods(matches)
 
         if self.as_ents:
             ents, discarded = filter_spans(
-                list(doc.ents) + dates + durations, return_discarded=True
+                list(doc.ents) + matches, return_discarded=True
             )
 
             doc.ents = ents

diff --git a/edsnlp/pipelines/misc/dates/factory.py b/edsnlp/pipelines/misc/dates/factory.py
@@ -25,19 +25,58 @@
 @Language.factory("eds.dates", default_config=DEFAULT_CONFIG, assigns=["doc.spans"])
 def create_component(
     nlp: Language,
-    name: str,
-    absolute: Optional[List[str]],
-    relative: Optional[List[str]],
-    duration: Optional[List[str]],
-    false_positive: Optional[List[str]],
-    on_ents_only: Union[bool, str, List[str], Set[str]],
-    detect_periods: bool,
-    detect_time: bool,
-    as_ents: bool,
-    attr: str,
+    name: str = "eds.dates",
+    absolute: Optional[List[str]] = None,
+    relative: Optional[List[str]] = None,
+    duration: Optional[List[str]] = None,
+    false_positive: Optional[List[str]] = None,
+    on_ents_only: Union[bool, str, List[str], Set[str]] = False,
+    detect_periods: bool = False,
+    detect_time: bool = True,
+    as_ents: bool = False,
+    attr: str = "LOWER",
 ):
+    """
+    Tags and normalizes dates, using the open-source `dateparser` library.
+
+    The pipeline uses spaCy's `filter_spans` function.
+    It filters out false positives, and introduce a hierarchy between patterns.
+    For instance, in case of ambiguity, the pipeline will decide that a date is a
+    date without a year rather than a date without a day.
+
+    Parameters
+    ----------
+    nlp : spacy.language.Language
+        Language pipeline object
+    absolute : Union[List[str], str]
+        List of regular expressions for absolute dates.
+    relative : Union[List[str], str]
+        List of regular expressions for relative dates
+        (eg `hier`, `la semaine prochaine`).
+    duration : Union[List[str], str]
+        List of regular expressions for durations
+        (eg `pendant trois mois`).
+    false_positive : Union[List[str], str]
+        List of regular expressions for false positive (eg phone numbers, etc).
+    on_ents_only : Union[bool, str, List[str]]
+        Whether to look on dates in the whole document or in specific sentences:
+
+        - If `True`: Only look in the sentences of each entity in doc.ents
+        - If False: Look in the whole document
+        - If given a string `key` or list of string: Only look in the sentences of
+          each entity in `#!python doc.spans[key]`
+    detect_periods : bool
+        Whether to detect periods (experimental)
+    detect_time: bool
+        Whether to detect time inside dates
+    as_ents : bool
+        Whether to treat dates as entities
+    attr : str
+        spaCy attribute to use
+    """
     return Dates(
         nlp,
+        name=name,
         absolute=absolute,
         relative=relative,
         duration=duration,