Skip to content

Commit

Permalink
feat: fix to_datetime/duration docs & integration with eds.history
Browse files Browse the repository at this point in the history
Co-authored-by: Ariel Cohen <[email protected]>
Co-authored-by: Perceval Wajsbürt <[email protected]>
  • Loading branch information
percevalw and aricohen93 committed Aug 4, 2023
1 parent dd3099f commit bdfa30a
Show file tree
Hide file tree
Showing 11 changed files with 264 additions and 127 deletions.
7 changes: 6 additions & 1 deletion changelog.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Changelog

## v0.8.2 (2023-06-07)
## Unreleased

### Added

- New `to_duration` method to convert an absolute date into a date relative to the note_datetime (or None)

### Changes

Expand All @@ -19,6 +23,7 @@
- the "relative" / "absolute" / "duration" mode of the time entity is now stored in
the `mode` attribute of the `span._.date/duration`
- the "from" / "until" period bound, if any, is now stored in the `span._.date.bound` attribute
- `to_datetime` now only return absolute dates, converts relative dates into absolute if `doc._.note_datetime` is given, and None otherwise

## v0.8.1 (2023-05-31)

Expand Down
29 changes: 12 additions & 17 deletions docs/pipelines/misc/dates.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,27 +35,30 @@ doc = nlp(text)

dates = doc.spans["dates"]
dates
# Out: [23 août 2021, il y a un an, pendant une semaine, mai 1995]
# Out: [23 août 2021, il y a un an, mai 1995]

dates[0]._.date.to_datetime()
# Out: 2021-08-23T00:00:00+02:00

dates[1]._.date.to_datetime()
# Out: -1 year
# Out: None

note_datetime = pendulum.datetime(2021, 8, 27, tz="Europe/Paris")

dates[1]._.date.to_datetime(note_datetime=note_datetime)
# Out: DateTime(2020, 8, 27, 0, 0, 0, tzinfo=Timezone('Europe/Paris'))
# Out: 2020-08-27T00:00:00+02:00

date_3_output = dates[3]._.date.to_datetime(
date_2_output = dates[2]._.date.to_datetime(
note_datetime=note_datetime,
infer_from_context=True,
tz="Europe/Paris",
default_day=15,
)
date_3_output
# Out: DateTime(1995, 5, 15, 0, 0, 0, tzinfo=Timezone('Europe/Paris'))
date_2_output
# Out: 1995-05-15T00:00:00+02:00

doc.spans["durations"]
# Out: [pendant une semaine]
```

## Declared extensions
Expand All @@ -66,17 +69,9 @@ The `eds.dates` pipeline declares one [spaCy extension](https://spacy.io/usage/p

The pipeline can be configured using the following parameters :

| Parameter | Explanation | Default |
|------------------|--------------------------------------------------|-----------------------------------|
| `absolute` | Absolute date patterns, eg `le 5 août 2020` | `None` (use pre-defined patterns) |
| `relative` | Relative date patterns, eg `hier`) | `None` (use pre-defined patterns) |
| `durations` | Duration patterns, eg `pendant trois mois`) | `None` (use pre-defined patterns) |
| `false_positive` | Some false positive patterns to exclude | `None` (use pre-defined patterns) |
| `detect_periods` | Whether to look for periods | `False` |
| `detect_time` | Whether to look for time around dates | `True` |
| `on_ents_only` | Whether to look for dates around entities only | `False` |
| `as_ents` | Whether to save detected dates as entities | `False` |
| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"NORM"` |
::: edsnlp.pipelines.misc.dates.factory.create_component
options:
only_parameters: true

## Authors and citation

Expand Down
15 changes: 3 additions & 12 deletions docs/pipelines/qualifiers/history.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,18 +80,9 @@ doc.ents[3]._.history # (2)

The pipeline can be configured using the following parameters :

| Parameter | Explanation | Default |
| -------------------- | -------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
| `attr` | spaCy attribute to match on (eg `NORM`, `TEXT`, `LOWER`) | `"NORM"` |
| `history` | History patterns | `None` (use pre-defined patterns) |
| `termination` | Termination patterns (for syntagma/proposition extraction) | `None` (use pre-defined patterns) |
| `use_sections` | Whether to use pre-annotated sections (requires the `sections` pipeline) | `False` |
| `use_dates` | Whether to use dates pipeline (requires the `dates` pipeline and ``note_datetime`` context is recommended) | `False` |
| `history_limit` | If `use_dates = True`. The number of days after which the event is considered as history. | `14` (2 weeks) |
| `exclude_birthdate` | If `use_dates = True`. Whether to exclude the birth date from history dates. | `True` |
| `closest_dates_only` | If `use_dates = True`. Whether to include the closest dates only. If `False`, it includes all dates in the sentence. | `True` |
| `on_ents_only` | Whether to qualify pre-extracted entities only | `True` |
| `explain` | Whether to keep track of the cues for each entity | `False` |
::: edsnlp.pipelines.qualifiers.history.factory.create_component
options:
only_parameters: true

## Declared extensions

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ def __init__(
town_mention: Union[List[str], bool],
document_date_mention: Union[List[str], bool],
attr: str,
name: str = "eds.consultation_dates",
**kwargs,
):

Expand Down
1 change: 1 addition & 0 deletions edsnlp/pipelines/misc/consultation_dates/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ def create_component(
):
return ConsultationDates(
nlp,
name=name,
attr=attr,
consultation_mention=consultation_mention,
document_date_mention=document_date_mention,
Expand Down
26 changes: 11 additions & 15 deletions edsnlp/pipelines/misc/dates/dates.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ class Dates(BaseComponent):
each entity in `#!python doc.spans[key]`
detect_periods : bool
Whether to detect periods (experimental)
detect_time: bool
Whether to detect time inside dates
as_ents : bool
Whether to treat dates as entities
attr : str
Expand All @@ -68,8 +70,10 @@ def __init__(
detect_time: bool,
as_ents: bool,
attr: str,
name: str = "eds.dates",
):
self.nlp = nlp
self.name = name

if absolute is None:
if detect_time:
Expand Down Expand Up @@ -170,7 +174,7 @@ def parse(
self, matches: List[Tuple[Span, Dict[str, str]]]
) -> Tuple[List[Span], List[Span]]:
"""
Parse dates using the groupdict returned by the matcher.
Parse dates/durations using the groupdict returned by the matcher.
Parameters
----------
Expand All @@ -184,29 +188,21 @@ def parse(
List of processed spans, with the date parsed.
"""

dates = []
durations = []
for span, groupdict in matches:
if span.label_ == "relative":
parsed = RelativeDate.parse_obj(groupdict)
span.label_ = "date"
span._.date = parsed
dates.append(span)
print("SPAN", span, parsed.dict())
elif span.label_ == "absolute":
parsed = AbsoluteDate.parse_obj(groupdict)
span.label_ = "date"
span._.date = parsed
dates.append(span)
print("SPAN", span, parsed.dict())
else:
parsed = Duration.parse_obj(groupdict)
span.label_ = "duration"
span._.duration = parsed
durations.append(span)
print("SPAN", span, parsed.dict())

return dates, durations
return [span for span, _ in matches]

def process_periods(self, dates: List[Span]) -> List[Span]:
"""
Expand Down Expand Up @@ -283,17 +279,17 @@ def __call__(self, doc: Doc) -> Doc:
spaCy Doc object, annotated for dates
"""
matches = self.process(doc)
dates, durations = self.parse(matches)
matches = self.parse(matches)

doc.spans["dates"] = dates
doc.spans["durations"] = durations
doc.spans["dates"] = [d for d in matches if d.label_ != "duration"]
doc.spans["durations"] = [d for d in matches if d.label_ == "duration"]

if self.detect_periods:
doc.spans["periods"] = self.process_periods(dates + durations)
doc.spans["periods"] = self.process_periods(matches)

if self.as_ents:
ents, discarded = filter_spans(
list(doc.ents) + dates + durations, return_discarded=True
list(doc.ents) + matches, return_discarded=True
)

doc.ents = ents
Expand Down
59 changes: 49 additions & 10 deletions edsnlp/pipelines/misc/dates/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,58 @@
@Language.factory("eds.dates", default_config=DEFAULT_CONFIG, assigns=["doc.spans"])
def create_component(
nlp: Language,
name: str,
absolute: Optional[List[str]],
relative: Optional[List[str]],
duration: Optional[List[str]],
false_positive: Optional[List[str]],
on_ents_only: Union[bool, str, List[str], Set[str]],
detect_periods: bool,
detect_time: bool,
as_ents: bool,
attr: str,
name: str = "eds.dates",
absolute: Optional[List[str]] = None,
relative: Optional[List[str]] = None,
duration: Optional[List[str]] = None,
false_positive: Optional[List[str]] = None,
on_ents_only: Union[bool, str, List[str], Set[str]] = False,
detect_periods: bool = False,
detect_time: bool = True,
as_ents: bool = False,
attr: str = "LOWER",
):
"""
Tags and normalizes dates, using the open-source `dateparser` library.
The pipeline uses spaCy's `filter_spans` function.
It filters out false positives, and introduce a hierarchy between patterns.
For instance, in case of ambiguity, the pipeline will decide that a date is a
date without a year rather than a date without a day.
Parameters
----------
nlp : spacy.language.Language
Language pipeline object
absolute : Union[List[str], str]
List of regular expressions for absolute dates.
relative : Union[List[str], str]
List of regular expressions for relative dates
(eg `hier`, `la semaine prochaine`).
duration : Union[List[str], str]
List of regular expressions for durations
(eg `pendant trois mois`).
false_positive : Union[List[str], str]
List of regular expressions for false positive (eg phone numbers, etc).
on_ents_only : Union[bool, str, List[str]]
Whether to look on dates in the whole document or in specific sentences:
- If `True`: Only look in the sentences of each entity in doc.ents
- If False: Look in the whole document
- If given a string `key` or list of string: Only look in the sentences of
each entity in `#!python doc.spans[key]`
detect_periods : bool
Whether to detect periods (experimental)
detect_time: bool
Whether to detect time inside dates
as_ents : bool
Whether to treat dates as entities
attr : str
spaCy attribute to use
"""
return Dates(
nlp,
name=name,
absolute=absolute,
relative=relative,
duration=duration,
Expand Down
Loading

0 comments on commit bdfa30a

Please sign in to comment.