Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp of the date pipeline #22

Merged
merged 45 commits into from
Apr 8, 2022
Merged

Revamp of the date pipeline #22

merged 45 commits into from
Apr 8, 2022

Conversation

keyber
Copy link
Contributor

@keyber keyber commented Mar 15, 2022

Description

This PR proposes a new eds.dates pipeline. It addresses the many limitations suffered by the current component.

The proposed pipeline drops the dateparser dependency. Albeit powerful, dateparser was not a perfect fit, and incurred a significant toll on the pipeline's overall complexity to work around the library's limitations.

The new date matcher handles:

  • detection of absolute and relative dates, as well as durations
  • direct parsing of dates, using RegEx group dicts
  • normalisation using pendulum
  • detection of periods (experimental)

Example

import spacy

nlp = spacy.blank('fr')
nlp.add_pipe('eds.dates', config=dict(detect_periods=True))

text = "le patient a été hospitalisé cette année pendant trois semaines"

doc = nlp(text)

for date in doc.spans['dates']:
    print(date, date._.date.dict(exclude_none=True))
# Out: cette année {'year': 0, 'direction': <Direction.CURRENT: 'CURRENT'>}
# Out: pendant trois semaines {'mode': <Mode.DURATION: 'DURATION'>, 'week': 3}

for date in doc.spans['dates']:
    print(date, date._.date.parse())
# Out: cette année 0 microseconds
# Out: pendant trois semaines 3 weeks

for p in doc.spans['periods']:
    print(p)
    print(p._.period)
# Out: cette année pendant trois semaines
# Out: {<Mode.FROM: 'FROM'>: cette année, <Mode.DURATION: 'DURATION'>: pendant trois semaines}
Label Snippet Parsed Normalised
relative cette année {'year': 0, 'direction': <Direction.CURRENT: 'CURRENT'>} 0 microseconds
duration pendant trois semaines {'mode': <Mode.DURATION: 'DURATION'>, 'week': 3} 3 weeks
periode cette année pendant trois semaines {<Mode.FROM: 'FROM'>: cette année, <Mode.DURATION: 'DURATION'>: pendant trois semaines}

The experimental period detection algorithm uses pre detected dates.

Checklist

  • If this PR is a bug fix, the bug is documented in the test suite.
  • Changes were documented in the changelog (pending section).
  • If necessary, changes were made to the documentation (eg new pipeline).

@keyber keyber changed the title add fast_parse for days and months date pipeline improvements Mar 17, 2022
@bdura bdura added the enhancement New feature or request label Mar 18, 2022
@keyber
Copy link
Contributor Author

keyber commented Mar 22, 2022

The test that fails is il y a 1 an pdt 1 mois
the test expects one extracted date but we extract two: "il y a un an" and "pdt 1 mois".
previously, this test passed because the relative_pattern had a bug.

I'm removing this single test

@keyber keyber marked this pull request as ready for review March 25, 2022 14:09
@keyber keyber requested a review from bdura March 25, 2022 14:12
@bdura bdura self-assigned this Mar 30, 2022
@bdura bdura requested review from percevalw and Thomzoy and removed request for bdura March 30, 2022 15:43
@bdura bdura marked this pull request as draft March 30, 2022 15:43
@bdura bdura changed the title date pipeline improvements Revamp of the date pipeline Apr 6, 2022
@bdura bdura marked this pull request as ready for review April 6, 2022 16:01
@bdura
Copy link
Contributor

bdura commented Apr 6, 2022

Drum roll please 🥁

@bdura
Copy link
Contributor

bdura commented Apr 6, 2022

@Thomzoy & @percevalw, I'm eager to have your opinion on this one.

@bdura bdura requested a review from Aremaki April 6, 2022 16:07
@codecov-commenter
Copy link

Codecov Report

Merging #22 (da80b1c) into master (e3c66d9) will increase coverage by 0.34%.
The diff coverage is 99.13%.

@@            Coverage Diff             @@
##           master      #22      +/-   ##
==========================================
+ Coverage   95.88%   96.22%   +0.34%     
==========================================
  Files         131      139       +8     
  Lines        3107     3151      +44     
==========================================
+ Hits         2979     3032      +53     
+ Misses        128      119       -9     
Impacted Files Coverage Δ
edsnlp/pipelines/misc/dates/patterns/current.py 100.00% <ø> (ø)
edsnlp/pipelines/misc/dates/dates.py 96.00% <96.15%> (+3.97%) ⬆️
...ines/misc/consultation_dates/consultation_dates.py 98.11% <100.00%> (ø)
edsnlp/pipelines/misc/dates/factory.py 100.00% <100.00%> (ø)
edsnlp/pipelines/misc/dates/models.py 100.00% <100.00%> (ø)
edsnlp/pipelines/misc/dates/patterns/__init__.py 100.00% <100.00%> (ø)
edsnlp/pipelines/misc/dates/patterns/absolute.py 100.00% <100.00%> (ø)
...dsnlp/pipelines/misc/dates/patterns/atomic/days.py 100.00% <100.00%> (ø)
...pipelines/misc/dates/patterns/atomic/delimiters.py 100.00% <100.00%> (ø)
...pipelines/misc/dates/patterns/atomic/directions.py 100.00% <100.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e3c66d9...da80b1c. Read the comment docs.

Copy link
Collaborator

@Aremaki Aremaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review all files but what I've seen is terrific

@percevalw
Copy link
Member

Great work !
Here's a benchmark of the new pipeline.

Benchmark on 10000 clinical docs:

Pipelines Time
before  108s
this PR  309s

It is a bit slower but the new features are definitely worth it, and we can still optimise in the future.

@bdura
Copy link
Contributor

bdura commented Apr 8, 2022

Thanks for the benchmark! I am a bit disappointed by the results though :)

I just pushed a fix for the code block checks, I'll merge the changes as soon as the CI passes.

Thank you for your inputs !

@bdura bdura merged commit b19a11a into master Apr 8, 2022
@bdura bdura deleted the dates branch April 8, 2022 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants