Architecture choice on custom extensions #47

bdura · 2022-04-08T08:08:18Z

Description

The way we've handled spaCy extensions in EDS-NLP has been erratic at best, with each pipeline declaring its own set of new extensions, cluttering spaCy's Underscore object.

For instance, the pipeline eds.dates, eds.measures and eds.emergency.priority all include a parsing component, and each pipeline saves its result in a different attribute.

We can clearly do better than that by adopting a more holistic and uniform approach.

Proposition

To start off the discussion, here are a few ideas/questions:

Every extensions managed by edsnlp should probably land within a Obect._.edsnlp master attribute. This would avoid cluttering the extensions, and leave more room for the end users.
We should regroup similar extensions as much as possible. In the example above, all three pipelines could write the parsing results to a unique key, for instance value or parsed.
We could introduce a norm key, that contains the normalised variant for a given entity (eg stripped of pollution, accents, etc). A reasonable idea is to provide the text used for matching. The normalised variant should perhaps be computed on the fly, using a getter function ?
To enable the use of getters and methods within the edsnlp extensions, we could use an Underscore object. Not sure if that is overkill and/or incurs significant added complexity.

The text was updated successfully, but these errors were encountered:

bdura · 2022-04-08T08:09:04Z

@percevalw, @Thomzoy, @Aremaki, I'd love to get your thoughts on this!

percevalw · 2022-04-08T09:00:29Z

Great idea, I second you on this. Here are a few points of our IRL discussion

As most end users would either only use the existing attributes (negated, value, ...) and not write new ones, we should prefer a simpler notation as it is currently the case, under the base _ getter, but limiting the number of attributes. Therefore as suggested, we could use:

span._.value for dates, measures, scores, concepts, ...,
and span._.negation, span._.rspeech, etc for more syntax-based attributes.

As the norm attribute is now strongly used throughout the lib as a depolluted ascii version of the text, changing it would probably mean refactoring most of the code. To generate semantically normalized text, we can use the __str__ and __repr__ methods on the generic value attribute.

Following your dates revamp, the _.value extension could inherit of a pydantic.BaseModel like:

class EDSValue(BaseModel):
  def __str__(self): ...
  def __repr__(self): ...
  ...

class Date(EDSValue):
  ...

class Measure(EDSValue):
  ...

class Concept(EDSValue):
  ...

bdura · 2022-04-08T09:23:16Z

Sounds good! 🎉

percevalw · 2022-09-06T14:38:36Z

As discussed, here is a potential solution that standardizes the current architecture for custom extensions @Thomzoy @aricohen93
Each component can create a Span extension named after the label of the entities it creates:

eds.adicap creates entities labeled adicap, and adds an ent._.adicap extension containing the decoding information
eds.tnm creates entities labeled tnm, and adds an ent._.tnm extension
eds.drugs creates entities labeled drug, and adds an ent._.drug extension
...

A specific ._.value extension is defined as an aggregator and retrieves the field associated with the label via a getter such that ent._.value == getattr(ent._, ent.label_). The str representation of ._.value could be the one displayed in the demonstrator.

This way, we can keep a consistent typing of each extension (tnm -> TNMScore, adicap -> AdicapCode, date -> Date, ...), while offering a unique entry point for some use cases via the value extension.

This does not prevent to define other extensions if needed, or to keep the old entity extensions and deprecate them in future versions.

percevalw · 2022-10-13T16:27:07Z

@Vincent-Maladiere

percevalw · 2023-07-31T16:23:31Z

Following the discussion with @Thomzoy, we carry on with the approach commented above:

each pipe defines the extensions it needs (negation, scores, etc)
the extensions related to a normalized value should be named with the label_ of the entity extracted (if any)
the value extensions is defined as the following getter: lambda span: span._.get(span.label_, default=None). Having multiple extensions and an aggregator extension allows multiple pipes to modify a single entity, and to prioritize the normalized value of the entity by setting its label — for instance, to choose between the extraction of eds.drugs (label = drug), and eds.umls (label = umls) — without loosing information between pipes
the normalized extension can be anything: an int, a bool, a string, an object, depending on the complexity of the extraction, and should implement the equality operator such that any span1._.value == span2._.value test runs

For instance, the following (non-exhaustive) modifications should be made:

dates: the dates will be labelled as date, to match the date extension (instead of absolute/relative since this info is already stored in the span._.date object)
measurements: the label of the extracted measurements becomes measurement, and the previous label (e.g. eds.weight is added to the normalized span._.measurement object
consultation_dates: the label of the consultation_dates spans will become consultation_date and the consultation_date extension will be the extracted date
tables: labelled as table and span._.table is a getter to span._.to_pd_table(as_values=True)
umls: labelled as umls (this is already the case) and change span._.umls to a new UMLSConcept(id=the cui, sty=the semantic type) object
...

percevalw · 2023-09-15T15:31:51Z

These suggestions have been integrated in #213

bdura added the discussion Discussion about architecture choices label Apr 8, 2022

bdura mentioned this issue Apr 8, 2022

Revamp of the date pipeline #22

Merged

3 tasks

percevalw mentioned this issue Oct 28, 2022

UMLS matching #156

Closed

percevalw mentioned this issue Aug 4, 2023

Refacto of the extensions #213

Merged

6 tasks

percevalw closed this as completed Sep 15, 2023

percevalw mentioned this issue Oct 4, 2023

Feature request: relieve constraints on non edsnlp custom attributes #220

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture choice on custom extensions #47

Architecture choice on custom extensions #47

bdura commented Apr 8, 2022

bdura commented Apr 8, 2022

percevalw commented Apr 8, 2022 •

edited

Loading

bdura commented Apr 8, 2022

percevalw commented Sep 6, 2022

percevalw commented Oct 13, 2022

percevalw commented Jul 31, 2023

percevalw commented Sep 15, 2023

Architecture choice on custom extensions #47

Architecture choice on custom extensions #47

Comments

bdura commented Apr 8, 2022

Description

Proposition

bdura commented Apr 8, 2022

percevalw commented Apr 8, 2022 • edited Loading

bdura commented Apr 8, 2022

percevalw commented Sep 6, 2022

percevalw commented Oct 13, 2022

percevalw commented Jul 31, 2023

percevalw commented Sep 15, 2023

percevalw commented Apr 8, 2022 •

edited

Loading