Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: relieve constraints on non edsnlp custom attributes #220

Closed
cvinot opened this issue Oct 3, 2023 · 3 comments · Fixed by #228
Closed

Feature request: relieve constraints on non edsnlp custom attributes #220

cvinot opened this issue Oct 3, 2023 · 3 comments · Fixed by #228

Comments

@cvinot
Copy link
Contributor

cvinot commented Oct 3, 2023

Feature type

Enhance compatibility of EDSnlp custom attributes with potential external pipelines.

Description

in the BaseComponent class, in this commit you added this line :

        Span.set_extension(
            "value",
            getter=lambda span: span._.get(span.label_)
            if span._.has(span.label_)
            else None,
            force=True,
        )`

Doing this, you are enforcing (overwriting if already defined) an attribute which is non edsnlp specific. i.e not named specificaly for your use case, and could be very easily required by anyone using your package as part of a broader pipeline.(see spacy good practices regarding naming components/attributes)

Forcing a getter function means if later on a component were to try to set a value, it would be ignored.

Example of potential conflict:

import spacy
# Setting up a first pipeline
random_nlp = spacy.blank("fr")
random_nlp.add_pipe(
    "eds.terminology", # This component is enforcing the "value" custom attribute
    name="test",
    config=dict(label="Any",terms={}),
)

# Setting up another custom pipeline somewhere else in the code
nlp = spacy.blank("fr")
text = "hello this is a test"
doc = nlp(text)
my_span = doc[0:3]
my_span._.value = "CustomValue" # This raises no error.

assert my_span._.value == "CustomValue" # Error: my_span._.value is None.

WIth my modest experience I would suggest avoiding enforcing attributes in general, but if necessary, renaming the attribute to avoid conflicts. If not possible allowing renaming of the attribute, and/or make sure you let the user know what attributes you are enforcing. (I believe this should be made very clear in the doc specifically in the case of attributes such as "value").

I ran into this conflict upgrading edsnlp from 0.7.4 to 0.9 I might be missing something here and would love to hear what the reasons are if this is absolutely needed.

@cvinot cvinot changed the title Feature request: [feature] Feature request: relieve constraints on non edsnlp custom attributes Oct 3, 2023
@percevalw
Copy link
Member

Thank you for letting us know! Indeed, this is problematic. This change was made to provide a uniform, generic, access to the normalized entity value (see #47), useful for printing results, exporting as a dataframe, etc, regardless of the span label.
I feel we can have the best of both worlds by querying the user_data dict first, and running the getter if the search failed.

In any case, you're right:

  • force should be False
  • this should be better documented
  • a warning can be displayed when the user sets the extension directly to remind them that its intended to be used as a getter

@cvinot
Copy link
Contributor Author

cvinot commented Oct 5, 2023

Thanks for the PR link, it does make more sense understanding your approach.

However I still find dubious the use of the "value" extension you are suggesting, as a label_ getter. It seems you are basically erasing this extension by linking it to "label_" which is basically limited to span categories, when it could simply be used to different use cases. I do understand the will to store normalized values.

Why not get directly the "label_" attribute when needed ?

In my use case I see the value extension more like a free entry value which could be set and overridden by different process, sort of how you use the value of your measurements. I run various pipelines, which store information in their respective specific attributes, defined in specifically named attributes, and "decisions" pipeline which decide for a value given the information of the spans. That means the value is not specifically normalized and could be of many shapes including custom models.

I think attributes with strong constraints should be isolated in very specific names to avoid overlapping with more general attributes if that makes sense. I'm personnaly not scared of verbose, we could imagine "span._.label_getter" with the same usecase.

Anyway thanks for explaining your view, I can understand that multiple approaches can be considered.

@percevalw
Copy link
Member

In my use case I see the value extension more like a free entry value which could be set and overridden by different process

Agreed, we will soon allow this. But in the case a value has not been set by the user, we will keep the getter behavior.

Why not get directly the "label_" attribute when needed ?

It's true that just typing span._.get(span.label_) would often be sufficient. However, we have multiple use cases in which we list the extensions that we want to see exported / learned / evaluated, etc: with the value extension, just listing it in my_func(extensions=["value", "negation"]) solves this.

That means the value is not specifically normalized and could be of many shapes including custom models.

In our case, a normalized value can also be an instance of a specific class (for instance an AbsoluteDate object or a SimpleMeasurement), or a str if it is enough to semantically describe the extraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants