-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parsing of author information #521
base: main
Are you sure you want to change the base?
Conversation
Instead of relying on regular expressions, this patch leverages Python’s builtin `email.utils.parseaddr()` functionality to parse an RFC-822-compliant email address string into its name and address parts. This should also resolve issues with special characters in the name part; see for example Poetry issues python-poetry#370 and python-poetry#798. python-poetry/poetry#370 python-poetry/poetry#798
tests/utils/test_helpers.py
Outdated
@@ -118,3 +119,60 @@ def test_utils_helpers_readme_content_type( | |||
readme: str | Path, content_type: str | |||
) -> None: | |||
assert readme_content_type(readme) == content_type | |||
|
|||
|
|||
def test_utils_helpers_parse_author(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make it into parametrized test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done; see commit 8286618.
@@ -32,6 +33,8 @@ | |||
|
|||
T = TypeVar("T", bound="Package") | |||
|
|||
# TODO: once poetry.console.commands.init.InitCommand._validate_author | |||
# uses poetry.core.utils.helpers.parse_author, this can be removed. | |||
AUTHOR_REGEX = re.compile(r"(?u)^(?P<name>[- .,\w\d'’\"():&]+)(?: <(?P<email>.+?)>)?$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you stack this upon #517, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about email.utils
before. If we can make it fly, it is clearly preferred over an own regex implementation 👍
I played around a bit with it. For any reasons with blabla <[email protected]
I receive blabla
as name and [email protected]
as email. Which is surprising, because the closing >
is missing. 🤔
src/poetry/core/utils/helpers.py
Outdated
if "@" not in address: | ||
return address, None | ||
name, email = parseaddr(address) | ||
if not name and "@" not in email: | ||
return email, None | ||
return name or None, email or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we can remove all the if
blocks.
It seems it will be hard (if not impossible) to make this backward compatible, given the existing tests for valid and invalid author names. What’s the general concept in Poetry for author and maintainer strings? Judging from the test suite, it appears that a name is mandatory, whereas an email address is optional. Then again, the ValueError raised in For reference: According to the PyPA specification (derived from PEP-621), author and maintainer information can be either name-only, email-only, or name and email:
Furthermore, RFC-822-compliance for names is mentioned as well:
TL;DR: I think it needs to be decided whether to adopt the PyPA spec (at least in this regard, which would be a breaking change), or if Poetry wants to continue having its own concept of what constitutes valid author/maintainer strings. In the latter case, this PR can probably be closed without further ado. |
Hello @yggi49, thanks for these details 👍 . IMO the goal must be PyPA spec compliance. But maybe in two steps. Within the current scope of your PR we should switch from the regex implementation to In a second PR we can than make name optional as well (unless other maintainers have a veto) and also make sure that name/email goes into the correct metadata fields as described in https://packaging.python.org/en/latest/specifications/declaring-project-metadata/#authors-maintainers. fin swimmer |
Thank you for providing guidance, @finswimmer. I am going to tweak the logic accordingly (name mandatory, email address optional), and adjust the tests as needed. |
I'm not a fan of how the tests are structured in this PR; I would like something much more thorough and incorporated to the existing tests (see the tests/additions in #517). |
@finswimmer, I adapted the logic towards PyPA-compliance, with a mandatory name and an optional email address. Check out the test cases for the new @neersighted, my primary goal was to factor out author parsing into a single separate helper method and provide a comprehensive list of test cases for it, to showcase the expected behavior. Initially, I didn’t want to mess around too much with what was there already, but merely adapted the cases of existing tests accordingly. Personally, I think that the tests in test_package.py— |
Thanks for making progress on that @yggi49 👍 I'm still a bit irritated by the behavior of email.utils.parseaddr("Me [Some Department] <[email protected]>")
('', 'Me') I would expect email.utils.parseaddr("Me <[email protected]")
('Me', '[email protected]') I would expected either a parsing error (preferred) or I don't understand why it behaves like this. Expected? Bug? How do we want to handle this? We could check if we can reconstruct the address that goes into fin swimmer |
If brackets should be part of the name, the name must be quoted—i.e., it should read Python’s address parsing is implemented in I did implement a poor-man’s reverse check into the |
It's a rabbit hole ... 🐇 I asked at the PyPA discord server about the behavior of >>> from email.headerregistry import AddressHeader
>>> parsed = AddressHeader.value_parser("Me [Some Department] <[email protected]>")
>>> len(parsed.all_defects) > 0
True
>>> parsed = AddressHeader.value_parser('"Me [Some Department]" <[email protected]>')
>>> parsed.all_defects
[]
>>> parsed.mailboxes[0].addr_spec
'[email protected]'
>>> parsed.mailboxes[0].display_name
'Me [Some Department]' So if we detect an
This all leads to the question: Is it all worth it? I don't know and does not have a strong opinion about it. Maybe some of the other maintainers? |
@finswimmer, I already mentioned that this is going to be the case, since the existing regular expression for the name part Maybe it’s easier to approach this entire topic from the other side, and make a basic strategic decision for this project first: is PyPA compatiblity the goal, or does Poetry want to keep doing it its own way?
|
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
@finswimmer, any updates on this front? |
Instead of relying on regular expressions, this patch leverages Python’s builtin
email.utils.parseaddr()
functionality to parse an RFC-822-compliant email address string into its name and address parts.This should also resolve issues with special characters in the name part; see for example:
poetry build
poetry#370This is a followup to python-poetry/poetry PR #1040, as advised by @Secrus’ comment.
If this PR gets accepted, there will be two additional steps:
poetry.console.commands.init.InitCommand._validate_author()
to use the newparse_author()
helper, see https://github.com/python-poetry/poetry/blob/master/src/poetry/console/commands/init.py#L441AUTHOR_REGEX
frompoetry.core.packages.package
.