Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLs are not serialized according to WHATWG spec and compare differently #1167

Open
5 tasks done
mzmm1000 opened this issue Sep 28, 2024 · 7 comments
Open
5 tasks done
Labels

Comments

@mzmm1000
Copy link

mzmm1000 commented Sep 28, 2024

Please confirm the following

  • I understand this is open source software provided for free and that I might not receive a timely response.
  • I am positive I am NOT reporting a (potential) security
    vulnerability, to the best of my knowledge. (These must be shared by
    submitting this report form instead, if
    any hesitation exists.)
  • I am willing to submit a pull request with reporoducers as xfailing test cases or even entire fix. (Assign this issue to me.)

Describe the bug

Looking at the WHATWG example for URL equality, the URLs web+demo:/.//not-a-host/ and web+demo:/path/..//not-a-host/ should be considered equal after URL serialization. When parsing those URLs using ada, they are indeed serialized identically.

To Reproduce

Using yarl 1.12.1, run

import yarl
assert yarl.URL("web+demo:/path/..//not-a-host/") == yarl.URL("web+demo:/.//not-a-host/")

Expected behavior

Both URLs should be serialized as described in the WHATWG specification, i.e. the serialized result should be web+demo:/.//not-a-host/ in both cases.

Logs/tracebacks

n/a

Python Version

Python 3.12.6

multidict Version

Name: multidict
Version: 6.1.0

yarl Version

Name: yarl
Version: 1.12.1

OS

Linux

Additional context

No response

@mzmm1000 mzmm1000 added the bug label Sep 28, 2024
@webknjaz
Copy link
Member

webknjaz commented Oct 8, 2024

I think that might be implemented this way to prevent security vulnerabilities given the contexts where yarl is used..

cc @Dreamsorcerer @bdraco

@bdraco
Copy link
Member

bdraco commented Oct 8, 2024

Currently we only normalize the path if the host (netloc) is set

If you add a host, then the assertion will pass

assert yarl.URL("web+demo://host/path/..//not-a-host/") == yarl.URL("web+demo://host/.//not-a-host/")

@bdraco
Copy link
Member

bdraco commented Oct 8, 2024

It seems this behavior is quite intentional as if I remove the check these tests fail:

FAILED tests/test_url.py::test_join_from_rfc_3986_normal[../..-http://a/] - AssertionError: assert URL('http://a/b/c/d;p?q') == URL('http://a/')
FAILED tests/test_url.py::test_join_from_rfc_3986_normal[.-http://a/b/c/] - AssertionError: assert URL('http://a/b/c/d;p?q') == URL('http://a/b/c/')
FAILED tests/test_url.py::test_join_cpython_urljoin[http://a/b/c/d/e/-../../f/g/-http://a/b/c/f/g/] - AssertionError: assert URL('http://a/b/c/d/e/f/g/') == URL('http://a/b/c/f/g/')
FAILED tests/test_url.py::test_join_cpython_urljoin[http://a/b/c/d/e/-../../f/g-http://a/b/c/f/g] - AssertionError: assert URL('http://a/b/c/d/e/f/g') == URL('http://a/b/c/f/g')
FAILED tests/test_url.py::test_join_from_rfc_3986_normal[./-http://a/b/c/] - AssertionError: assert URL('http://a/b/c/d;p?q') == URL('http://a/b/c/')
FAILED tests/test_url.py::test_join_from_rfc_3986_abnormal[./../g-http://a/b/g] - AssertionError: assert URL('http://a/b/c/g') == URL('http://a/b/g')
FAILED tests/test_url.py::test_join_from_rfc_3986_normal[../../-http://a/] - AssertionError: assert URL('http://a/b/c/d;p?q') == URL('http://a/')
FAILED tests/test_url.py::test_join_cpython_urljoin[http://a/b/-../../f/g/-http://a/f/g/] - AssertionError: assert URL('http://a/b/f/g/') == URL('http://a/f/g/')
FAILED tests/test_url.py::test_join_cpython_urljoin[http://a/b/c/d/e-../../f/g/-http://a/b/f/g/] - AssertionError: assert URL('http://a/b/c/d/f/g/') == URL('http://a/b/f/g/')
FAILED tests/test_url.py::test_join_from_rfc_3986_normal[..-http://a/b/] - AssertionError: assert URL('http://a/b/c/d;p?q') == URL('http://a/b/')
FAILED tests/test_url.py::test_join_from_rfc_3986_normal[../../g-http://a/g] - AssertionError: assert URL('http://a/b/c/g') == URL('http://a/g')
FAILED tests/test_url.py::test_join_from_rfc_3986_normal[../-http://a/b/] - AssertionError: assert URL('http://a/b/c/d;p?q') == URL('http://a/b/')
FAILED tests/test_url.py::test_join_from_rfc_3986_abnormal[../../../g-http://a/g] - AssertionError: assert URL('http://a/b/c/g') == URL('http://a/g')
FAILED tests/test_url.py::test_join_from_rfc_3986_abnormal[../../../../g-http://a/g] - AssertionError: assert URL('http://a/b/c/g') == URL('http://a/g')
FAILED tests/test_url.py::test_join_from_rfc_3986_normal[../g-http://a/b/g] - AssertionError: assert URL('http://a/b/c/g') == URL('http://a/b/g')

@Dreamsorcerer
Copy link
Member

I'd note that all of those tests are for joining, not equality. So, maybe something should be tweaked here, but I'm not too clear on what.

@Dreamsorcerer
Copy link
Member

Basically, we can't normalise the path upfront if it's relative, as it needs to evaluate that relative part once joined to a base URL. Possibly, we should be normalising the paths of relative URLs on equality check though?

The RFC seems to suggest to me that equality of relative references is not relevant...
https://datatracker.ietf.org/doc/html/rfc3986#section-6.2

@bdraco
Copy link
Member

bdraco commented Oct 8, 2024

Possibly, we should be normalising the paths of relative URLs on equality check though?

I think that would be ok but not sure if thats a breaking change or not....

@webknjaz
Copy link
Member

webknjaz commented Oct 8, 2024

Except for the cases where untrusted input is fed to yarl.URL() invocations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants