Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaping netloc #4

Open
chirayuk opened this issue Apr 24, 2013 · 2 comments
Open

Escaping netloc #4

chirayuk opened this issue Apr 24, 2013 · 2 comments

Comments

@chirayuk
Copy link

I'm trying to use url.py to sanitize url's input from the user so I can stick them in html as links. I came across this behavior, which I consider a bug.

>>> url.parse("http://\"<script>/b\"<b>c?q=\"<script>#bar\"<script>").escape().unicode()
u'http://"<script>/b%22%3Cb%3Ec?q=%22%3Cscript%3E#bar"<script>'

I'm working around it by escaping everything aggressively (for my use case, it's alright to accept only "nice looking" urls as they're expected to be gateway urls). I'm not sure of the exact escaping that needs to be done to submit a patch/pull request (assuming you agree it's a bug.)

Thanks for the nice library.

@dlecocq
Copy link
Contributor

dlecocq commented Apr 24, 2013

That's interesting -- though URLs are described over like 5 different RFCs, I'm pretty sure that ", > and < are disallowed from the hostname. And yet, you are right that it's getting parsed out as the netloc. In particular from RFC 2396:

hostport      = host [ ":" port ]
host          = hostname | IPv4address
hostname      = *( domainlabel "." ) toplabel [ "." ]
domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

As such, the hostname portion should never need escaping. Punycoding, in some cases, but not escaping. In these cases I like to check other implementations:

  • curl and wget both interpret the hostname as "<script>
  • Chrome runs a search instead, indicating it doesn't believe it to be a url
  • Safari escapes it in the url bar, but complains about not being able to resolve "<script>, like curl

If I get a chance today, I might take a look at the source for urlparse. I have a feeling that it (like many implementations) is probably quite permissive.

@dlecocq
Copy link
Contributor

dlecocq commented Apr 24, 2013

I figured I should check some of the other RFCs, and it turns out that RCF 3986 is much more permissive on the matter than 2396:

authority     = [ userinfo "@" ] host [ ":" port ]
userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
host          = IP-literal / IPv4address / reg-name
port          = *DIGIT
reg-name      = *( unreserved / pct-encoded / sub-delims )
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

If we want to adhere to that, then it seems that yes, any netloc must be escaped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants