Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve srcset parsing #1160

Merged
merged 4 commits into from
Jul 29, 2023
Merged

Improve srcset parsing #1160

merged 4 commits into from
Jul 29, 2023

Conversation

mre
Copy link
Member

@mre mre commented Jul 16, 2023

Our current srcset parsing is pretty basic.

We split on comma and then on whitespace and take the first part, which is the image source URL.
However, we don't handle URLs containing unencoded commas like
/cdn-cgi/image/format=webp,width=640/https://img.youtube.com/vi/hVBl8_pgQf0/maxresdefault.jpg, which leads to false-positives.

According to the spec, commas in strings should be encoded, but in practice, there are some websites which don't do that. To handle these cases, too, I propose to extend the srcset parsing to make use of a small "state machine", which detects if a comma is within the image source or outside of it while parsing.

This is part of an effort to reduce false-positives during link checking.

@mre
Copy link
Member Author

mre commented Jul 17, 2023

@Techassi, @HU90m: maybe you got any feedback. 😉

@Techassi
Copy link
Contributor

State machines are definitely the way to go when writing parsers. I will take a look at the PR asap.

Copy link
Contributor

@HU90m HU90m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Just a few suggestions:

lychee-lib/src/extract/html/srcset.rs Outdated Show resolved Hide resolved
lychee-lib/src/extract/html/srcset.rs Outdated Show resolved Hide resolved
lychee-lib/src/extract/html/srcset.rs Outdated Show resolved Hide resolved
@mre
Copy link
Member Author

mre commented Jul 20, 2023

Loved the suggestions. Makes the code quite a bit more idiomatic. Thanks!

@mre mre merged commit cead4ce into master Jul 29, 2023
6 checks passed
@mre mre deleted the srcset-parsing branch July 29, 2023 15:06
@mre mre added the enhancement New feature or request label Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants