better handling of percent-encoded image references #605

benoit-pierre · 2024-10-19T04:12:30Z

fix DecodeHTMLUrlString so it does not mangle non-ASCII characters
fix lunasvgDrawImageHelper to properly handle a percent-encoded image URL

Cf. koreader/koreader#12656.

This change is

Don't mangle non-encoded non-ASCII characters.

Correctly handle percent encoded URLs.

Frenzie

Makes sense to me

IMAGE_SOURCE_FROM_BYTES sounds somewhat useful. Nothing like that is used anywhere?

poire-z · 2024-10-19T07:28:33Z

crengine/src/epubfmt.cpp

-                if ( stream.isNull() ) {
-                    // Try again in case cover_image_path is percent-encoded
-                    cover_image_path = LVCombinePaths(codeBase, DecodeHTMLUrlString(cover_image_href));


Sure we don't need this "try again" ? We're still doing it elsewhere.
The thing is that we may get crappy EPUBs, with sometimes the %escapes literally in the zip item names (that is, the correctly %encoded attribute values in some XML, once decoded, won't be found, because the zip contains the original value with %), so these "try again" were added to handle these crappy cases. Also, zip item names don't have any encoding, so we meet stuff as bytes, which may be or not be utf8.

Some possibly related issues that I managed to find again:
koreader/koreader#7661
Bottomest item in #326.
Search for "try again" in ef95bcc.

Sure we don't need this "try again" ? We're still doing it elsewhere. The thing is that we may get crappy EPUBs, with sometimes the %escapes literally in the zip item names (that is, the correctly %encoded attribute values in some XML, once decoded, won't be found, because the zip contains the original value with %), so these "try again" were added to handle these crappy cases.

In theory, but are there actually instances of this?

Also, zip item names don't have any encoding, so we meet stuff as bytes, which may be or not be utf8.

Some possibly related issues that I managed to find again: koreader/koreader#7661 Bottomest item in #326. Search for "try again" in ef95bcc.

At least the current code in ldomNode::getObjectImageSource does it in the right order: correct behavior first (percent-encoded), ugly workaround second…

But that code in ldomDocumentFragmentWriter::convertHref seems icky:

// Depending on what's calling us, href may or may not have // gone thru DecodeHTMLUrlString() to decode %-encoded bits. // We'll need to try again with DecodeHTMLUrlString() if not // initially found in "pathSubstitutions" (whose filenames went // thru DecodeHTMLUrlString(), and so did 'codeBase').

Possibly percent-decoding a string 2 times…

In theory, but are there actually instances of this?

I think so, if the code "try again with" is there :) (May be - don't remember - the original code was bad, and I added the first chunk to do it right, keeping the "bad" chunk as a fallbadk (as it has most often never hurt us) just in case.

At least the current code in ldomNode::getObjectImageSource does it in the right order: correct behavior first (percent-encoded), ugly workaround second…

So, re-order existing "try again" code in the right order - rather than removing the "try again".

I don't remember all this and don't want to dig in - so trusting you. Just remember crengine doesn't enforce/require strictly-perfect-EPUB, but was made (with such "try again" branches) to handle the crappy ones we've met over the years.

benoit-pierre · 2024-10-19T13:36:59Z

Makes sense to me

IMAGE_SOURCE_FROM_BYTES sounds somewhat useful. Nothing like that is used anywhere?

Nope.

benoit-pierre added 4 commits October 19, 2024 06:06

epubfmt: minor simplification

2c82a88

lvstring: fix DecodeHTMLUrlString implementation

0d0102e

Don't mangle non-encoded non-ASCII characters.

lvimg: minor cleanup (dead code)

4df360b

lvimg: fix lunasvgDrawImageHelper implementation

f3d74bf

Correctly handle percent encoded URLs.

Frenzie approved these changes Oct 19, 2024

View reviewed changes

poire-z reviewed Oct 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better handling of percent-encoded image references #605

better handling of percent-encoded image references #605

benoit-pierre commented Oct 19, 2024 •

edited by Frenzie

Loading

Frenzie left a comment

poire-z Oct 19, 2024

benoit-pierre Oct 19, 2024

poire-z Oct 19, 2024

benoit-pierre commented Oct 19, 2024

better handling of percent-encoded image references #605

Are you sure you want to change the base?

better handling of percent-encoded image references #605

Conversation

benoit-pierre commented Oct 19, 2024 • edited by Frenzie Loading

Frenzie left a comment

Choose a reason for hiding this comment

poire-z Oct 19, 2024

Choose a reason for hiding this comment

benoit-pierre Oct 19, 2024

Choose a reason for hiding this comment

poire-z Oct 19, 2024

Choose a reason for hiding this comment

benoit-pierre commented Oct 19, 2024

benoit-pierre commented Oct 19, 2024 •

edited by Frenzie

Loading