Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better handling of percent-encoded image references #605

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 0 additions & 7 deletions crengine/include/lvimg.h
Original file line number Diff line number Diff line change
Expand Up @@ -147,11 +147,4 @@ void LVDrawBatteryIcon( LVDrawBuf * drawbuf, const lvRect & batteryRc, int perce

unsigned char * convertSVGtoPNG(const unsigned char *svg_data, int svg_data_size, float zoom_factor, int *png_data_len);

#define IMAGE_SOURCE_FROM_BYTES( imgvar , bufvar ) \
extern unsigned char bufvar []; \
extern int bufvar ## _size ; \
LVImageSourceRef imgvar = LVCreateStreamImageSource( \
LVCreateMemoryStream( bufvar , bufvar ## _size ) )


#endif
7 changes: 1 addition & 6 deletions crengine/src/epubfmt.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1045,7 +1045,7 @@
bool _bold;
lString32 _url;
public:
EmbeddedFontStyleParser(LVEmbeddedFontList & fontList) : _fontList(fontList) { }

Check warning on line 1048 in crengine/src/epubfmt.cpp

View workflow job for this annotation

GitHub Actions / Static linting

3 uninitialized fields at the end of the constructor call
void onToken(char token) {
// 4,5: font-family:
// 6,7: font-weight:
Expand Down Expand Up @@ -1753,18 +1753,13 @@
LVStreamRef stream = m_arc->OpenStream(cover_xhtml_path.c_str(), LVOM_READ);
lString32 cover_image_href;
if ( ExtractCoverFilenameFromCoverPageFragment(stream, cover_image_href, node_scheme, attr_scheme, ns_scheme) ) {
cover_image_href = DecodeHTMLUrlString(cover_image_href);
lString32 codeBase = LVExtractPath( cover_xhtml_path );
if ( codeBase.length()>0 && codeBase.lastChar()!='/' )
codeBase.append(1, U'/');
lString32 cover_image_path = LVCombinePaths(codeBase, cover_image_href);
CRLog::info("EPUB cover image file: %s", LCSTR(cover_image_path));
LVStreamRef stream = m_arc->OpenStream(cover_image_path.c_str(), LVOM_READ);
if ( stream.isNull() ) {
// Try again in case cover_image_path is percent-encoded
cover_image_path = LVCombinePaths(codeBase, DecodeHTMLUrlString(cover_image_href));
Comment on lines -1762 to -1764
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we don't need this "try again" ? We're still doing it elsewhere.
The thing is that we may get crappy EPUBs, with sometimes the %escapes literally in the zip item names (that is, the correctly %encoded attribute values in some XML, once decoded, won't be found, because the zip contains the original value with %), so these "try again" were added to handle these crappy cases. Also, zip item names don't have any encoding, so we meet stuff as bytes, which may be or not be utf8.

Some possibly related issues that I managed to find again:
koreader/koreader#7661
Bottomest item in #326.
Search for "try again" in ef95bcc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we don't need this "try again" ? We're still doing it elsewhere. The thing is that we may get crappy EPUBs, with sometimes the %escapes literally in the zip item names (that is, the correctly %encoded attribute values in some XML, once decoded, won't be found, because the zip contains the original value with %), so these "try again" were added to handle these crappy cases.

In theory, but are there actually instances of this?

Also, zip item names don't have any encoding, so we meet stuff as bytes, which may be or not be utf8.

Some possibly related issues that I managed to find again: koreader/koreader#7661 Bottomest item in #326. Search for "try again" in ef95bcc.

At least the current code in ldomNode::getObjectImageSource does it in the right order: correct behavior first (percent-encoded), ugly workaround second…

But that code in ldomDocumentFragmentWriter::convertHref seems icky:

// Depending on what's calling us, href may or may not have
// gone thru DecodeHTMLUrlString() to decode %-encoded bits.
// We'll need to try again with DecodeHTMLUrlString() if not
// initially found in "pathSubstitutions" (whose filenames went
// thru DecodeHTMLUrlString(), and so did 'codeBase').

Possibly percent-decoding a string 2 times…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, but are there actually instances of this?

I think so, if the code "try again with" is there :) (May be - don't remember - the original code was bad, and I added the first chunk to do it right, keeping the "bad" chunk as a fallbadk (as it has most often never hurt us) just in case.

At least the current code in ldomNode::getObjectImageSource does it in the right order: correct behavior first (percent-encoded), ugly workaround second…

So, re-order existing "try again" code in the right order - rather than removing the "try again".

I don't remember all this and don't want to dig in - so trusting you. Just remember crengine doesn't enforce/require strictly-perfect-EPUB, but was made (with such "try again" branches) to handle the crappy ones we've met over the years.

CRLog::info("EPUB cover image file pct-decoded: %s", LCSTR(cover_image_path));
stream = m_arc->OpenStream(cover_image_path.c_str(), LVOM_READ);
}
if ( !stream.isNull() ) {
LVImageSourceRef img = LVCreateStreamImageSource(stream);
if ( !img.isNull() ) {
Expand Down
2 changes: 1 addition & 1 deletion crengine/src/lvimg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2077,7 +2077,7 @@ static bool lunasvgDrawImageHelper(lunasvg::external_context_t * xcontext, const
ldomDocument * doc = ((LVNodeImageSource *)xcontext->external_object)->GetSourceDocument();
if ( doc ) {
ldomNode * node = ((LVNodeImageSource *)xcontext->external_object)->GetSourceNode();
img = doc->getObjectImageSource(Utf8ToUnicode(url), node);
img = doc->getObjectImageSource(DecodeHTMLUrlString(Utf8ToUnicode(url)), node);
}
else {
// We may be used by frontends without a ldomDocument to render SVG, and
Expand Down
91 changes: 40 additions & 51 deletions crengine/src/lvstring.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5842,64 +5842,53 @@ bool lString32::replaceIntParam(int index, int replaceNumber)
return replaceParam( index, lString32::itoa(replaceNumber));
}

static int decodeHex( lChar32 ch )
{
if ( ch>='0' && ch<='9' )
return ch-'0';
else if ( ch>='a' && ch<='f' )
return ch-'a'+10;
else if ( ch>='A' && ch<='F' )
return ch-'A'+10;
return -1;
}

static lChar8 decodeHTMLChar( const lChar32 * s )
{
if (s[0] == '%') {
int d1 = decodeHex( s[1] );
if (d1 >= 0) {
int d2 = decodeHex( s[2] );
if (d2 >= 0) {
return (lChar8)(d1*16 + d2);
}
}
}
return 0;
}

/// decodes path like "file%20name%C3%A7" to "file nameç"
/// NOTE: return the original string unchanged on error
/// (malformed escape sequence, bad UTF-8 encoding, …).
lString32 DecodeHTMLUrlString( lString32 s )
{
const lChar32 * str = s.c_str();
for ( int i=0; str[i]; i++ ) {
if ( str[i]=='%' ) {
lChar8 ch = decodeHTMLChar( str + i );
if ( ch==0 ) {
for (const lChar32 *src = s.c_str(); *src; ++src) {
if (*src != '%')
continue;
lString32 res(s);
lChar32 *dst = res.modify() + (src - s.c_str());
int continuation = 0;
lChar32 c;
while (*src) {
if (*src != '%') {
if (continuation)
return s; // ERROR: truncated UTF-8 sequence.
*dst++ = *src++;
continue;
}
// HTML encoded char found
lString8 res;
res.reserve(s.length());
res.append(UnicodeToUtf8(str, i));
res.append(1, ch);
i+=3;

// continue conversion
for ( ; str[i]; i++ ) {
if ( str[i]=='%' ) {
ch = decodeHTMLChar( str + i );
if ( ch==0 ) {
res.append(1, (lChar8)str[i]);
continue;
}
res.append(1, ch);
i+=2;
} else {
res.append(1, (lChar8)str[i]);
}
int hex = decodeHex(src + 1, 2);
if (hex <= 0)
return s; // ERROR: malformed or invalid escape sequence.
src += 3;
if (continuation) {
if ((hex & 0xc0) != 0x80)
return s; // ERROR: bad UTF-8 continuation byte.
c = (c << 6) | (hex & 0x3f);
if (!--continuation)
*dst++ = c;
} else {
if (!(hex & 0x80))
*dst++ = hex;
else if ((hex & 0xe0) == 0xc0) {
c = hex & 0x1f;
continuation = 1;
} else if ((hex & 0xf0) == 0xe0) {
c = hex & 0x0f;
continuation = 2;
} else if ((hex & 0xf8) == 0xf0) {
c = hex & 0x07;
continuation = 3;
} else
return s; // ERROR: bad UTF-8 sequence first byte.
}
return Utf8ToUnicode(res);
}
res.erase(dst - res.c_str(), res.length());
return res;
}
return s;
}
Expand Down
Loading