doesn't work when specific Japanese characters exist in a tag #250

wataradio · 2023-08-25T10:40:31Z

Tag plugin doesn't work when specific Japanese characters, e.g '一'(U+4E00), exist in a tag like as follows.

{{tag> 一}}

Because '一's UTF-8 byte sequence(\xE4\xB8\x80) get corrupted by the following code in syntax_plugin_tag_tag::handle(tag.php).

$tags = trim($tags, "\xe2\x80\x8b"); // strip word/wordpad breaklines(U+200b)

It removes \x80 from \xE4\xB8\x80('一's UTF-8 byte sequence), and its result becomes an invalid sequence \xE4\xB8.

The text was updated successfully, but these errors were encountered:

wataradio · 2023-08-25T12:05:10Z

For example, the following characters' UTF-8 byte sequence end with \xe2, \x80 or \x8b, so the same problem occurs.

U+2000 (en quad): 0xE2, 0x80, 0x80
U+3000 (ideographic space): 0xE3, 0x80, 0x80
U+4E00 (一): 0xE4, 0xB8, 0x80
U+228B (⊋, contains as member): 0xE2, 0x8A, 0x8B
U+308B (る): 0xE3, 0x82, 0x8B

Klap-in · 2023-08-25T14:12:56Z

Thanks for the extra info, I think I do now understand the cause. The intent of the trim() was to remove the U+2000, i.e. a multibyte character of three pieces/bytes. However, because trim() it is not multibyte aware, it handles it as three separate characters.

So we should use here str_replace()? Does that work?

$tags = str_replace("\xe2\x80\x8b", '', $tags); // strip word/wordpad breaklines(U+200b)

wataradio · 2023-08-25T15:05:20Z

Thanks, I think it works well.

I confirmed the following small test code worked expectedly.

<?php
$str = "\xE4\xB8\x80"; // "一"
$zero_width_space = "\xe2\x80\x8b"; // U+200b ZERO WIDTH SPACE
$tags = $zero_width_space . $str . $zero_width_space;

$tags = str_replace("\xe2\x80\x8b", '', $tags); 

// expecting \xe4\xb8\x80 would be printed
$bytes = unpack('C*', $tags);
foreach ($bytes as $byte) {
    echo '\\x' . dechex($byte);
}
?>

Klap-in · 2023-08-25T21:14:28Z

Thanks for testing. trim() works only on the end of the string, str_replace() everywhere. I think that is fine for tags. I will implement it.

Fixes #250

Klap-in added the bug label Aug 25, 2023

Klap-in added a commit that referenced this issue Aug 25, 2023

replace trim() by str_replace for cleaning U+200b (zero width space)

5871336

Fixes #250

Klap-in linked a pull request Oct 17, 2023 that will close this issue

Some fixes #249

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doesn't work when specific Japanese characters exist in a tag #250

doesn't work when specific Japanese characters exist in a tag #250

wataradio commented Aug 25, 2023 •

edited

Loading

wataradio commented Aug 25, 2023 •

edited

Loading

Klap-in commented Aug 25, 2023

wataradio commented Aug 25, 2023

Klap-in commented Aug 25, 2023

doesn't work when specific Japanese characters exist in a tag #250

doesn't work when specific Japanese characters exist in a tag #250

Comments

wataradio commented Aug 25, 2023 • edited Loading

wataradio commented Aug 25, 2023 • edited Loading

Klap-in commented Aug 25, 2023

wataradio commented Aug 25, 2023

Klap-in commented Aug 25, 2023

wataradio commented Aug 25, 2023 •

edited

Loading

wataradio commented Aug 25, 2023 •

edited

Loading