Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser? #1535

Open
jpcaruana opened this issue Aug 10, 2023 · 24 comments
Labels

Comments

@jpcaruana
Copy link
Contributor

jpcaruana commented Aug 10, 2023

Hi,

I posted https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ and Bridgy posted it to mastodon https://indieweb.social/@jpcaruana/110864216016868215.

In the process, the char is displayed as … (unicode U+2026, Horizontal Ellipsis) in the mastodon post.

Emojis work fine though, see https://indieweb.social/@jpcaruana/110849548714309169 for instance.

How could I help here?

@snarfed
Copy link
Owner

snarfed commented Aug 10, 2023

Interesting!

Looking at the Bridgy log, the mf2 parser ended up with two content values from your post, one plain text with the … HTML entity, one html + text:

{
      "type": [
        "h-entry"
      ],
      "properties": {
        "url": [
          "https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/"
        ],
        "content": [
          "Not sure. This post https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/ is 501 chars long and https://indieweb.social/@jpcaruana/110836774445084366 does not contain the last sentence.\nWhen was the production deploy for this update? Maybe I just posted this before? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).\nI haven\u2019t posted anything longer than 500 chars since… I\u2019ll give it another try.",
          {
            "html": "<h1></h1><p>Not sure. This post <a href=\"https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/\">[https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a>](https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/%3C/a%3E); is 501 chars long and <a href=\"https://indieweb.social/@jpcaruana/110836774445084366\">[https://indieweb.social/@jpcaruana/110836774445084366</a>](https://indieweb.social/@jpcaruana/110836774445084366%3C/a%3E); does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).</p><p>I haven\u2019t posted anything longer than 500 chars since&amp;mldr; I\u2019ll give it another try.</p>",
            "value": "Not sure. This post https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/ is 501 chars long and https://indieweb.social/@jpcaruana/110836774445084366 does not contain the last sentence.\nWhen was the production deploy for this update? Maybe I just posted this before? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn\u2019t mess up with time zones).\nI haven\u2019t posted anything longer than 500 chars since&mldr; I\u2019ll give it another try."
          }
          "..."
        ]

@snarfed
Copy link
Owner

snarfed commented Aug 10, 2023

A few more data points:

@snarfed
Copy link
Owner

snarfed commented Aug 10, 2023

Oops, I was wrong. I looked at https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/ in browser dev tools, and evidently that shows me the contents after decoding HTML entities. curl and Python requests both show that the response actually has multiple HTML entities, notably both &rsquo; and &mldr; inside *-content. Interestingly we decode &rsquo; ok, but not &mldr;.

(Btw @jpcaruana it looks like the reason you have multiple values for content is that you're using both p-content and e-content: <div class="p-name p-content e-content">. You probably only want one of those.)

@snarfed
Copy link
Owner

snarfed commented Aug 10, 2023

Looks like this is a BeautifulSoup thing. Bridgy and granary fetch posts with requests:

bridgy/webmention.py

Lines 66 to 71 in b26c885

try:
resp = util.requests_get(url)
resp.raise_for_status()
except werkzeug.exceptions.HTTPException:
# raised by us, probably via self.error()
raise

...then parse it manually with BeautifulSoup, then pass that to mf2py:

https://github.com/snarfed/webutil/blob/63be8a763a618d43e957c6d414c0f6de8f298184/util.py#L1917-L1985

  if isinstance(input, requests.Response):
    content_type = input.headers.get('content-type') or ''
    input = input.text if 'charset' in content_type else input.content

  return bs4.BeautifulSoup(input, **kwargs)
def parse_mf2(input, url=None, id=None):
  ...
  return mf2py.parse(url=url, doc=input, img_with_alt=True)

When I do this outside of Bridgy/granary, BeautifulSoup converts &rsquo; to but doesn't recognize &mldr;:

>>> resp = util.requests_get('https://jp.caruana.fr/notes/2023/08/10/not-sure-this-post-https-jp/')
>>> print(resp.text)
...
<div class="p-name p-content e-content"><h1></h1><p>Not sure. This post <a href=https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/>https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a> is 501 chars long and <a href=https://indieweb.social/@jpcaruana/110836774445084366>https://indieweb.social/@jpcaruana/110836774445084366</a> does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didn&rsquo;t mess up with time zones).</p><p>I haven&rsquo;t posted anything longer than 500 chars since&mldr; I&rsquo;ll give it another try.</p></div>
...
>>> soup = bs4.BeautifulSoup(resp.text)
>>> print(soup)
...
<div class="p-name p-content e-content"><h1></h1><p>Not sure. This post <a href="https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/">https://jp.caruana.fr/notes/2023/08/05/how-to-synchronize-a-tree/</a> is 501 chars long and <a href="https://indieweb.social/@jpcaruana/110836774445084366">https://indieweb.social/@jpcaruana/110836774445084366</a> does not contain the last sentence.</p><p>When was the production deploy for this update? Maybe I just posted this <em>before</em>? It posted it on Aug 5th, 13:29 CEST (4:29 PDT if I didnt mess up with time zones).</p><p>I havent posted anything longer than 500 chars since&amp;mldr; Ill give it another try.</p></div>
...

@capjamesg @angelogladding any thoughts here?

@snarfed
Copy link
Owner

snarfed commented Aug 10, 2023

Hmm, maybe it's an lxml thing?

>>> from bs4.diagnose import diagnose
>>> diagnose(resp.text)
Diagnostic running on Beautiful Soup 4.12.2
Python version 3.9.16 (main, Dec  7 2022, 10:06:04)
[Clang 14.0.0 (clang-1400.0.29.202)]
Found lxml version 4.9.3.0
Found html5lib version 1.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
...
        I havent posted anything longer than 500 chars sinceIll give it another try...
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
...
        I havent posted anything longer than 500 chars sinceIll give it another try...
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
...
        I havent posted anything longer than 500 chars since&amp;mldr; Ill give it another try.
...

@snarfed
Copy link
Owner

snarfed commented Aug 10, 2023

Looks like maybe yes.

>>> print(bs4.BeautifulSoup(text, 'html.parser'))
...
<p>I havent posted anything longer than 500 chars sinceIll give it another try.</p>
...
>>> print(bs4.BeautifulSoup(text, 'html5lib'))
...
<p>I havent posted anything longer than 500 chars sinceIll give it another try.</p>
...
>>> print(bs4.BeautifulSoup(text, 'lxml'))
...
<p>I havent posted anything longer than 500 chars since&amp;mldr; Ill give it another try.</p>
...

@snarfed
Copy link
Owner

snarfed commented Aug 10, 2023

I see mldr in WHATWG's list of entities and in the 2011 HTML spec. I don't see anything obvious searching lxml's bug tracker for mldr or for entities. So I guess the next step is to file a bug with them?

@capjamesg
Copy link

@snarfed I was thinking about this and my first intuition was to try another parser like html5lib. It seems like an lxml issue.

@snarfed
Copy link
Owner

snarfed commented Aug 10, 2023

Filed https://bugs.launchpad.net/lxml/+bug/2031045

@jpcaruana
Copy link
Contributor Author

(Btw @jpcaruana it looks like the reason you have multiple values for content is that you're using both p-content and e-content:

. You probably only want one of those.)

Thank you, I know I had strange issues back in the days, so I ended up in duplicating indieweb class names. I'll take a look at it.

Which one is, in your opinion, the "one to keep"?

@snarfed
Copy link
Owner

snarfed commented Aug 11, 2023

Generally e-content, which preserves the inner HTML tags. Only use p-content if you want to collapse the value to plain text.

@jpcaruana
Copy link
Contributor Author

cristal clear, thank you. Fixed on my side :)

@snarfed
Copy link
Owner

snarfed commented Aug 13, 2023

@snarfed
Copy link
Owner

snarfed commented Aug 14, 2023

Evidently the root cause is that libxml2 only supports HTML4, not HTML5, even 15y later 🤦. They have a 2y old issue tracking adding HTML5 support, with some discussion, but no obvious progress. Sigh. https://gitlab.gnome.org/GNOME/libxml2/-/issues/211

@snarfed
Copy link
Owner

snarfed commented Aug 15, 2023

Added this to the Bridgy docs: b37f45c, https://brid.gy/about#html-entities

@snarfed snarfed changed the title Some chars are not encoded correctly Publish: HTML5 entities not supported due to lxml/libxml2. Should we switch to a different parser? Aug 15, 2023
@capjamesg
Copy link

Wow. That is surprising. I think switching to a different parser sounds wise; the user shouldn't have to foot the burden and see malformed markup.

@jpcaruana
Copy link
Contributor Author

Wow. That is surprising. I think switching to a different parser sounds wise; the user shouldn't have to foot the burden and see malformed markup.

Do you know any alternative our there?

@capjamesg
Copy link

You can use html5lib with BeautifulSoup for HTML5 parsing, but the BeautifulSoup documentation says this parser is Very slow.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

@jpcaruana
Copy link
Contributor Author

I found but did not test html5-parser: they claim it is fast:

A fast implementation of the HTML 5 parsing spec for Python

https://html5-parser.readthedocs.io/en/latest/

@capjamesg
Copy link

@jpcaruana Looks interesting!

There is a benchmark script named benchmark.py that compares the parse times for parsing a large (~ 5.7MB) HTML document in html5lib and html5-parser. The results on my system (using python 3) show a speedup of 37x.

@snarfed
Copy link
Owner

snarfed commented Aug 15, 2023

Thanks for the sleuthing, guys! Also I think BeautifulSoup's claims are something like 10y old or or more. So they may still be true, but they may well not be. 🤷

@kevinmarks
Copy link

I remember sknebel doing a bunch of performance work a while back https://github.com/microformats/mf2py/issues?q=label%3Aperformance+ - 5 years ago and some BeautifulSoup refactoring before that by kyle.

It may be time to look at backing out of BeautifulSoup in favour of a modern parser like html5-parser, which would be more like the way the Go parser works, but that would be a big refactor.

@snarfed
Copy link
Owner

snarfed commented Aug 15, 2023

Thanks @kevinmarks!

Not a top priority for me personally, but moving away from BeautifulSoup might only be a medium sized refactor, spread across a few packages, and we could do them independently.

@snarfed
Copy link
Owner

snarfed commented Oct 8, 2024

libxml2 reported a bunch of progress HTML5 support recently: https://gitlab.gnome.org/GNOME/libxml2/-/issues/211#note_2241407

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants