Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode astral character confusing pyquery #46

Open
kesinger opened this issue Aug 23, 2023 · 0 comments
Open

Unicode astral character confusing pyquery #46

kesinger opened this issue Aug 23, 2023 · 0 comments

Comments

@kesinger
Copy link

This is more of a pyquery bug but I found it while using tapas-dl.

In the comments to the first installment of
https://tapas.io/series/talesofthehangman
there's a "🤩" character and something about that is messing up pyquery:


  File "/Users/jake/Library/Caches/pypoetry/virtualenvs/tapas-comic-downloader-Iag5BTTj-py3.9/lib/python3.9/site-packages/pyquery/pyquery.py", line 57, in fromstring
    result = getattr(etree, meth)(context)
  File "src/lxml/etree.pyx", line 3254, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1793, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 2
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 2, column 1

The workaround I found is to replace pq(pageReqest.text) with

    prt  = "".join([x for x in pageReqest.text if ord(x) < 128])
    page = pq(prt)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant