html5lib
So there's now an 'undocumented' html5lib integration. which helped me figure out how to integrate parts of minidom that exist i.e NameNodeMap. Only just started so won't render to pyml yet but should have all the DOM methods available.
You will also have to pip install html5lib to use it. im created an ext lib for libraries I interface with but don't include in reqs. i.e. later on may do a namespace for django or a filter for jinja or a pdfkit integration or css minifier so these kind of things could go in there.
Here's an example... seems to run fine for several websites til i killed it.
import requests
import html5lib
from domonic.ext.html5lib_ import getTreeBuilder
for SITE in sites:
r = requests.get("https://"+SITE)
parser = html5lib.HTMLParser(tree=getTreeBuilder())
page = parser.parse(r.content.decode("utf-8"))
links = page.getElementsByTagName('a')
for l in links:
try:
print(l.href)
except Exception as e:
# no href on this tag
pass