Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠️ Self-closing tags get corrupted 🚨 #83

Open
n-sviridenko opened this issue Aug 9, 2019 · 4 comments
Open

⚠️ Self-closing tags get corrupted 🚨 #83

n-sviridenko opened this issue Aug 9, 2019 · 4 comments

Comments

@n-sviridenko
Copy link

The library doesn't support html5 tags (e.g. self-closing span).

When parsing the following:

<span itemprop="price" content="139.90" />

foo

bar

It adds "foo ... bar" to the price attribute until it won't find a closing </span> tag.

The issue is in chtml which replaces /> w/ >

@n-sviridenko
Copy link
Author

n-sviridenko commented Aug 9, 2019

Steps to reproduce:

var scrape = require('html-metadata');
scrape.loadFromString('<div itemscope><span itemprop="price" content="139.90" /> <span itemprop="priceCurrency" content="PLN" /></div>').then(e => console.log(JSON.stringify(e)));

// {"schemaOrg":{"items":[{"properties":{"priceCurrency":["PLN"],"price":[" "]}}]}}

Possible resolution:

  1. First of all, htmlparser2 should recognize self-closing:
  var dom = microdataDom(htmlparser.parseDOM(html, {
    decodeEntities: true,
+   recognizeSelfClosing: true
  }), config);
  1. Secondly, cheerio.load(html).html() should not replace /> w/ >
var cheerio = require('cheerio');
cheerio.load('<div itemscope><span itemprop="price" content="139.90" /> <span itemprop="priceCurrency" content="PLN" /></div>').html()

// '<html><head></head><body><div itemscope><span itemprop="price" content="139.90"> <span itemprop="priceCurrency" content="PLN"></span></span></div></body></html>'

@n-sviridenko
Copy link
Author

Janpot/microdata-node#8

@mvolz
Copy link
Collaborator

mvolz commented Aug 10, 2019

Looks like cheeriojs/cheerio#598 might have a solution (setting {xmlMode: true} ? )

@n-sviridenko
Copy link
Author

n-sviridenko commented Aug 10, 2019

It's not enough (see # 1). And I'm not sure if "xml mode" supports html5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants