Does mochiweb's parse/1 ignore (some) white-space? #166

rrrene · 2016-02-26T22:18:12Z

First off, thanks for an amazing library. I love the parser and how it enabled me to concentrate on the "sanitizing" part while building an HTML sanitizer in Elixir. One thing that struck me as odd was that white-space between a closing and an opening tag seems to be omitted in the parser's return value.

The following is in Elixir syntax and I hope it is understandable. I am sorry that my Erlang is not good enough to translate this for a better bug report. 😞

When we put this binary into mochi_web:

:mochiweb_html.parse("<html>just <b>an</b> <b>other</b> test</html>")

the result is this:

{"html", [], ["just ", {"b", [], ["an"]}, {"b", [], ["other"]}, " test"]}

The space between the closing of the first </b> and the opening of the second <b> is somehow lost. I would have expected the following, where the space is preserved as a "text node":

{"html", [], ["just ", {"b", [], ["an"]}, " ", {"b", [], ["other"]}, " test"]}

But maybe the described behaviour is intended? Or is this a bug?

Thanks again and keep up the good work! 👍

The text was updated successfully, but these errors were encountered:

mworrell · 2016-02-26T22:37:11Z

In Zotonic we have an HTML sanitizer. See:

https://github.com/zotonic/z_stdlib/blob/master/src/z_html.erl
https://github.com/zotonic/zotonic/blob/master/src/support/z_sanitize.erl

In our Mochiweb fork we have some changes that still need to be merged upstream.
https://github.com/zotonic/mochiweb/blob/master/src/mochiweb_html.erl

(zotonic001@Lamma)1> mochiweb_html:parse("<html>just <b>an</b> <b>other</b> test</html>").
{<<"html">>,[],
 [<<"just ">>,
  {<<"b">>,[],[<<"an">>]},
  <<" ">>,
  {<<"b">>,[],[<<"other">>]},
  <<" test">>]}

smeevil · 2016-02-26T22:44:44Z

Nice, that is way better then the regex replace of spaces between open and closing tags to :D

If you still like to go that way though, then I can send you a pull request if you like. (all tests are passing now as well)

mworrell · 2016-02-26T22:50:04Z

@smeevil You mean sending a pull request to Mochiweb based on the Zotonic fork? That would be very nice. Or a pull to the Zotonic fork for getting the tests working? That is welcome as well 👍

smeevil · 2016-02-26T22:54:33Z

no, sorry :)
I mean the changes I made in the afore mentioned commit.
Thats where transform the html to replace the spaces to   where necessary for the result to parse to expectations.

mworrell · 2016-02-26T22:57:46Z

@rrrene ah, in the html_sanitize_ex issue.

Be free to check the Zotonic fork - it is used a lot in production for lots of html (and z_stdlib also sanitizes the css).

smeevil · 2016-02-26T22:59:38Z

Oh, I now see I mixed up the issues... sorry :O

rrrene · 2016-02-27T09:55:39Z

@mworrell Thanks for getting back to me so quickly :)

I will check the zotonic fork in the evening, but your example looks very promising! 👍

etrepum · 2016-03-14T17:00:43Z

For the most part, this code was designed to parse out data from HTML (e.g. microformats style data) and XML, not correctly deal with all of HTML.

The workaround I would use is   or  .   has meaning beyond just a space (it's a non-breaking space).

jameslong · 2022-10-08T10:45:02Z

Is the current status of this 'Won't fix'?

etrepum · 2022-10-08T12:47:35Z

The current status of this is "Nobody has contributed a fix"

rrrene mentioned this issue Feb 26, 2016

whitespace truncated rrrene/html_sanitize_ex#4

Closed

philss mentioned this issue Dec 18, 2016

Floki removes blank text nodes without option to avoid this philss/floki#75

Open

sadraskol mentioned this issue Jan 7, 2017

update outdated dependencies rrrene/html_sanitize_ex#15

Closed

marvelm mentioned this issue Apr 20, 2017

Update 'mochiweb' to 2.15 rrrene/html_sanitize_ex#22

Merged

danschultzer mentioned this issue Apr 14, 2018

Fix whitespace loss, retain doctypes, remove inlined styles from HEAD danschultzer/premailex#9

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does mochiweb's parse/1 ignore (some) white-space? #166

Does mochiweb's parse/1 ignore (some) white-space? #166

rrrene commented Feb 26, 2016

mworrell commented Feb 26, 2016

smeevil commented Feb 26, 2016

mworrell commented Feb 26, 2016

smeevil commented Feb 26, 2016

mworrell commented Feb 26, 2016

smeevil commented Feb 26, 2016

rrrene commented Feb 27, 2016

etrepum commented Mar 14, 2016

jameslong commented Oct 8, 2022

etrepum commented Oct 8, 2022

Does mochiweb's parse/1 ignore (some) white-space? #166

Does mochiweb's parse/1 ignore (some) white-space? #166

Comments

rrrene commented Feb 26, 2016

mworrell commented Feb 26, 2016

smeevil commented Feb 26, 2016

mworrell commented Feb 26, 2016

smeevil commented Feb 26, 2016

mworrell commented Feb 26, 2016

smeevil commented Feb 26, 2016

rrrene commented Feb 27, 2016

etrepum commented Mar 14, 2016

jameslong commented Oct 8, 2022

etrepum commented Oct 8, 2022