Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does mochiweb's parse/1 ignore (some) white-space? #166

Open
rrrene opened this issue Feb 26, 2016 · 10 comments
Open

Does mochiweb's parse/1 ignore (some) white-space? #166

rrrene opened this issue Feb 26, 2016 · 10 comments

Comments

@rrrene
Copy link

rrrene commented Feb 26, 2016

First off, thanks for an amazing library. I love the parser and how it enabled me to concentrate on the "sanitizing" part while building an HTML sanitizer in Elixir. One thing that struck me as odd was that white-space between a closing and an opening tag seems to be omitted in the parser's return value.

The following is in Elixir syntax and I hope it is understandable. I am sorry that my Erlang is not good enough to translate this for a better bug report. 😞

When we put this binary into mochi_web:

:mochiweb_html.parse("<html>just <b>an</b> <b>other</b> test</html>")

the result is this:

{"html", [], ["just ", {"b", [], ["an"]}, {"b", [], ["other"]}, " test"]}

The space between the closing of the first </b> and the opening of the second <b> is somehow lost. I would have expected the following, where the space is preserved as a "text node":

{"html", [], ["just ", {"b", [], ["an"]}, " ", {"b", [], ["other"]}, " test"]}

But maybe the described behaviour is intended? Or is this a bug?

Thanks again and keep up the good work! 👍

@mworrell
Copy link
Contributor

In Zotonic we have an HTML sanitizer. See:

https://github.com/zotonic/z_stdlib/blob/master/src/z_html.erl
https://github.com/zotonic/zotonic/blob/master/src/support/z_sanitize.erl

In our Mochiweb fork we have some changes that still need to be merged upstream.
https://github.com/zotonic/mochiweb/blob/master/src/mochiweb_html.erl

(zotonic001@Lamma)1> mochiweb_html:parse("<html>just <b>an</b> <b>other</b> test</html>").
{<<"html">>,[],
 [<<"just ">>,
  {<<"b">>,[],[<<"an">>]},
  <<" ">>,
  {<<"b">>,[],[<<"other">>]},
  <<" test">>]}

@smeevil
Copy link

smeevil commented Feb 26, 2016

Nice, that is way better then the regex replace of spaces between open and closing tags to   :D

If you still like to go that way though, then I can send you a pull request if you like. (all tests are passing now as well)

@mworrell
Copy link
Contributor

@smeevil You mean sending a pull request to Mochiweb based on the Zotonic fork? That would be very nice. Or a pull to the Zotonic fork for getting the tests working? That is welcome as well 👍

@smeevil
Copy link

smeevil commented Feb 26, 2016

no, sorry :)
I mean the changes I made in the afore mentioned commit.
Thats where transform the html to replace the spaces to &nbsp; where necessary for the result to parse to expectations.

@mworrell
Copy link
Contributor

@rrrene ah, in the html_sanitize_ex issue.

Be free to check the Zotonic fork - it is used a lot in production for lots of html (and z_stdlib also sanitizes the css).

@smeevil
Copy link

smeevil commented Feb 26, 2016

Oh, I now see I mixed up the issues... sorry :O

@rrrene
Copy link
Author

rrrene commented Feb 27, 2016

@mworrell Thanks for getting back to me so quickly :)

I will check the zotonic fork in the evening, but your example looks very promising! 👍

@etrepum
Copy link
Member

etrepum commented Mar 14, 2016

For the most part, this code was designed to parse out data from HTML (e.g. microformats style data) and XML, not correctly deal with all of HTML.

The workaround I would use is &#32; or &#x20;. &nbsp; has meaning beyond just a space (it's a non-breaking space).

@jameslong
Copy link

Is the current status of this 'Won't fix'?

@etrepum
Copy link
Member

etrepum commented Oct 8, 2022

The current status of this is "Nobody has contributed a fix"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants