Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix whitespace loss, retain doctypes, remove inlined styles from HEAD #9

Closed
wants to merge 7 commits into from

Conversation

willbarrett
Copy link

Hey @danschultzer - I've tweaked the functionality described in issue #8 , and added some test cases. Happy to discuss this work! Here's the high-level:

Improvements:

  • Retain whitespace between tags when inlining CSS
    • Important for inline tags, e.g. <b>this space</b> <i>should be respected</i>
    • Makes the generated markup far more readable
  • Retain doctypes when inlining CSS
    • Some mail clients parse mail differently based on doctype
  • Allows the user to specify the CSS used to lookup style elements
    • Sometimes templates have both CSS included in the HEAD, and inlined CSS, and these two things can be different.
    • Greater flexibility for the lib.
  • Remove inlined styles from document HEAD after inlining
    • Reduces bandwidth usage, less template to read :)

TODO

Why Meeseeks?

Currently, there isn't a sane HTML parser written in Elixir or Erlang, according to multiple places on the interwebs, and copious research. Meeseeks provides a good wrapper and nice functionality around the html5ever parser, written in Rust, and the maintainer's a super-smart super-helpful guy. The main bug being fixed here, the missing whitespace between tags, will result in broken display for most users of this library at some point. I think it's worth moving to a different parser for the functionality necessary to resolve the issue.

@willbarrett
Copy link
Author

CI failure is due to Travis failing when pulling the meeseeks dependency from Github. Not sure what that's about, as deps are pulled successfully locally. Tests are passing locally. The next release of meeseeks will be compatible with this change.

@danschultzer
Copy link
Owner

Looks great @willbarrett, thanks!

For reference, the whitespace issue is found in mochiweb: mochi/mochiweb#166

I think an API that makes it easy to exchange the HTML parser would be the best way of going about this. Then we can provide an option for a pure elixir dependency in case a developer wouldn't want to compile rust code.

The other changes are great too. I've a few ideas for what to improve, but I'll get back to you when I've tried out this PR 😄

@danschultzer
Copy link
Owner

I've opened #10, but it'll require rebasing of this branch.

@willbarrett
Copy link
Author

Nice. I'll rebase + follow that approach Monday.

@danschultzer
Copy link
Owner

Sounds good. I'll be working on removing the rest of the Floki method calls tomorrow and run them all through the HTMLParser API instead.

@danschultzer
Copy link
Owner

Sorry, got busy with some unrelated work. I've merged in #11, and I'll start to work on the issues you've highlighted.

@willbarrett
Copy link
Author

No worries - I got busy too. I'll rebase on this branch today and update it now that your changes are merged.

@danschultzer
Copy link
Owner

FYI, I've worked on the whitespace issue in #12. From what I've read, we want to preserve whitespace only when child elements consists of text nodes and inline-block nodes. So I've let Premailex taken care of the quirks in both Meeseeks and Floki.

@danschultzer
Copy link
Owner

danschultzer commented Apr 23, 2018

I switched to use Meeseeks for the HTMLToPlainText module, and it was full of whitespace. I'm now sanitizing the HTML tree after it's being parsed.

@willbarrett
Copy link
Author

In our case, the full readable whitespace is desirable - premailex is just one step of a long compilation process, so we'd like the result to be readable rather than fully concatenated. Perhaps we could add an option to retain all whitespace vs. only the lexically-relevant whitespace?

@danschultzer
Copy link
Owner

That makes sense, I'll update #12 with the option to retain all whitespace.

@willbarrett
Copy link
Author

OK, cool. Since you're working on that changeset, is there a part of this that you'd like me to break off into a separate PR? I'd be happy to help out since I brought the issues to light.

@danschultzer
Copy link
Owner

That would be great! I prefer to keep the PR's small, so if you could break this one up into separate PR's:

  • Remove inline style tags
  • Option to specify CSS style tag selector
  • Retain doctypes

@willbarrett
Copy link
Author

Awesome. I've got a bunch of PR reviews to get to, but I'll open at least one of those this afternoon. Retaining doctypes is somewhat harder under Floki. Is it acceptable if some of these changes require Meeseeks to work properly?

@danschultzer
Copy link
Owner

Yeah, it's acceptable.

@danschultzer
Copy link
Owner

I was thinking it through, and have rewritten the code so instead of removing whitespace by default, it'll only remove the whitespace when doing plain text conversion. So using Meeseeks, you will be able to preserve the original formatting. v0.2.0 has been released 😀

I'll include the rest of the improvements in a v0.2.1 release.

@willbarrett
Copy link
Author

Nice. Yesterday afternoon went sideways, hopefully I'll have a PR for you this AM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants