Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rehype-remove-comments: ability to preserve specific comments #50

Closed
4 tasks done
didnotwant opened this issue Jun 6, 2024 · 16 comments
Closed
4 tasks done

rehype-remove-comments: ability to preserve specific comments #50

didnotwant opened this issue Jun 6, 2024 · 16 comments
Labels
💪 phase/solved Post is done

Comments

@didnotwant
Copy link

Initial checklist

Problem

On one of my projects I use nginx SSI Commands to include some company-wide HTML content to one of my pages – the injected HTML just replaces a specific set of HTML comments.
Currently the rehype-remove-comments only allows to set removeConditional flag so that the IE comments can be kept.
I imagine that my use-case could be not the only one in the world – that's why I decided to file this issue.

Solution

My proposal is to add an option called exclude where one can pass an array of regular expressions matching specific content of HTML comments.

Example usage in .rehyperc:

export default {
  plugins: [
    ['./build/rehype-remove-html-comments.mjs', {
      exclude: [
        // We want to preserve HTML comments containing SSI includes commands
        // handled by nginx. See: https://nginx.org/en/docs/http/ngx_http_ssi_module.html
        /# (block|endblock|include)/,
        // … some other use-cases if needed
      ],
    }],
  ],
}

But I'm not sure if such way of configuring it is acceptable in all possible usages of the plugin (I think that currently there's just a flag for a specific reason, e.g. it's easier to use the tools in CLI?).

I created a custom plugin for my repo (I believe as a temporary solution, if my idea is good enough to the maintainers of rehype-minify) which is mostly the same as the current rehype-remove-comments plugin, so let me share the code I prepared:

import {visit} from 'unist-util-visit';

export default function rehypeRemoveHtmlComments(options = {}) {
  const shouldPreserve = (node) => {
    if (!('exclude' in options)) {
      return false;
    }

    if (!Array.isArray(options.exclude)) {
      throw new Error('The `exclude` option requires passing an array.');
    }

    if (options.exclude.some((maybeRegex) => !(maybeRegex instanceof RegExp))) {
      throw new Error('Elements of `exclude` have to be regular expressions.');
    }

    return options.exclude.some((regex) => regex.test(node.value));
  };

  return function (tree) {
    visit(tree, 'comment', function (node, index, parent) {
      if (
        typeof index === 'number' &&
        parent &&
        !shouldPreserve(node)
      ) {
        parent.children.splice(index, 1)
        return index;
      }
    })
  }
}

Alternatives

I wasn't able to find any alternative plugins that could allow doing exactly what I expect in the description of the Problem.
If there are some and anyone knows them, please share with me!

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Jun 6, 2024
@remcohaszing
Copy link
Member

I like it, but I think I would call it preserve and make it accept a regular expression or a function that returns a boolean. This supports more complex use cases.

rehype()
  .use(rehypeRemoveComments, {
    preserve: /# (block|endblock|include)/
  })

rehype()
  .use(rehypeRemoveComments, {
    preserve(content) {
      if (content.includes('# block')) {
        return true
      }

      if (content.includes('# endblock')) {
        return true
      }

      if (content.includes('# include')) {
        return true
      }

      return false
    }
  })

@ChristianMurphy
Copy link
Member

ChristianMurphy commented Jun 6, 2024

Taking a step back, is an option needed at all?
It sounds like the only comments would be NGINX commands.
So all comments could be kept.
By turning this part of the minifier off.

I'm not entirely opposed to adding an option, but it does feel very niche to me.

Even the IE directive pass through could and potentially should be removed in the next major.
IE has dropped below 0.5% of global browser share and keeps dropping.

@wooorm
Copy link
Member

wooorm commented Jun 6, 2024

preserve

Why not just go with test(node: Comment) / test(value: string)?

Even the IE directive pass through could and potentially should be removed in the next major.

👍


Taking a step back, is an option needed at all?
It sounds like the only comments would be NGINX commands.
So all comments could be kept.

I’m 👍 on this Q. Are there comments that you want to drop? Is turning off this plugin to keep all comments viable?

@didnotwant
Copy link
Author

@remcohaszing – good point, I already had the idea on just a regex instead of the array of them, but it seemed to me not too flexible if I ever wanted to preserve more cases of comments. The function is even better in terms of the readability!


@ChristianMurphy, @wooorm – your approach on permanently disabling the HTML comments trimming is what I exactly had done before I decided to have the option I proposed in this issue, so let me add more context to this.

The rehype is used in my repository with statically generated pages (using Astro), where the pages could be in different formats, depending on what's needed for a specific page: .astro, .mdx or even plain .html. The Astro and MDX (as I guess) allow to add comments that won't be rendered in production, the HTML – doesn't. In all of them it's possible to use the HTML comments, so there's a chance that something would leak to prod if the HTML comments aren't removed.

I could use only the non-HTML comments, but I'm not the only member of the team, and we don't want to have to remember about this rule each time we change something in the repo, and also I think it would be hard to establish some linting rule (or a rule for astro check) which ensures that we don't write HTML comments on pages (with the exception example I mentioned: nginx commands).

What do you guys think?

@remcohaszing
Copy link
Member

Why not just go with test(node: Comment) / test(value: string)?

Both the names test / preserve are equally fine with me. I think test is used more often within the unified ecosystem, so that makes sense.

Also passing either the Comment node or the string value is fine with me. Passing a RegExp would be nice, but this could also be implemented by the user in the test function.

Taking a step back, is an option needed at all?

IMO it makes sense. HTML comments are part of the DOM and can have actual meaning. IIRC AngularJS used them?

Terser also has this option (comments). A use case in their documentation is to preserve legal comments.

@wooorm
Copy link
Member

wooorm commented Jun 7, 2024

I think test is used more often within the unified ecosystem, so that makes sense.

Indeed.

Passing a RegExp would be nice, but this could also be implemented by the user in the test function.

True. And, preferably we’d support a string too. So that JSON config files are supported.

IIRC AngularJS used them?

How?
Isn’t how Angular uses it the same as how React uses them. Which is exactly why humans can‘t render comments in React, but the framework does?

Terser also has this option (comments). A use case in their documentation is to preserve legal comments.

A valid use case, but it seems to apply to JS/CSS/other code — I never see such copyright info in HTML comments. HTML has other ways to show that (<meta>, <link>?). I also don’t see it in other markup/content languages (markdown, svg, xml?)


it's possible to use the HTML comments, so there's a chance that something would leak to prod if the HTML comments aren't removed.

If people can write HTML comments and leak things, they could leak things in any other way: they could leak outside of comments.
It seems like security is your main concern, but I’m not sure security is helped much by this? 🤔
The main reason this exists is for performance instead: so that the HTML sent over the wire is smaller.

which ensures that we don't write HTML comments on pages (with the exception example I mentioned: nginx commands).

Wouldn’t it make sense to use rehype to minify and remove all comments after nginx commands?
Or, to disallow them, as there is already a good alternative: if people want to do such things, they could use Astro / MDX?
rehype-minify is a minifier, it makes things ugly, I’d expect it to interfere with nginx.


I’m leaning on 👍 for such a feature, but I do wonder about the story for this use case.

@remcohaszing
Copy link
Member

Passing a RegExp would be nice, but this could also be implemented by the user in the test function.

True. And, preferably we’d support a string too. So that JSON config files are supported.

Personally I lean against accepting a string until this is something a user actually asks for. They can use JavaScript config files, but I think typically this is used programmatically.

IIRC AngularJS used them?

How? Isn’t how Angular uses it the same as how React uses them. Which is exactly why humans can‘t render comments in React, but the framework does?

Yeah, I think something like that. It has been a while.

Terser also has this option (comments). A use case in their documentation is to preserve legal comments.

A valid use case, but it seems to apply to JS/CSS/other code — I never see such copyright info in HTML comments. HTML has other ways to show that (<meta>, <link>?). I also don’t see it in other markup/content languages (markdown, svg, xml?)

I’ve seen such comments while viewing the HTML source of miscellaneous websites over the course of years. I don’t recall which websites exactly, but I know it happens.

@wooorm
Copy link
Member

wooorm commented Jun 17, 2024

Personally I lean against accepting a string until this is something a user actually asks for. They can use JavaScript config files, but I think typically this is used programmatically.

I think this is generally fair—to wait for users to request things—however, all current plugins (as far as I am aware) support the non-JS config so that users don’t have to use the JS config file.

I’ve seen such comments while viewing the HTML source of miscellaneous websites over the course of years. I don’t recall which websites exactly, but I know it happens.

This seems a bit iffy to me 🤔 Humans will see the content, so if there’s copyright on that content, you’d want that available to those humans too. Not just to programmars who view source?


@didnotwant could you comment on my questions to you in #50 (comment)? About security and nginx commands? Thanks!

@remcohaszing
Copy link
Member

Humans will see the content, so if there’s copyright on that content, you’d want that available to those humans too. Not just to programmars who view source?

I’m not judging, only observing. I wouldn’t choose for this either. The exact same thing can be said about JavaScript minifiers preserving such comments.

@ChristianMurphy ChristianMurphy added the 🙉 open/needs-info This needs some more info label Jun 17, 2024

This comment has been minimized.

@wooorm
Copy link
Member

wooorm commented Jun 18, 2024

The exact same thing can be said about JavaScript minifiers preserving such comments.

My point is that it’s completely different: js/css minifiers have this option because OSS licenses (MIT etc) say that you must include the license in the code. So you import js/css from node_modules and get that comment in your bundle and the bundler needs to keep it. There is a special comment syntax for “license” comments so those bundlers know whether to keep it or not.
Markup as found in HTML does not have that. It doesn’t exist in node_modules. There is no MIT license on the tags. You can’t import some HTML into other HTML.
No, the words have licenses, creative commons and such. And thus you need to show that copyright + license visually next to the words.

I’m not judging, only observing.

I would love to see it because it seems so weird to me. 😅

@didnotwant
Copy link
Author

@wooorm – thanks for the mention, I forgot to answer. Anyway:

If people can write HTML comments and leak things, they could leak things in any other way: they could leak outside of comments.
It seems like security is your main concern, but I’m not sure security is helped much by this? 🤔
The main reason this exists is for performance instead: so that the HTML sent over the wire is smaller.

Nope, security isn't the main thing here (primarily we use rehype for minification, though), but to me, removing all the unwanted comments (in HTML, but also in JS, or CSS) is not only about the performance, but also cleaning up possible comments containing e.g. business domain's knowledge – and sometimes particular types of comments should stay on prod, e.g. in minified JS there are the comments about author or licenses. But you're right – at the end of the day we (as developers) are the ones responsible for what's pushed to prod, so it's not my main reason as you wrote, but rather a nice-to-have tool which could ensure us that we don't have to remember about possible leaks from comments.
So it'd be great to keep rehype-minify as a helpful utility, but with ability to preserve some comments on demand, which could be useful not only for our case.

Wouldn’t it make sense to use rehype to minify and remove all comments after nginx commands?
Or, to disallow them, as there is already a good alternative: if people want to do such things, they could use Astro / MDX?
rehype-minify is a minifier, it makes things ugly, I’d expect it to interfere with nginx.

Minification happens during the build of the "dist", and the nginx commands are used to inject some, let's say organisation-wide, HTML fragments (replacing the HTML comments which I'd like to preserve) before serving the already built page to the client.
Of course, some HTML fragments could be Astro components instead of SSI includes, but not each and every of them, or at least not at the very moment (different repositories in different technologies, not-yet migrated stuff etc.).

@wooorm
Copy link
Member

wooorm commented Jun 24, 2024

OK, I think a PR for test(value: string) is acceptable!

@wooorm wooorm added the 👍 phase/yes Post is accepted and can be worked on label Jun 24, 2024
@github-actions github-actions bot removed 🤞 phase/open Post is being triaged manually 🙉 open/needs-info This needs some more info labels Jun 24, 2024

This comment has been minimized.

@wooorm wooorm closed this as completed in 72c2051 Sep 17, 2024

This comment has been minimized.

@wooorm wooorm added the 💪 phase/solved Post is done label Sep 17, 2024
@github-actions github-actions bot removed the 👍 phase/yes Post is accepted and can be worked on label Sep 17, 2024
@wooorm
Copy link
Member

wooorm commented Sep 17, 2024

Released in https://github.com/rehypejs/rehype-minify/releases/tag/rehype-remove-comments%406.1.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 phase/solved Post is done
Development

No branches or pull requests

4 participants