Various web applications often need to work with strings of HTML on the client-side. This might take place, for instance, as part of a client-side templating solution or perhaps come to play through the process of rendering user-generated content. The key problem is that it remains difficult to perform these tasks in a safe way. This is specifically the case because the naive approach of joining strings together and stuffing them into an Element's innerHTML
is fraught with risks. A very common negative implication concerns the JavaScript execution, which can occur in a number of unexpected ways.
To address the problem, libraries like DOMPurify attempt to carefully manage the inputs and alleviate risks. This is usually accomplished through parsing and sanitizing strings before insertion and takes advantage of an allowlist for constructing a DOM and handling its components. This is considerably safer than doing the same on the server-side, yet much untapped potential can still be observed when it comes the client-side sanitization.
As it stands, every browser has a fairly good idea of when and how it is going to execute code. Capitalizing on this, it is possible to improve the user-space libraries by teaching the browser how to render HTML from an arbitrary string in a safe manner. In other words, we seek to make sure that this happens in a way that is much more likely to be maintained and updated along with the browsers’ ever-changing parser implementations.
Provide a browser-maintained "ever-green", safe, and easy-to-use library for user input sanitization as part of the general web platform.
-
user input sanitization: The basic functionality is to take a string, and turn it into strings that are safe to use and will not cause inadvertent execution of JavaScript.
-
browser-maintained, "ever-green" / as part of the general web platform: The library is shipped with the browser, and will be updated alongside it as bugs or new attack vectors are found.
-
Safe and easy-to-use: The API surface should be small, and the defaults should make sense across a wide range of use cases.
-
Cover existing browser functionality, especially the sanitization of clipboard data.
-
Easy things should be easy. This requires easy-to-use and safe defaults, and a small API surface for the common case.
-
Cover a reasonably wide range of base requirements, but be open to more advanced use cases or future enhancements. This probably requires some sort of configuration or options, ideally in a way that both the developer and a security reviewer should be able to reason about them.
-
Should be integratable into other security mechanisms, both browser built-ins and others.
-
Be poly-fillable, although the polyfill would presumably have different security and performance properties.
Force the use of this library, or any other enforcement mechanism. Some applications will have sanitization requirements that are not easily met by a general purpose library. These should continue to be able to use whichever library or mechanism they prefer. However, the library should play well with other enforcement mechanisms.
Context: Various disagreements over API details lead to a re-design of the Sanitizer API. This presents a new API proposal (April '23, based on #192):
There is a 2x2 set of methods that parse and filter the resulting node tree:
On one axis they differ in the parsing context and are analogous to innerHTML
and DOMParser
's parseFromString()
; on the other axis, they differ
in whether they enforce an XSS-focused security baseline or not. These two
aspects pair well and yield:
Element.setHTML(string, {options})
- Parsesstring
usingthis
as context element, similar in spirit toinnerHTML
; applies a filter, while enforcing an XSS-focused baseline; and finally replaces the children ofthis
with the results.Element.setHTMLUnsafe(string, {options})
- Like above, but it does not enforce a baseline. E.g. if the filter is configured to allow a<script>
element, then a<script>
would be inserted. If no filter is configured then no filtering takes place.Document.parseHTML(string, {options})
(static method) - Creates a new Document instance, and parsesstring
as its content similar to howDOMParser.parseFromString
would. Applies a filter, while enforcing an XSS-focused baseline. Returns the document.Document.parseHTMLUnsafe(string, {options})
(static method) - Like above, but it will not enforce a baseline or apply any filter by default.
Note that while these methods are similar to innerHTML
and DOMParser
's
parseFromString
, we expect some differences in addition to HTML filtering:
For example, these new methods should support declarative shadow DOM by default.
They will not support creation of XML documents.
All variants take an options dictionary with a filter
(naming TBD) key and a
filter configuration. The options dictionary can be easily extended to accept
whatever parsing parameters make sense.
The 'safe' methods may have some built-in anti-XSS behaviours that are not
expressibleby the config, e.g. dropping javascript:
-URLs in contexts
that navigate.
The 'unsafe' methods will not apply any filtering if no explicit config is supplied.
[!Note] The 'unsafe' methods are being worked on here: whatwg/html#9538
The currently proposed API differs in a number of aspects:
- Two sets of methods,
innerHTML
-like andDOMParser
-like. - Since the 'sanitizer' config can now be used in both safe and unsafe ways, it's arguably no longer a sanitizer config but a filter config.
- Enforcement of a security baseline depends on the method. The filter/sanitizer config can now be used differently, either in a guaranteed-secure way or in use-config-as-written way.
- The configuration dictionary differs substantially in syntax.
The new APIs, in their most basic form:
const example = `<b onclick="alert(1)">hello world</b>`;
const element = document.createElement("div");
// Modify element:
element.setHTML(example); // <div><b>hello world</b></div>
element.setHTMLUnsafe(example); // <div><b onclick="alert(1)">hello world</b></div>
// Return a Document instance:
Document.parseHTML(example); // <html><head></head><body><b>hello world</b></body></html>
Document.parseHTMLUnsafe(example); // <html><head></head><body><b onclick="alert(1)">hello world</b></body></html>
Parsing observes its contexts:
const example_tr = `<tr><td>A table row.</td></tr>`;
const table = document.createElement("table");
element.setHTML(example_tr); // <div>A table row.</div>
table.setHTML(example_tr); // <table><tbody><tr><td>A table row.</td></tr></tbody></table>
Document.parseHTML(example_tr); // <html><head></head><body>A table row.</body></html>
All of these would have had identical results if the "unsafe" variants had been used.
Parsing follows HTML parsing rules, unlike innerHTML
, where it depends on the
document type:
const element_xml = new DOMParser().parseFromString("<html xmlns='http://www.w3.org/1999/xhtml'><body><div/></body></html>", "application/xhtml+xml").getElementsByTagName("div")[0];
const example_not_xml = "<bLoCkQuOtE>bla";
element_xml.getRootNode().contentType; // application/xhtml+xml
element_xml.innerHTML = example_not_xml; // Throws.
element_xml.setHTML(example_not_xml); // <div xmlns="http://www.w3.org/1999/xhtml"><blockquote>bla</blockquote></div>
// Note case and closing elements.
element.setHTML(example_not_xml); // Same as above.
The "safe" methods remove all script-y content defined by the platform and let the rest pass:
element.setHTML(`<a href=about:blank onclick=alert(1) onload=alert(2) id=myid class=something><script>alert(3);</script>`);
// <div><a href="about:blank" id="myid" class="something"></a></div>
Note that the context node might also be a script element. In this case adding plain text to it creates new script content:
const sneaky = document.createElement("script");
sneaky.setHTMLUnsafe("alert('Surprise!');");
// <script>alert('Surprise!');</script>
For the "safe" versions this case will be treated specially. setHTML
checks
the context element and calling it on a <script>
element is a no-op.
sneaky.setHTMLUnsafe("boring();"); // <script>boring();</script>
sneaky.setHTML("alert('Surprise!');"); // <script>boring();</script>
The operation of the built-in sanitizer can be configured to suit your applications' needs. Both "safe" and "unsafe" versions can take a configuration.
The "safe" version will ignore configuration items that break its security guarantees:
const an_unsafe_config = new Sanitizer({ 'elements': [ { name: 'script' } ] });
element.setHTML("<script>", { sanitizer: an_unsafe_config }); // <div></div>
element.setHTMLUnsafe("<script>", { sanitizer: an_unsafe_config }); // You now have a script. Congrats.
For elements the HTML namespace is default. For attributes, the null namespace. Other namespaces can be supported. A string entry stands for a dictionary with only the name, in the HTML/null namespace (for elements/attributes, respectively).
const config_with_namespaces = new Sanitizer({
elements: [
'a', // The HTML anchor element.
{ name: 'a' }, // Also the HTML anchor element.
{ name: 'a', namespace: 'http://www.w3.org/1999/xhtml' }, // Another one.
{ name: 'a', namespace: 'http://www.w3.org/2000/svg' } // SVG's anchor element
],
attributes: [
'href', // An href attribute. The one you'd expect on an HTML anchor.
{ name: 'href' }, // The very same.
{ name: 'href', namespace: '' }, // There it is again.
{ name: 'href', namespace: 'http://www.w3.org/1999/xlink' }, // xlink:href. SVG sometimes uses this.
{ name: 'href', namespace: 'http://www.w3.org/1999/xhtml' } // This isn't a thing.
// It won't match any HTML-defined href attributes. Probably the config
// author made an error.
]
});
Note
The config_with_namespaces
example contains multiple entries for the same
element or attribute, to illustrate the syntax. Note that this isn't actually
allowed.
The Sanitizer object can be constructed from a dictionary. The same dictionary can also be used directly in the method options.
const config_dict = {
elements: [ "div", "p", "em", "b", "span" ],
attributes: [ "class", "style" ]
};
// These two should be the same:
const some_html_string = "...";
div.setHTML(some_html_string, {sanitizer: config_dict});
div.setHTML(some_html_string, {sanitizer: new Sanitizer(config_dict)});
Note that implementations are expected to perform normalization work on the configuration, which can be easily re-used and amortized over many calls when used with the Sanitizer object that holds the configuration. To encourage this usage we explicitly instantiate objects in this explainer, outside of this particular sub-section.
// These should have the same results, but likely different performance:
const huge_array_of_strings = [ "...", ... ];
for (const str of huge_array_of_strings) {
div.setHTML(str, {sanitizer: config_dict});
}
const sanitizer = new Sanitizer(config_dict);
for (const str of huge_array_of_strings) {
div.setHTML(str, {sanitizer: sanitizer});
}
There are two ways you can build up a config: Specify the elements & attributes you wish to allow. This is easy to read and makes it easy to understand what to expect in the sanitizer output. Or you can specify what elements & attributes you wish to remove. Or to block, as other sanitizer libraries might call it. This effectively specifies the sanitizer output relative to the built-in list. This can be useful if you wish to mostly retain the built-in defaults.
const config_allow_some_formatting = new Sanitizer({
elements: [ "div", "p", "em", "b", "img" ], // Allows only 5 elements.
attributes: [ "class" ] // Allows only class attributes.
// Output with "safe" and "unsafe" methods are the same for this config.
});
const config_disallow_style_definitions = new Sanitizer({
removeElements: [ "style" ], // Allows the defaults, but without <style>.
removeAttributes: [ "class", "style" ] // No style or class attribute either.
// And not XSS-y stuff, either, if used with a "safe" method.
// Output with "safe" and "unsafe" methods might be quite different.
});
You may also wish to remove elements, but retain their children. This is chiefly useful to remove unwanted formatting from user input, while preserving its textual content.
const config_that_removes_elements_but_preserves_their_children = new Sanitizer({
replaceWithChildrenElements: ["span", "em", "u", "s", "i", "b"]
});
element.setHTML(
"Fancy <b>text</b> with <span style='color:blue'>pizzazz</span>.",
{ sanitizer: config_that_removes_elements_but_preserves_their_children });
// <div>Fancy text with pizzazz.</div>
There is no replaceWithChildrenAttributes
because attribute nodes do not have
children.
replaceWithChildrenElements
applies to its immediate children, i.e. to one
level. Combining elements
with replaceWithChildrenElements
lets you keep
some formatting, but all the text content:
const config_replace_spans = new Sanitizer({
elements: ["b", "i"],
replaceWithChildrenElements: ["span"]
});
// <div>Fancy text with <b>pizzazz</b>.</div>
element.setHTML(
"Fancy <span style='color:blue'>text with <b>pizzazz</b></span>.",
{ sanitizer: config_replace_spans}
);
A common use case is to allow or remove all instances of a given attribute, but this isn't always sufficient. Attribute interpretation depends on the element they are attached to, and so one may also want to act on attributes on specific elements.
In the example config_allow_some_formatting
in the previous chapter
we have allowed the class
attribute on any of allowed elements.
If one wanted to allow class
everywhere, but src
only on <img>
, the
following would do:
const config_with_element_specific_attributes = new Sanitizer({
elements: [
"div", "p","em", "b",
{ name: "img", attributes: [ "src" ] }
],
attributes: ["class"],
});
If you want to remove src
attributes from <input>
elements but retain them
elsewhere, you can use:
const remove_src_attribute_from_input = new Sanitizer({
elements: [{ name: "input", removeAttributes: ["src"]}],
});
Note that the removeAttributes
key is on an allowed element, since removing
the element itself would also remove all the attributes that are part of that
element.
Handling of HTML comment nodes can be controlled by an option. Setting
comments
to true
allows them:
const config_comments: new Sanitizer({ comments: true });
element.setHTML("XXX<!-- Hello world! -->XXX", {sanitizer: config_comments});
// <div>XXX<!-- Hello world! -->XXX</div>
The configuration allows expressing redundant or even contradictory options.
For example, allowing and removing the same element. In cases where the
meaning of a configuration dictionary isn't clear, we will
throw a TypeError
instead of making a best effort attempt at interpreting
the configuration. A well-formed configuration has the following properties:
- It contains either an allow-list or a remove-list, but not both.
- This applies to both element and attribute lists, seperately.
- Note that any config with both, an allow-list and remove-list, can be rewritten by removing the remove-list items from the allow-list and then droping the remove-list entirely.
- Both allow-lists and remove-lists can be combined with replace-with-children-lists.
- The action for any name - allow, remove, or replaceWithChildren - should be
specified only once. E.g. an element name should neither appear twice in an
allow-list, nor should it appear in both an allow-list and a
replace-with-children-list.
- This would apply to short forms as well.
E.g.,
["div", { name: "div", namespace: "http://www.w3.org/1999/xhtml" }]
contains the same name twice and would thus throw. - While lists with duplicate element or attribute names could be coalesced, it is ambiguous what the meaning of duplicate elements with different element-dependent attribute lists would be.
- This would apply to short forms as well.
E.g.,
- The name must be set.
// Mixing allow and block lists throws.
const config_that_mixes_allow_and_block_lists = new Sanitizer({
elements: ["i", "u"],
removeElements: ["u", "s"],
});
element.setHTML("bla", {sanitizer: config_that_mixes_allow_and_block_lists}); // throws
// Mixing allow and replace with children lists works.
const config_that_retains_simple_styling_but_most_text = new Sanitizer({
elements: ["p", "b", "i"],
replaceWithChildrenElements: ["div", "span", "em", "u", "s", "li"],
});
const styled_text = "<p>Some <span style='color: blue'>colourful</span> <u>styled</u> <b>text</b>";
// <div><p>Some colourful styled <b>text</b></p></div>
element.setHTML(styled_text, {sanitizer: config_that_retains_simple_styling_but_most_text});
// Duplicate entries throw.
const config_with_dupes = new Sanitizer({
elements: [ "div", { name: "div", namespace: "http://www.w3.org/1999/xhtml" } ]
});
element.setHTML("bla", {sanitizer: config_with_dupes}); // throws.
const config_with_dupes2 = new Sanitizer({
elements: [
{ name: "div", attributes: ["class"] },
{ name: "div", attributes: ["style"] }
] });
element.setHTML("bla", config_with_dupes2); // throws.
Listing an attribute in the "global" allow-list and in an element specific one is allowed. In this case, the specific action takes precedence.
const config_with_local_and_global_attributes = new Sanitizer({
elements: [ "span", { name: "b", removeAttributes: [ "class" ] } ],
attributes: ["class"]
});
// <div><span class="a">abc</span> <b>def</b></div>
element.setHTML("<span class='a'>abc</span> <b class='b'>def</b>",
{sanitizer: config_with_local_and_global_attributes});
If you would like to better understand what a given configuration will do, you
can query a Sanitizer
(and possibly build a new config out of an existing one):
const a_simple_config = new Sanitizer({ elements: [ "div", "p", "span", "script" ] });
a_simple_config.get();
// The result will be quite long. It'll look something like this:
{
elements: [
{ "name": "div", "namespace": "http://www.w3.org/1999/xhtml" },
{ "name": "p", "namespace": "http://www.w3.org/1999/xhtml" },
{ "name": "span", "namespace": "http://www.w3.org/1999/xhtml" }
],
attributes: [
{ "name": "href", "namespace": "" },
{ "name": "class, "namespace": "" },
{ "name": "id", "namespace": "" },
// ... many more
]
};
Note that:
- The returned config entries all have the "long" form with explicit name and namespace.
- The
"script"
element has disappeared. Because, when used in a "safe" version of the API, it wouldn't be allowed. - Note that we suddenly have an
"attributes"
key that represent the defaults.
But what would the unsafe versions do with this config? Just ask:
a_simple_config.getUnsafe();
// The result:
{
elements: [
{ "name": "div", "namespace": "http://www.w3.org/1999/xhtml" },
{ "name": "p", "namespace": "http://www.w3.org/1999/xhtml" },
{ "name": "span", "namespace": "http://www.w3.org/1999/xhtml" }
{ "name": "script", "namespace": "http://www.w3.org/1999/xhtml" }
],
attributes: [
// ... many more. Should be the same list as above.
]
};
The configuration that is returned corresponds to what the specification calls a canonical configuration: Names are resolved into their explicit name & namespace form. But some keys are also processed further. For example:
new Sanitizer({
removeElements: [ "span" ],
removeAttributes: ["id", "class"]
}).get();
// The result will be quite long. It'll look something like this:
{
elements: [
{ "name": "div", "namespace": "http://www.w3.org/1999/xhtml" },
{ "name": "p", "namespace": "http://www.w3.org/1999/xhtml" },
// ... many more. But no span.
],
attributes: [
{ "name": "href", "namespace": "" },
// ... many more. But no id or class.
]
};
Note that here, the remove-lists are converted to their allow-list equivalents, based on the built-in defaults.