Apply the standard rule mechanism also to text nodes #304

martincizek · 2020-03-01T21:56:29Z

There might be strong reasons for creating rules also for text nodes. This simple patch makes it possible. We believe it even makes the current code more consistent, as we can rely on the node name "#text".

Backround:
Our use case is related to text escaping. As the docs admit, it is quite simplistic and adds unnecessary backslashes. Unfortunately any simple GFM escaping also necessarily corrupts data. We aim at nearly-lossless conversions, so we have developed GfmEscape to address context-dependent escaping.

I admit that full escaping complexity might be overkill for Turndown.
But Turndown is still a fantastic framework for user-developed conversions, so let's just provide the desired hook. :)

Thank you!

…plify node processing.

domchristie

Hey @martincizek!
Thanks for all your work and your detailed pull requests 💫

I really like how this cleans up the process function. However I'm not sure it's the right solution yet. As you've figured out, escaping in Turndown is simple, but naive. It works for most cases, however it's often customised. I'd like a offer an improved system for customising escaping, perhaps along the lines of addEscape, removeEscape (custom escaping also discussed in #242 (comment)). Is this something you'd be interested in helping out on?

domchristie · 2020-05-11T19:15:38Z

src/turndown.js

-    var replacement = ''
-    if (node.nodeType === 3) {
-      replacement = node.isCode ? node.nodeValue : self.escape(node.nodeValue)
-    } else if (node.nodeType === 1) {
-      replacement = replacementForNode.call(self, node)
-    }
-


domchristie · 2020-05-11T19:30:46Z

src/commonmark-rules.js

+  replacement: function (content, node, options) {
+    if (node.isCode) return node.nodeValue
+    return options.escapes.reduce(function (accumulator, escape) {
+      return accumulator.replace(escape[0], escape[1])
+    }, node.nodeValue).trim()
+  }
+}


Slight concern here is that it requires a fair bit of knowledge to add to the behaviour i.e. a developer has to remember to:

pass through unchanged code content

iterate over the existing escapes and performs the replacements

trim the value

martincizek · 2020-05-23T08:55:34Z

Hey @domchristie, thank you for your reply! As I went through all the Turndown's code over time (again, nice job!), I admit another approach would be better - introducing textReplacement (similarly to blankReplacement etc.).

We have done it, please check this commit, we just somehow forgot to make a new PR and deprecate this PR. :) Is this the way to go?

Regarding relation to escaping:

I believe textReplacement is still a valid extension point even if there were totally customizable escaping
Even though we can now override the escape() method, it is not called for isCode nodes and the node and options arguments are not passed. All these things are necessary for implementing for comprehensive escaping.

martincizek · 2020-05-23T09:02:48Z

Regarding help with the escaping subsystem:

Currently, iterative escaping is used. I.e. all text is escaped according to first replacement, the result is escaped with the second replacement, etc.
Iterative escaping can only work just in the simplest cases, that's why we created the UnionReplacer project. The README also describes its Turndown-related background and issues with different escaping approaches.

So I believe that Turndown would need something like UnionReplacer to do the custom escaping right, efficiently and user friendly (without need of placeholdering magic).

The mentioned project GfmEscape is actually very thoroughly configured UnionReplacer with comprehensive replacements, so this one is not appropriate to integrate.

But UnionReplacer seems to be a perfect match for configurable escaping (actually it is designed for it :-)). I know Turndown does not use any libs, but this one:

was actually designed to be used with tools like Turndown
it is something like 100 lines (but extreme effort was put into making it this small)
Our performance tests suggested that the only eventual performance footprint is related to increased code amount (those 100 lines). It becomes more efficient than iterative escaping for around 5 replacements depending on the JS engine.

Is integrating UnionReplacer dependency an option when making the configurable escaping? If it is an option, then we can help. :)

If it were done this way, the comprehensive GfmEscape rules can be used as some sort of Turndown plugin then.

martincizek · 2020-07-06T13:22:32Z

Closing this PR in favour of #339, which provides related hooks in a way more consistent with the rest of Turndown.

Move TextNode escaping to a separate rule, escapes to options and sim…

6f3306e

…plify node processing.

michbart force-pushed the separate-textnode-rule branch from 706dc1d to 6f3306e Compare March 12, 2020 13:31

domchristie reviewed May 11, 2020

View reviewed changes

martincizek mentioned this pull request Jun 27, 2020

Characters inside plain text urls should not be escaped #324

Open

martincizek mentioned this pull request Jul 6, 2020

Provide text escaping and replacement hooks for context-dependent escaping #339

Draft

martincizek closed this Jul 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply the standard rule mechanism also to text nodes #304

Apply the standard rule mechanism also to text nodes #304

martincizek commented Mar 1, 2020

domchristie left a comment

domchristie May 11, 2020

domchristie May 11, 2020

martincizek commented May 23, 2020

martincizek commented May 23, 2020

martincizek commented Jul 6, 2020

Apply the standard rule mechanism also to text nodes #304

Apply the standard rule mechanism also to text nodes #304

Conversation

martincizek commented Mar 1, 2020

domchristie left a comment

Choose a reason for hiding this comment

domchristie May 11, 2020

Choose a reason for hiding this comment

domchristie May 11, 2020

Choose a reason for hiding this comment

martincizek commented May 23, 2020

martincizek commented May 23, 2020

martincizek commented Jul 6, 2020