Support for nested <ul>/<ol> tags without <li> (not technically valid) #445

dnotes · 2023-08-01T01:21:08Z

I'm opening this pull request for your re-consideration of issue #232. My reasoning is this:

Although this syntax is invalid HTML, it has de-facto support in all browsers.
Some people are making this mistake in writing their HTML.
Converting that HTML to Markdown changes the semantic meaning that is clear in the original.
It wasn't that hard to fix.

Note that this PR differs slightly from the code that I pasted there; I think it's unnecessary to "fix" the case with an empty <li> element (<li><ul>...</ul></li>) because the meaning is preserved in the resulting markdown.

- fix invalid markdown when <ul> is nested inside an empty <li> - add support for nested <ul>/<ol> outside of <li> (a common and widely supported usage, though not technically valid)

- empty li is allowed in html and markdown (who does this on purpose?) - fixed edge case of numbering when <ol>s are nested without <li>

alexander-turner · 2024-06-22T00:11:58Z

I support merging this PR. I had to check out the repo and cherry-pick these commits using gh pr checkout 445. It fixed my problem.

pavelhoral · 2024-06-23T20:57:41Z

I don't think "being simple / easy to fix" or "browsers are able to display that" is a valid argument here. The decision for this library (made long time ago) is that it supports valid HTML and nothing more. We don't want to break this rule as it might invite all sorts of non-standard hacks.

However we want to allow users to modify rules and extend the functionality if they want to do that. That is a long standing task (unfortunatelly with very little momentum behind it at the moment).

But various errors in the HTML can be easily fixed by simply pre-processing the DOM before passing it to Turndown. In that way you don't have to make Turndown do anything non-standard. I think that would be ideal solution for this usecase.

(side note - I am not a full maintainer here, I have simply fixed few things here and there, so my opinion is not binding)

dnotes · 2024-06-24T01:38:21Z

I can understand saying that this shouldn't be supported, and I'll respect if that's the decision because of performance or principles or whatever, and I don't want to be difficult...but the phrase "browsers are able to display that" drastically and unfairly understates the argument. This is a de-facto standard, not a hack. As far as I can tell, absolutely every browser displays this HTML in exactly the same way by default. The lists are also clearly nested in the code. Turndown corrupts the meaning of content that has been clearly and consistently conveyed by every browser since at least the days of IE8. Browser screenshots from BrowserStack below.

I think this is as strongly as I can state the case; I'm gonna back out now. Cheers!

dnotes · 2024-06-24T02:12:53Z

Oh, and even if this is not committed, you shouldn't have to cherry-pick the commits or preprocess the DOM to fix this in your own code; since Turndown does allow you to add and replace the rules as needed, you should be able to copy/paste the rules for lists and list items from the Turndown code and make the modifications there. I'm posting some code from one project for reference, but you might have to change it for your uses. Hope this helps @alexander-turner so that you don't have to basically maintain your own fork of the entire project.

let toMd = new TurndownService({headingStyle: 'atx', emDelimiter: '*', codeBlockStyle: 'fenced'})

  toMd.addRule('list', {
    filter: ['ul', 'ol'],

    replacement: function (content, node) {
      var parent = node.parentNode

      // CHANGE: Allow for <ul> and <ol> that are improperly nested outside of <li>.
      if (node.parentNode.nodeName.match(/^(UL|OL)$/i)) {
        content = '    ' + content
          .replace(/^\n+/, '') // remove leading newlines
          .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
          .replace(/\n/gm, '\n    ') // indent
      }

      if (parent.nodeName === 'LI' && parent.lastElementChild === node) {
        return '\n' + content
      } else {
        return '\n\n' + content + '\n\n'
      }
    }
  })
 .addRule('listItem', {
    filter: 'li',

    replacement: function (content, node, options) {
      content = content
        .replace(/^\n+/, '') // remove leading newlines
        .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
        .replace(/\n/gm, '\n    ') // indent

      var prefix = options.bulletListMarker + '   '
      var parent = node.parentNode

      // CHANGE: Don't output two markers if there is an empty li.
      if (node.children.length === 1 && node.children[0].nodeName.match(/^(UL|OL)$/i) && node.textContent === node.children[0].textContent ) prefix = '    '

      else if (parent.nodeName === 'OL') {
        var start = parent.getAttribute('start')
        // CHANGE: When <ol> is improperly nested outside of <li>, get the numbering correct.
        var index = Array.prototype.indexOf.call(Array.prototype.filter.call(parent.children, el => el.nodeName === 'LI'), node)
        prefix = (start ? Number(start) + index : index + 1) + '.  '
      }

      return (
        prefix + content + (node.nextSibling && !/\n$/.test(content) ? '\n' : '')
      )
    }
  })

martincizek · 2024-09-08T20:53:16Z

You might want to use something like this to preprocess the invalid HTML. Update: this lib unfortunately does not fix nested lists.

Still, preprocessing the DOM is the cleaner way to do it. The more the rules contains something that looks inside and around the processed elements, the less is the code readable and the less stable the rules get, especially for users customizing them. Also, we want to track context in the future - the rules will be provided by the rule path from DOM root to the current node. Which is something that needs reasonable DOM input and one-thing per rule approach.

What can be added in future versions, is a DOM preprocessing hook that would be called after the DOM is cloned by turndown.

By the way, where did you guys get your HTMLs with broken nested lists? What generates it?

dnotes and others added 4 commits July 31, 2023 15:33

Update commonmark-rules.js

1344a90

- fix invalid markdown when <ul> is nested inside an empty <li> - add support for nested <ul>/<ol> outside of <li> (a common and widely supported usage, though not technically valid)

fix code style

93dd22f

add tests and fix small issues; revert empty li

eaf8eeb

- empty li is allowed in html and markdown (who does this on purpose?) - fixed edge case of numbering when <ol>s are nested without <li>

remove blank lines and comments from commonmark-rules.js

e7bd2de

alexander-turner mentioned this pull request Sep 8, 2024

Content trimming #222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for nested <ul>/<ol> tags without <li> (not technically valid) #445

Support for nested <ul>/<ol> tags without <li> (not technically valid) #445

dnotes commented Aug 1, 2023

alexander-turner commented Jun 22, 2024

pavelhoral commented Jun 23, 2024

dnotes commented Jun 24, 2024

dnotes commented Jun 24, 2024

martincizek commented Sep 8, 2024 •

edited

Loading

Support for nested <ul>/<ol> tags without <li> (not technically valid) #445

Are you sure you want to change the base?

Support for nested <ul>/<ol> tags without <li> (not technically valid) #445

Conversation

dnotes commented Aug 1, 2023

alexander-turner commented Jun 22, 2024

pavelhoral commented Jun 23, 2024

dnotes commented Jun 24, 2024

dnotes commented Jun 24, 2024

martincizek commented Sep 8, 2024 • edited Loading

martincizek commented Sep 8, 2024 •

edited

Loading