Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for nested <ul>/<ol> tags without <li> (not technically valid) #445

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dnotes
Copy link

@dnotes dnotes commented Aug 1, 2023

I'm opening this pull request for your re-consideration of issue #232. My reasoning is this:

  1. Although this syntax is invalid HTML, it has de-facto support in all browsers.
  2. Some people are making this mistake in writing their HTML.
  3. Converting that HTML to Markdown changes the semantic meaning that is clear in the original.
  4. It wasn't that hard to fix.

Please see my comment on #232.

Note that this PR differs slightly from the code that I pasted there; I think it's unnecessary to "fix" the case with an empty <li> element (<li><ul>...</ul></li>) because the meaning is preserved in the resulting markdown.

dnotes and others added 4 commits July 31, 2023 15:33
- fix invalid markdown when <ul> is nested inside an empty <li>
- add support for nested <ul>/<ol> outside of <li>
  (a common and widely supported usage, though not technically valid)
- empty li is allowed in html and markdown (who does this on purpose?)
- fixed edge case of numbering when <ol>s are nested without <li>
@alexander-turner
Copy link

I support merging this PR. I had to check out the repo and cherry-pick these commits using gh pr checkout 445. It fixed my problem.

@pavelhoral
Copy link
Collaborator

I don't think "being simple / easy to fix" or "browsers are able to display that" is a valid argument here. The decision for this library (made long time ago) is that it supports valid HTML and nothing more. We don't want to break this rule as it might invite all sorts of non-standard hacks.

However we want to allow users to modify rules and extend the functionality if they want to do that. That is a long standing task (unfortunatelly with very little momentum behind it at the moment).

But various errors in the HTML can be easily fixed by simply pre-processing the DOM before passing it to Turndown. In that way you don't have to make Turndown do anything non-standard. I think that would be ideal solution for this usecase.

(side note - I am not a full maintainer here, I have simply fixed few things here and there, so my opinion is not binding)

@dnotes
Copy link
Author

dnotes commented Jun 24, 2024

I can understand saying that this shouldn't be supported, and I'll respect if that's the decision because of performance or principles or whatever, and I don't want to be difficult...but the phrase "browsers are able to display that" drastically and unfairly understates the argument. This is a de-facto standard, not a hack. As far as I can tell, absolutely every browser displays this HTML in exactly the same way by default. The lists are also clearly nested in the code. Turndown corrupts the meaning of content that has been clearly and consistently conveyed by every browser since at least the days of IE8. Browser screenshots from BrowserStack below.

Screen Shot 2024-06-24 at 12 51 37 PM

I think this is as strongly as I can state the case; I'm gonna back out now. Cheers!

@dnotes
Copy link
Author

dnotes commented Jun 24, 2024

Oh, and even if this is not committed, you shouldn't have to cherry-pick the commits or preprocess the DOM to fix this in your own code; since Turndown does allow you to add and replace the rules as needed, you should be able to copy/paste the rules for lists and list items from the Turndown code and make the modifications there. I'm posting some code from one project for reference, but you might have to change it for your uses. Hope this helps @alexander-turner so that you don't have to basically maintain your own fork of the entire project.

let toMd = new TurndownService({headingStyle: 'atx', emDelimiter: '*', codeBlockStyle: 'fenced'})

  toMd.addRule('list', {
    filter: ['ul', 'ol'],

    replacement: function (content, node) {
      var parent = node.parentNode

      // CHANGE: Allow for <ul> and <ol> that are improperly nested outside of <li>.
      if (node.parentNode.nodeName.match(/^(UL|OL)$/i)) {
        content = '    ' + content
          .replace(/^\n+/, '') // remove leading newlines
          .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
          .replace(/\n/gm, '\n    ') // indent
      }

      if (parent.nodeName === 'LI' && parent.lastElementChild === node) {
        return '\n' + content
      } else {
        return '\n\n' + content + '\n\n'
      }
    }
  })
 .addRule('listItem', {
    filter: 'li',

    replacement: function (content, node, options) {
      content = content
        .replace(/^\n+/, '') // remove leading newlines
        .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
        .replace(/\n/gm, '\n    ') // indent

      var prefix = options.bulletListMarker + '   '
      var parent = node.parentNode

      // CHANGE: Don't output two markers if there is an empty li.
      if (node.children.length === 1 && node.children[0].nodeName.match(/^(UL|OL)$/i) && node.textContent === node.children[0].textContent ) prefix = '    '

      else if (parent.nodeName === 'OL') {
        var start = parent.getAttribute('start')
        // CHANGE: When <ol> is improperly nested outside of <li>, get the numbering correct.
        var index = Array.prototype.indexOf.call(Array.prototype.filter.call(parent.children, el => el.nodeName === 'LI'), node)
        prefix = (start ? Number(start) + index : index + 1) + '.  '
      }

      return (
        prefix + content + (node.nextSibling && !/\n$/.test(content) ? '\n' : '')
      )
    }
  })

@martincizek
Copy link
Collaborator

martincizek commented Sep 8, 2024

You might want to use something like this to preprocess the invalid HTML. Update: this lib unfortunately does not fix nested lists.

Still, preprocessing the DOM is the cleaner way to do it. The more the rules contains something that looks inside and around the processed elements, the less is the code readable and the less stable the rules get, especially for users customizing them. Also, we want to track context in the future - the rules will be provided by the rule path from DOM root to the current node. Which is something that needs reasonable DOM input and one-thing per rule approach.

What can be added in future versions, is a DOM preprocessing hook that would be called after the DOM is cloned by turndown.

By the way, where did you guys get your HTMLs with broken nested lists? What generates it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants