Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Rmodepdf and LuaXML to display block HTML elements #469

Open
Witiko opened this issue Jul 22, 2024 · 5 comments
Open

Use Rmodepdf and LuaXML to display block HTML elements #469

Witiko opened this issue Jul 22, 2024 · 5 comments
Labels
defaults Related to default option values and renderer prototypes latex Related to the LaTeX interface and implementation tug 2024 Related to the TUG 2024 conference
Milestone

Comments

@Witiko
Copy link
Owner

Witiko commented Jul 22, 2024

As discussed with @michal-h21 before and after their TUG 2024 talk (slides, preprint), we may want to look into using the LuaXML library with the default transformation rules from rmodepdf to display block HTML elements.

For inline HTML elements, this does not seem applicable, because inline HTML elements produce renderers that do not necessarily represent complete HTML fragments that can be represented in DOM:

$ docker run --rm -it witiko/markdown markdown-cli html=true <<< 'Hello <i>world</i>!'
\markdownRendererDocumentBegin
Hello \markdownRendererInlineHtmlTag{<i>}world\markdownRendererInlineHtmlTag{</i>}!\markdownRendererDocumentEnd

We can't easily change this, since the CommonMark standard allows Markdown markup within inline HTML elements.

@Witiko Witiko added latex Related to the LaTeX interface and implementation defaults Related to default option values and renderer prototypes tug 2024 Related to the TUG 2024 conference labels Jul 22, 2024
@Witiko Witiko added this to the 3.7.1 milestone Jul 22, 2024
@michal-h21
Copy link

You can wrap HTML fragments in some dummy element to prevent parsing issues. I also think that you can process the text nodes for Markdown, so it should be possible to use it here.

This is a proof of concept:

kpse.set_program_name "luatex"
local domobject = require("luaxml-domobject")
local transform = require("luaxml-transform")

local function parse(block)
  -- wrap the text in a container element, so it doesn't matter that the HTML markup can be incomplete
  -- <body> is a good candidate
  local dom = domobject.html_parse("<body>" .. block .. "</body>")
  return dom
end



local function should_expand(element)
  -- test if we should expand markdown in this element
  local element_name = element:get_element_name()
  -- do some tests with the element name
  -- ...
  -- for now, just return true
  return true
end


local function process_markdown(text)
  -- this is just an example. the real funtion would need to be much more complex
  text = text:gsub("%*(..-)%*", "\\textit{%1}")
  return text
end

local function expand_markdown(element)
  -- recursively loop over child elements and expand markdown in text nodes
  for i, child in ipairs(element:get_children()) do
    if child:is_element() then
      -- recurse for child elements
      expand_markdown(child)
    elseif child:is_text() and should_expand(element) then
      -- run this only on text nodes in elements that should be processed
      child._text = process_markdown(child._text)
    end
  end
end

local transformer = transform.new()

-- disable escaping of TeX commands and braces
transformer.unicodes = {
  [92] = nil,
  [123] = nil,
  [125] = nil,
}

-- actions for HTML elements
transformer:add_action("i", "\\textit{%s}")
transformer:add_action("b", "\\textbf{%s}")

local test = "Hello <i>world</i>! Another text <b>with *markdown*</b>"
local dom = parse(test)
expand_markdown(dom:root_node())

-- debugging print of the processed DOM
print(dom:serialize())

-- and now convert to TeX
print(transformer:process_dom(dom))

For this test string: "Hello <i>world</i>! Another text <b>with *markdown*</b>" it produces the following output:

<body>Hello <i>world</i>! Another text <b>with \textit{markdown}</b></body>
Hello \textit{world}! Another text \textbf{with \textit{markdown}}

@Witiko
Copy link
Owner Author

Witiko commented Jul 22, 2024

That's a compelling approach: First, parse the Markdown document an HTML document, construct a DOM and only then convert the text nodes from Markdown to LaTeX. However, it seems incompatible with the current approach of CommonMark in general and the Markdown package in particular, where we first parse the whole document as a Markdown document and then we identify HTML code within the document.

An alternative would be to redefine \markdownRendererInlineHtmlTag to scan ahead for all other \markdownRendererInlineHtmlTags within the same paragraph/block, replace the intervening texts with unique identifiers, process the combined text with LuaXML, and replace the identifiers with the intervening texts. For example:

  1. Take Hello \markdownRendererInlineHtmlTag{<i>}world\markdownRendererInlineHtmlTag{</i>}!.
  2. Combine all inline tags to a single string: <i>world</i>.
  3. Replace the intervening texts with unique identifiers: <i>TEXT1</i>.
  4. Process with LuaXML: \textit{TEXT1}
  5. Replace the unique identifiers with intervening texts: \textit{world}.

However, this seems like a lot of plumbing in TeX, which runs the risk of breaking commands that change catcodes such as \verb in hybrid mode. A better option would be to introduce a renderer \markdownRendererInlineHtmlFragment instead, which would receive a variable number of parameters:

\markdownRendererInlineHtmlFragment{2}{<i>}{world}{</i>}

However, we can't just do that without breaking compatibility, since users may already rely on \markdownRendererInlineHtmlTag. Perhaps we can have a backwards-compatibility definition of \markdownRendererInlineHtmlFragment that would expand to \markdownRendererInlineHtmlTag{<i>}world\markdownRendererInlineHtmlTag{</i>} in my example. This definition would be used when the user has redefined \markdownRendererInlineHtmlTag or \markdownRendererInlineHtmlTagPrototype. This would allow us to use LuaXML with both block and inline HTML elements.

@michal-h21
Copy link

Well, I don't know much about CommonMark and also how the Markdown package processes the document, so I am not sure what the best way is, so I cannot comment on this :( I can only help on the LuaXML end, I am afraid.

@Witiko
Copy link
Owner Author

Witiko commented Jul 22, 2024

That's OK, few people do! I am happy to put in the work on the Markdown side of things.

@Witiko
Copy link
Owner Author

Witiko commented Jul 24, 2024

A better option would be to introduce a renderer \markdownRendererInlineHtmlFragment instead, which would receive a variable number of parameters:

\markdownRendererInlineHtmlFragment{2}{<i>}{world}{</i>}

However, things would still break if, instead of "world", there were some brittle content that needs to appear at the top level of a file. We can still fix this by putting "world" into a separate file.

[...] we may want to look into using the LuaXML library with the default transformation rules from rmodepdf to display block HTML elements.

Come to think of it, in CommonMark, block HTML elements do not necessarily represent complete HTML fragments that can be represented in DOM either. Therefore, we would need to do something similar to the command markdownRendererInlineHtmlFragment on the level of blocks.

Both changes seem significant and possibly breaking for some users. Let's do something simpler instead and only use Rmodepdf and LuaXML for raw HTML blocks and HTML file transclusion, as these are both very likely to contain complete HTML fragments.

@Witiko Witiko modified the milestones: 3.7.1, 3.8.0 Aug 30, 2024
@Witiko Witiko modified the milestones: 3.8.0, 3.9.0 Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defaults Related to default option values and renderer prototypes latex Related to the LaTeX interface and implementation tug 2024 Related to the TUG 2024 conference
Projects
None yet
Development

No branches or pull requests

2 participants