Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: mdoc reader #10225

Draft
wants to merge 81 commits into
base: main
Choose a base branch
from
Draft

wip: mdoc reader #10225

wants to merge 81 commits into from

Conversation

silby
Copy link
Contributor

@silby silby commented Sep 27, 2024

There's a substantial amount of work left to do here, but as I am going on vacation for four weeks on Monday and not bringing a computer with me it seems reasonable to put up a draft PR. I welcome feedback on what I've done so far.

closes #9056

I'll try to start doing real commits for myself from now on
Replacing spacetab copied from Roff lexer
mandoc's roff(7) says "Blank text lines, which may include whitespace,
are only permitted within literal contexts." mandoc -T lint warns about
blank lines and inserts a roff `sp` request, which is handled
differently depending on the output format. My read is that mandoc
considers the handling of a blank line in non-literal context in mdoc(7)
to be undefined.
Copy-pasted. Maybe they'll come back.
See mdoc(7) section MACRO SYNTAX
This will handle Ns in the future
There's a number of unique-looking cases for Fl parsing so I am just
handling them very explicitly instead of trying to generalize anything
enough to handle it.
Solves a delta with mandoc
the edge case of "Ap (" tested in this mandoc regress isn't present in
any actual OpenBSD base system manuals, where Ap is only ever followed
by a letter. Furthermore, "Ap" is generally uncommon compared to "Ns '"
(e.g. ".Xr mandoc 1 Ns 's"). I'm accepting a difference from mandoc here
because correctly suppressing space after the "(" here would require
more refactoring than I feel like doing at time of writing.
It ends up with bad results in the ANSI writer, for example, because it
then can't break lines at Spaces. This isn't wholly inconsistent with
mandoc, because it makes no effort to render multiple consecutive spaces
from the source document in HTML.
Getting to the point where I can start working with real manual pages so
this is helpful.
A bit janky but worse things have happened.
Comment on lines +867 to +871

executable lexroff
import: common-executable
main-is: lexroff.hs
build-depends: pandoc, text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private test thingy, need to zap this.

@@ -547,6 +548,7 @@ library
hs-source-dirs: src

exposed-modules: Text.Pandoc,
Text.Pandoc.Readers.Roff,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exposed this for my lexroff.hs thing which I didn't commit, move back to other-modules before merge

@jgm
Copy link
Owner

jgm commented Sep 27, 2024

I have only had a very cursory look, but one question that comes to mind is why you have a new lexer with a new kind of token. Is lexRoff from T.P.Readers.Roff inadequate for mdoc? Why? Could it be improved instead of adding a new module that does the same thing?

@silby
Copy link
Contributor Author

silby commented Sep 27, 2024

Part of it was just that I wanted to figure out how to implement this without having to kitbash the Roff lexer beyond recognition or keep the Man reader in sync with stuff that I changed, though I did end up extracting and reusing the escape sequences. But in a few ways the needs are fairly different.

The token type used by lexRoff in T.P.R.Roff is based on roff's native syntax, where control lines start with a request or a macro and any further arguments in the control line are simply arguments to that macro. Hence the token type constructor of ControlLine Text [Arg] SourcePos where the Text is the macro or request name and each Arg is handled as either a keyword or as literal text by the macro/request.

While the mdoc format inherits the superficial elements of roff syntax and in GNU groff is still implemented as a package of roff macros, mdoc macros themselves have a more complicated syntax. See MACRO SYNTAX in mandoc's mdoc(7) manual. The upshot is that the arguments to many macros are themselves parsed for macro calls, and in turn many macros can be called in argument position. (Cf. "Callable"/"Parsed" attributes of each macro.)

So the Mdoc.Lex lexer, instead of packaging all the arguments on a roff control line together, lexes each token from the control line individually and emits a totally linear token stream, which is more amenable to recursive parsing of macro arguments/multiple macros in one line. The lexer uses the rules for callable and parsed macros to decide whether to lex a control argument as a Macro token or as a Lit (non-macro text). It's especially handy to make this determination in the lexer because it directly takes care of escaping macro names in argument position: \&No gets lexed as Lit "No", because \& isn't a legal character to start a macro name.

For example:

.Sy hello Em world

I lex this as [Macro "Sy", Lit "hello", Macro "Em", Lit "world", Eol]. So a notional parseSy and parseEm (simpler than the ones in this branch) can boil down to this:

parseSy = do
  macro "Sy"
  args <- manyTill lit (anyMacro <|> eol)
  return $ strong $ mconcat $ intersperse space (map toString args)

parseEm = do
  macro "Em"
  args <- manyTill lit (anyMacro <|> eol)
  return $ emph $ mconcat $ intersperse space (map toString args)

If my token stream were of the existing RoffToken type, I would need to do an intermediate step to transform a ControlLine into a flat structure where macros are distinguished from lits. That's seemingly straightforward enough: ControlLine "Sy" ["hello", "Em", "world"] could become a list of my token type via something like

roffTokenToMdocTokens (ControlLine nm args) = Macro nm : map litOrMacro args <> [Eol]
  where
     litOrMacro x | isParsedMacro nm && isCallableMacro x = Macro x
                  | otherwise = Lit x

But! The existing roff lexer, by applying escapes, will have already destroyed some information we need if I take this approach. The following two lines will get the same lex from current lexRoff:

.Sy hello Em world
.Sy hello \&Em world

All of the above leaving aside the handling of delimiters required by mdoc but irrelevant to man, which is also convenient to deal with in the lexer.

Finally, the Roff lexer implements roff's macro definition requests, so it will actually expand any custom macros that are defined in a manual page read by the Man reader. This is very neat but I think it is an antifeature for mdoc documents, where use of raw roff requests at all, let alone custom macros, is discouraged and hopefully vanishingly rare in the wild. Only a subset of raw roff requests are supported by mandoc, and only about 3 are in use in mdoc manuals in the OpenBSD base system. So my intention was to not include that feature in the mdoc reader.

The bottom line of all this is that RoffToken and MdocToken are pretty different because the associated readers need different information from each control line. But all that being said, I guess it's plausible to at least base the lexers on some shared code by expanding on my (misnamed) RoffMonad typeclass found in T.P.R.Roff.Escape with functions like lexControlLine, lexTextLine. I'm not sure how much code would actually end up being shared though. Ultimately the MdocToken type I introduced is proving pretty adaptable to the things I need it to do and if I did try to reuse the existing lexRoff I'd probably still translate RoffToken to MdocToken for use in the parsers.

@jgm
Copy link
Owner

jgm commented Sep 28, 2024

But! The existing roff lexer, by applying escapes, will have already destroyed some information we need if I take this approach.

I'd like to understand this better. I would have thought that low level roff stuff like escapes was common currency for man, ms, and mdoc. Can you explain further why we can't handle the escapes in the lexer as we were doing?

@silby
Copy link
Contributor Author

silby commented Sep 28, 2024

We do continue to handle the escapes in the lexer, and I'm reusing all the escaping code from T.P.R.Roff, now moved to T.P.R.Roff.Escape. There's just an interaction between applying escapes and tokenizing control lines that needs to be handled differently for mdoc. I'll hopefully make my example from before clearer. Consider these two control lines:

.Sy hello Em world
.Sy hello \&Em world

The Roff lexer lexes this as (the moral equivalent of) [ControlLine "Sy" ["hello", "Em", "world"], ControlLine "Sy" ["hello", "Em", "world"]]. (There's a couple more types involved in the argument list but the contents boil down to Texts in this instance.)

Mdoc.Lex lexes this as [Macro "Sy", Lit "hello", Macro "Em", Lit "world", Eol, Macro "Sy", Lit "hello", Lit "Em", Lit "World", Eol]. The \&Em on the second line is escaped to Em, but it also tokenizes that Em as a literal rather than a macro call. (You can actually see the difference in github's syntax highlighting!)

So if we wanted to reuse the RoffToken type for mdoc we might have to stop processing escapes within lexRoff, because escape characters (by convention \& for zero-width space) are needed to protect strings that happen to be macro names from mdoc macro expansion. The concern doesn't exist for man because there are no man macros that expand further macros in the same control line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mdoc reader
2 participants