wip: mdoc reader #10225

silby · 2024-09-27T05:08:48Z

There's a substantial amount of work left to do here, but as I am going on vacation for four weeks on Monday and not bringing a computer with me it seems reasonable to put up a draft PR. I welcome feedback on what I've done so far.

closes #9056

I'll try to start doing real commits for myself from now on

Replacing spacetab copied from Roff lexer

mandoc's roff(7) says "Blank text lines, which may include whitespace, are only permitted within literal contexts." mandoc -T lint warns about blank lines and inserts a roff `sp` request, which is handled differently depending on the output format. My read is that mandoc considers the handling of a blank line in non-literal context in mdoc(7) to be undefined.

Copy-pasted. Maybe they'll come back.

See mdoc(7) section MACRO SYNTAX

This will handle Ns in the future

There's a number of unique-looking cases for Fl parsing so I am just handling them very explicitly instead of trying to generalize anything enough to handle it.

Solves a delta with mandoc

the edge case of "Ap (" tested in this mandoc regress isn't present in any actual OpenBSD base system manuals, where Ap is only ever followed by a letter. Furthermore, "Ap" is generally uncommon compared to "Ns '" (e.g. ".Xr mandoc 1 Ns 's"). I'm accepting a difference from mandoc here because correctly suppressing space after the "(" here would require more refactoring than I feel like doing at time of writing.

It ends up with bad results in the ANSI writer, for example, because it then can't break lines at Spaces. This isn't wholly inconsistent with mandoc, because it makes no effort to render multiple consecutive spaces from the source document in HTML.

Getting to the point where I can start working with real manual pages so this is helpful.

A bit janky but worse things have happened.

Was having probalos

silby · 2024-09-27T05:11:34Z

pandoc.cabal

+
+executable lexroff
+  import:  common-executable
+  main-is: lexroff.hs
+  build-depends: pandoc, text


Private test thingy, need to zap this.

silby · 2024-09-27T05:12:33Z

pandoc.cabal

@@ -547,6 +548,7 @@ library
  hs-source-dirs:  src

  exposed-modules: Text.Pandoc,
+                   Text.Pandoc.Readers.Roff,


exposed this for my lexroff.hs thing which I didn't commit, move back to other-modules before merge

jgm · 2024-09-27T15:37:07Z

I have only had a very cursory look, but one question that comes to mind is why you have a new lexer with a new kind of token. Is lexRoff from T.P.Readers.Roff inadequate for mdoc? Why? Could it be improved instead of adding a new module that does the same thing?

silby · 2024-09-27T19:34:46Z

Part of it was just that I wanted to figure out how to implement this without having to kitbash the Roff lexer beyond recognition or keep the Man reader in sync with stuff that I changed, though I did end up extracting and reusing the escape sequences. But in a few ways the needs are fairly different.

The token type used by lexRoff in T.P.R.Roff is based on roff's native syntax, where control lines start with a request or a macro and any further arguments in the control line are simply arguments to that macro. Hence the token type constructor of ControlLine Text [Arg] SourcePos where the Text is the macro or request name and each Arg is handled as either a keyword or as literal text by the macro/request.

While the mdoc format inherits the superficial elements of roff syntax and in GNU groff is still implemented as a package of roff macros, mdoc macros themselves have a more complicated syntax. See MACRO SYNTAX in mandoc's mdoc(7) manual. The upshot is that the arguments to many macros are themselves parsed for macro calls, and in turn many macros can be called in argument position. (Cf. "Callable"/"Parsed" attributes of each macro.)

So the Mdoc.Lex lexer, instead of packaging all the arguments on a roff control line together, lexes each token from the control line individually and emits a totally linear token stream, which is more amenable to recursive parsing of macro arguments/multiple macros in one line. The lexer uses the rules for callable and parsed macros to decide whether to lex a control argument as a Macro token or as a Lit (non-macro text). It's especially handy to make this determination in the lexer because it directly takes care of escaping macro names in argument position: \&No gets lexed as Lit "No", because \& isn't a legal character to start a macro name.

For example:

.Sy hello Em world

I lex this as [Macro "Sy", Lit "hello", Macro "Em", Lit "world", Eol]. So a notional parseSy and parseEm (simpler than the ones in this branch) can boil down to this:

parseSy = do
  macro "Sy"
  args <- manyTill lit (anyMacro <|> eol)
  return $ strong $ mconcat $ intersperse space (map toString args)

parseEm = do
  macro "Em"
  args <- manyTill lit (anyMacro <|> eol)
  return $ emph $ mconcat $ intersperse space (map toString args)

If my token stream were of the existing RoffToken type, I would need to do an intermediate step to transform a ControlLine into a flat structure where macros are distinguished from lits. That's seemingly straightforward enough: ControlLine "Sy" ["hello", "Em", "world"] could become a list of my token type via something like

roffTokenToMdocTokens (ControlLine nm args) = Macro nm : map litOrMacro args <> [Eol]
  where
     litOrMacro x | isParsedMacro nm && isCallableMacro x = Macro x
                  | otherwise = Lit x

But! The existing roff lexer, by applying escapes, will have already destroyed some information we need if I take this approach. The following two lines will get the same lex from current lexRoff:

.Sy hello Em world
.Sy hello \&Em world

All of the above leaving aside the handling of delimiters required by mdoc but irrelevant to man, which is also convenient to deal with in the lexer.

Finally, the Roff lexer implements roff's macro definition requests, so it will actually expand any custom macros that are defined in a manual page read by the Man reader. This is very neat but I think it is an antifeature for mdoc documents, where use of raw roff requests at all, let alone custom macros, is discouraged and hopefully vanishingly rare in the wild. Only a subset of raw roff requests are supported by mandoc, and only about 3 are in use in mdoc manuals in the OpenBSD base system. So my intention was to not include that feature in the mdoc reader.

The bottom line of all this is that RoffToken and MdocToken are pretty different because the associated readers need different information from each control line. But all that being said, I guess it's plausible to at least base the lexers on some shared code by expanding on my (misnamed) RoffMonad typeclass found in T.P.R.Roff.Escape with functions like lexControlLine, lexTextLine. I'm not sure how much code would actually end up being shared though. Ultimately the MdocToken type I introduced is proving pretty adaptable to the things I need it to do and if I did try to reuse the existing lexRoff I'd probably still translate RoffToken to MdocToken for use in the parsers.

jgm · 2024-09-28T16:05:58Z

But! The existing roff lexer, by applying escapes, will have already destroyed some information we need if I take this approach.

I'd like to understand this better. I would have thought that low level roff stuff like escapes was common currency for man, ms, and mdoc. Can you explain further why we can't handle the escapes in the lexer as we were doing?

silby · 2024-09-28T16:32:37Z

We do continue to handle the escapes in the lexer, and I'm reusing all the escaping code from T.P.R.Roff, now moved to T.P.R.Roff.Escape. There's just an interaction between applying escapes and tokenizing control lines that needs to be handled differently for mdoc. I'll hopefully make my example from before clearer. Consider these two control lines:

.Sy hello Em world
.Sy hello \&Em world

The Roff lexer lexes this as (the moral equivalent of) [ControlLine "Sy" ["hello", "Em", "world"], ControlLine "Sy" ["hello", "Em", "world"]]. (There's a couple more types involved in the argument list but the contents boil down to Texts in this instance.)

Mdoc.Lex lexes this as [Macro "Sy", Lit "hello", Macro "Em", Lit "world", Eol, Macro "Sy", Lit "hello", Lit "Em", Lit "World", Eol]. The \&Em on the second line is escaped to Em, but it also tokenizes that Em as a literal rather than a macro call. (You can actually see the difference in github's syntax highlighting!)

So if we wanted to reuse the RoffToken type for mdoc we might have to stop processing escapes within lexRoff, because escape characters (by convention \& for zero-width space) are needed to protect strings that happen to be macro names from mdoc macro expansion. The concern doesn't exist for man because there are no man macros that expand further macros in the same control line.

silby added 30 commits September 12, 2024 19:37

Start Mdoc lexer

68ec788

Start mdoc reader

b4bf4f4

Register the mdoc reader

419ec78

Checkpoint

f44074e

I'll try to start doing real commits for myself from now on

stash pandoc.cabal nopub

b301ee0

Use standard spaceChar

fe1659d

Replacing spacetab copied from Roff lexer

Consume spaces when lexing mdoc control tokens

35900e9

Backtrack delim lexing

5038334

Delete some traces

fd2ee1b

Remove skip to end

d88e71d

Handle blanks in mdoc parser

f4383b9

Comment upon parseStr's output

353943c

Remove unused/redundant imports

5a56e5d

Copy-pasted. Maybe they'll come back.

Make more progress on inlines/paras

754ffc8

Add parsing of code blocks from mdoc

e1eddf6

tktk: Unify roff escapes

9e44bb6

Comment upon my approach

a714f9d

Add Xr parser

4431651

Lex callable macros in parsed control lines

759d327

See mdoc(7) section MACRO SYNTAX

Correct comments

784a873

Parse Nm macro (outside Sh NAME)

1e8f7b7

Parse multiple inline macros per control line

57c9709

Add a deprecated macro synonym

34a83d9

Plan for special handling of AUTHORS

a3a892a

Correct pasted comment

4b3ecb8

Add macros for line-scoped enclosures

91f24a0

Add Qq

497477d

Extract spacify function

7131923

This will handle Ns in the future

Rework enclosures and support multiline ones

1f5da9a

silby added 24 commits September 19, 2024 16:13

Add Fl, Ev, Cm and tests

f921ba8

fixup Ev: Move test to the right section

79c34c9

Refine Fl parser

86af568

There's a number of unique-looking cases for Fl parsing so I am just handling them very explicitly instead of trying to generalize anything enough to handle it.

Extract helper for inline macros that produce Code

5a2196e

Zap useless comments

162f48d

Remove useless do

2cf842d

Add .Fl Fl support

91e1730

Add Ar macro

2a1688b

Add more easy macros

c9d1530

Zap comment alluding to an abandoned helper

9b63a48

Add In macro (non-SYNOPSIS version)

8fe57d3

Ignore .Ns at start of line in mdoc lexer

dbf31cb

Solves a delta with mandoc

Add a couple more easy codelike macros

f7d07be

tktktk: Sm support

3464d9c

Skip to the end of the SYNOPSIS for now

681e1d3

Reformat

9d956b8

Fix .Nm parsing and insertion of manual name

7e0f712

Skip unknown macros in block context

638d84a

Getting to the point where I can start working with real manual pages so this is helpful.

Parse display blocks and 1-line literals

6ed129f

A bit janky but worse things have happened.

Move skipUnknownMacro to outer blocks

2d94ad6

Was having probalos

Add Dv macro

2589dbb

Add bullet, ordered, and definition lists

2147b2f

silby commented Sep 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: mdoc reader #10225

wip: mdoc reader #10225

silby commented Sep 27, 2024

silby Sep 27, 2024

silby Sep 27, 2024

jgm commented Sep 27, 2024

silby commented Sep 27, 2024

jgm commented Sep 28, 2024

silby commented Sep 28, 2024 •

edited

Loading

wip: mdoc reader #10225

Are you sure you want to change the base?

wip: mdoc reader #10225

Conversation

silby commented Sep 27, 2024

silby Sep 27, 2024

Choose a reason for hiding this comment

silby Sep 27, 2024

Choose a reason for hiding this comment

jgm commented Sep 27, 2024

silby commented Sep 27, 2024

jgm commented Sep 28, 2024

silby commented Sep 28, 2024 • edited Loading

silby commented Sep 28, 2024 •

edited

Loading