-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide source information to caller during parsing #91
Comments
Yes, it could have quite a lot of applications. But so far noone was motivated enough to really work on it... I have some idea how the feature might be implemented, but I actually never started to work on it because I think it would be better to do so only when there is some real development of the application at the same time so the API gets some feedback/verification whether the ideas are solid. Prototype of the callback could look like this: int syntax(MS_SIZE off, MD_SIZE size, unsigned what, void* detail, void* userdata); Parameters:
The The
The public header would provide some list of preprocessor macros for all the potentially fired Some more notes come to my mind:
|
I re-read this and it occurred to me that the |
(Not saying it's not useful - because I really think it is - but I think I made an error in thinking about that in particular.) |
Correct. And I can't see how that could be (easily) done if you need it. The parser has no knowledge how much text the renderer shall output, given the output format syntax decoration (e.g. html tags) or escape rules. All what can theoretically be achieved is that it would be called at the "right time" so that when called, the renderer knows "now, at the current position in the output, it corresponds to this and that". That would mean we would not call the callback when we detect something but only during the time when all the rendering callbacks are called. The drawback of that approach would be larger memory consumption as we would have to remember more data (the offsets in the input at least) for all the stuff. Another consequence would be that we could not emit ranges but beginning and end events because "the right time" is a different thing for them. That would complicate "simple" applications who may just need the corresponding beginning and end of some stuff in the input to e.g. colorize the Markdown syntax. And finally, there would be unsolvable limits of that approach: E.g. reference link definitions have no direct counterpart in the rendered text at all. They only affect all those reference links distributed arbitrarily in the document, which match its label. |
Just got an idea that maybe just a guaranty that the events come in some defined order (e.g. as ordered by beginning of the range) within the set of events of that particular type would suffice. E.g. that the order of paragraphs as provided by This would allow the application to build a mapping between offsets in the input and in the output if it needs them, as the renderer can simply count how many paragraphs it has already seen and use that counter as an ID of all the paragraphs. Ditto for all the other block/inline elements. I know it's kind of "leave the hard work to the caller" way but maybe it is the right one here. |
That I understand, I think. What about "for a given rendered block/inline, which are its offsets in the source text"? The offsets in the output are not important, only that each rendered something has offset references to their own source. |
IDK. I can theoretically imagine that maybe all those detail structures would have some But:
|
Yes, you are of course right on those points. I think doing work on the Do you think it would be interesting for only certain types, like tasks? Maybe a compile-time flag could toggle this feature, to avoid the extra overhead. |
For tasks, indeed, it is extra important as the app may want to change status of it so it needs to know where the check char in the input lives. But that one already has it (see |
That was a nice find. I guess I haven't looked into the tasks feature enough. Thank you for mentioning it. That satisfies my particular needs for now with regards to source offsets. Then I think it would also be appropriate to add |
Well, I hope that if we add
No. It seems we are being added to more and more linux distros and compile-time option would effectively mean that it cannot be turned on/off per app anyway. But I believe that if all the bigger data is provided via the |
Ok, then we aim for your idea about syncing the callback order. |
I think it's quite problematic to separate the sax-functions and the syntax information. Instead the syntax information should be part of the sax functions. I've implemented a proposal in karstenBriksoft@ec1d181 but the critical part is still missing: the correct offset. For the changes to work the context has a new My idea for an initial implementation was to only use MD_OFFSET as offset, so the offsets that are passed to the callback would be byte-offsets in the source string. Providing line/column information would require additional computation of line/column to offset and back. Maybe this can be solved by introducing a new callback like |
At a first glance, I would tend to agree. Actually, it was my original idea too to provide the syntax info in the SAX callbacks, too. But after some more thinking it would also bring some hard-to-solve or maybe even impossible-to-solve problems, for the implementation as well for the interface itself. Most of them come from the fact that most likely application interested in it are text editors and that those would want much richer information about the syntax used, e.g. exact position of every bracket-like or quote-like character used to encode an inline link (example: So the following is a (possibly incomplete) list of potential problems with such approach:
|
My thinking was more along the lines of: md4c is built for converting Markdown to something else. If there's some rough syntax information provided, that's at least a good starting point.
That was my impression, too. That's why the callbacks I defined don't yet provide the correct offsets. As for expanding the structures, i'd start with just the beginning- and end-offsets.
As the possibilities in Markdown are close to endless, I'd not even try to provide the position of every character. It makes the data structures complicated to define, it makes them complicated to fill and on the other end it also makes them complicated to read and understand. The editor would need to be aware of every possibility, maybe without even wanting to support all of them.
That's where I actually see use for the
According to my idea about only providing start/end offsets, I'd say position 0 starts with the first
Also here, i'd keep it simple and only provide the start and end offsets (haven't actually tried what md4c provides as callbacks inside tables, but i'd say the strings are at least reported separately so their offsets would also be reported, which should give you a pretty good idea about where the table is located and what it consists of). |
@karstenBriksoft Could you maybe briefly explain what your use case is? It likely is not syntax highlighting, or is it? |
It is syntax highlighting, but on a very basic level. Consider an editor like iA writer: You see the Markdown but the Markdown has styles applied so that you get an idea of what you're typing. |
I understand your approach may be enough for some simple syntax highlighting as yours. But at the same time I still believe that providing a way to highlight So, give me some time to think it more over to see, whether the SAX-like approach could/should be expanded to accommodate both approaches at the same time or whether it would be a road to a maintenance hell. |
Like I said: if you want to highlight code like |
No, it is not. You cannot easily distinguish these from twhat you get into the callback, for example:
versus
versus
Or
versus
Or you cannot automatically treat And last but not least, you force the application to understand the Markdown specification and reimplement what the parser does, with all the maintenance burden it may bring if e.g. a new link type is added tomorrow. |
But this is not a proposal at all. Why don't you make an attempt at implementing it instead. Only then will we start to see where the real issues are. There are so many unknowns about how this will work in practice. |
it's a proposal of how an alternative API could look like. md4c is not exactly super easy to understand in an hour or so, which makes it hard to "quickly" add some offset information to the parser. The implementation is highly optimised for speed, not for comprehension. |
It is better syntax highlighting than you are clearly having in mind, and it should offer the following features:
|
From what I understand, the parser sends the callbacks in the correct order. So if a link is part of a block quote, you'll get the information about the block quote first. That allows you to apply styling to the block quote, given you know its location. Then you get the information about the anchor span, allowing you to apply the appropriate format in its span. Lastly you get the information about text inside the anchor, allowing you to apply the formatting for the text in an anchor in a block quote.
That's why i wouldn't want to provide the source information in a highly specific way, because then it needs adaptation with new rules. If you only provide start/end that's more future proof. |
If you get information where the block quote begins and ends, you have no idea which EDIT: There may be an escaped EDIT 2: And there are proposals/demand for new container blocks, like a new table syntaxes which would allow to accommodate multi-line text in a cell. So consider a block quote inside something like that. The API has be to extensible for such things into the future. Sources: 1, 2, 3, and some discussions at https://talk.commonmark.org/ but that seems inaccessible right now |
And that's why there is not much to gain from just proposing an API. The implementation is the issue. The risk is that we discuss this to death. A partial implementation would be very valuable. Even a failed attempt will give much insight. |
I think i'm starting to get where you're coming from. If i take the quoted code below from your earlier example:
the anchor is split over multiple lines, likewise the blockquote is split and it's all inside a codeblock that's again part of a blockquote.
That means that enter-callbacks probably wouldn't need an offset information at all but leave-callbacks would then provide a list of start- and end-offset tupels. |
My use case for source mapping is a markdown editor that shows some markers (like for emphasis), and it hides others (like for tables). We use offsets in the original markdown as positions, and we need it for all parsed content including markers. As the SAX interface already has callbacks indicating where things start and end, it would be natural to extend it with a marker callback that reports a marker type including the marker text and offsets:
This kind of interface has several good features:
It could be useful to know the type of a marker, for example for a link title it could be useful to know it is the beginning quotation of the title. On the other hand, it would quickly be quite complicated and in most cases, it would not be needed anyway. And it is still possible to look at the content and order of the markers within a block to make a good enough guess in the application. |
I'm not sure what you exactly mean here: As of now, there are no offsets referring to the source are propagated into the callbacks. Probably the only exception is the offset of the mark in the task list ( The offsets in the structure
And I cannot guarantee some future Markdown features would need it to do anything similar even in most important/common situations.
The parsing works (and has to work) in two passes over the input (that's btw a reason why there is no Markdown parser which can work in a streaming fashion): The first pass is responsible for a block analysis (and also collects all link reference definitions) and then inline analyses of every block. In MD4C, a lot of stuff gathered and analyzed during the block analysis is simply forgotten if it's not strictly needed later, so a lot of work would be needed to keep it around so it could be fired into the application in the right order. The callbacks (even for the enclosing blocks) are only called later during the inline pass. Only very minimal information is currently propagated to the 2nd pass. The output of the block analysis is just a list of blocks in a very condensed representation, and for each block there is more or less only a vector of its "lines" where the line struct holds only beginning and end offset in the source input, so that its contents is stripped of any indentation or block decorations. The inline analyzer simply processes the stuff inside those (stripped) lines so that all the stuff in the gaps between them is ignored: After all there is a block encoding stuff the inline analyzer does not understand and it would confuse its parsing. It could be relatively easy to change MD4C so that it calls a callback passing the gap contents in the right moments (between processing one line and subsequent line), but it could provide no additional information what the characters in it means or how it is related to the stack of nested blocks the currently processed paragraph lives in. Changing it so that some richer data is passed to the 2nd pass would be a lot of work, and imho quite bad from the maintenance point of view: Suddenly the 2nd pass would be aware of all that info and understand it at least to some degree, whilke now the two passes are very independent and the inline analyzer even has no need to know whether it is a top level block or inside a block quote or inside a list or a table cell, or some devilish combination of all of those in some nesting:
The "lost information" in the gaps between the lines or the blocks includes for example:
I'm lost here: If you're nested in multiple levels of blockquotes, and your callbacks gets called saying "Hey, I've encountered a marker which encodes a blockquote", how are you exactly determining which of the nested blockquotes it is really about? We currently have no unique block identifiers or anything similar: The application implementing the callbacks just maintains the stack of started (and not yet finished) blocks on its own. |
Yes, it was poorly formulated. I meant to say that the SAX interface reports boundaries of syntax elements in the order of the text. Adding offsets to events would give the exact location. From that perspective, it is a natural extension to have callbacks for markers as well. The benefit I am trying to communicate is that this is a more flexible way to report markers than trying to come up with a data structure on the existing events that will cover all the cases of possible marker locations.
In any case, it seems good to provide accessors for attributes as they might be pre-processed as you say. If one wants to be able to recreate markdown exactly from the SAX events, then the attributes must also be reported as they appeared in the markdown.
It might be good enough. Markdown is so messy that it is hard to see how one could do something that is really exact. For inlines at least one should be able in the app to associate markers with the inline at the top of the stack. For block elements, it would be more tricky. I one could at least get the marker type (like
I assume that the markers would arrive in the same order as the stack (from bottom to top), at least for blockquotes and lists. I know that this is an incredibly tricky problem, and there will be tradeoffs. Even getting offsets for the current events would go a long way for many applications. |
I stumbled onto this issue and had the exact same problem. I first tried to modify md4c for my needs, but with very little success. I ended up rewriting a complete parser (with some flavored Markdown). The idea is quickly represented here. My solution for the problemLet's say that we have the following Markdown example: - >> [abc
>> def](example.com) This example would generate an abstract syntax tree (AST) like:
How do we attribute each non-text markers (like I created a parser to solve this specific problem, while keeping reasonable performance. To do this, each object (BLOCK or SPAN) is represented by an array of boundaries. A boundary is defined as follows: struct Boundary {
int line_number;
int pre;
int beg;
int end;
int post;
}; This struct designates offsets in the raw text which form its structure. Here is a simple example. Suppose we have the following text: Going back to the first example, we now use the following notation to illustrate ownership of markers: if there is
So at each block / span enter, an array of boundaries is provided to the caller. This may inform of the caller of all markers used to create a specific block / span. In the case of lists and sub-lists, spaces are attributed in the following way:
I hope this idea will help people looking for a solution to this problem. |
As mentioned previously by @mity:
I will make an additional suggestion that the current line number is useful information too.(It just crossed my mind that it may be well enough to have the offsets, and no line numbers. Text editors usually have a method to place the cursor at some specified offset from 0.)For another useful application of this, see http://moinmo.in/WikiSandBox. Double click somewhere in the text, and you will be taken to the editor with the cursor placed at the start of the corresponding line. If you look at the HTML source of the rendered text, you will see that it has hidden line references.
The text was updated successfully, but these errors were encountered: