-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Markdown: improved version #3236
Conversation
Signed-off-by: Masatake YAMATO <[email protected]>
This parser is based on the asciidoc parser and tries to preserve all features of the regex-based parser (all kinds, full scope, sectionMarker field, running subparsers for code). Comparing with the regex based implementation, the rewritten parser is more than 10 times faster: [yamato@dev64]~/var/codebase% CTAGS_EXE=/home/yamato/var/ctags-github/ctags-Regex ./codebase ctags Markdown version: 5e6ab4b6 features: +wildcards +regex +gnulib_regex +iconv +option-directory +xpath +json +interactive +sandbox +packcc +optscript +pcre2 log: results/5e6ab4b6,Markdown............,..........,time......,default...,2021-12-24-00:27:22.log tagsoutput: /dev/null cmdline: + /home/yamato/var/ctags-github/ctags-Regex --quiet --options=NONE --sort=no --options=profile.d/maps --totals=yes --languages=Markdown -o - -R code/markdown-corpus code/progit code/selinux-handbook 686 files, 322854 lines (24506 kB) scanned in 2.0 seconds (12462 kB/s) 19246 tags added to tag file real 0m2.014s user 0m1.967s sys 0m0.035s + set +x [yamato@dev64]~/var/codebase% CTAGS_EXE=/home/yamato/var/ctags-github/ctags-C ./codebase ctags Markdown version: 258b9cbd features: +wildcards +regex +gnulib_regex +iconv +option-directory +xpath +json +interactive +sandbox +packcc +optscript +pcre2 log: results/258b9cbd,Markdown............,..........,time......,default...,2021-12-24-00:27:34.log tagsoutput: /dev/null cmdline: + /home/yamato/var/ctags-github/ctags-C --quiet --options=NONE --sort=no --options=profile.d/maps --totals=yes --languages=Markdown -o - -R code/markdown-corpus code/progit code/selinux-handbook 686 files, 322854 lines (24506 kB) scanned in 0.1 seconds (164033 kB/s) 19112 tags added to tag file real 0m0.190s user 0m0.172s sys 0m0.015s + set +x @masatake editted the commit log, deleted old implementation, and implemented code for filling "end:" fields.
Codecov Report
@@ Coverage Diff @@
## master #3236 +/- ##
==========================================
+ Coverage 85.01% 85.27% +0.25%
==========================================
Files 206 207 +1
Lines 49127 49304 +177
==========================================
+ Hits 41765 42043 +278
+ Misses 7362 7261 -101
Continue to review full report at Codecov.
|
Thank you for updating.
I think tagging |
Oh yeah, forgot about this possibility. Thanks for noticing, should be fixed now. |
Thank you for updating. I found 3 bug patterns. Could you try to fix them? input0.md:
output:
expectation:
input1.md:
output:
expectation:
The "frontmatter" area should be skipped. (@masatake must improve Units/parser-markdown.r/frontmatter.d/input.md to be able to detect the bug I will report here.") input2.md:
The new parser makes a tag for |
Hmm, I just found https://github.github.com/gfm/ when looking for some formal specification because there are many ambiguities and there are a lot more things to handle. For instance, it's not clear to me how to handle things like
I think this isn't a fenced code block https://github.github.com/gfm/#fenced-code-blocks but rather a code span https://github.github.com/gfm/#code-spans But it's not clear to me how in such a case I could distinguish these when I first encounter the opening https://github.github.com/gfm/#info-string Also, is there any formal specification for how the "frontmatter" should look like? This isn't anything markdown-related, I guess it's just used by some tools using markdown - is this always delimited by |
Also, the markdown standard seems to suport empty headings - should I generate tags for these too? In addition, markdown seems to support multiline headings like
but I think I'll just give up on these. |
No, at least this pull request. If the old regex parser cannot do something, it is o.k. the new parser cannot do them. |
I'll try to do something myself, it's not a problem to fix the issues you mentioned somehow, I was just wondering if you have some ideas as this "something" may not be the completely correct thing to do. |
The question is what should happen in situations like |
@masatake I hope I fixed the problems. With the |
@masatake I just made minor formatting changes in the code to follow the style used by uctags more closely. |
Could you split d37fea4 into smaller ones? When you made a commit (commit #3241 is an example following the above practice. |
Headings defined using underline by either '=' or '-' may contain an arbitrary amount of '=' or '-' characters which don't have to be equal to the number of characters on the previous line (which was what the previous code did). Special attention has to be paid to code blocks using indentation (either by 4 or more spaces or a \t character). For instance abcd --- this is a code block followed by underline, while abcd --- is a 'abcd' heading.
The indent can be composed of combination of tabs and spaces and the previous code didn't take into account combinations like " \t". Fix that.
We have to check if a sequence of '-'s is followed by a visible character or not to distinguish a - b which is a list containing 'b' and a - which is a heading.
@masatake Done. |
I have tested the update changes. In the following input:
(Two spaces before `O`) the parser cannot extract
If you are o.k., could you include the above change to 9e602a6? Could you do mkdir parser-markdown.r/backquote.d? input.md:
args.ctags:
(though there is only one tag entry.) expected.tags:
Could you add this new test case to 9e602a6, too? |
The next one is an extra request. Is it possible to skip
|
Oh, yeah, totally, I didn't pay enough attention when updating the code, thanks for noticing, will fix.
It should be no problem for the simple cases you provided but it would be much harder for something like
I'll provide a patch for the simple cases only (it will basically be what is done for |
This isn't quite correct check as there may be multiple code spans on a single line and they can go over multiple lines but at least it fixes the case the previous regex parser handled.
- use space between function name and ( - use camel case instead of underline in names
Thank you. I tried the new changes. It seems that we can expect With the following change, tags output became better:
Could you include this change to 69be461? |
@masatake Yep, makes sense, done. |
Thank you. |
I fixed the problem with underlined headings from #3233 and in addition cleaned up the code a bit. Also, when running unit tests on the fixed code, there popped up a problem with code blocks using just indenting I fixed, more in the commit message (I had no clue one can do that, I haven't seen this documented anywhere but even github markdown works this way).