Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsers,lisp: speed up token lookup #3741

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
178 changes: 83 additions & 95 deletions parsers/lisp.c
Original file line number Diff line number Diff line change
Expand Up @@ -149,106 +149,94 @@ static int lisp_hint2kind (const vString *const hint)
return k;
}

/* TODO: implement this in hashtable. */
/* Lookup code produced by gperf version 3.1 */
Copy link
Member

@masatake masatake May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful for maintenance.
Could you write more about this?

I wonder how to generate the table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masatake I put a comment into the code. I think that adding gperf to the required toolchain is too much of a pain. Do you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give me time to answer. I have some ideas about this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the build process of ctags, we use two code generators: packcc and optlib2c.

packcc is developed at an external project. With "git subtree," the
source code of packcc is integrated into our source tree. On UNIX-alike
systems, packcc is built before building ctags. packcc is run for
generating .c and .h files. If I understand
correctly, on Windows+msvc , packcc is not built. So parsers requiring
packcc are not built on Windows+msvc. The *.peg files under peg/ are
input for packcc. Generated .c and .h are not in our git repo.

optlib2c is a code generator. It is in our source tree (misc/optlib2c)
and we maintain it as part of ctags. It is written in Perl.
optlib2c CAN generate .c files from optlib/*.ctags files.
We put both .ctags files and generated .c files into our git repo.
So, unlike .peg based parsers, .ctags parsers are built on any platform
including Windows+msvc.

I decided to put .c files derived from .ctags into our git repo because
I know optlib2c generates the same C code for the same .ctags file.

I decided not to put .c and .h files derived from .peg into our git repo because
I should not expect packcc generates the same code for the same .peg file; packcc
is developed at the external project.

Even .ctags parser, there is a limitation. As far as not modifying a .ctags
parser, one can build the .ctags parser even on Windows+msvc. However,
one cannot modify .ctags because the build script for Windows+msvc has no
rule to run optlib2c.

Conclusion:

optlib2c based parsers (.ctags parsers or optlib parsers) can be built
on Windows+msvc unless modified. If modified, it is hard to
build the parser on the platform.

packcc based parsers (.peg parsers) cannot be built on Windows+msvc.

We can enhance the build scripts on Windows+msvc. However, it is
not a topic here.

What should we do about gperf?

In my understanding, gperf is a rather stable program.
So we can expect it generates the same C code for the same input.
So, like optlib2c, we can put generated code into our git source tree.
That means Windows+msvc users can build the gperf based parser.
However, unlike optlib2c and packcc, we don't put the source code
of gperf into our source tree. One wanting to modify a gperf based parser
must install gperf in one's workspace.

So what we have to do:

TODO1. we must choose a version of gperf all developers and contributors use
to ensure the expectation.

TODO2. we must provide a container file (Dockerfile, Contaienrfile),
or a container image to run the version of gperf.

TODO3. we must deploy a detector on our CI environments for comparing
the code generated by the version of gperf and code submitted by
a contributor. If a pull request includes code generated by an unexpected version of gperf installed in the contributor's workspace, merging
should be prevented automatically.

Above TODO items are about our build-infrastructure.

TODO4. A generated code and manually written code must be separated at the file level.

  • lisp.c: this file should not include any generated code.
  • lisp_gperf.c: this file should include only generated.

TODO5. An input for gperf (.gperf file) must be in our git repo.

  • lisp_def.gperf: one wanting to add a new keyword may edit this file.

lisp_gperf.c and lisp_def.gperf are the new source code for ctags.

TODO6. source.mak must be extended to accept the new files.
TODO7. The way to update the win32 build script must be provided.

Achieving these items may be possible but challenging.

Thank you for reading, and sorry for my broken English.
I burned out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @masatake, this is well detailed. It is a lot of work: which is why I integrated the generated code by hand.

I think this is too much work for small payoff. How important is it to be able to add keywords easily? Do we think that new Emacs Lisp keywords will be added with some frequency? If not, I think this PR is good enough. We can take the lazy approach: if new keywords are to be added, we can update the mechanism as you outlined above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is it to be able to add keywords easily?

Just another critical question. Give me time to answer again. I have too many words about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be late to answer.

I think this is too much work for small payoff.
Yes if we apply the technique only to the EmacsLisp parser.
No because we can apply the technique to many other parsers.

How important is it to be able to add keywords easily? Do we think that new Emacs Lisp keywords will be added with some frequency?

Conceptually, keywords will be added frequently.
Let me explain.

I have planned to unify parsers with S expression-based syntaxes like Lisp, EmacsLisp, Scheme, and Clojure for years:

What I have to do may be:

  1. S expression-based meta parser. It is a library to develop a concrete parser specialized to a real language.
  2. Rewrite Lisp, EmacsLisp, Scheme, and Clojure based on the meta parser.

If the way to add a keyword is not sophisticated, it will be an issue in realizing the plan.

This is not only the reason. Developing in-language DSL is a viral technique among Lisp programmers. They may want to make tags for their own def-something.
See https://docs.ctags.io/en/latest/running-multi-parsers.html?highlight=agent#subparser-tagging-definitions-of-higher-upper-level-language.

Universal Ctags provides a concept, subparser, to satisfy these needs.
If a programmer knows C language, one can add a subparser specialized to one's DSL.
A subparser is a set of callback routines called from a base parser.
Moose parser is an example of subparser. It is called from Perl parser.

The concept of subparser is not enough for real demands.
Only a few people want to write C code for extracting tags from one's DSL input.
Only a few people like me want to pay the cost.
The greatest and the most loved feature of ctags may be --regex-=....
It will be nice if ctags provides a way to define a subparser for a Lisp-based parser from the command line like --Sexp-=...
In this case, a user may define a keyword from the command line.

In these grand stories, it is YES for the question "keywords will be added with some frequency?".

However, what I wrote is not more than a plan. It has not been realized for ... 7 years.
The current development should be limited to plans.

I can only ask you to make a way to add a keyword (1) easy, (2) well-documented, or (3) integrated into our build system.

Thank you for reading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation. I understand what you are trying to accomplish. I will give this some thought.

static unsigned int hash_hint (const char *str, size_t len)
{
static unsigned char asso_values[] =
{
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 8, 41, 30,
41, 41, 0, 20, 41, 5, 41, 41, 41, 5,
41, 41, 41, 41, 41, 0, 15, 15, 0, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
41, 41, 41, 41, 41, 41
};
return len + asso_values[(unsigned char)str[4]];
}

static int elisp_hint2kind (const vString *const hint)
{
int k = eK_UNKNOWN;
int n;

/* 4 means strlen("(def"). */
#define EQN(X) strncmp(vStringValue (hint) + 4, &X[3], n) == 0
switch (vStringLength (hint) - 4)
#define MIN_WORD_LENGTH 6
#define MAX_WORD_LENGTH 31
#define MAX_HASH_VALUE 40

static const char *wordlist[] =
{
case 2:
n = 2;
if (EQN("defun"))
k = eK_FUNCTION;
break;
case 3:
n = 3;
if (EQN("defvar"))
k = eK_VARIABLE;
else if (EQN("defun*"))
k = eK_FUNCTION;
break;
case 4:
n = 4;
if (EQN("defface"))
k = eK_FACE;
case 5:
n = 5;
if (EQN("defconst"))
k = eK_CONST;
else if (EQN("defmacro"))
k = eK_MACRO;
else if (EQN("defalias"))
k = eK_ALIAS;
else if (EQN("defsubst"))
k = eK_SUBST;
else if (EQN("defgroup"))
k = eK_GROUP;
else if (EQN("deftheme"))
k = eK_THEME;
break;
case 6:
n = 6;
if (EQN("defcustom"))
k = eK_CUSTOM;
else if (EQN("defsubst*"))
k = eK_SUBST;
else if (EQN("defmacro*"))
k = eK_MACRO;
break;
case 7:
n = 7;
if (EQN("define-key"))
k = KIND_GHOST_INDEX;
break;
case 9:
n = 9;
if (EQN("defvar-local"))
k = eK_VARIABLE;
else if (EQN("define-error"))
k = eK_ERROR;
break;
case 8:
n = 8;
if (EQN("defvaralias"))
k = eK_VARALIAS;
break;
case 10:
n = 10;
if (EQN("define-inline"))
k = eK_INLINE;
break;
case 14:
n = 14;
if (EQN("define-minor-mode"))
k = eK_MINOR_MODE;
break;
case 16:
n = 16;
if (EQN("define-derived-mode"))
k = eK_DERIVED_MODE;
break;
case 21:
n = 21;
if (EQN("define-global-minor-mode"))
k = eK_MINOR_MODE;
break;
case 25:
n = 25;
if (EQN("define-globalized-minor-mode"))
k = eK_MINOR_MODE;
break;
case 27:
n = 27;
if (EQN("define-obsolete-function-alias"))
k = eK_ALIAS;
break;
"", "", "", "", "", "", "", "(defvar", "(defface",
"(defsubst", "(defsubst*", "", "(defvaralias",
"(defvar-local", "(defmacro", "(defmacro*", "(define-key",
"(defalias", "(define-error", "(define-inline",
"", "(defun", "(defun*", "(define-minor-mode",
"(deftheme", "(define-derived-mode", "", "",
"", "(defgroup", "(define-global-minor-mode",
"", "", "", "(define-globalized-minor-mode", "",
"(define-obsolete-function-alias", "", "", "(defconst",
"(defcustom"
};

static const int tokenKinds[] =
{
eK_UNKNOWN, eK_UNKNOWN, eK_UNKNOWN, eK_UNKNOWN,
eK_UNKNOWN, eK_UNKNOWN, eK_UNKNOWN, eK_VARIABLE, eK_FACE,
eK_SUBST, eK_SUBST, eK_UNKNOWN, eK_VARALIAS, eK_VARIABLE,
eK_MACRO, eK_MACRO, KIND_GHOST_INDEX, eK_ALIAS, eK_ERROR,
eK_INLINE, eK_UNKNOWN, eK_FUNCTION, eK_FUNCTION,
eK_MINOR_MODE, eK_THEME, eK_DERIVED_MODE, eK_UNKNOWN,
eK_UNKNOWN, eK_UNKNOWN, eK_GROUP, eK_MINOR_MODE,
eK_UNKNOWN, eK_UNKNOWN, eK_UNKNOWN, eK_MINOR_MODE,
eK_UNKNOWN, eK_ALIAS, eK_UNKNOWN, eK_UNKNOWN, eK_CONST,
eK_CUSTOM,
};

const char *const str = vStringValue (hint);
const size_t len = vStringLength (hint);
if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH)
{
const unsigned int key = hash_hint (str, len);
if (key <= MAX_HASH_VALUE)
{
const char *const s = wordlist[key];
if (*str == *s && 0 == strncmp (str + 1, s + 1, len - 1))
return tokenKinds[key];
}
}
#undef EQN
return k;
return eK_UNKNOWN;

#undef MIN_WORD_LENGTH
#undef MAX_WORD_LENGTH
#undef MAX_HASH_VALUE

}

static void L_getit (vString *const name, const unsigned char *dbp,
Expand Down