The storage defines how the ASTs should be saved on disk.
For now, astminer
supports several tree-based and path-based storage formats.
astminer
also knows how to find the structure of the dataset and can
save trees or path contexts in the appropriate holdout folders. (train
, val
and test
).
If the data is not split, all trees will be saved in the data
folder.
Description files for trees or paths will be saved along with holdouts in the same outputPath
directory.
Storage config classes are defined in StorageConfigs.kt.
Saves the trees with labels to a comma-separated file. Each tree is encoded to a single line using parentheses sequences.
name: csv AST
Saves each tree in a separate file using the dot syntax.
Along with dot files, this storage also saves description.csv
with a mapping between files with trees, source files, and labels.
name: dot AST
Saves each tree with its label in the JSON lines format inspired by the 150k Python dataset.
name: json AST
In this format, each line represents an AST with its label, path, and all vertices:
{
"label": "1.java",
"path": "src/test/resources/examples/1.java",
"ast": [
{ "token": "EMPTY", "typeLabel": "CompilationUnit", "children": [1] },
{ "token": "class", "typeLabel": "TypeDeclaration", "children": [2, 3, 4] },
...
]
Path-based representation was introduced by Alon et al..
It is used in popular code representation models such as code2vec
and code2seq
.
Extracts paths from each AST. The output is stored in 4 files:
-
node_types.csv
contains numeric IDs and corresponding node types with directions (up/down, as described in this paper by Uri Alon et al.). -
tokens.csv
contains numeric IDs and corresponding tokens. -
paths.csv
contains numeric IDs and AST paths in the form of space-separated sequences of node type IDs. -
path_contexts.c2s
contains the labels and sequences of path-contexts (each representing two tokens and a path between them). This file is generated for every holdout.Each line in
path_contexts.c2s
starts with a label followed by a sequence of space-separated triples. Each triple contains comma-separated IDs of the start token, path, and end token.
name: code2vec
maxPathLength: 10
maxPathWidth: 2
maxTokens: 1000 # can be omitted
maxPaths: 1000 # can be omitted
maxPathContextsPerEntity: 200 # can be omitted
Extracts paths from each AST and save in the code2seq format.
The output is path_context.c2s
file, which is generated for every holdout.
Each line starts with a label followed by a sequence of space-separated triples.
Each triple contains comma-separated IDs of the start token, path node types, and end token.
To reduce memory usage, you can enable the nodesToNumbers
option.
If nodesToNumbers
is set to true
, all types are converted into numbers and node_types.csv
with the node-number vocabulary is added to the output files.
name: code2seq
length: 10
width: 2
maxPathContextsPerEntity: 200 # can be omitted
nodesToNumbers: true # can be omitted
length
stands for the maximum length of a path inclusively; width
stands for the maximum distance between the children of the least common ancestor of a path.