The artifact that we are submitting for evaluation is the func2vec tool (originally called program2vec as will be evident below). Anything named program2vec is simply our internal name for what became the func2vec tool.
This tool can be used to produce an embedding for any program compilable by LLVM, and includes the accompanying Linux dataset used to evaluate the ability of func2vec to detect function synonyms.
Running func2vec consists of a pipeline with several stages:
- Produce a bitcode file for the software system
- A set of LLVM passes that extract interprocedural control flow from a bitcode file
- The random walker which translates the LLVM control flow data into an l-PDS and performs the walks
- Commands for training a model from these random walks using the Gensim implementation of word2vec.
- Commands for evaluating the model, given a gold standard of function synonyms.
NOTE: Unzipping the artifact file creates an "artifact" directory. It is assumed that all commands are run from the root of that directory. This artifact directory should be the /program2vec in the docker container (see "Running the docker container").
- data/definedfunctions.txt A list of all functions in linux (generated as a byproduct of func2vec)
- data
- data/pathbased.walks
- Path-based random walks from Linux
- data/flat.walks
- Flat random walks from Linux
- data/pathbased.model
- Path-based func2vec embedding of Linux kernel (text word2vec format)
- data/flat.model
- Path-based func2vec embedding of Linux kernel (text word2vec format)
- data/pcisound.sqlite
- The error handler database (input to miner) for PCI sound drivers in the Linux kernel.
- data/5fs.sqlite
- The input to the error-handling specification miner for file system mining
- data/manual-golden.txt
- The full gold standard that was manually reviewed
- data/underscores.txt
- The underscore naming convention gold standard
Download program2vec.tar which contains the docker container for running this artifact.
$PATHTOARTIFACT should be the unzipped artifact directory (contains src directory).
docker load --input program2vec.tar docker run -it -v $PATHTOARTIFACT:/program2vec program2vec
The run command starts a docker container with the local artifact folder mounted inside the container at /program2vec.
* IT IS IMPORTANT THAT THE DOCKER VOLUME IS MOUNTED AT THIS LOCATION *
ls /program2vec should show a "src" directory.
cd docker docker build -t program2vec .
Issue the following commands inside the docker container:
cd /program2vec mkdir build cd build cmake .. make -j$(nproc)
The run.sh in script in the root of the artifact replicates all of the main paper results. This takes several hours and 20G to run (three hours on an i7-4500). Alternatively, individual steps are described in this document.
cd /program2vec run.sh
Explanation of output:
The run.sh replication script creates an output folder with all of the main results of the evaluation of func2vec.
- output/pathbased.walks
- The random walks generated by the random walker using path-based func2vec
- output/flat.walks
- The random walks generated by the random walker using flat func2vec
- output/pathbased.model
- This is the path-based model
- output/flat.model
- The flat model
- output/pathbased-manual-roc.png
- ROC curve image for Figure (path-based manual)?
- output/pathbased-flat-roc.png
- ROC curve image for Figure (flat manual)?
- output/pathbased-underscores-roc.png
- ROC curve image for Figure (path-based underscores)?
- output/flat-underscores-roc.png
- ROC curve image for Figure (flat underscores)?
- output/pathbased-clusters.txt
- Path-based k-Means clusters
- output/flat-clusters.txt
- Flat k-Means clusters
- output/pathbased-manual-f1.txt
- F1 score for path-based manual gold standard. Result at the end of the file.
- output/flat-manual-f1.txt
- F1 score for flat manual gold standard. Result at the end of the file.
- output/pathbased-underscores-f1.txt
- F1 score for path-based underscores gold standard
- output/flat-underscores-f1.txt
- F1 score for flat underscores gold standard
- output/sound-unmerged.specs
- Sound PCI mining results without using function synonyms
- output/sound-merged.specs
- Sound PCI mining results with using function synonyms
- output/fs-unmerged.specs
- File system mining results without using function synonyms
- output/fs-merged.specs
- File system mining results with using function synonyms
- output/table3.txt
- Matches table 3 in the paper. Shows increase in support of mining rules discussed in paper. These results will differ in absolute numbers, but increase in support for the rules is consistent with what is reported in the paper.
For any single C file, func2vec can be run as follows:
clang -c -emit-llvm example.c -o example.bc python2 -m walker walk --bitcode example.bc --output example.walks python2 -m walker train --input examples.walks --output example.model
Memory requirement: Approximately 20G
Time: Approximately an hour
INPUT: data/vmlinux.o.bc
OUTPUTS: data/pathbased.walks data/flat.walks
The data directory contains a pre-generated bitcode file that corresponds to Linux ??. The only modifications to the source code where patches provided by the LLVMLinux project required to compile Linux with clang.
Our evaluation of func2vec compares the standard invocation of func2vec, which is path-based, with a flat version that does not use paths.
The parameters used in the paper for the "path-based" model are:
- bias 2.0
- length 100
- numwalks 100
- interproc 1
- remove-cross-folder
The parameters used in the paper for the "flat" model are:
- length 100
- numwalks 100
- interproc 0
- flat
(The other parameters are not applicable to the flat method of walking)
Run these commands from /program2vec/data:
python2 -m walker walk --bitcode vmlinux.o.bc --output pathbased.walks --length 100 --walks 100 --interprocedural 1 --bias 2.0 --remove-cross-folder python2 -m walker walk --bitcode vmlinux.o.bc --output flat.walks --length 100 --walks 100 --interprocedural 0 --flat
INPUTS: data/pathbased.walks data/flat.walks
OUTPUTS: data/pathbased.model data/flat.mode1l
Training takes approximately 10 minutes.
python2 -m walker train --input pathbased.walks --output pathbased.model --mincount 5 --window 1 python2 -m walker train --input flat.walks --output flat.model --mincount 5 --window 1
We constructed two different gold standard data sets to evaluate func2vec.
- The first data set consists of 2,652 unique functions that were identified by hand and reviewed with the assistance of a Linux kernel developer. These functions form 265 equivalence classes of function synonyms. This data set, along with the explicit false relations used to check for conflicts, is in the file data/manual-golden.txt
- The second data set consists of pairs of functions that follow an underscore naming scheme. This is a strong naming convention in Linux. These pairs have been randomly sampled and spot checked, but not every pair has been examined by hand. This data set is located in the file data/underscores.txt.
There are two metrics that each gold standard uses, that correspond to the two different data sets that are a part of this artifact.
The first data set is manual-synonyms.txt, which are the synonyms that were manually reviewed. The second data set is underscores.txt, which are synonyms that were identified based on the use of the underscore naming convention.
Each command will output the full ROC curve to a .png file, and print the AUROC score to the screen.
(From the /func2vec/data directory)
python2 -m walker metric roc --model linux-paths.model --reference manual-synonyms.txt --png pathbased-manual-roc.png python2 -m walker metric roc --model linux-paths.model --reference underscores.txt --png pathbased-underscores-roc.png python2 -m walker metric roc --model linux-flat.model --reference manual-synonyms.txt --png flat-manual-roc.png python2 -m walker metric roc --model linux-flat.model --reference underscores.txt --png flat-underscores.roc.png
INPUTS: data/pathbased.model data/flat.model
OUTPUTS: data/pathbased_clusters.txt data/flat_clusters.txt
python2 -m walker cluster kmeans --k 3500 --model pathbased.model --output pathbased-clusters-k3500.txt python2 -m walker cluster kmeans --k 3500 --model flat.model --output flat-clusters-k3500.txt
python2 -m walker metric f1cluster --reference manual-synonyms.txt --clusters pathbased-clusters.txt python2 -m walker metric f1cluster --reference manual-synonyms.txt --clusters flat-clusters.txt python2 -m walker metric f1cluster --reference underscores.txt --clusters pathbased-clusters.txt python2 -m walker metric f1cluster --reference underscores.txt --clusters flat-clusters.txt
Note that the k-Means clustering implementation is not entirely deterministic, and replication may produce slightly different results. Nonetheless, the numbers should be close and consistent with the reported results in Table 1.
ipython2 from gensim.models import KeyedVectors model = gensim.load_word2vec_format('MODELPATH', binary=False) model.similarity('FN1', 'FN2')
The example application of func2vec presented in the paper is the enhancement of specification mining using function synonyms. This section reproduces the main results of Tables 2 and 3, which shows that function synonyms can improve the support of individual specifications.
The input to the miner is a mining database that consists of a list of error handling locations, and the function calls that are in the context and response of the
Detecting error-handler locations is outside the scope of this paper and relies on other tools that are not included in this artifact. Given an error-handling database that is to be mined, this artifact shows how function synonyms can be applied.
Support in our mining setup is counted in terms of the number of individual error handlers that support an association rule. Each error handler has a corresponding row in the Handler table of the database. The following SQL statement counts the number of error handlers for the PCI sound database:
sqlite3 pcisound.sqlite select count(*) from handler; 3173
Execute the runminer python script. This wil put table3-* files in output/
- ::
- python2 runminer.py