usage: main.py [-h] tmp out Create musl-1.2.4 dataset, train models, report accuracy. positional arguments: tmp build directory for generating dataset out results and figures will go here options: -h, --help show this help message and exit
This project aims to improve the state of the art in reverse engineering and binary static analysis. We focus on the specific case of binary executable code without debug symbols.
Statically linked binary formats are a common way to deploy programs to end-users, but they are opaque and difficult to analyze for security purposes. Malware authors also intentionally make their code harder to analyze using obfuscation. This can circumvent antivirus and malware detection software, so we specifically train on obfuscated data.
The main.py program downloads musl-1.2.4 and extracts source code for some of its functions. It then creates multiple C source code versions of each function by obfuscating their sources with Tigress. Next it compiles all these sources with diverse options, resulting in a large set of obfuscated object files. It also compiles the source code of each function without any added obfuscation, resulting in a set of “plain” object files.
The musl project uses a typical ./configure && make install
workflow, so our program parses the output of make
to determine the baseline options required to build each function.
Tigress obfuscates single C programs and requires that each program contains a main
function, so our main.py program adds these to each file before obfuscating.
Finally, we strip all symbols from the object files using GNU strip.
Starting with the stripped object files, we apply further processing for use with some of the machine learning models.
from _utils import fname
from collections import Counter
from pathlib import Path
import numpy as np
c1 = Counter(fname(p) for p in Path('out/o').glob('*.o'))
c2 = Counter(fname(p) for p in Path('out/o_bb').glob('*'))
a1 = np.array(list(c1.values()))
a2 = np.array(list(c2.values()))
print(f'| name | obfuscated (\mu={a1.mean():.1f},\sigma={a1.std():.1f}) | block sampled (\mu={a2.mean():.1f},\sigma={a2.std():.1f})|')
print('|--|--|--|')
for (k1,v1),(_,v2) in zip(sorted(c1.items()),sorted(c2.items())):
print(f'|{k1} | {v1} | {v2} |')
name | object file (μ=1132.3,σ=77.6) | sampled (μ=10097.4,σ=8692.7) |
---|---|---|
abs | 843 | 2349 |
acos | 1168 | 15214 |
asin | 1168 | 14068 |
atan2 | 1084 | 20914 |
ceil | 1168 | 7772 |
cos | 1168 | 10043 |
daemon | 1030 | 10385 |
floor | 1168 | 7387 |
inet_addr | 1140 | 3878 |
inet_aton | 1140 | 11469 |
isalnum | 1222 | 2819 |
memccpy | 1152 | 7533 |
memcmp | 1072 | 4866 |
memmem | 1222 | 29691 |
sin | 1168 | 12776 |
stpcpy | 1156 | 6150 |
stpncpy | 1164 | 8221 |
strchr | 1129 | 4126 |
strcpy | 1032 | 2365 |
strncpy | 1074 | 2552 |
strstr | 1222 | 40261 |
strtok | 1168 | 8126 |
tan | 1168 | 9534 |
utime | 1153 | 4442 |
wmemmove | 1128 | 5495 |
Binaries are sequences of bytes, which means we can represent a binary of any length with a vector X of length 256 by storing the count of each byte i at Xi. This bag-of-words format loses information about the ordering of the original bytes.
We disassemble the original bytes into a stream of text tokens using Capstone’s support for the amd64 ISA. Each function becomes a single string. Whitespace separates each token within the string. A small sample of these tokens looks like this:
jg 0x47 add r8b, byte ptr [rcx] add dword ptr [rax], eax
In this representation, punctuation such as brackets and commas are part of the token.
After disassembly, we create another data representation using a traditional bag-of-words approach. First we generate a vocabulary mapping each unique token t to its rank in terms of how frequently it occurs in the dataset. Then for each function, we store the count of each token in the function t at Vt. The length of this vector is the size of the number of unique words in the training data. Because the test data may contain tokens not present in the training data, we add a “unknown” token to the bag of words.
Finally, we use Ghidra to extract basic blocks from each object file. To synthesize a new representation of a given function, we can sample any number of basic blocks from all the object files for that function.
We use the following models with the data representations described above:
- kNN
- Classify a function according to its k-nearest neighbors with a custom distance metric.
- RandomForest
- Classify a function using a “bag of bytes” representation.
- RandomForestTokens
- Classify a function using its “bag of tokens” disassembly representation.
- fastText
- Classify a function based on the disassembly representation using n-grams of length 5.
To find the similarity between the raw bytes of two functions, this model uses a custom metric called Normalized Compression Distance (NCD). First we compress, then take the length of the compressed documents D1 and D2. Next we concatenate both original documents and take the length of their compressed result as D3. Finally we express the normalized distance as:
This model reports the most frequent label from the nearest neighbors for k
For both the “Bag of Bytes” and “Bag of Tokens” representations, we train a Random Forest classifier from scikit-learn (version 1.3.1) with the following non-default hyperparameters:
- n_estimators: (40,80,100,120) (trained separately)
- n_jobs: -1
- class_weight: “balanced”
This fastText 0.9.2 model operates on disassembled text tokens, with the following non-default hyperparameters:
- thread: 18
- lr: 2
- epoch: 300
- dim: 30
- wordNgrams: 5
Confusion matrix for Block-Walk embedding model:
Training for 150 epochs…
Currently we use gcc (version 11.4.0) on Ubuntu Linux (version 22.04.3), and target only the x86-64 instruction set architecture (ISA). Future improvements could add more compilers, operating systems, programming languages, and ISAs.
In addition, some combinations of Tigress transformations result in invalid code. We discard these from our dataset, so improvements to Tigress transformations would give us a better and more diverse dataset in less time.
Finally, some valid C code results in object files which do not contain a symbol matching the name of the original function. This may be due to a failed Tigress transformation or aggressive function inlining by the compiler. We currently exclude these object files from our dataset, but we would rather keep the inlined results.