Skip to content
/ dga Public

Classifier to separate legitimate domains from those generated by a domain generating algorithm (DGA).

License

Notifications You must be signed in to change notification settings

jayjacobs/dga

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is an implement of a classification algorithm trained on legitamate domains (taken from the Alexa list of popular web sites and the Open DNS popular domains list), as well as algorithmically generated domains from the Cryptolocker and GOZ botnet.

Given a domain name the function will classify it as either "dga" or "legit" and include the probability of the classification.

Begin by loading up the DGA library (note: you may get an error on install_github if you had never ‘git clone’d before, or added the host as a known SSH host).

devtools::install_github("jayjacobs/dga")
library(dga)

Let's test with the easy most popular websites, and classify them as either "legit" or "dga".

good20 <- c("facebook.com", "google.com", "youtube.com",
           "yahoo.com", "baidu.com", "wikipedia.org",
           "amazon.com", "live.com", "quicken.com",
           "taobao.com", "blogspot.com", "google.co.in",
           "twitter.com", "linkedin.com", "yahoo.co.jp",
           "bing.com", "sina.com.cn", "yandex.ru",
           "msn.com", "vikings.com")

dgaPredict(good20)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

##         name class  prob
## 1   facebook legit 1.000
## 2     google legit 1.000
## 3    youtube legit 1.000
## 4      yahoo legit 1.000
## 5      baidu legit 1.000
## 6  wikipedia legit 0.998
## 7     amazon legit 1.000
## 8       live legit 1.000
## 9    quicken legit 1.000
## 10    taobao legit 1.000
## 11  blogspot legit 1.000
## 12    google legit 1.000
## 13   twitter legit 1.000
## 14  linkedin legit 1.000
## 15     yahoo legit 1.000
## 16      bing legit 1.000
## 17      sina legit 1.000
## 18    yandex legit 1.000
## 19       msn legit 1.000
## 20   vikings legit 1.000

Now some domain generated algorithms from the cryptolocker botnet:

bad20 <- c("btpdeqvfmjxbay.ru", "rrpmjoxjsbsw.ru", "wibiqshumvpns.ru", 
           "mhdvnabqmbwehm.ru", "chyfrroprecy.ru", "uyhdbelswnhkmhc.ru",
           "kqcrotywqigo.ru", "rlvukicfjceajm.ru", "ibxaoddvcped.ru", 
           "tntuqxxbvxytpif.ru", "heksblnvanyeug.ru", "kexngyjudoptjv.ru",
           "hwenbesxjwrwa.ru", "oovftsaempntpx.ru", "uipgqhfrojbnjo.ru", 
           "igpjponmegrxjtr.ru", "eoitadcdyaeqh.ru", "bqadfgvmxmypkr.ru", 
           "bycoifplnumy.ru", "aeqcwsreocpbm.ru")
dgaPredict(bad20)
##               name class  prob
## 1   btpdeqvfmjxbay   dga 1.000
## 2     rrpmjoxjsbsw   dga 1.000
## 3    wibiqshumvpns   dga 1.000
## 4   mhdvnabqmbwehm   dga 1.000
## 5     chyfrroprecy   dga 0.854
## 6  uyhdbelswnhkmhc   dga 1.000
## 7     kqcrotywqigo   dga 1.000
## 8   rlvukicfjceajm   dga 1.000
## 9     ibxaoddvcped   dga 1.000
## 10 tntuqxxbvxytpif   dga 1.000
## 11  heksblnvanyeug   dga 0.980
## 12  kexngyjudoptjv   dga 1.000
## 13   hwenbesxjwrwa   dga 1.000
## 14  oovftsaempntpx   dga 1.000
## 15  uipgqhfrojbnjo   dga 1.000
## 16 igpjponmegrxjtr   dga 1.000
## 17   eoitadcdyaeqh   dga 1.000
## 18  bqadfgvmxmypkr   dga 1.000
## 19    bycoifplnumy   dga 1.000
## 20   aeqcwsreocpbm   dga 1.000

Algorithm is about 98% effective, so some things are misclassified, the "prob" (probability) column can be used to manually inspect some of the output.

borderline <- c("20minutes.fr", "siriusxm.com", "fileblckr.com", "haus-am-brunnen.de", 
                "left21.com", "rw3ramr.info", "letter861cod.info", "mintadelpyjychw.ru", 
                "zsdm7erb.us", "surceskmgf.net")

dgaPredict(borderline)
##               name class  prob
## 1        20minutes   dga 0.588
## 2         siriusxm   dga 0.550
## 3        fileblckr   dga 0.576
## 4  haus-am-brunnen   dga 0.520
## 5           left21   dga 0.540
## 6          rw3ramr legit 0.546
## 7     letter861cod legit 0.536
## 8  mintadelpyjychw legit 0.522
## 9         zsdm7erb legit 0.524
## 10      surceskmgf legit 0.582

So if the application is more sensitive to misclassification, the threshold for classification can be adjusted up or down, notice the probability shown is the confidence in classification, so it will dip beneath 0.5 for legitimate domains if dgaThreshold is raised.

dgaPredict(borderline, dgaThreshold=0.55)
##               name class  prob
## 1        20minutes   dga 0.588
## 2         siriusxm   dga 0.550
## 3        fileblckr   dga 0.576
## 4  haus-am-brunnen legit 0.480
## 5           left21 legit 0.460
## 6          rw3ramr legit 0.546
## 7     letter861cod legit 0.536
## 8  mintadelpyjychw legit 0.522
## 9         zsdm7erb legit 0.524
## 10      surceskmgf legit 0.582

This uses a Random Forest model:

## Random Forest 
## 
## 85457 samples
##     3 predictors
##     2 classes: 'legit', 'dga' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## 
## Summary of sample sizes: 76911, 76911, 76911, 76912, 76912, 76911, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  ROC  Sens  Spec  ROC SD  Sens SD  Spec SD
##   2     1    1     1     6e-04   0.002    0.002  
##   3     1    1     1     9e-04   0.002    0.002  
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

About

Classifier to separate legitimate domains from those generated by a domain generating algorithm (DGA).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages