Skip to content

Machine learning [how build a working model from scratch]

fab edited this page Dec 28, 2023 · 36 revisions

I built a custom DevGPT to write most of the code I'm currently publishing on GitHub.

Since I can feed the machine learning pipeline with fresh data anytime (dataset with millions of blacklisted and whitelisted domains and subdomains aka FQDNs), I planned to build a model to predict badness score for new submitted FQDNs (1 to 100 where 1 is 100% safe and 100 is 100% bad) like example.com, suspicious-website.com or doqwindwoi2342.dwirh29r32.cc .

I then started by using a subset of the entire dataset (50000 total items instead of millions).

Afterthat I built a simple ensemble pipeline to find the most accurate method for training and inference. I tested the most popular and easy-to-implement methods in this context like RandomForest, GradientBoosting, ExtraTrees, LogisticRegression and SVC:

    classifiers = {
        "RandomForest": RandomForestClassifier(random_state=42),
        "GradientBoosting": GradientBoostingClassifier(random_state=42),
        "ExtraTrees": ExtraTreesClassifier(random_state=42),
        "LogisticRegression": LogisticRegression(random_state=42, max_iter=2000),
        "SVC": SVC(probability=True, random_state=42)
    }

performing RandomSearch instead of GridSearch:

        random_search = RandomizedSearchCV(clf, params[name], n_iter=20, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1, random_state=42)
        random_search.fit(X_res, y_res)

by using all of the following parameters (I'm running this project on a Dell R620 48 cores, 128GB ram, no GPU server):

    params = {
        "RandomForest": {'n_estimators': sp_randint(100, 500), 'max_depth': sp_randint(10, 50), 'min_samples_split': sp_randint(2, 11)},
        "GradientBoosting": {'n_estimators': sp_randint(100, 300), 'learning_rate': uniform(0.01, 0.2), 'max_depth': sp_randint(3, 10)},
        "ExtraTrees": {'n_estimators': sp_randint(100, 500), 'max_depth': sp_randint(10, 50), 'min_samples_split': sp_randint(2, 11)},
        "LogisticRegression": {'C': uniform(0.01, 100), 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']},
        "SVC": {'C': uniform(0.1, 10), 'kernel': ['linear', 'rbf', 'poly']}
    }

to find the most suitable approach. I then focus on the elected approach to increase accuracy.