Skip to content

Machine learning [how build a working model from scratch]

fab edited this page Dec 28, 2023 · 36 revisions

I built a custom DevGPT to write most of the code I'm currently publishing on GitHub and I am using it to build a model suitable for domain safety ranking from scratch.

I will use the rank score to filter out safe predicted FQDNs from the release blacklist to provide additional accuracy and reduce false positives.

Since I can feed the machine learning pipeline with fresh data anytime (dataset with millions of blacklisted and whitelisted domains and subdomains aka FQDNs), I planned to build a model to predict badness score for new submitted FQDNs. The rank score is in the 1-100 range where 1 means really safe and 100 means really bad.

I then started by using a subset of the entire dataset (25000 good + 25000 bad items instead of millions of them).

Afterthat I built a simple ensemble pipeline to find the most accurate method for training and inference.

I tested the most popular and easy-to-implement methods in this context like RandomForest, GradientBoosting, ExtraTrees, LogisticRegression and SVC:

    classifiers = {
        "RandomForest": RandomForestClassifier(random_state=42),
        "GradientBoosting": GradientBoostingClassifier(random_state=42),
        "ExtraTrees": ExtraTreesClassifier(random_state=42),
        "LogisticRegression": LogisticRegression(random_state=42, max_iter=2000),
        "SVC": SVC(probability=True, random_state=42)
    }

Let's describe all those methods one by one:

  1. RandomForest Classifier
  • Type: Ensemble Learning Method
  • Description: RandomForest is a type of ensemble learning method that constructs a multitude of decision trees during training. For classification tasks, it outputs the class that is the mode of the classes of individual trees.
    • Strengths:
      • Handles both numerical and categorical data well.
      • Robust to overfitting as it averages the results of many decision trees.
      • Good performance in a wide range of problems.
    • Weaknesses:
      • Can be less interpretable compared to a single decision tree.
      • Performance may degrade with very noisy data.
  1. GradientBoosting Classifier
  • Type: Ensemble Learning Method
  • Description: GradientBoosting builds an additive model in a forward stage-wise fashion, allowing optimization of an arbitrary differentiable loss function. It builds the model in a stage-wise fashion like other boosting methods do but generalizes them by allowing optimization of an arbitrary differentiable loss function.
    • Strengths:
      • Often provides predictive accuracy that cannot be trumped.
      • Lots of flexibility as it can optimize different loss functions and provides several hyperparameter tuning options.
    • Weaknesses:
      • Can overfit if the number of trees is too large.
      • Sensitive to noisy data and outliers.
      • Requires careful tuning of parameters and may take longer to train.
  1. ExtraTrees Classifier
  • Type: Ensemble Learning Method
  • Description: ExtraTrees (Extremely Randomized Trees) Classifier fits a number of randomized decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control overfitting.
    • Strengths:
      • Reduces variance more effectively than RandomForest by using random thresholds for each feature rather than searching for the best possible thresholds.
      • Typically faster to train than RandomForest.
    • Weaknesses:
      • Like RandomForest, can be less interpretable.
      • Might not perform well on data with strong linear relationships.
  1. Logistic Regression
  • Type: Regression-based Classifier
  • Description: Despite its name, Logistic Regression is used for binary classification problems. It models the probability of a default class (e.g., class labeled '1').
    • Strengths:
      • Simple, efficient, and easy to implement.
      • Performs well with linearly separable classes.
      • Outputs probabilities, which can be a useful feature.
    • Weaknesses:
      • Assumes linearity between dependent and independent variables.
      • Can struggle with complex relationships in data.
      • Vulnerable to overfitting if the data is highly dimensional.
  1. SVC (Support Vector Classifier)
  • Type: Kernel-based Classifier
  • Description: SVC is a powerful, versatile machine learning algorithm, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the best out-of-the-box classifiers.
    • Strengths:
      • Effective in high-dimensional spaces.
      • Versatile as different kernel functions can be specified for the decision function.
    • Weaknesses:
      • Can be inefficient on large datasets.
      • Requires careful tuning of parameters and selection of the kernel.
      • The choice of kernel and regularization can have a large impact on the performance of the algorithm.

The best method is choosed by performing RandomSearch instead of GridSearch:

        random_search = RandomizedSearchCV(clf, params[name], n_iter=20, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1, random_state=42)
        random_search.fit(X_res, y_res)

by using all of the following parameters (I'm running this project on a Dell R620 48 cores, 128GB ram, no GPU server):

    params = {
        "RandomForest": {'n_estimators': sp_randint(100, 500), 'max_depth': sp_randint(10, 50), 'min_samples_split': sp_randint(2, 11)},
        "GradientBoosting": {'n_estimators': sp_randint(100, 300), 'learning_rate': uniform(0.01, 0.2), 'max_depth': sp_randint(3, 10)},
        "ExtraTrees": {'n_estimators': sp_randint(100, 500), 'max_depth': sp_randint(10, 50), 'min_samples_split': sp_randint(2, 11)},
        "LogisticRegression": {'C': uniform(0.01, 100), 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']},
        "SVC": {'C': uniform(0.1, 10), 'kernel': ['linear', 'rbf', 'poly']}
    }

to find the most suitable approach. I then focus on the elected approach to increase accuracy.