Error on missing values without missing values #652

phisanti · 2021-05-08T19:54:26Z

phisanti
May 8, 2021

Hi, I am trying to do classification on a TRUE/FALSE class. Before running on the data set, I clean all the dataset so that it includes only compete cases, however, when trying to optimise I got an error regarding missing values.

Here is a sample code:

d  <- fread("datasets/training_data.csv")

# Filter training data

training_data <- d[, 
               c( predictors_cols), with = F]

xgb_learner <- lrn("classif.xgboost",
                  eval_metric = "logloss")

traintask <- TaskClassif$new(id = "training_data",
                                backend = training_data[, -remove_col,
                                with = FALSE], 
                                target = target)

train_indexes <- list(train = c(1, 2, 3, 4, 5), test = c(6, 7, 8, 9, 10))

# Set parameter space
XGB_parameters <- ps(

  eta = p_dbl(default = 0.05, lower = 0.001, upper = 0.1),
  max_depth = p_int(default = 6L, lower = 3L, upper = 15L),
  nrounds = p_int(default = 50L, lower = 5L, upper = 100L),
  gamma = p_dbl(default = 7, lower = 4, upper = 17),
  colsample_bytree = p_dbl(
  default = 0.15, lower = 0.05, upper = 0.25),
  subsample = p_dbl(default = 0.15, lower = 0.01, upper = 0.25),
  min_child_weight = p_dbl(default = 1, lower = 0, upper = 3),
  booster = p_fct(levels = c("dart")), 
  # Parameters specific for DART

  rate_drop = p_dbl(default = 0, lower = 0, upper = 1, tags = "train"),
  skip_drop = p_dbl(default = 0, lower = 0, upper = 1, tags = "train")
  )


XGB_parameters$add_dep("skip_drop", "booster", CondEqual$new("dart"))
XGB_parameters$add_dep("rate_drop", "booster", CondEqual$new("dart"))
XGB_parameters
tuner <- tnr("random_search")
rc <- rsmp("custom")
rc$instantiate(traintask, train_indexes$training_seqs, 
               train_indexes$training_seqs)

measure <- msr("classif.fbeta")
term_combo <- trm("combo",
  list(
  trm("evals", n_evals = 500),
  trm("perf_reached", level = 0.9)),
                  any = TRUE)

# Instantiate tuning

instance <- TuningInstanceSingleCrit$new(
  task = traintask,
  learner = xgb_learner,
  resampling = rc,
  measure = measure,
  search_space = XGB_parameters,
  terminator = term_combo
  )

tuner$optimize(instance)


Error in .__Archive__add_evals(self = self, private = private, super = super,  : 
  Assertion on 'ydt[, self$cols_y, with = FALSE]' failed: Contains missing values (column 'classif.fbeta', row 1).

For some reason, when I change the metring to "classif.acc" there seems to run without problem, but due to imbalanced classes, I would prefer to use F1. Please, let me know if I am doing something wrong.

Answered by mllg

May 10, 2021

However, I am not sure if I should use 0, or 1 as NA substitution. At least the hyperparameter tunning is running normally.

Well, it depends. Here are some good reads on this topic:

There are (at least) two ways to tackle this problem:

Ensure that each split has a reasonable number of observations of each label. You can do this manually or via stratification (set task role "stratum" before instantiation).
Switch from macro to micro aggregation: msr("classif.f1", average = "micro").

View full answer

be-marc · 2021-05-10T08:12:14Z

be-marc
May 10, 2021
Maintainer Sponsor

The error states that the measure classif.fbeta returned NaNs for some or all predictions.

Reproducible example

library(mlr3)
library(mlr3learners)

set.seed(2)

task = tsk("pima")
learner = lrn("classif.xgboost", eval_metric = "logloss")

indices = list(train = list(1, 2, 3, 4, 5), test = list(6, 7, 8, 9, 10))

rc = rsmp("custom")
rc$instantiate(task, indices$train, indices$test)

rr = resample(task, learner, rc)
rr$score(msr("classif.fbeta"))

0 replies

be-marc · 2021-05-10T08:52:27Z

be-marc
May 10, 2021
Maintainer Sponsor

The small custom resampling might be the cause.

0 replies

phisanti · 2021-05-10T09:03:00Z

phisanti
May 10, 2021
Author

I think I found the source of the problem. I have a very unbalanced dataset thus, some of the cross-validation resamples contain only T or only F labels. That means it is not possible to calculate the measurement. Here is a reproducible example:

library(mlr3learners)
library(data.table)

set.seed(2)

d <- data.table(x = rnorm(100), 
                label = sample(c(T, F), size = 100, T, prob = c(0.05, 0.95)))
d[, label := as.factor(label)]

traintask <- TaskClassif$new(id = "d",
                                backend = d, 
                                target = "label")

learner = lrn("classif.xgboost", eval_metric = "logloss")
indices = list(train = list(1, 2, 3, 4, 5), test = list(6, 7, 8, 9, 10))

rc = rsmp("custom")
rc$instantiate(traintask, indices$train, indices$test)

rr = resample(traintask, learner, rc)
rr$score(msr("classif.fbeta"))

The output shows several NaN:


                task task_id                     learner      learner_id             resampling resampling_id iteration              prediction classif.fbeta
1: <TaskClassif[46]>       d <LearnerClassifXgboost[33]> classif.xgboost <ResamplingCustom[19]>        custom         1 <PredictionClassif[19]>           NaN
2: <TaskClassif[46]>       d <LearnerClassifXgboost[33]> classif.xgboost <ResamplingCustom[19]>        custom         2 <PredictionClassif[19]>           NaN
3: <TaskClassif[46]>       d <LearnerClassifXgboost[33]> classif.xgboost <ResamplingCustom[19]>        custom         3 <PredictionClassif[19]>           NaN
4: <TaskClassif[46]>       d <LearnerClassifXgboost[33]> classif.xgboost <ResamplingCustom[19]>        custom         4 <PredictionClassif[19]>           NaN
5: <TaskClassif[46]>       d <LearnerClassifXgboost[33]> classif.xgboost <ResamplingCustom[19]>        custom         5 <PredictionClassif[19]>             1

0 replies

mllg · 2021-05-10T13:22:56Z

mllg
May 10, 2021
Maintainer Sponsor

@be-marc Can you improve the error message?

0 replies

phisanti · 2021-05-10T14:43:32Z

phisanti
May 10, 2021
Author

Okay, I think I have been able to solve it, but I would like to hear if you think it makes sense. What I have done to add the argument na_value = 0 when calling the measure, so it looks like this:

msr("classif.sensitivity", na_value = 0)

However, I am not sure if I should use 0, or 1 as NA substitution. At least the hyperparameter tunning is running normally.

0 replies

mllg · 2021-05-10T20:00:21Z

mllg
May 10, 2021
Maintainer Sponsor

However, I am not sure if I should use 0, or 1 as NA substitution. At least the hyperparameter tunning is running normally.

Well, it depends. Here are some good reads on this topic:

There are (at least) two ways to tackle this problem:

Ensure that each split has a reasonable number of observations of each label. You can do this manually or via stratification (set task role "stratum" before instantiation).
Switch from macro to micro aggregation: msr("classif.f1", average = "micro").

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on missing values without missing values #652

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Error on missing values without missing values #652

phisanti May 8, 2021

Replies: 6 comments

be-marc May 10, 2021 Maintainer Sponsor

be-marc May 10, 2021 Maintainer Sponsor

phisanti May 10, 2021 Author

mllg May 10, 2021 Maintainer Sponsor

phisanti May 10, 2021 Author

mllg May 10, 2021 Maintainer Sponsor

phisanti
May 8, 2021

be-marc
May 10, 2021
Maintainer Sponsor

be-marc
May 10, 2021
Maintainer Sponsor

phisanti
May 10, 2021
Author

mllg
May 10, 2021
Maintainer Sponsor

phisanti
May 10, 2021
Author

mllg
May 10, 2021
Maintainer Sponsor