Mice imputation with a small number of missing values - test/train set may have no missing values #526

annette987 · 2022-12-15T02:34:04Z

annette987
Dec 15, 2022

When performing resampling, I wish to perform the same imputation on the test set as I performed on the training set, which is accepted practice. So, when imputing with MICE, I generate a predictor matrix when imputing the training set and use the same predictor matrix when imputing the test set.

This works fine if there are plenty of missing values. However the following situations can arise if the data set is small and/or has only a small number of missing values:

training set has missing values but test set has no missing values
training set has no missing values but the test set does have missing values
both training and test sets have missing values

The first case is easy to handle - just return the original test set as it has no missing values.

In the second case, the mice predictor matrix created on the training set contains all zeroes and so cannot be used to impute the test set or it will cause an error - i.e. Error in edit.setup(data, setup, ...): mice detected constant and/or collinear variables. No predictors were left after their removal.\n" (In fact there are no collinear variables, but edit.setup checks for an all-zero predictor matrix and outputs this message as a result.) So a new predictor matrix has to be created for the test set. Is this valid?

In the final case, this works most of the time, but can cause problems if a particular column has missing values in only the training or test sets but not both.

Is there a better way to do this? Should I just regenerate the predictor matrix in the test set every time?
(Also posted on Cross-Validated - wasn't sure which was the more appropriate platform).

gerkovink · 2022-12-15T05:47:07Z

gerkovink
Dec 15, 2022
Maintainer

I believe this repo with simulations and manuscript will be of interest to you. It outlines the proper procedure for cross validating imputed data, which involves fitting a model on the training set and applying it on the test set. To ensure proper missingness coverage, one may choose to a) induce some missingness in both sets independently, or b) resample such that missingness is always present. A third strategy if there are only few missing values is to train and crossvalidate without the incomplete case cases and then impute and validate the incomplete cases to see if the model fitting procedure has been sensitive to the incomplete cases. This methodology, however, has never been validated and it's more-step nature may pose additional problems, especially if the missingness is deemed to influence the model performance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mice imputation with a small number of missing values - test/train set may have no missing values #526

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Mice imputation with a small number of missing values - test/train set may have no missing values #526

annette987 Dec 15, 2022

Replies: 1 comment

gerkovink Dec 15, 2022 Maintainer

annette987
Dec 15, 2022

gerkovink
Dec 15, 2022
Maintainer