Mice imputation with a small number of missing values - test/train set may have no missing values #526
Replies: 1 comment
-
I believe this repo with simulations and manuscript will be of interest to you. It outlines the proper procedure for cross validating imputed data, which involves fitting a model on the training set and applying it on the test set. To ensure proper missingness coverage, one may choose to a) induce some missingness in both sets independently, or b) resample such that missingness is always present. A third strategy if there are only few missing values is to train and crossvalidate without the incomplete case cases and then impute and validate the incomplete cases to see if the model fitting procedure has been sensitive to the incomplete cases. This methodology, however, has never been validated and it's more-step nature may pose additional problems, especially if the missingness is deemed to influence the model performance. |
Beta Was this translation helpful? Give feedback.
-
When performing resampling, I wish to perform the same imputation on the test set as I performed on the training set, which is accepted practice. So, when imputing with MICE, I generate a predictor matrix when imputing the training set and use the same predictor matrix when imputing the test set.
This works fine if there are plenty of missing values. However the following situations can arise if the data set is small and/or has only a small number of missing values:
The first case is easy to handle - just return the original test set as it has no missing values.
In the second case, the mice predictor matrix created on the training set contains all zeroes and so cannot be used to impute the test set or it will cause an error - i.e. Error in edit.setup(data, setup, ...): mice detected constant and/or collinear variables. No predictors were left after their removal.\n" (In fact there are no collinear variables, but edit.setup checks for an all-zero predictor matrix and outputs this message as a result.) So a new predictor matrix has to be created for the test set. Is this valid?
In the final case, this works most of the time, but can cause problems if a particular column has missing values in only the training or test sets but not both.
Is there a better way to do this? Should I just regenerate the predictor matrix in the test set every time?
(Also posted on Cross-Validated - wasn't sure which was the more appropriate platform).
Beta Was this translation helpful? Give feedback.
All reactions