Replies: 37 comments 7 replies
-
Dear Martin, thanks for your question. As you've found out, there is no standard facility in Applying a fitted imputation model to new data can probably technically be implemented, and the results should give valid inferences, of course assuming that the relations in the fitted and new are the same. The alternative is to stack the old and new data, and re-impute the stacked data. The first analysis is likely to be less efficient than the second one because it ignores the information in the new data to estimate the imputation model. Fixing parameters in the imputation model is easily done by writing your own Having said this, there will be scenario's where it can be useful to store and re-use the imputation model for new data. One such scenario is just to speed up the algorithm by saving the values over the iterations (the only memory in the MICE algorithm is the data). Another might be in production environments where the imputation model is estimated from old data, and needs to be controlled and fixed for any new data. Also, the stacked data may simply become too large. It is technically non-trivial to store replay the imputation process. Model fitting only is one of the steps. It is thus not enough to just store the fitted model object. We also need to codify the procedure that uses the estimated model parameters to calculate and/or draw the imputations. Perhaps the Stef. |
Beta Was this translation helpful? Give feedback.
-
Hello! I too am looking for a way to use mice on new data, in the context of wanting to develop the imputation model on a training set, and then apply it to an unseen test set to allow for fair comparisons with models which do not require imputation but are trained and tested with the same split dataset. In your answer above, you say that this can 'technically be implemented'. Do you mean with the existing mice package, or with additional code? If the former, could you give example code to use mice of how to apply the fitted imputation to new data? Thanks very much in advance, Andrew |
Beta Was this translation helpful? Give feedback.
-
@ajsteele |
Beta Was this translation helpful? Give feedback.
-
I've been comparing different kinds of survival model with different methods of dealing with missing values (imputation, 'missing indicator', discretisation) on medical records data. My concern is that imputing on the full dataset could allow information to 'leak' between the training and test sets, giving an unfair advantage to models using imputation…but there's no direct way to test this! Thanks for the offer, and there's no particular hurry…this project is winding down a bit now anyway. But I think the tools could be of use to future researchers with similar problems? :) |
Beta Was this translation helpful? Give feedback.
-
Hi, |
Beta Was this translation helpful? Give feedback.
-
How to export and re-use the imputation model?I have done a little thinking about what we can do with the current objects that are produced in MICE. The idea as formulated by Martin is that the imputation model should be fixed at the last values. MICE only stores the imputed data, together with the model specification. All intermediate modelling coefficients of how to arrive at the imputations are discarded. Suppose we wish to fix the following aspects of the imputation model:
I believe that it is possible to meet all four requirements exactly in MICE. We could be temped to save all regression coefficient from the latest iteration, but that does not help. The problem is that the coefficients of the current model are invalidated at the moment that one of the predictors is re-imputed, which occurs almost immediately as we move on to impute the next variable. Rather, in order to recreate the used model from iteration t, we would need to know the state of imputations at iteration t-1. Given these, all coefficients and imputations can be recalculated exactly. So, in order to apply a given imputation model to new data, all we need to do is store the imputations from the previous iteration. Of course, it is much easier to use the current imputations, and define the "last iteration" as the "first future iteration", which is equally as good as, if not better than, the last iteration. Thus, everything we need is already "exported" in the current Recreating the next iteration can be done by the The procedure to achieve this is as follows:
As described, we need two streams of random generators if want exact correspondence (which is useful for testing), but in practice that may not be worth the trouble. We obtain a new I believe this works if
Would such be useful to your application? Stef. |
Beta Was this translation helpful? Give feedback.
-
Hi, I second the request to have the ability to project the imputation model from a given dataset to a new dataset with the same variables for the exact utility of partition to train and test sets. @stefvanbuuren your suggestion sounds like it should do the trick. Many thanks and kudos for the great package! Iyar |
Beta Was this translation helpful? Give feedback.
-
I would like to do the same: I have some datasets where I have only some missing values in a single column, and others where that single column is completely missing (no predictors are missing, though). Is "old data" the same as "training data"? What are the "two streams of random numbers" exactly? Is it possible to get an example? Maybe with |
Beta Was this translation helpful? Give feedback.
-
Wondering if any examples/vignettes are available for conducting the procedure that @stefvanbuuren describes. |
Beta Was this translation helpful? Give feedback.
-
Sorry, no examples yet. |
Beta Was this translation helpful? Give feedback.
-
I have given this a try (roughly) following the strategey outlined by Stef. Code to get imputations for previously unseen test data can be found here, with a minimal example here. The function creates observations for the test set in the following way:
On the There is a good chance that I have disregarded some intricacies of mice, so please do let me know if something in my approach does not make sense or is obviously wrong. |
Beta Was this translation helpful? Give feedback.
-
@prockenschaub Thank you so much mate! I was in the same problem- was searching for a week now about how to use mice for test data using imputed training data. Stef van Buuren kindly advised here and you made the neat function. Thanks from the bottom of my heart, you saved me weeks worth of precious time. |
Beta Was this translation helpful? Give feedback.
-
Would the work completed by @prockenschaub be a feature within the scope of a direct addition to |
Beta Was this translation helpful? Give feedback.
-
I am pretty amazed (and shocked to a certain degree) that this is not an implemented functionality of mice, especially since this issue was raised over 3 years ago. |
Beta Was this translation helpful? Give feedback.
-
Agree that it would be useful. Nothing prevents you from contributing a pull request that would add it to |
Beta Was this translation helpful? Give feedback.
-
Nothing should prevent my function from working with mixed data. My function is mostly a wrapper around the in-built functionality of @prediction2020 if you run into any problems with mixed data, let me know and I can have a look where they might originate. As for |
Beta Was this translation helpful? Give feedback.
-
@prockenschaub |
Beta Was this translation helpful? Give feedback.
-
@prockenschaub. I looked into your code. It contains many useful elements and goes a far way, but it does not fully implement the algorithm that I suggested above. Your code tweaks the What is needed is to change I have now implemented an experimental solution in a separate branch that adds a new Things are not yet perfect, and there are still a couple of thing to consider:
I welcome any feedback on this approach. Would this work for your use cases? |
Beta Was this translation helpful? Give feedback.
-
Thanks @stefvanbuuren , this is really helpful! Yes I think this would cover my personal use cases.
The difficulty that you describe in getting a hold of I like your solution a lot, it ended up significantly simpler than I anticipated. I will have a thorough test run through your code over the weekend. With regards to the issues that you mention:
I tend to agree with a view to my immediate use case that returning
Am I right in my interpretation that one could plot the convergence of the entire data (train + test) but one could not get means / sdevs for the test set only?
I will try to create some examples and tests for this over the weekend.
I suppose this is because
It wouldn't surprise me but I think the work so far is a great start :) |
Beta Was this translation helpful? Give feedback.
-
Additional point:
|
Beta Was this translation helpful? Give feedback.
-
How could we use the function mice.use provided by prockenschaub with only one observation? |
Beta Was this translation helpful? Give feedback.
-
I guess you ran into an Constant and collinearity remove is turned on by default in
with
and I think that should do the trick. I have also updated my code to include this. |
Beta Was this translation helpful? Give feedback.
-
Yes it did the trick! Many thanks!!!! |
Beta Was this translation helpful? Give feedback.
-
Commit 46171f9 merges the work in the |
Beta Was this translation helpful? Give feedback.
-
mice.impute.norm <- function(y, ry, x, wy = NULL, ...) {
if (is.null(wy)) wy <- !ry
x <- cbind(1, as.matrix(x))
parm <- .norm.draw(y, ry, x, ...)
x[wy, ] %*% parm$beta + rnorm(sum(wy)) * parm$sigma
} mice.impute.normdump <- function (y, ry, x, wy = NULL, ...) {
if (is.null(wy)) wy <- !ry
x <- cbind(1, as.matrix(x))
parm <- .norm.draw(y, ry, x, ...)
betadump <<- c(betadump,parm$beta)
x[wy, ] %*% parm$beta + rnorm(sum(wy)) * parm$sigma
} Closing now. Thanks all for bringing up and discussing the issue. It makes up for a better |
Beta Was this translation helpful? Give feedback.
-
Hey prockenschaub ! I am using mice.reuse and I get this warning: invalid factor level, NA generated The problem seems to be at mids.append in the loop: for(i in names(x$imp)){
} However, I really can't understand why. The train and test set have exactly the same variable names and all predictors have the same factor levels. Any ideas? Many thanks! |
Beta Was this translation helpful? Give feedback.
-
@EviVal The error originates from my less then ideal set-up in my first code. I use Reproducible examplelibrary(mice)
library(tidyverse)
# Make sure to store `mice.reuse.R` in the same directory or change path
source("mice.reuse.R")
set.seed(42)
data <- data.frame(
x = rnorm(100),
z = factor(rep(c("a", "b", "c", "d"), each = 25))
)
data$z[runif(100) < 0.2] <- NA
data$z[100, "z"] <- NA # set the last row definitely to missing
#> Error in x[...] <- m: incorrect number of subscripts on matrix
imp.train <- mice(data[1:99, ], maxit = 5, m = 2, seed = 1)
#>
#> iter imp variable
#> 1 1 z
#> 1 2 z
#> 2 1 z
#> 2 2 z
#> 3 1 z
#> 3 2 z
#> 4 1 z
#> 4 2 z
#> 5 1 z
#> 5 2 z
imp.train
#> Class: mids
#> Number of multiple imputations: 2
#> Imputation methods:
#> x z
#> "" "polyreg"
#> PredictorMatrix:
#> x z
#> x 0 1
#> z 1 0
imp.test <- mice.reuse(imp.train, data[100, ], maxit = 1)
#> Warning in `[<-.factor`(`*tmp*`, ri, value = -0.592225372961588): invalid factor
#> level, NA generated
#> Warning in `[<-.factor`(`*tmp*`, ri, value = -0.592225372961588): invalid factor
#> level, NA generated
#>
#> iter imp variable
#> 6 1 z
#> 6 2 z SolutionDo not use my code but instead use the imp.ignore <- mice(data, ignore = c(rep(FALSE, 99), TRUE), maxit = 5, m = 2, seed = 1)
#>
#> iter imp variable
#> 1 1 z
#> 1 2 z
#> 2 1 z
#> 2 2 z
#> 3 1 z
#> 3 2 z
#> 4 1 z
#> 4 2 z
#> 5 1 z
#> 5 2 z |
Beta Was this translation helpful? Give feedback.
-
Hi! Sorry if it's the wrong place to ask thanks |
Beta Was this translation helpful? Give feedback.
-
Thanks for reminding me, though, I urgently need to write a short vignette on using Edit (22/03/2021): @AmeBol kindly pointed out that I mixed up the role of |
Beta Was this translation helpful? Give feedback.
-
See #336 for a demo script. |
Beta Was this translation helpful? Give feedback.
-
Dear Stef (et al),
this is not a bug report, but a public "request" for advice.
Context: We use mice on medium sized data set of Swiss meteo and bio data, several locations, species, etc. and mainly need to impute one Y variable (which however is also used in lagged form as predictor) in a linear regression model. Imputations work fine (using "ppm" and default least squares regression ((though a perfect model would take into account that errors seem to be more heavy tailed than the Gaussian, and in an ideal world we would use robust regression (e.g. as in robustbase:: lmrob()).
To assess the imputations we would like to compare the empirical distribution of the several imputed values with a hypothesized Gaussian of "known" (mu, sigma) = (x' \beta, \sigma) and hence would want to find (\beta, \sigma) from the regression model that was used in mice (but possibly fitting \beta,\sigma using different data, e.g., in a missingness-simulation fit it to the full (nonmissing) data).
Our problem is that the mice.impute.() functions which mice() works with all do not keep the parameters of the models used, but only return the predicted values - which is perfect for what they are designed to do, but leaves us without a clue about how the final model looked like.
What do you propose?
I assume others have had related wishes in the past, and there already is a perfect solution?
Beta Was this translation helpful? Give feedback.
All reactions