Get at the final model used in the MICE iterations? #346

mmaechler · 2017-05-16T10:36:47Z

mmaechler
May 16, 2017

Dear Stef (et al),
this is not a bug report, but a public "request" for advice.

Context: We use mice on medium sized data set of Swiss meteo and bio data, several locations, species, etc. and mainly need to impute one Y variable (which however is also used in lagged form as predictor) in a linear regression model. Imputations work fine (using "ppm" and default least squares regression ((though a perfect model would take into account that errors seem to be more heavy tailed than the Gaussian, and in an ideal world we would use robust regression (e.g. as in robustbase:: lmrob()).

To assess the imputations we would like to compare the empirical distribution of the several imputed values with a hypothesized Gaussian of "known" (mu, sigma) = (x' \beta, \sigma) and hence would want to find (\beta, \sigma) from the regression model that was used in mice (but possibly fitting \beta,\sigma using different data, e.g., in a missingness-simulation fit it to the full (nonmissing) data).
Our problem is that the mice.impute.() functions which mice() works with all do not keep the parameters of the models used, but only return the predicted values - which is perfect for what they are designed to do, but leaves us without a clue about how the final model looked like.

What do you propose?
I assume others have had related wishes in the past, and there already is a perfect solution?

stefvanbuuren · 2017-05-16T19:58:04Z

stefvanbuuren
May 16, 2017
Maintainer

Dear Martin, thanks for your question.

As you've found out, there is no standard facility in mice to save the imputation model, or to apply it to some new data. The primary function of the imputation model is just to produce good imputations. The imputation model itself is of little scientific interest, and the parameters generally have no sensible interpretation, so I've never felt the desire to save and study them.

Applying a fitted imputation model to new data can probably technically be implemented, and the results should give valid inferences, of course assuming that the relations in the fitted and new are the same. The alternative is to stack the old and new data, and re-impute the stacked data. The first analysis is likely to be less efficient than the second one because it ignores the information in the new data to estimate the imputation model.

Fixing parameters in the imputation model is easily done by writing your own mice.impute.xxx(), and we can save the parameters of the last fitted model on a case-by-case basis by tempering with the relevant mice.impute.xxx() function.

Having said this, there will be scenario's where it can be useful to store and re-use the imputation model for new data. One such scenario is just to speed up the algorithm by saving the values over the iterations (the only memory in the MICE algorithm is the data). Another might be in production environments where the imputation model is estimated from old data, and needs to be controlled and fixed for any new data. Also, the stacked data may simply become too large.

It is technically non-trivial to store replay the imputation process. Model fitting only is one of the steps. It is thus not enough to just store the fitted model object. We also need to codify the procedure that uses the estimated model parameters to calculate and/or draw the imputations. Perhaps the broom package could assist us here, but it is probably not rich enough to codify the imputation model. An additional complication is that some methods (e.g. predictive mean matching) draw from the observed data, so we may need also need to store these, depending on the requirements for reproducibility. But perhaps, there are brave people out there willing to give it a try?

Stef.

0 replies

ajsteele · 2017-06-15T17:32:49Z

ajsteele
Jun 15, 2017

Hello!

I too am looking for a way to use mice on new data, in the context of wanting to develop the imputation model on a training set, and then apply it to an unseen test set to allow for fair comparisons with models which do not require imputation but are trained and tested with the same split dataset.

In your answer above, you say that this can 'technically be implemented'. Do you mean with the existing mice package, or with additional code? If the former, could you give example code to use mice of how to apply the fitted imputation to new data?

Thanks very much in advance,

Andrew

0 replies

RianneSchouten · 2017-10-13T15:39:09Z

RianneSchouten
Oct 13, 2017

@ajsteele
we would need some kind of function that stores the parameters of the imputation model (or at least, the parameters of the last model). I am planning on writing this function (but I have many plans so don't know when I will do this). Just out of curiosity, what exact kind of comparisons are you doing?

0 replies

ajsteele · 2017-10-13T15:56:50Z

ajsteele
Oct 13, 2017

@RianneSchouten

I've been comparing different kinds of survival model with different methods of dealing with missing values (imputation, 'missing indicator', discretisation) on medical records data. My concern is that imputing on the full dataset could allow information to 'leak' between the training and test sets, giving an unfair advantage to models using imputation…but there's no direct way to test this!

Thanks for the offer, and there's no particular hurry…this project is winding down a bit now anyway. But I think the tools could be of use to future researchers with similar problems? :)

0 replies

stephenleo · 2018-01-03T01:55:29Z

stephenleo
Jan 3, 2018

Hi,
To add on to above. I'm interested to use MICE imputation on a production data set that is updated every few seconds. This requires the imputation model to be exported so that it can be re-used as new data comes in. Re-running a "cart" imputation model as every new data point comes in is not realistic due to the volume of the production line. The model is not expected to change significantly over time, but I could plan in a periodic imputation model refresh to ensure the model stays up to date. Any help on this is highly appreciated. Thank you.

0 replies

stefvanbuuren · 2018-01-05T16:34:51Z

stefvanbuuren
Jan 5, 2018
Maintainer

How to export and re-use the imputation model?

I have done a little thinking about what we can do with the current objects that are produced in MICE. The idea as formulated by Martin is that the imputation model should be fixed at the last values. MICE only stores the imputed data, together with the model specification. All intermediate modelling coefficients of how to arrive at the imputations are discarded.

Suppose we wish to fix the following aspects of the imputation model:

We want new imputations to be generated using the fitted coefficients frozen at the last iteration of the training data;
We want to restrict the set of donors to those from the training model;
We want to use the same procedures how to find donors given the observed cells in the new data;
We want the same number of imputations in the training and new data.

I believe that it is possible to meet all four requirements exactly in MICE. We could be temped to save all regression coefficient from the latest iteration, but that does not help. The problem is that the coefficients of the current model are invalidated at the moment that one of the predictors is re-imputed, which occurs almost immediately as we move on to impute the next variable. Rather, in order to recreate the used model from iteration t, we would need to know the state of imputations at iteration t-1. Given these, all coefficients and imputations can be recalculated exactly. So, in order to apply a given imputation model to new data, all we need to do is store the imputations from the previous iteration. Of course, it is much easier to use the current imputations, and define the "last iteration" as the "first future iteration", which is equally as good as, if not better than, the last iteration. Thus, everything we need is already "exported" in the current mids object.

Recreating the next iteration can be done by the mice.mids() function. The only new thing that needs to made is that this takes an additional argument newdata. The test is that the new data do not influence the imputes for the training data. So the imputations should be the same, whether we specify newdata or not.

The procedure to achieve this is as follows:

Define two streams of random numbers, one for the old data, and one for the new data. Initialize the stream for the old data from the last saved random seed taken from the mids object.
In the training data, fill in the current imputation from your saved mids object;
Initialize the missing data in newdata by random draws from the marginal, as usual;
NOW DO THE TRICK: Before imputing $Y_1$, set all $Y_1$ in the new data temporarily to missing (this will insure all new data will be ignored by the mice.impute.xxx() imputation procedure);
Impute the missing data in the training data (using the old random sequence), impute the missing data in the new data using the independent random generator, and store both sets of imputations;
Go to variable $Y_2$, and repeat from step 4, until we are at the last column;
Add a few extra iterations to get convergence in the new data.

As described, we need two streams of random generators if want exact correspondence (which is useful for testing), but in practice that may not be worth the trouble. We obtain a new mids object, with imputations for both the training data and the new data. Discard the training data, and do your analysis on the imputed new data.

I believe this works if

the new data contains any number of new records;
the records are assumed to be exchangeable (e.g., no time-dependencies beyond those abstracted in the imputation model);
the new record contains any number of missing values, including completely empty records;
the new data contains no new variables (but some variables in newdata may be entirely missing and will get imputed);
there are no new levels in the factors (but some levels in newdata may be entirely missing and will get imputed).

Would such be useful to your application?

Stef.

0 replies

IyarLin · 2018-04-09T08:12:48Z

IyarLin
Apr 9, 2018

Hi,

I second the request to have the ability to project the imputation model from a given dataset to a new dataset with the same variables for the exact utility of partition to train and test sets. @stefvanbuuren your suggestion sounds like it should do the trick.

Many thanks and kudos for the great package!

Iyar

0 replies

micdonato · 2018-11-07T02:08:44Z

micdonato
Nov 7, 2018

I would like to do the same: I have some datasets where I have only some missing values in a single column, and others where that single column is completely missing (no predictors are missing, though).
I admit that I tried to understand what @stefvanbuuren was suggesting, but I have no idea how to implement that.

Is "old data" the same as "training data"? What are the "two streams of random numbers" exactly?

Is it possible to get an example? Maybe with nhanes split in two?

0 replies

DavidBamat · 2019-11-06T13:26:31Z

DavidBamat
Nov 6, 2019

Wondering if any examples/vignettes are available for conducting the procedure that @stefvanbuuren describes.

0 replies

stefvanbuuren · 2019-11-06T13:59:57Z

stefvanbuuren
Nov 6, 2019
Maintainer

Sorry, no examples yet.

0 replies

prockenschaub · 2020-03-20T14:00:52Z

prockenschaub
Mar 20, 2020

I have given this a try (roughly) following the strategey outlined by Stef.

Code to get imputations for previously unseen test data can be found here, with a minimal example here.

The function creates observations for the test set in the following way:

Take the previously fit mids object (which contains the last set of imputations for the training data)
Create a new mids object for the test data by calling mice(test_data, maxit 0). This will initialize the missing data in the test set by random draws from observed values in the newdata (this is the default when mice() is called). Note: Alternatively these could be initialised with values from training + test set, particularly if the test set is only one observation.
Mark all values in the test data as missing to ensur that they are ingored by the mice.impute.xxx() imputation procedure. Put differently, we ask mice to draw imputations for every single value in the test data even if they have been observed.
Combine the two mids objects into one by appending all data items ($data, $imp, $nmis, $where) of the test mids object to the trainingmids object.
Simply sample via mice.mids() on the combined object. Trick: Before the first round of imputations and after each single imputation (i.e. each call to mice:::sampler.univ()), replace those values in $imp that came from the observed/non-missing test data (and were thus unnecessarily imputed because they were marked as missing despite the fact that we observed them) with the acutally observed values. This is done via the post-processing functionality of mice
Run the imputation for multiple iterations so that the imputations in the test data converge
Throw away the training data and keep only the imputations for the test data

On the nhanes dataset with PMM this approach seems to work (see example) but I haven't gotten around to do extensive testing. I also haven't bothered to create two streams of random numbers.

There is a good chance that I have disregarded some intricacies of mice, so please do let me know if something in my approach does not make sense or is obviously wrong.

0 replies

SrGh31 · 2020-04-09T20:28:44Z

SrGh31
Apr 9, 2020

@prockenschaub Thank you so much mate! I was in the same problem- was searching for a week now about how to use mice for test data using imputed training data. Stef van Buuren kindly advised here and you made the neat function. Thanks from the bottom of my heart, you saved me weeks worth of precious time.

0 replies

al-obrien · 2020-07-19T22:32:24Z

al-obrien
Jul 19, 2020

Would the work completed by @prockenschaub be a feature within the scope of a direct addition to mice? I imagine there could be many use cases for this with prediction modeling (e.g. applying the prepossessing during CV for the training and holdouts).

0 replies

prediction2020 · 2020-09-08T09:07:01Z

prediction2020
Sep 8, 2020

I am pretty amazed (and shocked to a certain degree) that this is not an implemented functionality of mice, especially since this issue was raised over 3 years ago.
One of the most common applications of imputation is for machine learning modelling and here the best practice clearly defines that for each split of data (and using nested cross-validation there can be many splits of data) we should perform imputation on the training set and apply the imputation model on the test or validation set. All this is necessary to avoid leakage.
While I do agree that there are other applications for imputation, not providing the means to perform best practice imputation for machine learning in 2020 is really weird for a package that is generally regarded as the "leading" imputation package in the data science domain...

0 replies

stefvanbuuren · 2020-09-08T09:18:12Z

stefvanbuuren
Sep 8, 2020
Maintainer

Agree that it would be useful. Nothing prevents you from contributing a pull request that would add it to mice.

0 replies

prockenschaub · 2020-09-08T11:42:20Z

prockenschaub
Sep 8, 2020

Nothing should prevent my function from working with mixed data. My function is mostly a wrapper around the in-built functionality of mice that only adds additional rows to the pretrained mids object and messes with the internal missingness flags to ensure that the new rows are only used in the sampling and don't lead to any further model training.

@prediction2020 if you run into any problems with mixed data, let me know and I can have a look where they might originate.

As for rpy2, I think I have seen mice work with it and there is nothing obvious that should prevent my solution to work with it, except perhaps namespace issues until it would be integrated into mice.

0 replies

prediction2020 · 2020-09-08T12:08:51Z

prediction2020
Sep 8, 2020

@prockenschaub
thank you!
I will try to implement your function with rpy2 for mixed data in the upcoming days and will let you know if I run into any problems!

0 replies

stefvanbuuren · 2020-09-11T10:48:47Z

stefvanbuuren
Sep 11, 2020
Maintainer

@prockenschaub. I looked into your code. It contains many useful elements and goes a far way, but it does not fully implement the algorithm that I suggested above. Your code tweaks the where parameter to obtain imputes. This parameter specifies what cells should be imputed, but it does not make the cells missing. Hence the observed values in the test data will inadvertently leak into the imputation model.

What is needed is to change ry, a logical vector that specifies which cells are missing. If we overwrite the part of ry that corresponds to the rows with the test data with FALSE, then the imputation model will ignore the information in those records. Still, mice will produce imputations for any missing values in these records, which is precisely what we want. The problem is that we cannot change ry from the outside (hard overwriting and resetting NA's in data isn't a real option). We need to go deeper into sampler() to tamper with ry, which is probably a little challenging if you haven't written that code yourself.

I have now implemented an experimental solution in a separate branch that adds a new ignore argument to mice() that does the overwriting trick in sampler() and lower layers. In addition, I added a newdata argument to mice.mids(). These functions would support two different ways for users to fit train data and impute test data. Also, I extended the mids object to include the ignore vector.

Things are not yet perfect, and there are still a couple of thing to consider:

Should mice.mids() return a mids object for the combined data (as it is now) or for the newdata only (am leaning towards the latter)?
Plotting convergence with the combined mids-object produces by mice.mids() is currently not possible.
No tests or vignettes yet.
While the approach appears compatible with the existing univariate imputation routines in mice, it doesn't work with multivariate imputation function like mice.impute.jomoImpute.
And, perhaps some other issues might surface

I welcome any feedback on this approach. Would this work for your use cases?

0 replies

prockenschaub · 2020-09-11T12:38:24Z

prockenschaub
Sep 11, 2020

Thanks @stefvanbuuren , this is really helpful! Yes I think this would cover my personal use cases.

What is needed is to change ry, a logical vector that specifies which cells are missing.

The difficulty that you describe in getting a hold of ry without going into mice's internals was really one of the main challenges in getting this to work. However, just to give some of those that might have used my method some peace of mind, if you run debug(mice:::sampler.univ) in my example and print ry in each iteration, you will see that it is indeed always FALSE for rows of the test set. The way this is achieved is in fact a hard overwrite of data (row 62 in my code) together with a post-processing function (row 81 in my code). I therefore believe that my code was valid (if dirty) and that any results that people might have obtained were valid and not subject to leakage. That being said, I absolutely agree with you that this is a major hack and should be replaced if possible.

I like your solution a lot, it ended up significantly simpler than I anticipated. I will have a thorough test run through your code over the weekend. With regards to the issues that you mention:

Should mice.mids() return a mids object for the combined data (as it is now) or for the newdata only (am leaning towards the latter)?

I tend to agree with a view to my immediate use case that returning newdata only is the more obvious choice, but I could imagine cases where returning both might be preferrable. For example, if I want to continuously update the imputation with new data, convergence would be achieved much quicker.
However, thinking about it more, this is probably easily achieved by simply combining the mids.objects of the new and old data via mice:::rbind.mids? (By the way, I completely missed that mice:::rbind.mids existed and shoddily implemented it myself...)

Plotting convergence with the combined mids-object produces by mice.mids() is currently not possible.

Am I right in my interpretation that one could plot the convergence of the entire data (train + test) but one could not get means / sdevs for the test set only?

No tests or vignettes yet.

I will try to create some examples and tests for this over the weekend.

While the approach appears compatible with the existing univariate imputation routines in mice, it doesn't work with multivariate imputation function like mice.impute.jomoImpute.

I suppose this is because mice simply wraps the functionality of mitml? Therefore, extending this to the multivariate imputation functions would require a similar ignore parameter in their interfaces? Since all currently supported multivariate imputation methods come from the same package, I could raise this issue on their github page and see if this can be considered for their package as well.

And, perhaps some other issues might surface

It wouldn't surprise me but I think the work so far is a great start :)

0 replies

stefvanbuuren · 2020-09-11T13:21:25Z

stefvanbuuren
Sep 11, 2020
Maintainer

@prockenschaub

Sorry, I missed your overwrite of the data that makes the leakage disappear.
In retrospect, the solution may now seem "obvious", but that wasn't certainly the case before I'd created it..
For the newdata-only option, we would need a filter.mids() function, which currently doesn't exist. It is always possible to run mice(..., ignore = ...) if we want both, so if we aim for newdata-only nothing is lost.
Yes, you're correct. Convergence plots from mids.mice() are easier to make if we settle on newdata-only. I don't know how useful these plot will be, as I expect very quick convergence after, say, 2-3 passes through the data, perhaps not long enough to make an interesting graph.
Vignettes/examples/tests. Yes, would be great.
I think there are no additional theoretical complications for the multivariate option. Yes, please ask mitml authors for this feature.

Additional point:

Checking how well newdata conforms to data is now somewhat weak. I can think of various ways in which things might go wrong: factors that introduce new levels/naming inconsistencies/data frames with list columns/imported data with attributes/duplicate row names/and so on. I checked only the most obvious. Would dplyr::bind_rows() here be more helpful that base::rbind()? Perhaps users experiencing problems should first try to bind by hand, e.g., rbind(obj$data, newdata) and see what happens before trying mice.mids(obj, newdata = newdata).

0 replies

EviVal · 2020-10-22T06:25:02Z

EviVal
Oct 22, 2020

How could we use the function mice.use provided by prockenschaub with only one observation?

0 replies

prockenschaub · 2020-10-22T06:43:21Z

prockenschaub
Oct 22, 2020

I guess you ran into an 'mice' detected constant and/or collinear variables. No predictors were left after their removal.?

Constant and collinearity remove is turned on by default in mice and excludes your observation in the setup of the mids object for the test data in the mice.reuse function. You can replace line 50 in mice.reuse.R

mids.new <- mice(newdata, mids$m, where = all_miss, maxit = 0)

with

mids.new <- mice(newdata, mids$m, where = all_miss, maxit = 0, remove.collinear = FALSE, remove.constant = FALSE)

and I think that should do the trick. I have also updated my code to include this.

0 replies

EviVal · 2020-10-22T06:51:57Z

EviVal
Oct 22, 2020

Yes it did the trick! Many thanks!!!!

0 replies

stefvanbuuren · 2020-10-25T12:35:38Z

stefvanbuuren
Oct 25, 2020
Maintainer

Commit 46171f9 merges the work in the ignore argument and the filter() method to be main version, so the functionality is now available in mice 3.11.7. Thanks @prockenschaub for leading this work.

0 replies

stefvanbuuren · 2020-11-14T20:19:40Z

stefvanbuuren
Nov 14, 2020
Maintainer

mice 3.12.0 now includes the ignore argument. Some comments:

I assume that this solution will suffice for most applications.
It would be useful to a write a vignette that demonstrates the new functionality.
If you really want to get the parameter estimates of the imputation model, use the following hack: Copy the univariate imputation functions of interest, adapt it to write the estimates to the .GlobalEnv, and call the new function, e.g. normdump instead norm using the method argument of mice(). Below is an example how to adapt mice.impute.norm().

mice.impute.norm <- function(y, ry, x, wy = NULL, ...) {
  if (is.null(wy)) wy <- !ry
  x <- cbind(1, as.matrix(x))
  parm <- .norm.draw(y, ry, x, ...)
  x[wy, ] %*% parm$beta + rnorm(sum(wy)) * parm$sigma
}

mice.impute.normdump <- function (y, ry, x, wy = NULL, ...) {
  if (is.null(wy)) wy <- !ry
  x <- cbind(1, as.matrix(x))
  parm <- .norm.draw(y, ry, x, ...)
  betadump <<- c(betadump,parm$beta) 
  x[wy, ] %*% parm$beta + rnorm(sum(wy)) * parm$sigma
}

Closing now. Thanks all for bringing up and discussing the issue. It makes up for a better mice.

0 replies

EviVal · 2020-12-12T12:19:09Z

EviVal
Dec 12, 2020

Hey prockenschaub !

I am using mice.reuse and I get this warning: invalid factor level, NA generated

The problem seems to be at mids.append in the loop:

for(i in names(x$imp)){
if(i %in% miss_xy){
# Imputations
app_imp <- y$imp[[i]]
rownames(app_imp) <- y_idx[rownames(app_imp)]
app$imp[[i]] <- rbind(x$imp[[i]], app_imp)

  # nmis
  app$nmis[[i]] <- x$nmis[[i]] + y$nmis[[i]]
}

}

However, I really can't understand why. The train and test set have exactly the same variable names and all predictors have the same factor levels. Any ideas?

Many thanks!

0 replies

prockenschaub · 2020-12-12T13:25:53Z

prockenschaub
Dec 12, 2020

@EviVal The error originates from my less then ideal set-up in my first code. I use mice to set up a mids object for both training and test, and then combine both. This implicitely assumes that both train and test data.frames have the same structure AND also contain all factor levels. If test is very small and for example only contains missing values for variable x, the mids object for train will look different than the mids object for test (the second does not know about the factor levels).

Reproducible example

library(mice)
library(tidyverse)

# Make sure to store `mice.reuse.R` in the same directory or change path
source("mice.reuse.R")

set.seed(42)
data <- data.frame(
  x = rnorm(100),
  z = factor(rep(c("a", "b", "c", "d"), each = 25))
)
data$z[runif(100) < 0.2] <- NA
data$z[100, "z"] <- NA # set the last row definitely to missing
#> Error in x[...] <- m: incorrect number of subscripts on matrix
  
imp.train <- mice(data[1:99, ], maxit = 5, m = 2, seed = 1)
#> 
#>  iter imp variable
#>   1   1  z
#>   1   2  z
#>   2   1  z
#>   2   2  z
#>   3   1  z
#>   3   2  z
#>   4   1  z
#>   4   2  z
#>   5   1  z
#>   5   2  z
imp.train
#> Class: mids
#> Number of multiple imputations:  2 
#> Imputation methods:
#>         x         z 
#>        "" "polyreg" 
#> PredictorMatrix:
#>   x z
#> x 0 1
#> z 1 0

imp.test <- mice.reuse(imp.train, data[100, ], maxit = 1)
#> Warning in `[<-.factor`(`*tmp*`, ri, value = -0.592225372961588): invalid factor
#> level, NA generated

#> Warning in `[<-.factor`(`*tmp*`, ri, value = -0.592225372961588): invalid factor
#> level, NA generated
#> 
#>  iter imp variable
#>   6   1  z
#>   6   2  z

Solution

Do not use my code but instead use the ignore argument in the new mice version 3.12.0, which should be able to deal with this just fine.

imp.ignore <- mice(data, ignore = c(rep(FALSE, 99), TRUE), maxit = 5, m = 2, seed = 1)
#> 
#>  iter imp variable
#>   1   1  z
#>   1   2  z
#>   2   1  z
#>   2   2  z
#>   3   1  z
#>   3   2  z
#>   4   1  z
#>   4   2  z
#>   5   1  z
#>   5   2  z

2 replies

venkatpgi Jun 19, 2021

@prockenschaub
The ignore argument chooses the rows it would ignore to impute in a random fashion or as in the above example, it would ignore to impute from row 100 onwards?
The reason I am asking this question is that in ML modelling, the whole data should be split into training and testing data in a random fashion

prockenschaub Jun 21, 2021

@venkatpgi you can simply draw the row indices of a random 80% training set by using something like trn_idx <- sampe(nrow(full_data), nrow(full_data) * 0.8, replace = FALSE). Your 20% test set indices are then given by tst_idx <- (1:nrow(full_data))[-trn_idx. The only thing left is to create an ignore vector that is FALSE if the index is part of the training set and TRUE if it is part of the test set ignore = purrr::map_lgl(1:nrow(full_data), ~ . %in% tst_idx).

arjunbazinga · 2020-12-17T11:34:33Z

arjunbazinga
Dec 17, 2020

Hi!
I'm unable to find mice.reuse after importing mice, i checked sessionInfo and i'm using mice_3.12.0 , is there something i need to do to enable it ?

Sorry if it's the wrong place to ask

thanks
Arjun

0 replies

prockenschaub · 2020-12-17T11:48:38Z

prockenschaub
Dec 17, 2020

mice.reuse was my own hacked function and is not part of the mice package (you can still find it here but I wouldn't recommend using it anymore).mice::mice version 3.12.0 contains the ignore parameter that does the same thing in one go. Simply pass it a vector with FALSE for all rows that should be used during training and TRUE for all rows that should only be imputed (but not used during training). See the proposed solution in my example two comments ago for an idea on how to use it.

Thanks for reminding me, though, I urgently need to write a short vignette on using ignore. This probably won't happen this side of Christmas but I hope to add this early next year.

Edit (22/03/2021): @AmeBol kindly pointed out that I mixed up the role of TRUE and FALSE in my description of the ingore paramter above. Fixed now and in line with the help pages of mice.

5 replies

amirdol Jun 11, 2021

@prockenschaub there is still something i don't understand. The ignore parameter is great in order to use only some of the records to train the imputation model, but what about the case where you already trained it and you want to apply it on new data?

What I really think I need is access to the per variable models generated by mice::mice and a method such as predict to apply them on new data with the same columns

stefvanbuuren Jun 11, 2021
Maintainer

@amirdol mice.mids() can take a mids object fitted to the training data, and apply the fitted imputation model to new data through its newdata argument.

venkatpgi Jun 19, 2021

@stefvanbuuren @prockenschaub
I have been reading the thread carefully and have just joined to comment
I am pretty new to R and am learning ML for prediction in health care sciences
In ML models, it is always advisable to split the data first into training, validation and testing sets and conduct the model building process completely in the training set and apply the model in the validation and then testing set to prevent data leakage and overfitting. Many seems to recommend to split the data first in a completely random fashion (not in a serial fashion) and then do the pre-processing in the training set. Pre-processing includes missing data imputation followed by centering/scaling and then model training.
The ignore functionality seems to work well (whatever I could understand from the minimal example by @prockenschaub ) where the data is split training and testing for the sake of imputation. What I am not sure/clear, how this new functionality would work when I wish to split the data randomly and then apply the function
If a small working example can be given, it may be easy for me to understand as I am pretty new to data science

Thanks in advance

venkatpgi Jun 19, 2021

@stefvanbuuren
Hi Stef,
The predict() function in base R doesn't seem to accept a 'mira' 'matrix' as an object and hence running the trained model after imputation on a test data seems next to impossible. Any thoughts?

prockenschaub Jun 21, 2021

@venkatpgi you need to extract the imputed data as a data.frame using mice::complete. These can then be passed into the predict function of your model. Be aware that depending on how many imputations you chose, there are more than one imputed data.frame and you will need to build your models, predict the outcome, and evaluate the models separately for each of those data.frames. To my knowledge, the current implementation of mice::pool cannot automatically pool the model evaluation (e.g. AUC) and you will need to apply Rubin's Rules yourself to average your results across all imputed datasets.

stefvanbuuren · 2021-12-06T08:03:41Z

stefvanbuuren
Dec 6, 2021
Maintainer

See #336 for a demo script.

0 replies

Get at the final model used in the MICE iterations? #346

Replies: 37 comments · 7 replies

stefvanbuuren May 16, 2017 Maintainer

stefvanbuuren Jan 5, 2018 Maintainer

How to export and re-use the imputation model?

stefvanbuuren Nov 6, 2019 Maintainer

stefvanbuuren Sep 8, 2020 Maintainer

stefvanbuuren Sep 11, 2020 Maintainer

stefvanbuuren Sep 11, 2020 Maintainer

stefvanbuuren Oct 25, 2020 Maintainer

stefvanbuuren Nov 14, 2020 Maintainer

Reproducible example

Solution

stefvanbuuren Jun 11, 2021 Maintainer

stefvanbuuren Dec 6, 2021 Maintainer

Replies: 37 comments 7 replies

stefvanbuuren
May 16, 2017
Maintainer

stefvanbuuren
Jan 5, 2018
Maintainer

stefvanbuuren
Nov 6, 2019
Maintainer

stefvanbuuren
Sep 8, 2020
Maintainer

stefvanbuuren
Sep 11, 2020
Maintainer

stefvanbuuren
Sep 11, 2020
Maintainer

stefvanbuuren
Oct 25, 2020
Maintainer

stefvanbuuren
Nov 14, 2020
Maintainer

stefvanbuuren Jun 11, 2021
Maintainer

stefvanbuuren
Dec 6, 2021
Maintainer