NAs left in mice output without any loggedEvents #349

LukasWallrich · 2020-09-15T12:02:24Z

LukasWallrich
Sep 15, 2020

I am trying to impute missing values in a dataset and am left with a rather large share of missing values. I tried to find previous issues and SO questions, and all seemed to be either related to logged events or could be fixed with remove_collinear = FALSE ... I get no logged events and remove_collinear has no effect so that I am stuck and think there might be a bug in mice - at least with regard to the absence of loggedEvents?

Given that I am trying to impute categorical variables with many categories, I can't make a very small reproducible example. However, this dataset with 500 lines works: https://drive.google.com/file/d/1n_U-BYBU-nJVar2D_5FkeOJf6ARIKrbK/view?usp=sharing. Each line has at least 3 non-missing values, yet in the output, I get NAs in each variable that I am trying to impute.

I'd be very grateful for any suggestions regarding how to get a complete imputed dataset.

method <- character(0)
method["age"] <- "pmm"
method["gender"] <- "polyreg"
method["education"] <- "polr"
method["ethnicity"] <- "polyreg"
method["politicalid_7"] <- "pmm"
method["test"] <- ""
method["IAT_score"] <- ""
method["att_7"] <-  ""
method["t_diff"] <- ""
method["religion"] <- "polyreg"
method["religionid"] <- "pmm"

input <- readr::read_rds("mice_input.RDS")
imputed <- input %>%  mice::mice(m=1, method = method, seed = 270491,  remove_collinear = FALSE)
compl <- complete(imputed)
compl %>% {colSums(is.na(.))}

LukasWallrich · 2020-09-15T12:46:54Z

LukasWallrich
Sep 15, 2020
Author

After some more playing around, the issue seems to come from the fact that I did not impute the auxiliary variables. If I add methods to them, I get rid of the missing data, but the running time doubles (which is an issue, given that my data has more than 5mn rows and the first imputation attempt took 4 days to run). Is there any alternative that just uses the available information in the auxiliary variables? If there is not, it might be worth clarifying that in the documentation?

0 replies

gerkovink · 2020-09-15T13:22:05Z

gerkovink
Sep 15, 2020
Maintainer

The mice algorithm iteratively imputes the column in the data based on a set of predictors in the data. By default, every column serves as a predictor for every other column. If you'd not wish to impute auxiliary variables for some reason, then these variables cannot serve as predictors for those cases for which they are unobserved. You may change the role columns take in the iterative process by changing the predictorMatrix. Setting a more efficient predictorMatrix may be a solution to the lengthy runtimes you experience. See e.g. the mice Vignettes for examples on how to do this.

All the best,

Gerko

0 replies

LukasWallrich · 2020-09-15T13:32:49Z

LukasWallrich
Sep 15, 2020
Author

Thanks, Gerko. I was thinking along those lines and probably just need to impute all variables, but still don't quite get it. I have 50% missing data in t_diff, an auxiliary variable which is used to predict every other variable. However, in the imputed variables, I have at most 10% NAs. Why does the imputation work in 4 out of 5 rows and fail in the others?

I am considering setting a more efficient predictorMatrix, primarily by dropping categorical variables with many levels as predictors, but find it hard to come up with a theoretical rationale for that ... will have another look through the vignettes and through published articles in my field though.

0 replies

stefvanbuuren · 2020-09-15T15:44:55Z

stefvanbuuren
Sep 15, 2020
Maintainer

Related #263

0 replies

prockenschaub · 2020-09-16T17:56:48Z

prockenschaub
Sep 16, 2020

@LukasWallrich I think the reason why you don't see 50% missingness in your inputed variables is because only those values stay missing where there was missingness to begin with (e.g. 16 rows for input[["age]]) and where there was 1 or more missing values in one of your auxilliary variables in the same row. In these cases, the imputation algorithm (e.g. pmm) cannot sample a value to use for imputation, since the predictors of those rows aren't fully observed.

You can run the following code to check this:

# Create a logical vector of length nrow(input) that is TRUE if any auxilliary variable in a row is missing
any_aux_missing <- with(
  input, 
  is.na(test) | is.na(IAT_score) | is.na(att_7) | is.na(t_diff)
)

# For each variable that is imputed, check how many rows that had a missing value also had 
# a missing auxilliary variable
for(col in names(method)[method != ""]){
  n_joint_missing <- sum(is.na(input[[col]]) & any_aux_missing)
  cat(col, ":", n_joint_missing, "\n")
}

0 replies

LukasWallrich · 2020-09-16T20:42:20Z

LukasWallrich
Sep 16, 2020
Author

Thank you, @prockenschaub, that's very helpful. Now I finally understand what's going on.

0 replies

LukasWallrich · 2020-09-16T20:45:43Z

LukasWallrich
Sep 16, 2020
Author

@stefvanbuuren Given how similar my mistake was to #263 I might suggest adding a note that auxiliary variables should be imputed to the explanation of the method argument (or to the vignette). As it stands, I understood the possibility to exclude variables from being imputed as a quick-win to reduce runtime ...

0 replies

stefvanbuuren · 2020-09-16T21:34:43Z

stefvanbuuren
Sep 16, 2020
Maintainer

I have added to the mice() docs:

Skipping imputation: The user may skip imputation of a column by setting its 
entry to the empty method: "". For complete columns without missing data 
mice will automatically set the empty method. Setting the empty method does 
not produce imputations for the column, so any missing cells remain NA. If 
column A contains NA's and is used as predictor in the imputation model for 
column B, then mice produces no imputations for the rows in B where A is missing. 
The imputed data for B may thus contain NA's. The remedy is to remove column A 
from the imputation model for the other columns in the data. This can be done by 
setting the entire column for variable A in the predictorMatrix equal to zero.

0 replies

ivmcphail · 2022-07-16T01:21:36Z

ivmcphail
Jul 16, 2022

Hello,

I have been running into this issue, as with the OP and similar to post #263. However, I am wondering why this is the case that one cannot skip a given variable for imputation (column A in Dr. Van Buuren's post) but also use it as an auxiliary/predictor variable AND avoid the problem of having NAs in imputed variables (column B) when the skipped variable (column A) contains NAs. This seems to unnecessarily restrict the kinds of imputation models that can be built. So I have been puzzling over getting this code to run prior to reading these posts here, but I am now just wondering why this is the case in mice. Dr. Van Buuren, if you are able/have time, could you explain the reasoning behind not allowing an item to be skipped, included in the imputation model along with other predictors, and contain missing values?

Many thanks,

Ian

2 replies

gerkovink Jul 16, 2022
Maintainer

Imagine that you need to post a stack of letters. Some letters miss the address, but contain the name of the addressee. However, you are not allowing yourself to look up the missing addresses. This means that some letters might get posted, but will not reach their destination.

It's mathematically impossible to obtain parameter estimates when the predictor space still contains missing values. It is only possible to estimate on the complete subset. So if you choose to exclude an incomplete variable from imputation, but want it to be part of that predictor space for another variable: mice needs to make a choice. It must either impute the column anyway, despite your specification. Or obey the specification and not impute the rows for which you've posed the impossible restriction on the predictor space.

In mice, we choose the latter solution, because it is true to the user specification.

ivmcphail Jul 16, 2022

Yes, this makes sense. Thank you for the response, this hadn't been quite clear to me, so it is good to know as I use the mice package.

Ian

ianaianawong · 2023-07-28T16:23:35Z

ianaianawong
Jul 28, 2023

Hi there,

I ran into a similar problem, so I left my question in this thread. After the multiple imputation (pmm method), there are still missing values in my dataset (although the number of missing values was reduced).

I have checked that there was no issue with constant value or multicollinearity as there was no logged event. I have included most auxiliary variables in the multiple imputation. I removed 3 auxiliary variables earlier due to the presence of logged events. But after such removal, there were no logged events. I have also checked that no variables/columns were completely empty, whereas there were about 7 participants who did not answer any part of the survey (so about 7 rows were completely empty).

There are 14 variables in the main analyses and 10 auxiliary variables. All of them were included in the multiple imputation. All of them contain missing values. All variables in the main analyses are continuous. For auxiliary variables, 6 are categorical and 4 are continuous. The categorical variables were coded as factors in r.

I wonder why there were still missing values? Is this normal?

Can anyone please advise how can I get a complete imputed dataset? If not, can I proceed to multiple mediation analysis with those missing values?

I used this code for the multiple imputation:
alldata4.mi <- mice::mice(alldata4, m = 5, method = 'pmm')

Please see this link to part of my dataset: https://drive.google.com/file/d/1s_KNTSp4NlxvLYKhVWSPfYbBf0EeniXx/view?usp=drive_link

I've also checked out the following discussion, but they don't seem to have the relevant answer for my situation.
#350
https://www.statalist.org/forums/forum/general-stata-discussion/general/1470175-missing-imputed-values-still-present-after-doing%C2%A0multiple-imputation-mice
https://stackoverflow.com/questions/36330570/mice-does-not-impute-certain-columns-but-also-does-not-give-an-error?noredirect=1&lq=1
https://stackoverflow.com/questions/25472640/leftover-nas-after-imputing-using-mice

Thank you for your time and help in advance!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAs left in mice output without any loggedEvents #349

{{title}}

Replies: 10 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

NAs left in mice output without any loggedEvents #349

LukasWallrich Sep 15, 2020

Replies: 10 comments · 2 replies

LukasWallrich Sep 15, 2020 Author

gerkovink Sep 15, 2020 Maintainer

LukasWallrich Sep 15, 2020 Author

stefvanbuuren Sep 15, 2020 Maintainer

prockenschaub Sep 16, 2020

LukasWallrich Sep 16, 2020 Author

LukasWallrich Sep 16, 2020 Author

stefvanbuuren Sep 16, 2020 Maintainer

ivmcphail Jul 16, 2022

gerkovink Jul 16, 2022 Maintainer

ivmcphail Jul 16, 2022

ianaianawong Jul 28, 2023

LukasWallrich
Sep 15, 2020

Replies: 10 comments 2 replies

LukasWallrich
Sep 15, 2020
Author

gerkovink
Sep 15, 2020
Maintainer

LukasWallrich
Sep 15, 2020
Author

stefvanbuuren
Sep 15, 2020
Maintainer

prockenschaub
Sep 16, 2020

LukasWallrich
Sep 16, 2020
Author

LukasWallrich
Sep 16, 2020
Author

stefvanbuuren
Sep 16, 2020
Maintainer

ivmcphail
Jul 16, 2022

gerkovink Jul 16, 2022
Maintainer

ianaianawong
Jul 28, 2023