Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulling from multiple source tables with get_acs() causes erroneous NAs #565

Open
jacksonvoelkel opened this issue Apr 5, 2024 · 2 comments

Comments

@jacksonvoelkel
Copy link

When pulling variables from multiple data source tables with get_acs(), I've noticed inconsistent behavior:

First, we can pull a list of variables from the same table ("DP") without issue:

library(tidycensus) # v1.6.3
library(dplyr) # v1.1.4

# Pulling from the same data source ---------------------------------------

# All "DP" variables
vars_dp <- c("DP04_0001", "DP02_0001")

acs_from_list_same_table <- get_acs(
  geography = "tract", 
  variables = vars_dp, 
  year = 2010,
  state = c("AL", "NY", "CA"),
  output = "wide",
  cache = FALSE) %>% 
  bind_rows()

table(is.na(acs_from_list_same_table$DP04_0001E))

This has zero NA values in the variable DP04_0001E for the states I have pulled it for.

Next, I try to add in a variable from a different table ("S"):

# Pulling from different source tables ------------------------------------

# A mix of "DP" and "S" variables
vars_dps <- c("DP04_0001", "S0601_C01_001")

acs_from_list_different_table <- get_acs(
  geography = "tract", 
  variables = vars_dps, 
  year = 2010,
  state = c("AL", "NY", "CA"),
  output = "wide",
  cache = FALSE) %>% 
  bind_rows()

table(is.na(acs_from_list_different_table$DP04_0001E))

As can be seen, we now have many NA values in variable DP04_0001E.

When comparing the two, I see that the values pulled are the same where the "multiple source table" is not NA:

# Comparing ---------------------------------------------------------------

joined_data <- left_join(x = acs_from_list_same_table,
                         y = acs_from_list_different_table, 
                         by = "GEOID") %>% 
  select(starts_with("DP04"))

# In instances where "multiple data source" values were not NA, they match the
#  "pulled from a single data source" version's values. 
joined_data %>% 
  print(n = 10)

joined_data %>% 
  filter(complete.cases(.)) %>% 
  print(n = 10)

This was working fine a couple of months ago, but unfortunately I don't have a record of the previous tidycensus version I used.

@resistor4u
Copy link

Are some values returning -888888888? I'm not able to see the API call results to check.

@walkerke
Copy link
Owner

I found the error, it's here:

https://github.com/walkerke/tidycensus/blob/master/R/acs.R#L405

To combine variables from different tables, we join on GEOID and NAME to preserve both columns. The problem is that in the 2010 ACS, the NAME column is not consistent across datasets (some use commas, others use semicolons as separators). This does not appear to be an issue in later years, which is why we never noticed it.

This should be a quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants