Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duckplyr_df_from_csv: overriding auto-detect for a single column apparently not available #223

Open
math-mcshane opened this issue Aug 13, 2024 · 4 comments
Labels
feature a feature request or enhancement

Comments

@math-mcshane
Copy link

Attended posit::conf 2024 workshop and ran into this issue with duckplyr_df_from_csv:

Create example data

using first three rows of IMDB data

myData = readr::read_tsv(
  "tconst	averageRating	numVotes
  tt0000001	5.7	2051
  tt0000002	5.7	274
  tt0000003	6.5	2005"
)
readr::write_tsv(myData, file = "ratings.tsv")

Default behavior

automatic read in is fine (setting delim not required) but does not most accurately cast last column as integer

duckplyr_df_csv = duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv", 
  options = list(
    delim = "\t"
  )
)

Manual fix

This is a correct way to read in this data, however, it requires manually re-specifying the first two column types. From https://duckdb.org/docs/data/csv/auto_detection, types = {'numVotes': 'INTEGER'} would be the SQL approach

duckplyr_df_csv = duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(c("VARCHAR", "DOUBLE", "INTEGER"))
  )
)

Kirill's attempt

We could not find a way to specify just a single column. Here's one of Kirill's attempts:

duckplyr_df_csv = duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(list(tconst = "VARCHAR", numVotes = "INTEGER"))
  )
)
@krlmlr krlmlr added the feature a feature request or enhancement label Aug 13, 2024
@krlmlr
Copy link
Member

krlmlr commented Aug 13, 2024

Thanks, Ryan, this is helpful! This looks like it should work, here are a few more experiments:

text <- "tconst\taverageRating\tnumVotes\ntt0000001\t5.7\t2051\ntt0000002\t5.7\t274\ntt0000003\t6.5\t2005"
writeLines(text, "ratings.tsv")

# Works but need to specify redundant type for the `averageRating` column
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(c("VARCHAR", "DOUBLE", "INTEGER"))
  )
)
#> No duckplyr fallback reports ready for upload.
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (INTEGER)
#> 
#> # A tibble: 3 × 3
#>   tconst    averageRating numVotes
#>   <chr>             <dbl>    <int>
#> 1 tt0000001           5.7     2051
#> 2 tt0000002           5.7      274
#> 3 tt0000003           6.5     2005

# Doesn't change `numVotes`
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(c(tconst = "VARCHAR", numVotes = "INTEGER"))
  )
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (INTEGER)
#> - numVotes (BIGINT)
#> 
#> # A tibble: 3 × 3
#>   tconst    averageRating numVotes
#>   <chr>             <int>    <dbl>
#> 1 tt0000001             6     2051
#> 2 tt0000002             6      274
#> 3 tt0000003             7     2005

# Fails
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = c(tconst = "VARCHAR", numVotes = "INTEGER")
  )
)
#> Error: rel_from_table_function: Need scalar parameter

# How is a struct mapped back to R?
con <- DBI::dbConnect(duckdb::duckdb())
as_tibble(dbGetQuery(
  con,
  "SELECT {'FlightDate': 'DATE', 'Origin': 'VARCHAR', 'Dest': 'VARCHAR', 'DepTime': 'INTEGER', 'ArrTime': 'INTEGER'} AS a"
))
#> Error in as_tibble(dbGetQuery(con, "SELECT {'FlightDate': 'DATE', 'Origin': 'VARCHAR', 'Dest': 'VARCHAR', 'DepTime': 'INTEGER', 'ArrTime': 'INTEGER'} AS a")): could not find function "as_tibble"

# Trying a data frame
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = data.frame(tconst = "VARCHAR", numVotes = "INTEGER")
  )
)
#> Error: rel_from_table_function: Need scalar parameter

duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(data.frame(tconst = "VARCHAR", numVotes = "INTEGER"))
  )
)
#> Error: {"exception_type":"Binder","exception_message":"read_csv_auto types requires a list of types (varchar) as input"}

Created on 2024-08-13 with reprex v2.1.0

@krlmlr
Copy link
Member

krlmlr commented Sep 14, 2024

With duckdb/duckdb-r#379, the following works, but other unintended usages still give surprises.

text <- "tconst\taverageRating\tnumVotes\ntt0000001\t5.7\t2051\ntt0000002\t5.7\t274\ntt0000003\t6.5\t2005"
writeLines(text, "ratings.tsv")

# Works now
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = data.frame(tconst = "VARCHAR", numVotes = "INTEGER")
  )
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (INTEGER)
#> 
#> # A tibble: 3 × 3
#>   tconst    averageRating numVotes
#>   <chr>             <dbl>    <int>
#> 1 tt0000001           5.7     2051
#> 2 tt0000002           5.7      274
#> 3 tt0000003           6.5     2005

# Unexpected result
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(a = "VARCHAR", b = "INTEGER")
  )
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (BIGINT)
#> 
#> # A tibble: 3 × 3
#>   tconst    averageRating numVotes
#>   <chr>             <dbl>    <dbl>
#> 1 tt0000001           5.7     2051
#> 2 tt0000002           5.7      274
#> 3 tt0000003           6.5     2005

Created on 2024-09-14 with reprex v2.1.0

@krlmlr
Copy link
Member

krlmlr commented Sep 14, 2024

Now in the same PR:

text <- "tconst\taverageRating\tnumVotes\ntt0000001\t5.7\t2051\ntt0000002\t5.7\t274\ntt0000003\t6.5\t2005"
writeLines(text, "ratings.tsv")

# Works now
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = data.frame(tconst = "VARCHAR", numVotes = "INTEGER")
  )
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (INTEGER)
#> 
#> # A tibble: 3 × 3
#>   tconst    averageRating numVotes
#>   <chr>             <dbl>    <int>
#> 1 tt0000001           5.7     2051
#> 2 tt0000002           5.7      274
#> 3 tt0000003           6.5     2005

# Unexpected result
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(a = "VARCHAR", b = "INTEGER")
  )
)
#> Error: rel_from_table_function: Need scalar parameter

Created on 2024-09-14 with reprex v2.1.0

@krlmlr
Copy link
Member

krlmlr commented Oct 16, 2024

Does the current development version work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants