`duckplyr_df_from_csv`: overriding auto-detect for a single column apparently not available #223

math-mcshane · 2024-08-13T00:36:00Z

Attended posit::conf 2024 workshop and ran into this issue with duckplyr_df_from_csv:

Create example data

using first three rows of IMDB data

myData = readr::read_tsv(
  "tconst	averageRating	numVotes
  tt0000001	5.7	2051
  tt0000002	5.7	274
  tt0000003	6.5	2005"
)
readr::write_tsv(myData, file = "ratings.tsv")

Default behavior

automatic read in is fine (setting delim not required) but does not most accurately cast last column as integer

duckplyr_df_csv = duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv", 
  options = list(
    delim = "\t"
  )
)

Manual fix

This is a correct way to read in this data, however, it requires manually re-specifying the first two column types. From https://duckdb.org/docs/data/csv/auto_detection, types = {'numVotes': 'INTEGER'} would be the SQL approach

duckplyr_df_csv = duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(c("VARCHAR", "DOUBLE", "INTEGER"))
  )
)

Kirill's attempt

We could not find a way to specify just a single column. Here's one of Kirill's attempts:

duckplyr_df_csv = duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(list(tconst = "VARCHAR", numVotes = "INTEGER"))
  )
)

The text was updated successfully, but these errors were encountered:

krlmlr · 2024-08-13T13:46:34Z

Thanks, Ryan, this is helpful! This looks like it should work, here are a few more experiments:

text <- "tconst\taverageRating\tnumVotes\ntt0000001\t5.7\t2051\ntt0000002\t5.7\t274\ntt0000003\t6.5\t2005"
writeLines(text, "ratings.tsv")

# Works but need to specify redundant type for the `averageRating` column
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(c("VARCHAR", "DOUBLE", "INTEGER"))
  )
)
#> No duckplyr fallback reports ready for upload.
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (INTEGER)
#> 
#> # A tibble: 3 × 3
#>   tconst    averageRating numVotes
#>   <chr>             <dbl>    <int>
#> 1 tt0000001           5.7     2051
#> 2 tt0000002           5.7      274
#> 3 tt0000003           6.5     2005

# Doesn't change `numVotes`
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(c(tconst = "VARCHAR", numVotes = "INTEGER"))
  )
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (INTEGER)
#> - numVotes (BIGINT)
#> 
#> # A tibble: 3 × 3
#>   tconst    averageRating numVotes
#>   <chr>             <int>    <dbl>
#> 1 tt0000001             6     2051
#> 2 tt0000002             6      274
#> 3 tt0000003             7     2005

# Fails
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = c(tconst = "VARCHAR", numVotes = "INTEGER")
  )
)
#> Error: rel_from_table_function: Need scalar parameter

# How is a struct mapped back to R?
con <- DBI::dbConnect(duckdb::duckdb())
as_tibble(dbGetQuery(
  con,
  "SELECT {'FlightDate': 'DATE', 'Origin': 'VARCHAR', 'Dest': 'VARCHAR', 'DepTime': 'INTEGER', 'ArrTime': 'INTEGER'} AS a"
))
#> Error in as_tibble(dbGetQuery(con, "SELECT {'FlightDate': 'DATE', 'Origin': 'VARCHAR', 'Dest': 'VARCHAR', 'DepTime': 'INTEGER', 'ArrTime': 'INTEGER'} AS a")): could not find function "as_tibble"

# Trying a data frame
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = data.frame(tconst = "VARCHAR", numVotes = "INTEGER")
  )
)
#> Error: rel_from_table_function: Need scalar parameter

duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(data.frame(tconst = "VARCHAR", numVotes = "INTEGER"))
  )
)
#> Error: {"exception_type":"Binder","exception_message":"read_csv_auto types requires a list of types (varchar) as input"}

^{Created on 2024-08-13 with reprex v2.1.0}

krlmlr · 2024-09-14T09:06:52Z

With duckdb/duckdb-r#379, the following works, but other unintended usages still give surprises.

text <- "tconst\taverageRating\tnumVotes\ntt0000001\t5.7\t2051\ntt0000002\t5.7\t274\ntt0000003\t6.5\t2005"
writeLines(text, "ratings.tsv")

# Works now
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = data.frame(tconst = "VARCHAR", numVotes = "INTEGER")
  )
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (INTEGER)
#> 
#> # A tibble: 3 × 3
#>   tconst    averageRating numVotes
#>   <chr>             <dbl>    <int>
#> 1 tt0000001           5.7     2051
#> 2 tt0000002           5.7      274
#> 3 tt0000003           6.5     2005

# Unexpected result
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(a = "VARCHAR", b = "INTEGER")
  )
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (BIGINT)
#> 
#> # A tibble: 3 × 3
#>   tconst    averageRating numVotes
#>   <chr>             <dbl>    <dbl>
#> 1 tt0000001           5.7     2051
#> 2 tt0000002           5.7      274
#> 3 tt0000003           6.5     2005

^{Created on 2024-09-14 with reprex v2.1.0}

krlmlr · 2024-09-14T10:12:38Z

Now in the same PR:

text <- "tconst\taverageRating\tnumVotes\ntt0000001\t5.7\t2051\ntt0000002\t5.7\t274\ntt0000003\t6.5\t2005"
writeLines(text, "ratings.tsv")

# Works now
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = data.frame(tconst = "VARCHAR", numVotes = "INTEGER")
  )
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (INTEGER)
#> 
#> # A tibble: 3 × 3
#>   tconst    averageRating numVotes
#>   <chr>             <dbl>    <int>
#> 1 tt0000001           5.7     2051
#> 2 tt0000002           5.7      274
#> 3 tt0000003           6.5     2005

# Unexpected result
duckplyr::duckplyr_df_from_csv(
  path = "ratings.tsv",
  options = list(
    delim = "\t",
    types = list(a = "VARCHAR", b = "INTEGER")
  )
)
#> Error: rel_from_table_function: Need scalar parameter

^{Created on 2024-09-14 with reprex v2.1.0}

krlmlr · 2024-10-16T11:06:48Z

Does the current development version work for you?

krlmlr added the feature a feature request or enhancement label Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`duckplyr_df_from_csv`: overriding auto-detect for a single column apparently not available #223

`duckplyr_df_from_csv`: overriding auto-detect for a single column apparently not available #223

math-mcshane commented Aug 13, 2024

krlmlr commented Aug 13, 2024

krlmlr commented Sep 14, 2024

krlmlr commented Sep 14, 2024

krlmlr commented Oct 16, 2024

duckplyr_df_from_csv: overriding auto-detect for a single column apparently not available #223

duckplyr_df_from_csv: overriding auto-detect for a single column apparently not available #223

Comments

math-mcshane commented Aug 13, 2024

Create example data

Default behavior

Manual fix

Kirill's attempt

krlmlr commented Aug 13, 2024

krlmlr commented Sep 14, 2024

krlmlr commented Sep 14, 2024

krlmlr commented Oct 16, 2024

`duckplyr_df_from_csv`: overriding auto-detect for a single column apparently not available #223

`duckplyr_df_from_csv`: overriding auto-detect for a single column apparently not available #223