Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider changing default format for dataframes to arrow or CSV #666

Open
juliasilge opened this issue Oct 28, 2022 · 10 comments
Open

Consider changing default format for dataframes to arrow or CSV #666

juliasilge opened this issue Oct 28, 2022 · 10 comments
Labels
feature a feature request or enhancement

Comments

@juliasilge
Copy link
Member

We have seen users who write using the default from R, and then are frustrated when their Python colleagues can't read. We have considered changing to arrow for a long time:

# Might consider switch to arrow in the future

Does arrow have enough usage in the community for this to be reasonable? It would be a much better choice if interoperability is one of the main reasons people use pins (to read with Python).

@juliasilge
Copy link
Member Author

We recently moved arrow to Suggests in #646 so this would likely mean some new users would be prompted to install another package, even when using the defaults.

@machow
Copy link
Collaborator

machow commented Nov 1, 2022

Adding arrow as a requirement seems like it could introduce some friction (maybe?). I wonder if the audience for pins might lean toward CSV (for example, this pins blog post aims at an audience that is emailing CSVs, so maybe emailing CSV -> stashing CSV with pins might feel like a smaller step?).

(This is me mostly thinking of pins as a very early stepping stone for data versioning / sharing, since I'd personally be very into storing everything in arrow/parquet!)

@iainmwallace
Copy link

I would suggest csv as the default. We often share via Connect and it is frustrating for non R/Python users when they go to the connect landing page for that dataset and they can't download the file in a format they can understand or open easily.

@juliasilge
Copy link
Member Author

Reading CSV via read.csv() often has downsides, like guessing that goes wrong, not handling dates, etc. If we consider changing the default to CSV, would it be better (less surprising overall, easier collaboration with Python folks, etc) to use vroom for reading and writing?

@juliasilge juliasilge changed the title Consider changing default format for dataframes to arrow Consider changing default format for dataframes to arrow or CSV Nov 4, 2022
@juliasilge juliasilge added the feature a feature request or enhancement label Nov 4, 2022
@wibeasley
Copy link

I agree about the downsides of csvs, especially the lack of explicit variable types. When pins saves a csv, could it save a second file that stores the variable info? Essentially a serialized/dput-ed readr::col_types object?

I don't like having to redefine (a) integer vs floating, and (b) factor levels.

If the data is later imported by pins, pins would look for the metadata and use it. But the csv is still valid and can be read by other programs that don't know how to interpret the "mtcars.readr_col_types" plain-text file. The metadata file isn't critical -it's just optional gravy.

@juliasilge
Copy link
Member Author

@wibeasley That is an interesting suggestion! As of now, we would recommend that folks follow this vignette for managing custom formats, like reading CSVs with more control:

library(pins)
library(palmerpenguins)

b <- board_temp()

penguin_col_spec <- as.character(readr::as.col_spec(penguins))
penguin_col_spec
#> [1] "ffddiifi"

b %>% 
  pin_write(
    penguins, 
    "very-nice-penguins",
    type = "csv",
    metadata = list(col_spec = penguin_col_spec)
  )
#> Creating new version '20230223T212321Z-809e9'
#> Writing to pin 'very-nice-penguins'

new_col_spec <- pin_meta(b, "very-nice-penguins")$user$col_spec
pin_download(b, "very-nice-penguins") %>%
  readr::read_csv(col_types = new_col_spec)
#> # A tibble: 344 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
#>    <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
#>  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
#>  2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
#>  3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
#>  4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
#>  5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
#>  6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
#>  7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
#>  8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
#>  9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
#> 10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
#> # … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#> #   ²​body_mass_g

Created on 2023-02-23 with reprex v2.0.2

Those last two bits could be wrapped up in a pin_read_col_spec() helper function for an individual to use, if they always wanted to set up their files this way.

@leslem
Copy link

leslem commented Apr 15, 2024

An argument to specify a csv reading function (e.g. read.csv or readr::read_csv or data.table::fread) would be good for the use case that led me to this issue. Or passing arguments on to read.csv would be helpful.

I have a colleague who's writing pins from python as type='csv', and then I want to read them in R, but with read.csv under the hood I get column names modified (to be syntactic) and column types I don't want. For now I'm going to do pin_download() and then readr::read_csv() to get the data read in the way I'd like.

@juliasilge
Copy link
Member Author

For now I'm going to do pin_download() and then readr::read_csv() to get the data read in the way I'd like.

This is definitely the right thing to do for now.

CSV writing can be so cantankerous, especially if you are using R and Python or something else. Have you talked with your colleague about considering switching to parquet? Is there a particular constraint that makes that not a good move?

@juliasilge
Copy link
Member Author

I was just thinking about the problem reported by @leslem again today, and how it highlights that switching to CSV will not really solve all user pain around this issue.

In rstudio/pins-python#231 @isabelizimm added support for reading .rds files from Python, which means that Python users will be able to read rectangular data written from R with the current default. The rdata package which powers that PR uses the binary types of R objects which means it's kind of like a poor man's arrow. It really improves the situation. I would still recommend that R + Python collaborators use parquet, but with that change on the Python side, maybe we don't want to change the default format for dataframes, at least not without a lightweight option for reading/writing parquet.

@juliasilge
Copy link
Member Author

In #843 I added nanoparquet, a lightweight, zero-dependency option for reading/writing parquet, to pins. This will get released in pins 1.4.0. I expect that to go well of course, but let's give it another release or two, and then consider finally switching the default for writing dataframes to parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants