-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider changing default format for dataframes to arrow or CSV #666
Comments
We recently moved arrow to Suggests in #646 so this would likely mean some new users would be prompted to install another package, even when using the defaults. |
Adding arrow as a requirement seems like it could introduce some friction (maybe?). I wonder if the audience for pins might lean toward CSV (for example, this pins blog post aims at an audience that is emailing CSVs, so maybe emailing CSV -> stashing CSV with pins might feel like a smaller step?). (This is me mostly thinking of pins as a very early stepping stone for data versioning / sharing, since I'd personally be very into storing everything in arrow/parquet!) |
I would suggest csv as the default. We often share via Connect and it is frustrating for non R/Python users when they go to the connect landing page for that dataset and they can't download the file in a format they can understand or open easily. |
Reading CSV via |
I agree about the downsides of csvs, especially the lack of explicit variable types. When pins saves a csv, could it save a second file that stores the variable info? Essentially a serialized/dput-ed readr::col_types object? I don't like having to redefine (a) integer vs floating, and (b) factor levels. If the data is later imported by pins, pins would look for the metadata and use it. But the csv is still valid and can be read by other programs that don't know how to interpret the "mtcars.readr_col_types" plain-text file. The metadata file isn't critical -it's just optional gravy. |
@wibeasley That is an interesting suggestion! As of now, we would recommend that folks follow this vignette for managing custom formats, like reading CSVs with more control: library(pins)
library(palmerpenguins)
b <- board_temp()
penguin_col_spec <- as.character(readr::as.col_spec(penguins))
penguin_col_spec
#> [1] "ffddiifi"
b %>%
pin_write(
penguins,
"very-nice-penguins",
type = "csv",
metadata = list(col_spec = penguin_col_spec)
)
#> Creating new version '20230223T212321Z-809e9'
#> Writing to pin 'very-nice-penguins'
new_col_spec <- pin_meta(b, "very-nice-penguins")$user$col_spec
pin_download(b, "very-nice-penguins") %>%
readr::read_csv(col_types = new_col_spec)
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#> 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#> 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#> 4 Adelie Torgersen NA NA NA NA <NA> 2007
#> 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#> 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
#> 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
#> 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
#> 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
#> # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#> # ²body_mass_g Created on 2023-02-23 with reprex v2.0.2 Those last two bits could be wrapped up in a |
An argument to specify a csv reading function (e.g. I have a colleague who's writing pins from python as type='csv', and then I want to read them in R, but with |
This is definitely the right thing to do for now. CSV writing can be so cantankerous, especially if you are using R and Python or something else. Have you talked with your colleague about considering switching to parquet? Is there a particular constraint that makes that not a good move? |
I was just thinking about the problem reported by @leslem again today, and how it highlights that switching to CSV will not really solve all user pain around this issue. In rstudio/pins-python#231 @isabelizimm added support for reading |
In #843 I added nanoparquet, a lightweight, zero-dependency option for reading/writing parquet, to pins. This will get released in pins 1.4.0. I expect that to go well of course, but let's give it another release or two, and then consider finally switching the default for writing dataframes to parquet. |
We have seen users who write using the default from R, and then are frustrated when their Python colleagues can't read. We have considered changing to arrow for a long time:
pins-r/R/pin-read-write.R
Line 115 in 3d2dc6b
Does arrow have enough usage in the community for this to be reasonable? It would be a much better choice if interoperability is one of the main reasons people use pins (to read with Python).
The text was updated successfully, but these errors were encountered: