Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore board based on arrow's S3 support #530

Open
hadley opened this issue Oct 6, 2021 · 3 comments
Open

Explore board based on arrow's S3 support #530

hadley opened this issue Oct 6, 2021 · 3 comments
Labels
feature a feature request or enhancement

Comments

@hadley
Copy link
Member

hadley commented Oct 6, 2021

https://arrow.apache.org/docs/r/articles/fs.html#file-systems-that-emulate-s3

@hadley hadley added the feature a feature request or enhancement label Dec 9, 2021
@juliasilge
Copy link
Member

Via @gshotwell, this much is already possible:

library(pins)

board <- board_connect(server = "https://colorado.posit.co/rsc/",
                         account = "[email protected]",
                         key = Sys.getenv("COLORADO_KEY"))

pin(mtcars, board = board)

library(duckdb)
library(DBI)
con <- DBI::dbConnect(duckdb())
dbExecute(con, "INSTALL 'httpfs.duckdb_extension'")

dbGetQuery(con, "SELECT mpg FROM 'https://colorado.posit.co/rsc/content/519521d1-a6a1-45e6-a5ec-01046686f85f/data.csv'")

@gshotwell
Copy link
Contributor

This is what Hugging face does for their flat files. The way they do it is:

  • Convert everything to parquet
  • Shard files at 500GB

I think this would be a very good Connect feature because it really reduces the memory footprint of Connect assets without sacrificing much speed.

@machow
Copy link
Collaborator

machow commented Jun 8, 2023

Isn't the example above working only because that file is publicly readable? There needs to be some kind of R filesystem abstraction duckdb can use to authenticate (either arrow fs, or similar to fsspec in python, or using duckdb's httpfs for non-connect cases)

I'm guessing you can use httpfs right now, but it won't support connect, since connect is not s3 compatible (only s3, gcs, etc..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants