ibis.FileDataset read files from web #918

mark-druffel · 2024-10-29T22:14:42Z

Description

ibis.FileDataset fails when trying to read files from certain domains (or all domains) on the web. With ibis, I can read data from hugging face:

import ibis
con = ibis.duckdb.connect()
tracks = con.read_csv("hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv")

Based on a conversation with @deepyaman, the above code might only work because duckdb has a hugging face extension. That said, for what it's worth I found a a github repo with the same file and tried to read it with ibis and it worked:

import ibis
con = ibis.duckdb.connect()
tracks = con.read_csv("https://raw.githubusercontent.com/seanwryan/DS210-Final-Project/refs/heads/main/spotify.csv")

However, when adding either file (hf:/ or raw.githubusercontent) as a FileDataset, my pipeline fails:

tracks:
  type: ibis.FileDataset
  filepath: hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv
  file_format: csv
  connection: ${connection:spotify}
  load_args:
    sep: ","
  save_args:
    materialized: view
    overwrite: True

kedro.io.core.DatasetError: Failed while loading data from dataset FileDataset(backend=duckdb, file_format=csv, filepath=hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv, load_args={'sep': ,}, save_args={'materialized': view, 'overwrite': True}). IO Error: No files found that match the pattern "/spotify/hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv":

The issue appears to be with _get_load_path().

Context

I don't fully understand the extent of the issue so it's hard to advocate for fixing it. Obviously, being able to read files from data lakes (s3, azure, etc.) is essential, but I think those already work with _get_load_path(). Having the ability to plug a url pointing to a file anywhere on the web could be really convenient, but I don't have a really solid use case for it at the moment.

Possible Implementation

I'm not entirely sure this is valid, but since ibis seems to work on it's own in these examples perhaps ibis.FileDataset could use pass a raw path to ibis before failing...

Possible Alternatives

The easiest workaround is to just download the files.

The text was updated successfully, but these errors were encountered:

lrcouto added datasets bug Something isn't working labels Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ibis.FileDataset read files from web #918

ibis.FileDataset read files from web #918

mark-druffel commented Oct 29, 2024

ibis.FileDataset read files from web #918

ibis.FileDataset read files from web #918

Comments

mark-druffel commented Oct 29, 2024

Description

Context

Possible Implementation

Possible Alternatives