[Feature] Provide an operator that loads files content to parquet #543

touma-I · 2024-08-26T15:25:28Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

We currently have 3 transforms for HTML2Parquet, PDF2Parquet and Code2Parquet. As a user, I want to be able to specify any file type (.txt, .image, .py, whatever) and have its content loaded to parquet.

Assumption: This stage does not try to understand the blob in the loaded file. It is assumed that there will be other transforms in the next stage that understands the content type and process it appropriately but the first stage is is simply loading the content to parquet.

Question: cc @nirmdesai
If the file is an aggregate of multiple files, (I.e. .tar) do we want its content untarred and each file in a separate row.?
If the file is compressed (i.e. .zip) do we want it unzipped ?

cc: @shahrokhDaijavad, Please capture in reply in this issue any additional information you have. I want to make sure all the points for discussion on this issue are capture here. Thanks

Are you willing to submit a PR?

Yes I am willing to submit a PR!

shahrokhDaijavad · 2024-08-26T17:54:42Z

Additional info: The zip2parquet PR #525 implementation by Boris (not merged) is a superset of Code2Parquet that in default mode acts exactly like Code2parquet (on a zip of code files), but with setting a command line flag, it can also handle a zip of .txt files.

daw3rd · 2024-08-26T23:05:21Z

So perhaps we need a set of extensions specified to zip2parquet to configure which files from the zip are imported - 1 file per row with a column indicating the source file name from the zip. The default could just be all files I suppose.

touma-I added the enhancement New feature or request label Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Provide an operator that loads files content to parquet #543

[Feature] Provide an operator that loads files content to parquet #543

touma-I commented Aug 26, 2024 •

edited

Loading

shahrokhDaijavad commented Aug 26, 2024

daw3rd commented Aug 26, 2024

[Feature] Provide an operator that loads files content to parquet #543

[Feature] Provide an operator that loads files content to parquet #543

Comments

touma-I commented Aug 26, 2024 • edited Loading

Search before asking

Component

Feature

Are you willing to submit a PR?

shahrokhDaijavad commented Aug 26, 2024

daw3rd commented Aug 26, 2024

touma-I commented Aug 26, 2024 •

edited

Loading