Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Provide an operator that loads files content to parquet #543

Open
2 tasks done
touma-I opened this issue Aug 26, 2024 · 2 comments
Open
2 tasks done

[Feature] Provide an operator that loads files content to parquet #543

touma-I opened this issue Aug 26, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@touma-I
Copy link
Collaborator

touma-I commented Aug 26, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

We currently have 3 transforms for HTML2Parquet, PDF2Parquet and Code2Parquet. As a user, I want to be able to specify any file type (.txt, .image, .py, whatever) and have its content loaded to parquet.

Assumption: This stage does not try to understand the blob in the loaded file. It is assumed that there will be other transforms in the next stage that understands the content type and process it appropriately but the first stage is is simply loading the content to parquet.

Question: cc @nirmdesai
If the file is an aggregate of multiple files, (I.e. .tar) do we want its content untarred and each file in a separate row.?
If the file is compressed (i.e. .zip) do we want it unzipped ?

cc: @shahrokhDaijavad, Please capture in reply in this issue any additional information you have. I want to make sure all the points for discussion on this issue are capture here. Thanks

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@touma-I touma-I added the enhancement New feature or request label Aug 26, 2024
@shahrokhDaijavad
Copy link
Member

Additional info: The zip2parquet PR #525 implementation by Boris (not merged) is a superset of Code2Parquet that in default mode acts exactly like Code2parquet (on a zip of code files), but with setting a command line flag, it can also handle a zip of .txt files.

@daw3rd
Copy link
Member

daw3rd commented Aug 26, 2024

So perhaps we need a set of extensions specified to zip2parquet to configure which files from the zip are imported - 1 file per row with a column indicating the source file name from the zip. The default could just be all files I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants
@daw3rd @shahrokhDaijavad @touma-I and others