You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Component
Transforms/Other
Feature
We currently have 3 transforms for HTML2Parquet, PDF2Parquet and Code2Parquet. As a user, I want to be able to specify any file type (.txt, .image, .py, whatever) and have its content loaded to parquet.
Assumption: This stage does not try to understand the blob in the loaded file. It is assumed that there will be other transforms in the next stage that understands the content type and process it appropriately but the first stage is is simply loading the content to parquet.
Question: cc @nirmdesai
If the file is an aggregate of multiple files, (I.e. .tar) do we want its content untarred and each file in a separate row.?
If the file is compressed (i.e. .zip) do we want it unzipped ?
cc: @shahrokhDaijavad, Please capture in reply in this issue any additional information you have. I want to make sure all the points for discussion on this issue are capture here. Thanks
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Additional info: The zip2parquet PR #525 implementation by Boris (not merged) is a superset of Code2Parquet that in default mode acts exactly like Code2parquet (on a zip of code files), but with setting a command line flag, it can also handle a zip of .txt files.
So perhaps we need a set of extensions specified to zip2parquet to configure which files from the zip are imported - 1 file per row with a column indicating the source file name from the zip. The default could just be all files I suppose.
Search before asking
Component
Transforms/Other
Feature
We currently have 3 transforms for HTML2Parquet, PDF2Parquet and Code2Parquet. As a user, I want to be able to specify any file type (.txt, .image, .py, whatever) and have its content loaded to parquet.
Assumption: This stage does not try to understand the blob in the loaded file. It is assumed that there will be other transforms in the next stage that understands the content type and process it appropriately but the first stage is is simply loading the content to parquet.
Question: cc @nirmdesai
If the file is an aggregate of multiple files, (I.e. .tar) do we want its content untarred and each file in a separate row.?
If the file is compressed (i.e. .zip) do we want it unzipped ?
cc: @shahrokhDaijavad, Please capture in reply in this issue any additional information you have. I want to make sure all the points for discussion on this issue are capture here. Thanks
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: