-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactoring code to parquet to zip2parquet #525
base: dev
Are you sure you want to change the base?
Conversation
@blublinsky Is this new version completely backward compatible with the old version of code2parquet? |
Yes. By default it will work exactly the same as current (all the current tests pass). If you specify code_data == False, it will not try to decide which programming language it is - just store the file content as is |
Thanks, @blublinsky. Perfect! From a technical point of view, David needs to approve, but a note to myself that if this is approved, the example Jupyter notebook for code that Shivdeep has created should be modified with the new name and location of the transform. |
44715f0
to
22e7636
Compare
i'm not sure I think this has to be backwards compatibile with code2parquet. As such, I would suggest the following:
Separately, we have discussed the ability to add configuration to specify the list of extensions to import - that is, filtering of sorts. I would like to do more here than just generalize for code and .txt. Some more design seems needed. |
As such, I would suggest the following:
|
Why is this PR required in the first place? I would suggest to keep the code2parquet module as it is and add new modules as required. The code2parquet module is being used in many places and will break code flows. I would not support this PR. |
@Bytes-Explorer please take it up with @nirmdesai . It was his request |
Ok, will clarify. Lets not merge this PR till then. |
@Bytes-Explorer , @blublinsky, @touma-I : Team, since we have various notebooks and other artifacts that already depend on Code2Parquet, we cannot make breaking changes to this transform. It is on me that I did not explicitly clarify this earlier! For now, it would be best to add Any2Parquet as a separate transform that can read any file content as binary and produce a parquet. In future, if all notebooks / users were using "pip install" to use a specific stable version of DPK, we would be free to make breaking changes in developing the next release without affecting all the users. I know you all are moving in this direction already. |
Thanks for the clarifications, @nirmdesai. Since we now want to ingest more than just text files, doing Any2Parquet as a separate module is the best path forward. In defense of what was done, the changes done to the Code2Parquet were completely backward-compatible with the old version and would not have broken any notebook or artifact, if the name of the module and the path to its directory had not changed. At the same time, the name Code2Parquet was not appropriate, if it was handling both code and non-code text. Having said that, let's move towards the new Any2Parquet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@touma-I to check if this is still relevant or should be closed without merge.
@touma-I The code that Boris developed under this PR is valuable because it is a "generalization" of what we have with the current code2parquet. However, in order not to run into issues of backward compatibility with the notebooks/examples that use the current code2parquet, the most useful thing to do is to make changes to this PR, so that it keeps the current code2parquet as is (under the Code directory) and the new modified version by Boris, which should be renamed any2parquet, will be a separate transform under the universal directory. |
Why are these changes needed?
modified code2 parquet to support both code and language files and moved it to universal
Related issue number (if any).
#520