refactoring code to parquet to zip2parquet #525

blublinsky · 2024-08-21T12:00:43Z

Why are these changes needed?

modified code2 parquet to support both code and language files and moved it to universal

Related issue number (if any).

shahrokhDaijavad · 2024-08-21T15:08:32Z

@blublinsky Is this new version completely backward compatible with the old version of code2parquet?

blublinsky · 2024-08-21T15:14:29Z

Yes. By default it will work exactly the same as current (all the current tests pass). If you specify code_data == False, it will not try to decide which programming language it is - just store the file content as is

shahrokhDaijavad · 2024-08-21T16:13:46Z

Thanks, @blublinsky. Perfect! From a technical point of view, David needs to approve, but a note to myself that if this is approved, the example Jupyter notebook for code that Shivdeep has created should be modified with the new name and location of the transform.

daw3rd · 2024-08-27T22:04:48Z

i'm not sure I think this has to be backwards compatibile with code2parquet. As such, I would suggest the following:

code is NOT the default
There are a lot of configuration keys that are code-specific. Maybe we should have a single key and have its value be a dictionary like we do for DataAccessFactory. For example {"programming_language_column" : "some name", }

Separately, we have discussed the ability to add configuration to specify the list of extensions to import - that is, filtering of sorts.

I would like to do more here than just generalize for code and .txt. Some more design seems needed.

blublinsky · 2024-08-28T08:39:18Z

'm not sure I think this has to be backwards compatibile with code2parquet. This was initial requirement from @shahrokhDaijavad

As such, I would suggest the following:

code is NOT the default
Does not matter to me

There are a lot of configuration keys that are code-specific. Maybe we should have a single key and have its value be a dictionary like we do for DataAccessFactory. For example {"programming_language_column" : "some name", }
We need them for code support

Separately, we have discussed the ability to add configuration to specify the list of extensions to import - that is, filtering of sorts.
The last conversation with @nirmdesai was that we want all files

Bytes-Explorer · 2024-08-28T10:53:32Z

Why is this PR required in the first place? I would suggest to keep the code2parquet module as it is and add new modules as required. The code2parquet module is being used in many places and will break code flows. I would not support this PR.

blublinsky · 2024-08-28T12:17:46Z

Why is this PR required in the first place? I would suggest to keep the code2parquet module as it is and add new modules as required. The code2parquet module is being used in many places and will break code flows. I would not support this PR.

@Bytes-Explorer please take it up with @nirmdesai . It was his request

Bytes-Explorer · 2024-08-28T13:28:12Z

Ok, will clarify. Lets not merge this PR till then.

nirmdesai · 2024-08-28T15:32:04Z

@Bytes-Explorer , @blublinsky, @touma-I : Team, since we have various notebooks and other artifacts that already depend on Code2Parquet, we cannot make breaking changes to this transform. It is on me that I did not explicitly clarify this earlier!

For now, it would be best to add Any2Parquet as a separate transform that can read any file content as binary and produce a parquet.

In future, if all notebooks / users were using "pip install" to use a specific stable version of DPK, we would be free to make breaking changes in developing the next release without affecting all the users. I know you all are moving in this direction already.

shahrokhDaijavad · 2024-08-28T16:14:01Z

Thanks for the clarifications, @nirmdesai. Since we now want to ingest more than just text files, doing Any2Parquet as a separate module is the best path forward.

In defense of what was done, the changes done to the Code2Parquet were completely backward-compatible with the old version and would not have broken any notebook or artifact, if the name of the module and the path to its directory had not changed. At the same time, the name Code2Parquet was not appropriate, if it was handling both code and non-code text.

Having said that, let's move towards the new Any2Parquet.

touma-I

@touma-I to check if this is still relevant or should be closed without merge.

shahrokhDaijavad · 2024-10-14T15:27:46Z

@touma-I The code that Boris developed under this PR is valuable because it is a "generalization" of what we have with the current code2parquet. However, in order not to run into issues of backward compatibility with the notebooks/examples that use the current code2parquet, the most useful thing to do is to make changes to this PR, so that it keeps the current code2parquet as is (under the Code directory) and the new modified version by Boris, which should be renamed any2parquet, will be a separate transform under the universal directory.

blublinsky requested review from shahrokhDaijavad and daw3rd August 21, 2024 12:00

refactoring code to parquet to zip2parquet

22e7636

blublinsky force-pushed the zip2parquet branch from 44715f0 to 22e7636 Compare August 22, 2024 07:34

shahrokhDaijavad mentioned this pull request Aug 26, 2024

[Feature] Provide an operator that loads files content to parquet #543

Open

2 tasks

Bytes-Explorer self-requested a review August 28, 2024 10:53

touma-I self-requested a review October 14, 2024 12:36

touma-I requested changes Oct 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactoring code to parquet to zip2parquet #525

refactoring code to parquet to zip2parquet #525

blublinsky commented Aug 21, 2024

shahrokhDaijavad commented Aug 21, 2024

blublinsky commented Aug 21, 2024

shahrokhDaijavad commented Aug 21, 2024

daw3rd commented Aug 27, 2024

blublinsky commented Aug 28, 2024

Bytes-Explorer commented Aug 28, 2024

blublinsky commented Aug 28, 2024

Bytes-Explorer commented Aug 28, 2024

nirmdesai commented Aug 28, 2024

shahrokhDaijavad commented Aug 28, 2024 •

edited

Loading

touma-I left a comment

shahrokhDaijavad commented Oct 14, 2024

refactoring code to parquet to zip2parquet #525

Are you sure you want to change the base?

refactoring code to parquet to zip2parquet #525

Conversation

blublinsky commented Aug 21, 2024

Why are these changes needed?

Related issue number (if any).

shahrokhDaijavad commented Aug 21, 2024

blublinsky commented Aug 21, 2024

shahrokhDaijavad commented Aug 21, 2024

daw3rd commented Aug 27, 2024

blublinsky commented Aug 28, 2024

Bytes-Explorer commented Aug 28, 2024

blublinsky commented Aug 28, 2024

Bytes-Explorer commented Aug 28, 2024

nirmdesai commented Aug 28, 2024

shahrokhDaijavad commented Aug 28, 2024 • edited Loading

touma-I left a comment

Choose a reason for hiding this comment

shahrokhDaijavad commented Oct 14, 2024

shahrokhDaijavad commented Aug 28, 2024 •

edited

Loading