Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added folder_transform #691

Merged
merged 9 commits into from
Oct 15, 2024
Merged

added folder_transform #691

merged 9 commits into from
Oct 15, 2024

Conversation

blublinsky
Copy link
Collaborator

@blublinsky blublinsky commented Oct 10, 2024

Why are these changes needed?

After several conversations with Constantin, it appears useful to have a transform for processing folders rather then file.

Related issue number (if any).

#609
#705

@blublinsky
Copy link
Collaborator Author

It needs some tests, but this is a basic idea

logger.info(f"Number of files is {len(files)}, source profile {profile}")
if is_folder:
# folder transform
files = AbstractFolderTransform.get_folders(d_access=data_access)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this get_folders() be made part of DataAccess instead? If not, then doesn't this have to be runtime_config.get_transform_class().get_folders().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great minds, just moved it there

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for data access, discussed it with Constantin, and it is too application specific

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea. get_folders() cannot be in the DataAccess, because sometimes we know exactly the set of folders that we want to process (also the number of levels of subfolders that we want to take into consideration), and we don't need DataAccess to produce a list of those folders. But putting get_folders() in the runtime config will work and will provide maximum flexibility, because we can either use DataAccess to retrieve specific files, or we can just provide a list of folders ourselves.

Copy link
Collaborator

@cmadam cmadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give me until tomorrow morning to try this FolderTransform with the cluster analysis transform in fuzzy dedup. If it works smoothly (and it should), I will approve this right away.

logger.info(f"Number of files is {len(files)}, source profile {profile}")
if is_folder:
# folder transform
files = AbstractFolderTransform.get_folders(d_access=data_access)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does spark have a different logic? I think this line should be:

            files = runtime_config.get_folders(d_access=data_access)

And the file data-processing-lib/spark/src/data_processing_spark/runtime/spark/runtime_configuration.py should contain the get_folders() function in the SparkTransformRuntimeConfiguration class:

    def get_folders(self, data_access: DataAccess) -> list[str]:
        """
        Get folders to process
        :param data_access: data access
        :return: list of folders to process
        """
        raise NotImplemented()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixed

@touma-I touma-I self-requested a review October 11, 2024 10:27
Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmadam This PR does not seem to have an issue associated with it. Can you please create an issue explaining what problem this PR is trying to solve and why we chose this approach to solve the problem, and an example on how this can be used by the transforms. Given that this is on the open source, we want to make sure other folks can consume the asset. Thanks

@cmadam
Copy link
Collaborator

cmadam commented Oct 14, 2024

@touma-I : I have created Issue #705.

@blublinsky : can you please update the PR description, and associate it with Issue #705

@blublinsky
Copy link
Collaborator Author

blublinsky commented Oct 14, 2024

@touma-I : I have created Issue #705.

@blublinsky : can you please update the PR description, and associate it with Issue #705

its linked to 609 and 705

Copy link
Collaborator

@cmadam cmadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just used this code for fuzzy dedup, and was able to run correctly the fuzzy dedup code in python, spark and ray. In the context of fuzzy dedup, I have 4 transforms, 2 of those transforms (ClusterAnalysisTransform and GetDuplicateListTransform) are now inheriting from AbstractFolderTransform, while the other two transforms(SignatureCalculationTransform and DataCleaningTransform) inherit from AbstractTableTransform.

@blublinsky blublinsky merged commit 97810af into dev Oct 15, 2024
83 checks passed
@cmadam cmadam mentioned this pull request Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants