added folder_transform #691

blublinsky · 2024-10-10T18:08:43Z

Why are these changes needed?

After several conversations with Constantin, it appears useful to have a transform for processing folders rather then file.

Related issue number (if any).

#609
#705

blublinsky · 2024-10-10T18:10:03Z

It needs some tests, but this is a basic idea

daw3rd · 2024-10-10T19:52:04Z

data-processing-lib/python/src/data_processing/runtime/pure_python/transform_orchestrator.py

-        logger.info(f"Number of files is {len(files)}, source profile {profile}")
+        if is_folder:
+            # folder transform
+            files = AbstractFolderTransform.get_folders(d_access=data_access)


can this get_folders() be made part of DataAccess instead? If not, then doesn't this have to be runtime_config.get_transform_class().get_folders().

great minds, just moved it there

As for data access, discussed it with Constantin, and it is too application specific

This is a great idea. get_folders() cannot be in the DataAccess, because sometimes we know exactly the set of folders that we want to process (also the number of levels of subfolders that we want to take into consideration), and we don't need DataAccess to produce a list of those folders. But putting get_folders() in the runtime config will work and will provide maximum flexibility, because we can either use DataAccess to retrieve specific files, or we can just provide a list of folders ourselves.

cmadam

Give me until tomorrow morning to try this FolderTransform with the cluster analysis transform in fuzzy dedup. If it works smoothly (and it should), I will approve this right away.

cmadam · 2024-10-11T04:40:01Z

data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_orchestrator.py

-        logger.info(f"Number of files is {len(files)}, source profile {profile}")
+        if is_folder:
+            # folder transform
+            files = AbstractFolderTransform.get_folders(d_access=data_access)


Why does spark have a different logic? I think this line should be:

files = runtime_config.get_folders(d_access=data_access)

And the file data-processing-lib/spark/src/data_processing_spark/runtime/spark/runtime_configuration.py should contain the get_folders() function in the SparkTransformRuntimeConfiguration class:

def get_folders(self, data_access: DataAccess) -> list[str]: """ Get folders to process :param data_access: data access :return: list of folders to process """ raise NotImplemented()

Oops. Fixed

touma-I

@cmadam This PR does not seem to have an issue associated with it. Can you please create an issue explaining what problem this PR is trying to solve and why we chose this approach to solve the problem, and an example on how this can be used by the transforms. Given that this is on the open source, we want to make sure other folks can consume the asset. Thanks

cmadam · 2024-10-14T12:51:00Z

@touma-I : I have created Issue #705.

@blublinsky : can you please update the PR description, and associate it with Issue #705

blublinsky · 2024-10-14T13:46:55Z

@touma-I : I have created Issue #705.

@blublinsky : can you please update the PR description, and associate it with Issue #705

its linked to 609 and 705

cmadam

I have just used this code for fuzzy dedup, and was able to run correctly the fuzzy dedup code in python, spark and ray. In the context of fuzzy dedup, I have 4 transforms, 2 of those transforms (ClusterAnalysisTransform and GetDuplicateListTransform) are now inheriting from AbstractFolderTransform, while the other two transforms(SignatureCalculationTransform and DataCleaningTransform) inherit from AbstractTableTransform.

issue is added

blublinsky requested review from daw3rd and cmadam October 10, 2024 18:09

daw3rd reviewed Oct 10, 2024

View reviewed changes

cmadam reviewed Oct 10, 2024

View reviewed changes

cmadam requested changes Oct 11, 2024

View reviewed changes

touma-I self-requested a review October 11, 2024 10:27

touma-I previously requested changes Oct 11, 2024

View reviewed changes

blublinsky force-pushed the folder_transform branch from 1973b56 to 7091a2e Compare October 11, 2024 14:36

blublinsky requested review from cmadam and touma-I October 11, 2024 14:39

blublinsky force-pushed the folder_transform branch from 7091a2e to c728224 Compare October 13, 2024 07:35

cmadam approved these changes Oct 14, 2024

View reviewed changes

blublinsky added 8 commits October 14, 2024 20:00

added folder_transform

ea2cc23

added folder_transform

4d30ac3

added folder_transform

a772d1d

added folder_transform

2f28ab4

added noop testing

b3588ef

added noop Ray testing

a0a1400

added noop Spark testing

c2dd53c

more data access simplifications

59d57df

blublinsky force-pushed the folder_transform branch from 371a712 to 59d57df Compare October 14, 2024 19:00

documentation update

7b7736c

blublinsky merged commit 97810af into dev Oct 15, 2024
83 checks passed

cmadam mentioned this pull request Nov 15, 2024

Fuzzy dedup #699

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added folder_transform #691

added folder_transform #691

blublinsky commented Oct 10, 2024 •

edited

Loading

blublinsky commented Oct 10, 2024

daw3rd Oct 10, 2024

blublinsky Oct 10, 2024

blublinsky Oct 10, 2024

cmadam Oct 10, 2024

cmadam left a comment

cmadam Oct 11, 2024

blublinsky Oct 11, 2024

touma-I left a comment

cmadam commented Oct 14, 2024

blublinsky commented Oct 14, 2024 •

edited

Loading

cmadam left a comment

added folder_transform #691

added folder_transform #691

Conversation

blublinsky commented Oct 10, 2024 • edited Loading

Why are these changes needed?

Related issue number (if any).

blublinsky commented Oct 10, 2024

daw3rd Oct 10, 2024

Choose a reason for hiding this comment

blublinsky Oct 10, 2024

Choose a reason for hiding this comment

blublinsky Oct 10, 2024

Choose a reason for hiding this comment

cmadam Oct 10, 2024

Choose a reason for hiding this comment

cmadam left a comment

Choose a reason for hiding this comment

cmadam Oct 11, 2024

Choose a reason for hiding this comment

blublinsky Oct 11, 2024

Choose a reason for hiding this comment

touma-I left a comment

Choose a reason for hiding this comment

cmadam commented Oct 14, 2024

blublinsky commented Oct 14, 2024 • edited Loading

cmadam left a comment

Choose a reason for hiding this comment

blublinsky commented Oct 10, 2024 •

edited

Loading

blublinsky commented Oct 14, 2024 •

edited

Loading