-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added folder_transform #691
Conversation
It needs some tests, but this is a basic idea |
logger.info(f"Number of files is {len(files)}, source profile {profile}") | ||
if is_folder: | ||
# folder transform | ||
files = AbstractFolderTransform.get_folders(d_access=data_access) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this get_folders()
be made part of DataAccess instead? If not, then doesn't this have to be runtime_config.get_transform_class().get_folders()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great minds, just moved it there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for data access, discussed it with Constantin, and it is too application specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea. get_folders()
cannot be in the DataAccess
, because sometimes we know exactly the set of folders that we want to process (also the number of levels of subfolders that we want to take into consideration), and we don't need DataAccess
to produce a list of those folders. But putting get_folders()
in the runtime config will work and will provide maximum flexibility, because we can either use DataAccess
to retrieve specific files, or we can just provide a list of folders ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give me until tomorrow morning to try this FolderTransform
with the cluster analysis transform in fuzzy dedup. If it works smoothly (and it should), I will approve this right away.
logger.info(f"Number of files is {len(files)}, source profile {profile}") | ||
if is_folder: | ||
# folder transform | ||
files = AbstractFolderTransform.get_folders(d_access=data_access) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does spark have a different logic? I think this line should be:
files = runtime_config.get_folders(d_access=data_access)
And the file data-processing-lib/spark/src/data_processing_spark/runtime/spark/runtime_configuration.py
should contain the get_folders()
function in the SparkTransformRuntimeConfiguration
class:
def get_folders(self, data_access: DataAccess) -> list[str]:
"""
Get folders to process
:param data_access: data access
:return: list of folders to process
"""
raise NotImplemented()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmadam This PR does not seem to have an issue associated with it. Can you please create an issue explaining what problem this PR is trying to solve and why we chose this approach to solve the problem, and an example on how this can be used by the transforms. Given that this is on the open source, we want to make sure other folks can consume the asset. Thanks
1973b56
to
7091a2e
Compare
7091a2e
to
c728224
Compare
@touma-I : I have created Issue #705. @blublinsky : can you please update the PR description, and associate it with Issue #705 |
its linked to 609 and 705 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just used this code for fuzzy dedup, and was able to run correctly the fuzzy dedup code in python, spark and ray. In the context of fuzzy dedup, I have 4 transforms, 2 of those transforms (ClusterAnalysisTransform
and GetDuplicateListTransform
) are now inheriting from AbstractFolderTransform
, while the other two transforms(SignatureCalculationTransform
and DataCleaningTransform
) inherit from AbstractTableTransform
.
371a712
to
59d57df
Compare
Why are these changes needed?
After several conversations with Constantin, it appears useful to have a transform for processing folders rather then file.
Related issue number (if any).
#609
#705