Fuzzy dedup #699

Kibnelson · 2024-10-11T14:51:07Z

Why are these changes needed?

Provide fuzzy dedup implementation for Python, Spark and Ray

Related issue number (if any).

#152
#79

Signed-off-by: Constantin M Adam <[email protected]>

Signed-off-by: nelson <[email protected]>

Signed-off-by: Constantin M Adam <[email protected]>

touma-I · 2024-11-13T18:08:06Z

transforms/universal/fdedup/python/Dockerfile

+
+# Copy and install data processing libraries 
+# These are expected to be placed in the docker context before this is run (see the make image).
+COPY --chown=dpk:root data-processing-lib-python/ data-processing-lib-python/


The new makefile will copy the whl file (not the source) to the context folder

Fixed in commit 6cc18cd

touma-I · 2024-11-13T18:09:00Z

transforms/universal/fdedup/python/Dockerfile

+COPY src/ src/
+
+# copy source data
+COPY ./src/signature_calc_transform_python.py fdedup_transform_python.py


Questionable practice!!! Can we find an alternative ?

on renaming, not the move

And why is signature_calc_transform the main entry point? Shouldn't it be fuzzy_dedup_python.py?

touma-I · 2024-11-13T18:10:13Z

transforms/universal/fdedup/python/pyproject.toml

+    { name = "Nelson Bore", email = "[email protected]" },
+    { name = "Constantin Adam", email = "[email protected]" },
+]
+dependencies = [


Please move to requirements.txt and use dynamic dependencies in the pyproject.toml

Fixed in commit 528457c

touma-I · 2024-11-13T18:13:29Z

transforms/universal/fdedup/ray/Dockerfile

@@ -13,24 +12,30 @@ COPY --chown=ray:users data-processing-lib-python/ data-processing-lib-python/
 RUN cd data-processing-lib-python && pip install --no-cache-dir -e .


should be getting the wheel instead of the source.

Fixed in commit 6cc18cd

touma-I · 2024-11-13T18:16:12Z

transforms/universal/fdedup/ray/pyproject.toml

 ]
 dependencies = [
+    "dpk_fdedup_transform_python==0.2.2.dev1",


move to requirements.txt and also use dynamic requirements

touma-I · 2024-11-13T18:17:12Z

transforms/universal/fdedup/ray/pyproject.toml

 ]
 dependencies = [
+    "dpk_fdedup_transform_python==0.2.2.dev1",
    "data-prep-toolkit-ray==0.2.2.dev1",


Should be using data-prep-toolkit[ray] and also once the branch is updated with dev, this will change to 0.2.2.dev2

Fixed in commit 2f61be7

touma-I · 2024-11-13T18:19:15Z

transforms/universal/fdedup/spark/Dockerfile

+
+# Copy in the data processing framework source/project and install it
+# This is expected to be placed in the docker context before this is run (see the make image).
+COPY --chown=spark:root data-processing-lib-python/ data-processing-lib-python/


Should be getting the wheel and installing extra for ray and spark

Fixed in commit 6cc18cd

touma-I · 2024-11-13T18:20:19Z

transforms/universal/fdedup/spark/pyproject.toml

+]
+dependencies = [
+    "dpk_fdedup_transform_python==0.2.2.dev1",
+    "data-prep-toolkit-spark==0.2.2.dev1",


use data-prep-toolkit[spark]

Fixed in commit 6cc18cd

touma-I · 2024-11-13T18:21:19Z

transforms/universal/fdedup/spark/requirements.txt

@@ -0,0 +1,10 @@
+pyarrow==16.1.0


Do you need it here? This is already. installed by the runtime and it will help to have a single place for this dependency

Fixed, I removed pyarrow from all requirements.txt files.

touma-I

You need to update the branch from dev and resolve any conflicts first before tackling some of these points.

touma-I · 2024-11-13T18:22:24Z

transforms/universal/fdedup/spark/src/requirements.txt

@@ -0,0 +1,8 @@
+pyspark


Please use >= or <= for the dependencies. Getting the latest is not a good idea.

Fixed in commit 528457c

touma-I · 2024-11-13T18:24:09Z

transforms/universal/fdedup/utils/Makefile

can this be part of the primary Makefile ?

transforms/universal/fdedup/utils/requirements.txt

daw3rd · 2024-11-13T18:11:58Z

transforms/universal/fdedup/python/README.md

Some more words here to provide a gentle introduction would be nice. In addition, you need to describe all of the configuration keys. See doc_chunk for a template.

I am still working on the documentation.

transforms/universal/fdedup/python/pyproject.toml

transforms/universal/fdedup/python/src/cluster_analysis_local_python.py

daw3rd · 2024-11-13T18:16:06Z

transforms/universal/fdedup/python/src/cluster_analysis_transform.py

+            jaccard_similarity_threshold_key, jaccard_similarity_threshold_default
+        )
+        self.sort_output = config.get(sort_output_key, sort_output_default)
+        self.data_access = config.get("data_access")


If data_access is not provided, what happens. Either throw an exception, or use DataAccessLocal() as the default?

daw3rd · 2024-11-13T18:17:28Z

transforms/universal/fdedup/python/src/cluster_analysis_transform.py

+            folder_name = f"{folder_name}/"
+        return folder_name
+
+    def consolidate_band_segment_files(self, files: dict[str, bytes]) -> tuple[pl.DataFrame, dict[str, Any]]:


I think you should hide all these internal methods by prefixing them with _ or __.

daw3rd · 2024-11-13T18:18:29Z

transforms/universal/fdedup/python/src/cluster_analysis_transform.py

+            f"--{sort_output_cli_param}",
+            type=bool,
+            default=sort_output_default,
+            help="Sort",


sort all output or within a file and sort by what?

Signed-off-by: Constantin M Adam <[email protected]>

daw3rd · 2024-11-13T21:39:44Z

transforms/universal/fdedup/python/src/fuzzy_dedup_python.py

I still think you should rename this file to fdedup_transform_python.py. By convention, this indicates that this file provides the main() entry point to the transform execution.

daw3rd · 2024-11-13T21:41:29Z

transforms/universal/fdedup/python/src/fuzzy_dedup_python.py

+    parser.add_argument(
+        "--document_id_column", type=str, required=False, help="name of the column that stores document text"
+    )
+    parser.add_argument("--seed", type=int, required=False, help="name of the column that stores document text")


document_id and seed seem to have the wrong help text.

daw3rd · 2024-11-13T21:44:37Z

transforms/universal/fdedup/python/src/get_duplicate_list_transform_python.py

+        super().__init__(params=params)
+        self.logger = get_logger(__name__)
+
+    def get_folders(self, data_access: DataAccess) -> list[str]:


This get_folders method should be defined as an @abstractmethod in a new super-class for this framework. I wouldn't suggest this for one class, but you seem to using the pattern in your other transforms.

daw3rd · 2024-11-13T21:58:17Z

transforms/universal/fdedup/python/src/signature_calc_transform.py

+        super().__init__(
+            name=short_name,
+            transform_class=SignatureCalculationTransform,
+            remove_from_metadata=[sigcalc_data_factory_key],


I don't believe this is the right key. It needs to be the key corresponding to the command line parameter for the s3 credentials. I'm not sure but I believe in this case is scdata_data_s3_cred. See https://github.com/IBM/data-prep-kit/blob/dev/data-processing-lib/doc/ray-launcher-options.md

You can check in the metadata.json when using S3 to see what shows up there.

daw3rd · 2024-11-13T21:59:56Z

transforms/universal/fdedup/python/src/signature_calc_transform_python.py

+
+if __name__ == "__main__":
+    launcher = PythonTransformLauncher(SignatureCalculationTransformConfiguration())
+    logger.info("Launching noop transform")


noop. Maybe just remove this line.

daw3rd · 2024-11-13T22:00:48Z

transforms/universal/fdedup/python/src/get_duplicate_list_transform_local_python.py

+
+# create parameters
+input_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "test-data", "expected/cluster_analysis"))
+output_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "test-data", "expected"))


you should never write directly to test-data. Write somewhere else (output is the convention) then manually copy if needed.

daw3rd · 2024-11-13T22:02:47Z

transforms/universal/fdedup/python/src/data_cleaning_transform.py

+        self.docs_to_remove_df = self.docs_to_remove_df.rename({"docs_to_remove": self.document_id_column})
+
+    def transform(self, table: pa.Table, file_name: str = None) -> tuple[list[pa.Table], dict[str, Any]]:
+        self.logger.info(f"Transforming table with {table.num_rows} rows from file {file_name}")


maybe make this debug so the kfp log doesn't get overwhelmed.

daw3rd · 2024-11-13T22:03:27Z

transforms/universal/fdedup/python/src/data_cleaning_transform.py

+        super().__init__(
+            name=short_name,
+            transform_class=transform_class,
+            remove_from_metadata=[dataclean_data_factory_key],


same comment as elsewhere about the correct key to pass here

daw3rd · 2024-11-13T22:08:32Z

transforms/universal/fdedup/ray/src/data_cleaning_transform_ray.py

+
+        self.logger = get_logger(__name__)
+
+    def get_transform_config(


This function looks like a duplicate of that in DataCleaningPythonRuntime. Can you make a shared method somehow, either as a global or in a shared superclass?

daw3rd · 2024-11-13T22:09:34Z

transforms/universal/fdedup/spark/src/data_cleaning_transform_spark.py

+        super().__init__(params=params)
+        self.logger = get_logger(__name__)
+
+    def get_transform_config(


Again, this function looks like a duplicate of that in DataCleaningPythonRuntime. Can you make a shared method somehow, either as a global or in a shared superclass?

Signed-off-by: Constantin M Adam <[email protected]>

revit13 · 2024-11-14T18:12:29Z

transforms/universal/fdedup/kfp_ray/fdedup_wf.py

+        ComponentUtils.add_settings_to_component(execute_data_cleaning_job, ONE_WEEK_SEC)
+        # FIXME: see https://github.com/kubeflow/pipelines/issues/10914
+        if os.getenv("KFPv2", "0") != "1":
+            ComponentUtils.set_s3_env_vars_to_component(execute_data_cleaning_job, data_s3_access_secret)


As discussed, in KFP v2 the secret name is hard coded (in kfp_v2_workflow_support/src/workflow_support/compile_utils/component.py) as a workaround for kubeflow/pipelines#10914. Thus, this pipeline is not expected to run on kfp V2. Should it be added to the blacklist, given that we currently can't restrict it to run only for v1 in the CI/CD tests? @roytman what do you think? Thanks

daw3rd · 2024-11-14T18:06:54Z

transforms/universal/fdedup/python/README.md

+
+## Summary
+
+The basic implementation of the fuzzy dedup is based on [MinHash](https://en.wikipedia.org/wiki/MinHash). Also see


Forgive me if this is a duplicate comment as I thought I had submitted once already, but...

A more gentle introduction to what the transform does instead of only providing the links.

The set of configuration keys should be documented. See doc_chunk for a nice example.

This file needs to be linked from a ../README.md, which now only points to ray and python.

transforms/universal/fdedup/ray/src/cluster_analysis_local_ray.py

daw3rd · 2024-11-14T18:17:32Z

transforms/universal/fdedup/ray/src/cluster_analysis_transform_ray.py

+        )
+
+
+if __name__ == "__main__":


Do you really need these sub-transform main()s? They are not exposed in the Dockerfile (in the home dir) so are not even directly callable (in the standard way). Do we ever need to run this sub-transform manually outside of fdedup orchestrator/transform? If so, then it should be promoted to the home directory in the Dockerfile, otherwise maybe delete main()?

daw3rd · 2024-11-14T18:20:28Z

transforms/universal/fdedup/python/Dockerfile

+COPY src/ src/
+
+# copy source data
+COPY ./src/signature_calc_transform_python.py fdedup_transform_python.py


And why is signature_calc_transform the main entry point? Shouldn't it be fuzzy_dedup_python.py?

daw3rd · 2024-11-14T18:28:39Z

transforms/universal/fdedup/ray/src/signature_calc_transform_ray.py

+
+
+if __name__ == "__main__":
+    # launcher = NOOPRayLauncher()


NOOP, although per earlier common, do we need the main() for the sub-transforms orchestrated by fdedup.

daw3rd · 2024-11-14T18:36:11Z

transforms/universal/fdedup/spark/src/cluster_analysis_spark.py

By convention, this file should be named cluster_analysis_s3_spark.py

daw3rd · 2024-11-14T18:37:01Z

transforms/universal/fdedup/spark/src/data_cleaning_spark.py

By convention, this should be named data_cleaning_s3_spark.py

daw3rd · 2024-11-14T18:38:07Z

transforms/universal/fdedup/spark/src/fuzzy_dedup_spark.py

By convention, this should be named fuzzy_dedup_s3_spark.py

daw3rd · 2024-11-14T18:40:27Z

transforms/universal/fdedup/spark/src/signature_calc_spark.py

By convention...signature_calc_s3_spark.py

daw3rd · 2024-11-14T18:44:23Z

transforms/universal/fdedup/utils/Makefile.local

+		$(PIP) install --upgrade pip;		\
+		$(PIP) install -r requirements.txt;	\
+	fi					
+set-versions:


I guess you don't need this if you've renamed the file. If you go back to Makefile, you will also need to add

clean test build publish

blublinsky and others added 30 commits October 10, 2024 19:05

added folder_transform

47f4526

added folder_transform

5fd20a1

added folder_transform

38b4725

added folder_transform

a3abf21

Merge branch 'folder_transform' into fuzzy-dedup

d93a06c

Fuzzy dedup pure python implementation

af8475d

Signed-off-by: Constantin M Adam <[email protected]>

Fuzzy dedup spark implementation

7f9b503

Signed-off-by: Constantin M Adam <[email protected]>

added folder_transform

3349521

added folder_transform

0553edf

added folder_transform

a53412e

added folder_transform

9c3ace7

added noop testing

7091a2e

Fuzzy dedup ray implementation

680c78a

Signed-off-by: nelson <[email protected]>

Fixed bug in ray to distribute docs to remove file to all workers

0c31dc0

Signed-off-by: nelson <[email protected]>

Merge with updated folder_transform branch

47d8fdf

Signed-off-by: Constantin M Adam <[email protected]>

added folder_transform

6ee6695

added folder_transform

e7260ba

added folder_transform

5856f3f

added folder_transform

6519686

added noop testing

c728224

added noop Ray testing

6e2863a

added noop Spark testing

3c9be57

more data access simplifications

371a712

Renamed/refactored fuzzy dedup python orchestrator

680f313

Signed-off-by: Constantin M Adam <[email protected]>

Rewrote cluster_analysis_transform as a folder_transform

c29d3bf

Signed-off-by: Constantin M Adam <[email protected]>

Wrote get_duplicate_list_transform as a folder_transform

aada59e

Signed-off-by: Constantin M Adam <[email protected]>

Added text preprocessing

2019d56

Signed-off-by: Constantin M Adam <[email protected]>

Added python test data

9362803

Signed-off-by: Constantin M Adam <[email protected]>

Added project admin tools

ddbd602

Signed-off-by: Constantin M Adam <[email protected]>

Bug fix

4dac838

Signed-off-by: Constantin M Adam <[email protected]>

cmadam added 4 commits November 10, 2024 17:37

kfp implementation for fuzzy dedup

a8ede00

Signed-off-by: Constantin M Adam <[email protected]>

Merge word/char shingles

524236d

Signed-off-by: Constantin M Adam <[email protected]>

Added params to captured_arg_keys

96edea4

Signed-off-by: Constantin M Adam <[email protected]>

Add shingle type option (word or char) to kfp

24163af

Signed-off-by: Constantin M Adam <[email protected]>

cmadam marked this pull request as ready for review November 13, 2024 01:29

cmadam requested a review from daw3rd November 13, 2024 01:29

Utility to calculate number of bands and length of a band

3a43c3d

Signed-off-by: Constantin M Adam <[email protected]>

touma-I reviewed Nov 13, 2024

View reviewed changes

touma-I requested changes Nov 13, 2024

View reviewed changes

daw3rd requested changes Nov 13, 2024

View reviewed changes

cmadam added 3 commits November 13, 2024 13:55

Merge branch 'dev' into fuzzy-dedup

83c05f9

Signed-off-by: Constantin M Adam <[email protected]>

Set correct version for pyproject

2f61be7

Signed-off-by: Constantin M Adam <[email protected]>

Change the name of the utils Makefile

cd5eb05

Signed-off-by: Constantin M Adam <[email protected]>

daw3rd requested changes Nov 13, 2024

View reviewed changes

cmadam added 6 commits November 14, 2024 08:36

Copy whl file to the context folder

6cc18cd

Signed-off-by: Constantin M Adam <[email protected]>

Use keyword args in compute_common_params

9f33620

Signed-off-by: Constantin M Adam <[email protected]>

Use dynamic dependencies

528457c

Signed-off-by: Constantin M Adam <[email protected]>

Add FIXME for kubeflow/pipelines#10914

fffb630

Signed-off-by: Constantin M Adam <[email protected]>

Add FIXME for kubeflow/pipelines#10914

5547d7f

Signed-off-by: Constantin M Adam <[email protected]>

Remove pyproject.toml dependencies

09e56e0

Signed-off-by: Constantin M Adam <[email protected]>

revit13 reviewed Nov 14, 2024

View reviewed changes

daw3rd requested changes Nov 14, 2024

View reviewed changes

		@@ -13,24 +12,30 @@ COPY --chown=ray:users data-processing-lib-python/ data-processing-lib-python/
		RUN cd data-processing-lib-python && pip install --no-cache-dir -e .


		## Summary

		The basic implementation of the fuzzy dedup is based on [MinHash](https://en.wikipedia.org/wiki/MinHash). Also see

Fuzzy dedup #699

Are you sure you want to change the base?

Fuzzy dedup #699

Conversation

Kibnelson commented Oct 11, 2024

Why are these changes needed?

Related issue number (if any).

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

touma-I left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmadam Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daw3rd Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

daw3rd Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revit13 Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmadam Nov 14, 2024 •

edited

Loading

daw3rd Nov 13, 2024 •

edited

Loading

daw3rd Nov 13, 2024 •

edited

Loading

revit13 Nov 14, 2024 •

edited

Loading