Add decorator functionality to `validate_schema` function #255

kunaljubce · 2024-07-16T13:43:41Z

Proposed changes

Make validate_schema() behave both as a function and as a decorator to other functions. Also added associated changes to tests and README.

Took another stab at #140, as an extension to #144.

Types of changes

What types of changes does your code introduce to quinn?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Further comments: Implementation details for `validate_schema`

So we are implementing a decorator factory here so that our function can be used both as a decorator as well as a callable function. In this implementation:

The validate_schema function acts as both a decorator factory and a decorator. It takes required_schema, ignore_nullable, and an optional df_to_be_validated argument.

If df_to_be_validated is None, it means the function is being used as a decorator factory, and it returns the decorator decorator.
If df_to_be_validated is not None, it means the function is being called directly with a DataFrame, and it applies the decorator to df_to_be_validated immediately.

When validate_schema is called directly with a DataFrame, the validation logic gets executed by wrapping the DataFrame in a lambda function and immediately calling the decorator.

…a decorator to other functions. Also added associated changes to tests and README

kunaljubce · 2024-07-16T13:49:50Z

@SemyonSinchenko @MrPowers @jeffbrennan This PR is ready for review now!

SemyonSinchenko

To be honest, I do not like an implementation with deepcopy and three loops. I'm also not 100% sure how deepcopy() works with py4j.java_gateway.JavaObject.

SemyonSinchenko · 2024-07-16T14:39:44Z

quinn/dataframe_validator.py

+        def wrapper(*args: object, **kwargs: object) -> DataFrame:
+            dataframe = func(*args, **kwargs)
+            _all_struct_fields = copy.deepcopy(dataframe.schema)
+            _required_schema = copy.deepcopy(required_schema)



Why not to check that lengths of both schemas are the same? I mean before do the deepcopy.

@SemyonSinchenko Are you suggesting that we check if the length matches, only then we do the deepcopy?

Exactly! Deep copy may be quite expensive for big objects.

SemyonSinchenko · 2024-07-16T14:41:03Z

quinn/dataframe_validator.py

-        raise DataFrameMissingStructFieldError(error_message)
+            missing_struct_fields = [x for x in _required_schema if x not in _all_struct_fields]
+            error_message = (
+                f"The {missing_struct_fields} StructFields are not included in the DataFrame with the following StructFields {_all_struct_fields}"


In the case of nested schemas this error message will be unreadable

@SemyonSinchenko How do you suggest we put this?

I would suggest to flatten all field first to the way parent.parent.field and only after that render all the missing fields. I would suggest to put the flattening logic in a separate function

SemyonSinchenko · 2024-07-16T14:41:47Z

quinn/dataframe_validator.py

+    def decorator(func: Callable[..., DataFrame]) -> Callable[..., DataFrame]:
+        def wrapper(*args: object, **kwargs: object) -> DataFrame:
+            dataframe = func(*args, **kwargs)
+            _all_struct_fields = copy.deepcopy(dataframe.schema)


We are doing a deep copy of two potentially long schemas on each call. I do not like it.

@SemyonSinchenko I took a step back and re-evaluated the implementation. The question is - do we even need deepcopy here? We are not changing the original dfs or their schemas. Is shallow copy a better idea here? Lemme know your thoughts!

Yes, sound good!

SemyonSinchenko · 2024-07-16T14:47:21Z

quinn/dataframe_validator.py


-    if missing_struct_fields:
-        raise DataFrameMissingStructFieldError(error_message)
+            missing_struct_fields = [x for x in _required_schema if x not in _all_struct_fields]


I would like to transform this comprehension into a loop, like:

for field_name in _required_schema.fieldNames(): if field_name not in ... else: if ignore_nullable: %% compare name, dataType %% else: %% compare name, dataType, nullabe %%

…s setup Add details to `CONTRIBUTING.md` on auto-assigning issues, pre-commit installation, and GitHub Actions local setup using 'act'. * **Auto-assigning issues**: Add a section explaining auto-assigning issues on the comment 'take', referencing the configuration in `.github/workflows/assign-on-comment.yml`. * **Pre-commit installation and execution**: Add a section detailing pre-commit installation and execution, referencing the configuration in `.pre-commit-config.yaml`. * **GitHub Actions local setup using 'act'**: Add a section providing instructions for GitHub Actions local setup using 'act', referencing the configuration in `.github/workflows/ci.yml`. Include instructions for running specific jobs and handling MacBooks with M1 processors.

According to the review comment, make the pip installs as poetry installs

Bumps [jupyterlab](https://github.com/jupyterlab/jupyterlab) from 3.6.7 to 3.6.8. - [Release notes](https://github.com/jupyterlab/jupyterlab/releases) - [Changelog](https://github.com/jupyterlab/jupyterlab/blob/main/CHANGELOG.md) - [Commits](https://github.com/jupyterlab/jupyterlab/compare/@jupyterlab/[email protected]...@jupyterlab/[email protected]) --- updated-dependencies: - dependency-name: jupyterlab dependency-type: direct:development ... Signed-off-by: dependabot[bot] <[email protected]>

- Update deps - Update Ruff - Corresponding updates of pyproject - Slightly update Makefile - Drop Support of Spark2 - Drop Support of Spark 3.1-3.2 - Minimal python is 3.10 from now - Apply new ruff rules - Apply ruff linter On branch 202-drop-pyspark2 Changes to be committed: modified: .github/workflows/ci.yml modified: Makefile modified: poetry.lock modified: pyproject.toml modified: quinn/append_if_schema_identical.py modified: quinn/dataframe_helpers.py modified: quinn/keyword_finder.py modified: quinn/math.py modified: quinn/schema_helpers.py modified: quinn/split_columns.py modified: quinn/transformations.py

Related to mrpowers-io#237 Remove deprecated extension functions from the codebase * Delete `quinn/extensions/dataframe_ext.py` and `quinn/extensions/spark_session_ext.py` * Remove import statements for `dataframe_ext` and `spark_session_ext` in `quinn/extensions/__init__.py` * Remove per-file ignores for `quinn/extensions/dataframe_ext.py` and `quinn/extensions/__init__.py` in `pyproject.toml` * Delete test files `tests/extensions/test_dataframe_ext.py` and `tests/extensions/test_spark_session_ext.py`

* Create the `extensions/` directory. * Create the `__init__.py` file within it with the license header. * Create the `spark_session_ext.py` file within the extensions directory * Update the doc-string in create_df to align with the parameter list. * Format modified files & lint the code using ruff (successful)

As a matter of fact, a create_df function already exists under the dataframe_helpers.py Following are the changes introduced by this commit. * Update `dataframe_helpers.py` by updating the doc-string in the create_df function. * Remove quinn/extensions/ directory along with the two `.py` files involved. * Remove the ruff check from the pyproject.toml. * Finally, format the dataframe_helpers.py with ruff

enable UP007

* Remove the test_spark_connect.py file as it is not relevant anymore. * Update the Makefile to remove spark-connect test * Hardcode the hadoop version to 3 as 2 is EOL.

apply hotfix update lock file

update makefile add make command for ruff

All changes to make validate_schema behave both as a function and as …

3a36964

…a decorator to other functions. Also added associated changes to tests and README

kunaljubce changed the title ~~All changes to make validate_schema behave both as a function and as …~~ Add decorator functionality to validate_schema function Jul 16, 2024

kunaljubce mentioned this pull request Jul 16, 2024

Kunaljubce/add decorators for functions #253

Closed

4 tasks

SemyonSinchenko reviewed Jul 16, 2024

View reviewed changes

Fateme Tardasti and others added 19 commits September 7, 2024 13:20

update column extension function names and desc in readme

aafb6f8

Update the CONTRIBUTING.md regarding auto-assign issues

5e4eafa

Change pip to poetry for pre-commit installation

ca75b36

According to the review comment, make the pip installs as poetry installs

Updates from review

7305223

Update the poetry & pyproject with the dependencies for Spark-Connect

501f3b6

update column extension function names and desc in readme

f7cc2c2

enable UP007

improve documentation of makefile

398198c

Update the files according to the review comments

8345d1f

* Remove the test_spark_connect.py file as it is not relevant anymore. * Update the Makefile to remove spark-connect test * Hardcode the hadoop version to 3 as 2 is EOL.

apply hotfix

49c9ca1

apply hotfix update lock file

improve use of ruff

ff48cb6

update makefile add make command for ruff

relax pre-commit version

9d034d4

Rebasing branch with latest 1.0 upstream

086a5c3

Rebasing branch with latest 1.0 upstream

1653984

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add decorator functionality to `validate_schema` function #255

Add decorator functionality to `validate_schema` function #255

kunaljubce commented Jul 16, 2024 •

edited

Loading

kunaljubce commented Jul 16, 2024 •

edited

Loading

SemyonSinchenko left a comment

SemyonSinchenko Jul 16, 2024

kunaljubce Sep 10, 2024

SemyonSinchenko Sep 11, 2024

SemyonSinchenko Jul 16, 2024

kunaljubce Aug 31, 2024

SemyonSinchenko Aug 31, 2024

SemyonSinchenko Jul 16, 2024

kunaljubce Sep 10, 2024

SemyonSinchenko Sep 11, 2024

SemyonSinchenko Jul 16, 2024

Add decorator functionality to validate_schema function #255

Are you sure you want to change the base?

Add decorator functionality to validate_schema function #255

Conversation

kunaljubce commented Jul 16, 2024 • edited Loading

Proposed changes

Types of changes

Further comments: Implementation details for validate_schema

kunaljubce commented Jul 16, 2024 • edited Loading

SemyonSinchenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add decorator functionality to `validate_schema` function #255

Add decorator functionality to `validate_schema` function #255

kunaljubce commented Jul 16, 2024 •

edited

Loading

Further comments: Implementation details for `validate_schema`

kunaljubce commented Jul 16, 2024 •

edited

Loading