Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decorator functionality to validate_schema function #255

Open
wants to merge 20 commits into
base: planning-1.0-release
Choose a base branch
from

Conversation

kunaljubce
Copy link
Contributor

@kunaljubce kunaljubce commented Jul 16, 2024

Proposed changes

Make validate_schema() behave both as a function and as a decorator to other functions. Also added associated changes to tests and README.

Took another stab at #140, as an extension to #144.

Types of changes

What types of changes does your code introduce to quinn?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Further comments: Implementation details for validate_schema

So we are implementing a decorator factory here so that our function can be used both as a decorator as well as a callable function. In this implementation:

The validate_schema function acts as both a decorator factory and a decorator. It takes required_schema, ignore_nullable, and an optional df_to_be_validated argument.

  • If df_to_be_validated is None, it means the function is being used as a decorator factory, and it returns the decorator decorator.
  • If df_to_be_validated is not None, it means the function is being called directly with a DataFrame, and it applies the decorator to df_to_be_validated immediately.

When validate_schema is called directly with a DataFrame, the validation logic gets executed by wrapping the DataFrame in a lambda function and immediately calling the decorator.

…a decorator to other functions. Also added associated changes to tests and README
@kunaljubce kunaljubce changed the title All changes to make validate_schema behave both as a function and as … Add decorator functionality to validate_schema function Jul 16, 2024
@kunaljubce
Copy link
Contributor Author

kunaljubce commented Jul 16, 2024

@SemyonSinchenko @MrPowers @jeffbrennan This PR is ready for review now!

Copy link
Collaborator

@SemyonSinchenko SemyonSinchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I do not like an implementation with deepcopy and three loops. I'm also not 100% sure how deepcopy() works with py4j.java_gateway.JavaObject.

def wrapper(*args: object, **kwargs: object) -> DataFrame:
dataframe = func(*args, **kwargs)
_all_struct_fields = copy.deepcopy(dataframe.schema)
_required_schema = copy.deepcopy(required_schema)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to check that lengths of both schemas are the same? I mean before do the deepcopy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SemyonSinchenko Are you suggesting that we check if the length matches, only then we do the deepcopy?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! Deep copy may be quite expensive for big objects.

raise DataFrameMissingStructFieldError(error_message)
missing_struct_fields = [x for x in _required_schema if x not in _all_struct_fields]
error_message = (
f"The {missing_struct_fields} StructFields are not included in the DataFrame with the following StructFields {_all_struct_fields}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of nested schemas this error message will be unreadable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SemyonSinchenko How do you suggest we put this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to flatten all field first to the way parent.parent.field and only after that render all the missing fields. I would suggest to put the flattening logic in a separate function

def decorator(func: Callable[..., DataFrame]) -> Callable[..., DataFrame]:
def wrapper(*args: object, **kwargs: object) -> DataFrame:
dataframe = func(*args, **kwargs)
_all_struct_fields = copy.deepcopy(dataframe.schema)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing a deep copy of two potentially long schemas on each call. I do not like it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SemyonSinchenko I took a step back and re-evaluated the implementation. The question is - do we even need deepcopy here? We are not changing the original dfs or their schemas. Is shallow copy a better idea here? Lemme know your thoughts!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sound good!


if missing_struct_fields:
raise DataFrameMissingStructFieldError(error_message)
missing_struct_fields = [x for x in _required_schema if x not in _all_struct_fields]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to transform this comprehension into a loop, like:

for field_name in _required_schema.fieldNames():
   if field_name not in ...
   else:
       if ignore_nullable:
           %% compare name, dataType %%
       else:
           %% compare name, dataType, nullabe %% 

Fateme Tardasti and others added 19 commits September 7, 2024 13:20
…s setup

Add details to `CONTRIBUTING.md` on auto-assigning issues, pre-commit installation, and GitHub Actions local setup using 'act'.

* **Auto-assigning issues**: Add a section explaining auto-assigning issues on the comment 'take', referencing the configuration in `.github/workflows/assign-on-comment.yml`.
* **Pre-commit installation and execution**: Add a section detailing pre-commit installation and execution, referencing the configuration in `.pre-commit-config.yaml`.
* **GitHub Actions local setup using 'act'**: Add a section providing instructions for GitHub Actions local setup using 'act', referencing the configuration in `.github/workflows/ci.yml`. Include instructions for running specific jobs and handling MacBooks with M1 processors.
According to the review comment, make the pip installs as poetry installs
Bumps [jupyterlab](https://github.com/jupyterlab/jupyterlab) from 3.6.7 to 3.6.8.
- [Release notes](https://github.com/jupyterlab/jupyterlab/releases)
- [Changelog](https://github.com/jupyterlab/jupyterlab/blob/main/CHANGELOG.md)
- [Commits](https://github.com/jupyterlab/jupyterlab/compare/@jupyterlab/[email protected]...@jupyterlab/[email protected])

---
updated-dependencies:
- dependency-name: jupyterlab
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <[email protected]>
- Update deps
- Update Ruff
- Corresponding updates of pyproject
- Slightly update Makefile
- Drop Support of Spark2
- Drop Support of Spark 3.1-3.2
- Minimal python is 3.10 from now
- Apply new ruff rules
- Apply ruff linter

 On branch 202-drop-pyspark2
 Changes to be committed:
	modified:   .github/workflows/ci.yml
	modified:   Makefile
	modified:   poetry.lock
	modified:   pyproject.toml
	modified:   quinn/append_if_schema_identical.py
	modified:   quinn/dataframe_helpers.py
	modified:   quinn/keyword_finder.py
	modified:   quinn/math.py
	modified:   quinn/schema_helpers.py
	modified:   quinn/split_columns.py
	modified:   quinn/transformations.py
Related to mrpowers-io#237

Remove deprecated extension functions from the codebase
* Delete `quinn/extensions/dataframe_ext.py` and `quinn/extensions/spark_session_ext.py`
* Remove import statements for `dataframe_ext` and `spark_session_ext` in `quinn/extensions/__init__.py`
* Remove per-file ignores for `quinn/extensions/dataframe_ext.py` and `quinn/extensions/__init__.py` in `pyproject.toml`
* Delete test files `tests/extensions/test_dataframe_ext.py` and `tests/extensions/test_spark_session_ext.py`
* Create the `extensions/` directory.
* Create the `__init__.py` file within it with the license header.
* Create the `spark_session_ext.py` file within the extensions directory
* Update the doc-string in create_df to align with the parameter list.
* Format modified files & lint the code using ruff (successful)
As a matter of fact, a create_df function already exists under the
dataframe_helpers.py Following are the changes introduced by this commit.
* Update `dataframe_helpers.py` by updating the doc-string in the create_df function.
* Remove quinn/extensions/ directory along with the two `.py` files involved.
* Remove the ruff check from the pyproject.toml.
* Finally, format the dataframe_helpers.py with ruff
* Remove the test_spark_connect.py file as it is not relevant anymore.
* Update the Makefile to remove spark-connect test
* Hardcode the hadoop version to 3 as 2 is EOL.
apply hotfix

update lock file
update makefile

add make command for ruff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants