-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add decorator functionality to validate_schema
function
#255
base: planning-1.0-release
Are you sure you want to change the base?
Add decorator functionality to validate_schema
function
#255
Conversation
…a decorator to other functions. Also added associated changes to tests and README
validate_schema
function
@SemyonSinchenko @MrPowers @jeffbrennan This PR is ready for review now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I do not like an implementation with deepcopy and three loops. I'm also not 100% sure how deepcopy()
works with py4j.java_gateway.JavaObject
.
def wrapper(*args: object, **kwargs: object) -> DataFrame: | ||
dataframe = func(*args, **kwargs) | ||
_all_struct_fields = copy.deepcopy(dataframe.schema) | ||
_required_schema = copy.deepcopy(required_schema) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not to check that lengths of both schemas are the same? I mean before do the deepcopy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SemyonSinchenko Are you suggesting that we check if the length matches, only then we do the deepcopy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! Deep copy may be quite expensive for big objects.
raise DataFrameMissingStructFieldError(error_message) | ||
missing_struct_fields = [x for x in _required_schema if x not in _all_struct_fields] | ||
error_message = ( | ||
f"The {missing_struct_fields} StructFields are not included in the DataFrame with the following StructFields {_all_struct_fields}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of nested schemas this error message will be unreadable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SemyonSinchenko How do you suggest we put this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to flatten all field first to the way parent.parent.field
and only after that render all the missing fields. I would suggest to put the flattening logic in a separate function
def decorator(func: Callable[..., DataFrame]) -> Callable[..., DataFrame]: | ||
def wrapper(*args: object, **kwargs: object) -> DataFrame: | ||
dataframe = func(*args, **kwargs) | ||
_all_struct_fields = copy.deepcopy(dataframe.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are doing a deep copy of two potentially long schemas on each call. I do not like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SemyonSinchenko I took a step back and re-evaluated the implementation. The question is - do we even need deepcopy here? We are not changing the original dfs or their schemas. Is shallow copy a better idea here? Lemme know your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sound good!
|
||
if missing_struct_fields: | ||
raise DataFrameMissingStructFieldError(error_message) | ||
missing_struct_fields = [x for x in _required_schema if x not in _all_struct_fields] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to transform this comprehension into a loop, like:
for field_name in _required_schema.fieldNames():
if field_name not in ...
else:
if ignore_nullable:
%% compare name, dataType %%
else:
%% compare name, dataType, nullabe %%
…s setup Add details to `CONTRIBUTING.md` on auto-assigning issues, pre-commit installation, and GitHub Actions local setup using 'act'. * **Auto-assigning issues**: Add a section explaining auto-assigning issues on the comment 'take', referencing the configuration in `.github/workflows/assign-on-comment.yml`. * **Pre-commit installation and execution**: Add a section detailing pre-commit installation and execution, referencing the configuration in `.pre-commit-config.yaml`. * **GitHub Actions local setup using 'act'**: Add a section providing instructions for GitHub Actions local setup using 'act', referencing the configuration in `.github/workflows/ci.yml`. Include instructions for running specific jobs and handling MacBooks with M1 processors.
According to the review comment, make the pip installs as poetry installs
Bumps [jupyterlab](https://github.com/jupyterlab/jupyterlab) from 3.6.7 to 3.6.8. - [Release notes](https://github.com/jupyterlab/jupyterlab/releases) - [Changelog](https://github.com/jupyterlab/jupyterlab/blob/main/CHANGELOG.md) - [Commits](https://github.com/jupyterlab/jupyterlab/compare/@jupyterlab/[email protected]...@jupyterlab/[email protected]) --- updated-dependencies: - dependency-name: jupyterlab dependency-type: direct:development ... Signed-off-by: dependabot[bot] <[email protected]>
- Update deps - Update Ruff - Corresponding updates of pyproject - Slightly update Makefile - Drop Support of Spark2 - Drop Support of Spark 3.1-3.2 - Minimal python is 3.10 from now - Apply new ruff rules - Apply ruff linter On branch 202-drop-pyspark2 Changes to be committed: modified: .github/workflows/ci.yml modified: Makefile modified: poetry.lock modified: pyproject.toml modified: quinn/append_if_schema_identical.py modified: quinn/dataframe_helpers.py modified: quinn/keyword_finder.py modified: quinn/math.py modified: quinn/schema_helpers.py modified: quinn/split_columns.py modified: quinn/transformations.py
Related to mrpowers-io#237 Remove deprecated extension functions from the codebase * Delete `quinn/extensions/dataframe_ext.py` and `quinn/extensions/spark_session_ext.py` * Remove import statements for `dataframe_ext` and `spark_session_ext` in `quinn/extensions/__init__.py` * Remove per-file ignores for `quinn/extensions/dataframe_ext.py` and `quinn/extensions/__init__.py` in `pyproject.toml` * Delete test files `tests/extensions/test_dataframe_ext.py` and `tests/extensions/test_spark_session_ext.py`
* Create the `extensions/` directory. * Create the `__init__.py` file within it with the license header. * Create the `spark_session_ext.py` file within the extensions directory * Update the doc-string in create_df to align with the parameter list. * Format modified files & lint the code using ruff (successful)
As a matter of fact, a create_df function already exists under the dataframe_helpers.py Following are the changes introduced by this commit. * Update `dataframe_helpers.py` by updating the doc-string in the create_df function. * Remove quinn/extensions/ directory along with the two `.py` files involved. * Remove the ruff check from the pyproject.toml. * Finally, format the dataframe_helpers.py with ruff
* Remove the test_spark_connect.py file as it is not relevant anymore. * Update the Makefile to remove spark-connect test * Hardcode the hadoop version to 3 as 2 is EOL.
apply hotfix update lock file
update makefile add make command for ruff
Proposed changes
Make
validate_schema()
behave both as a function and as a decorator to other functions. Also added associated changes to tests and README.Took another stab at #140, as an extension to #144.
Types of changes
What types of changes does your code introduce to quinn?
Put an
x
in the boxes that applyFurther comments: Implementation details for
validate_schema
So we are implementing a decorator factory here so that our function can be used both as a decorator as well as a callable function. In this implementation:
The
validate_schema
function acts as both a decorator factory and a decorator. It takesrequired_schema
,ignore_nullable
, and an optionaldf_to_be_validated
argument.df_to_be_validated
is None, it means the function is being used as a decorator factory, and it returns the decorator decorator.df_to_be_validated
is not None, it means the function is being called directly with a DataFrame, and it applies the decorator todf_to_be_validated
immediately.When
validate_schema
is called directly with a DataFrame, the validation logic gets executed by wrapping the DataFrame in a lambda function and immediately calling the decorator.