-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add decorator functionality to validate_schema
function
#255
base: planning-1.0-release
Are you sure you want to change the base?
Changes from all commits
3a36964
aafb6f8
40a62b1
5e4eafa
ca75b36
35234bb
cd42f32
3afa41f
34fa8e5
a5fb013
7305223
501f3b6
f7cc2c2
398198c
8345d1f
49c9ca1
ff48cb6
9d034d4
086a5c3
1653984
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,10 +11,10 @@ | |
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from __future__ import annotations | ||
from __future__ import annotations # noqa: I001 | ||
|
||
import copy | ||
from typing import TYPE_CHECKING | ||
from typing import Any, Callable, TYPE_CHECKING | ||
|
||
if TYPE_CHECKING: | ||
from pyspark.sql import DataFrame | ||
|
@@ -52,38 +52,60 @@ def validate_presence_of_columns(df: DataFrame, required_col_names: list[str]) - | |
|
||
|
||
def validate_schema( | ||
df: DataFrame, | ||
required_schema: StructType, | ||
ignore_nullable: bool = False, | ||
) -> None: | ||
df_to_be_validated: DataFrame = None, | ||
) -> Callable[[Any, Any], Any]: | ||
"""Function that validate if a given DataFrame has a given StructType as its schema. | ||
Implemented as a decorator factory so can be used both as a standalone function or as | ||
a decorator to another function. | ||
|
||
:param df: DataFrame to validate | ||
:type df: DataFrame | ||
:param required_schema: StructType required for the DataFrame | ||
:type required_schema: StructType | ||
:param ignore_nullable: (Optional) A flag for if nullable fields should be | ||
ignored during validation | ||
:type ignore_nullable: bool, optional | ||
:param df_to_be_validated: DataFrame to validate, mandatory when called as a function. Not required | ||
when called as a decorator | ||
:type df_to_be_validated: DataFrame | ||
|
||
:raises DataFrameMissingStructFieldError: if any StructFields from the required | ||
schema are not included in the DataFrame schema | ||
""" | ||
_all_struct_fields = copy.deepcopy(df.schema) | ||
_required_schema = copy.deepcopy(required_schema) | ||
|
||
if ignore_nullable: | ||
for x in _all_struct_fields: | ||
x.nullable = None | ||
def decorator(func: Callable[..., DataFrame]) -> Callable[..., DataFrame]: | ||
def wrapper(*args: object, **kwargs: object) -> DataFrame: | ||
dataframe = func(*args, **kwargs) | ||
_all_struct_fields = copy.deepcopy(dataframe.schema) | ||
_required_schema = copy.deepcopy(required_schema) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not to check that lengths of both schemas are the same? I mean before do the deepcopy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SemyonSinchenko Are you suggesting that we check if the length matches, only then we do the deepcopy? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exactly! Deep copy may be quite expensive for big objects. |
||
for x in _required_schema: | ||
x.nullable = None | ||
if ignore_nullable: | ||
for x in _all_struct_fields: | ||
x.nullable = None | ||
|
||
missing_struct_fields = [x for x in _required_schema if x not in _all_struct_fields] | ||
error_message = f"The {missing_struct_fields} StructFields are not included in the DataFrame with the following StructFields {_all_struct_fields}" | ||
for x in _required_schema: | ||
x.nullable = None | ||
|
||
if missing_struct_fields: | ||
raise DataFrameMissingStructFieldError(error_message) | ||
missing_struct_fields = [x for x in _required_schema if x not in _all_struct_fields] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to transform this comprehension into a loop, like:
|
||
error_message = ( | ||
f"The {missing_struct_fields} StructFields are not included in the DataFrame with the following StructFields {_all_struct_fields}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the case of nested schemas this error message will be unreadable There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SemyonSinchenko How do you suggest we put this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest to flatten all field first to the way |
||
) | ||
|
||
if missing_struct_fields: | ||
raise DataFrameMissingStructFieldError(error_message) | ||
|
||
print("Success! DataFrame matches the required schema!") | ||
|
||
return dataframe | ||
|
||
return wrapper | ||
|
||
if df_to_be_validated is None: | ||
# This means the function is being used as a decorator | ||
return decorator | ||
|
||
# This means the function is being called directly with a DataFrame | ||
return decorator(lambda: df_to_be_validated)() | ||
|
||
|
||
def validate_absence_of_columns(df: DataFrame, prohibited_col_names: list[str]) -> None: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are doing a deep copy of two potentially long schemas on each call. I do not like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SemyonSinchenko I took a step back and re-evaluated the implementation. The question is - do we even need deepcopy here? We are not changing the original dfs or their schemas. Is shallow copy a better idea here? Lemme know your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sound good!