Discussion about implicit modification of PySpark Classes #74

SemyonSinchenko · 2023-03-10T20:00:48Z

SemyonSinchenko
Mar 10, 2023
Maintainer

It is a place for discussion topics from #26 #6, #35 and also mentioned in #42.

For me pattern with implicit modification of existed PySpark classes with import * is against Python Zen:
Explicit is better than implicit.

An example is:

from pyspark.sql.dataframe import DataFrame


def transform(self: DataFrame, f: Callable[[DataFrame], DataFrame]) -> DataFrame:
    return f(self)

DataFrame.transform = transform

My suggestion is the following:

Add DeprecationWarning with suggestion to use explicit functions instead of import *
Drop all the implicit modifications of existed DataFrame and Column object in future versions

MrPowers · 2023-03-10T22:06:36Z

MrPowers
Mar 10, 2023
Maintainer

I am a little mixed here. I like spark.create_df a lot. I also like the ergonomics of the column extensions.

We should definitely get rid of all the import * everywhere. Just added an issue: #75

At the same time I hate monkey patching and think it's an antipattern, so I'm torn, haha

0 replies

SemyonSinchenko · 2023-05-30T10:10:12Z

SemyonSinchenko
May 30, 2023
Maintainer Author

I was thinking again about this topic, and I see it is very dangerous to use it. Cause the current implementation depends on the order of imports. For example, this code block will work well:

from quinn.extensions import *

source_df.withColumn("is_stuff_falsy", F.col("has_stuff").isFalsy())

Lets imagine the situation that we have another file my_function.py in the project which has the following code block:

from pyspark.sql import Column

def modify_col(input: Column) -> Column:
    ....

If we import this file into the code after the import of the quinn we will break everything. The following code block won't work.

from quinn.extensions import *
from .my_function import *

source_df.withColumn("is_stuff_falsy", F.col("has_stuff").isFalsy())

The same will be if we mix import from quinn and from pyspark.sql in the same file: result will depend of the order of imports. It is a really bad practice that may create a lot of confusing...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion about implicit modification of PySpark Classes #74

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Discussion about implicit modification of PySpark Classes #74

SemyonSinchenko Mar 10, 2023 Maintainer

Replies: 2 comments

MrPowers Mar 10, 2023 Maintainer

SemyonSinchenko May 30, 2023 Maintainer Author

SemyonSinchenko
Mar 10, 2023
Maintainer

MrPowers
Mar 10, 2023
Maintainer

SemyonSinchenko
May 30, 2023
Maintainer Author