Skip to content

Commit

Permalink
Updated documentation for get_bcast_params() method
Browse files Browse the repository at this point in the history
Signed-off-by: Constantin M Adam <[email protected]>
  • Loading branch information
cmadam committed Oct 10, 2024
1 parent 36e5894 commit 8d9bb66
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 5 deletions.
8 changes: 5 additions & 3 deletions data-processing-lib/doc/spark-runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,11 @@ of this parameter:

## Transforms

* [SparkTransformRuntimeConfiguration](../spark/src/data_processing_spark/transform/runtime_configuration.py) allows
to configure transform to use PySpark

* [SparkTransformRuntimeConfiguration](../spark/src/data_processing_spark/runtime/spark/runtime_configuration.py)
allows to configure transform to use PySpark. In addition to its base class
[TransformRuntimeConfiguration](../python//src/data_processing/runtime/runtime_configuration.py) features,
this class includes `get_bcast_params()` method to get very large configuration settings. Before starting the
transform execution, the Spark runtime will broadcast these settings to all the workers.

## Runtime

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@ def get_bcast_params(self, data_access_factory: DataAccessFactoryBase) -> dict[s
"""Allows retrieving and broadcasting to all the workers very large
configuration parameters, like the list of document IDs to remove for
fuzzy dedup, or the list of blocked web domains for block listing. This
function is called after spark initialization, and before spark_context.parallelize()
function is called by the spark runtime after spark initialization, and
before spark_context.parallelize()
:param data_access_factory - creates data_access object to download the large config parameter
"""
return {}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@ def get_bcast_params(self, data_access_factory: DataAccessFactoryBase) -> dict[s
"""Allows retrieving and broadcasting to all the workers very large
configuration parameters, like the list of document IDs to remove for
fuzzy dedup, or the list of blocked web domains for block listing. This
function is called after spark initialization, and before spark_context.parallelize().
function is called by the spark runtime after spark initialization, and
before spark_context.parallelize()
:param data_access_factory - creates data_access object to download the large config parameter
"""
return {}
Expand Down

0 comments on commit 8d9bb66

Please sign in to comment.