-
Notifications
You must be signed in to change notification settings - Fork 130
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Alternate spark runtime implementation (#406)
* initial implementation * Added dockerfile * refactoring of code * removed unnecessary abstract transform class * removed unnecessary abstract transform class * removed unnecessary abstract transform class * fixed spark image help * documentation update * Added filter * preparing for docid * implementing docid * implementing docid * Add comments to implementation * Add comments to implementation * Add support for explicit parallelization * Add support for explicit parallelization * Add support for explicit parallelization * Add support for explicit parallelization * Addressed comments * Addressed comments * run pre commit * small fixes * addressed comments * addressed comments - launcher refactoring * added support for runtime * small cleanup * re factored doc id * Use multi-stage build Signed-off-by: Constantin M Adam <[email protected]> * changed Spark version * changed Spark version * changed Spark version * changed Spark version --------- Signed-off-by: Constantin M Adam <[email protected]> Co-authored-by: Constantin M Adam <[email protected]>
- Loading branch information
1 parent
7c42b8f
commit 03cba30
Showing
84 changed files
with
1,140 additions
and
1,512 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,59 @@ | ||
# Spark Framework | ||
The Spark runtime extends the base framework with the following set of components: | ||
The Spark runtime implementation is roughly based on the ideas from | ||
[here](https://wrightturn.wordpress.com/2015/07/22/getting-spark-data-from-aws-s3-using-boto-and-pyspark/), | ||
[here](https://medium.com/how-to-become-a-data-architect/get-best-performance-for-pyspark-jobs-using-parallelize-48c8fa03a21e) | ||
and [here](https://medium.com/@shuklaprashant9264/alternate-of-for-loop-in-pyspark-25a00888ec35). | ||
Spark itself is basically used for execution parallelization, but all data access is based on the | ||
framework's [data access](data-access-factory.md), thus preserving all the implemented features. At | ||
the start of the execution, the list of files to process is obtained (using data access framework) | ||
and then split between Spark workers for reading actual data, its transformation and writing it back. | ||
The implementation is based on Spark RDD (For comparison of the three Apache Spark APIs: | ||
RDDs, DataFrames, and Datasets see this | ||
[Databricks blog post](https://www.databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)) | ||
As defined by Databricks: | ||
```text | ||
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an | ||
immutable distributed collection of elements of your data, partitioned across nodes in your | ||
cluster that can be operated in parallel with a low-level API that offers transformations | ||
and actions. | ||
``` | ||
This APIs fits perfectly into what we are implementing. It allows us to fully leverage our | ||
existing DataAccess APIs thus preserving all of the investments into flexible, reliable data | ||
access. Additionally RDDs flexible low-level control allows us to work on partition level, | ||
thus limiting the amount of initialization and set up. | ||
Note that in our approach transform's processing is based on either binary or parquet data, | ||
not Spark DataFrames or DataSet. We are not currently supporting supporting these Spark APIs, | ||
as they are not well mapped into what we are implementing. | ||
|
||
In our implementation we are using | ||
[pyspark.SparkContext.parallelize](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.parallelize.html) | ||
for running multiple transforms in parallel. We allow 2 options for specifying the number of partitions, determining | ||
how many partitions the RDD should be divided into. See | ||
[here](https://sparktpoint.com/how-to-create-rdd-using-parallelize/) for the explanation | ||
of this parameter: | ||
* If you specify a positive value of the parameter, Spark will attempt to evenly | ||
distribute the data from seq into that many partitions. For example, if you have | ||
a collection of 100 elements and you specify numSlices as 4, Spark will try | ||
to create 4 partitions with approximately 25 elements in each partition. | ||
* If you don’t specify this parameter, Spark will use a default value, which is | ||
typically determined based on the cluster configuration or the available resources | ||
(number of workers). | ||
|
||
## Transforms | ||
|
||
* [AbstractSparkTransform](../spark/src/data_processing_spark/runtime/spark/spark_transform.py) - this | ||
is the base class for all spark-based transforms over spark DataFrames. | ||
* [SparkTransformConfiguration](../spark/src/data_processing_spark/runtime/spark/spark_transform_config.py) - this | ||
is simple extension of the base TransformConfiguration class to hold the transformation class | ||
(an extension of AbstractSparkTransform). | ||
* [SparkTransformRuntimeConfiguration](../spark/src/data_processing_spark/transform/runtime_configuration.py) allows | ||
to configure transform to use PySpark | ||
|
||
|
||
## Runtime | ||
|
||
* [SparkTransformLauncher](../spark/src/data_processing_spark/runtime/spark/spark_launcher.py) - this is a | ||
class generally used to implement `main()` that makes use of a `SparkTransformConfiguration` to | ||
start the Spark runtime and execute the transform over the specified set of input files. | ||
* [SparkTransformRuntimeConfiguration](../spark/src/data_processing_spark/runtime/spark/runtime_config.py) - this | ||
class is a simple extension of the transform's base TransformConfiguration class. | ||
Spark runtime extends the base framework with the following set of components: | ||
* [SparkTransformExecutionConfiguration](../spark/src/data_processing_spark/runtime/spark/execution_configuration.py) | ||
allows to configure Spark execution | ||
* [SparkTransformFileProcessor](../spark/src/data_processing_spark/runtime/spark/transform_file_processor.py) extends | ||
[AbstractTransformFileProcessor](../python/src/data_processing/runtime/transform_file_processor.py) to work on | ||
PySpark | ||
* [SparkTransformLauncher](../spark/src/data_processing_spark/runtime/spark/transform_launcher.py) allows | ||
to launch PySpark runtime and execute a transform | ||
* [orchestrate](../spark/src/data_processing_spark/runtime/spark/transform_orchestrator.py) function orchestrates Spark | ||
based execution |
2 changes: 1 addition & 1 deletion
2
data-processing-lib/python/src/data_processing/runtime/__init__.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
from data_processing.runtime.execution_configuration import TransformExecutionConfiguration | ||
from data_processing.runtime.execution_configuration import TransformExecutionConfiguration, runtime_cli_prefix | ||
from data_processing.runtime.runtime_configuration import TransformRuntimeConfiguration | ||
from data_processing.runtime.transform_launcher import AbstractTransformLauncher, multi_launcher | ||
from data_processing.runtime.transform_file_processor import AbstractTransformFileProcessor |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 0 additions & 1 deletion
1
data-processing-lib/python/src/data_processing/transform/__init__.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
17 changes: 0 additions & 17 deletions
17
data-processing-lib/python/src/data_processing/transform/abstract_transform.py
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.