Apache PySpark in Docker

PySpark docker container based on OpenJDK and Miniconda 3. PySpark 3+ uses OpenJDK 11, PySpark 2 uses OpenJDK 8.

Running the container

By default spark-submit --help is run:

docker run godatadriven/pyspark

To run your own job, make the job accessible through a volume and pass the necessary arguments:

docker run -v /local_folder:/job godatadriven/pyspark [options] /job/<python file> [app arguments]

Samples

The folder samples contain some PySpark jobs, how to obtain a spark session and crunch some data. The current directory is mapped as /job. So run the docker command from the root directory of this project.

# Self word counter:
docker run -v $(pwd):/job godatadriven/pyspark /job/samples/word_counter.py

# Self word counter with spark extra options
docker run -v $(pwd):/job godatadriven/pyspark \
	--name "I count myself" \
	--master "local[1]" \
	--conf "spark.ui.showConsoleProgress=True" \
	--conf "spark.ui.enabled=False" \
	/job/samples/word_counter.py "jobSampleArgument1"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Apache PySpark in Docker

Running the container

Samples

Files

README.md

Latest commit

History

README.md

File metadata and controls

Apache PySpark in Docker

Running the container

Samples