This is a fully dockerized Spark streaming example using a Kafka queue to explore the new back-pressure feature introduced in Spark 1.5.
-
Install docker-machine & docker-compose (require >= 1.5.1)
-
Create / start the virtual machine
docker-machine create rspark --driver virtualbox
or
docker-machine start rspark
-
Update your environment
eval "$(docker-machine env rspark)" export DOCKER_HOST_IP=`docker-machine ip rspark`
-
Run zookeeper, kafka, kafkamanager, spark master and worker
docker-compose up zookeeper kafka kafkamanager sparkmaster sparkworker
-
Configure kafkamanager and add the required topic
- Zookeeper:
zk:2181
- Kafka version:
0.8.2.1
- Topic:
numbers
- Zookeeper:
-
Build and submit the spark streaming consumer
Unfortunately Spark expects jobs to be submitted tospark://master:7077
(the docker internal hostname), so just map it in your/etc/hosts
.sbt consumer/assembly consumer/sparkSubmit
-
Build and start the containerized producer
sbt producer/docker:publishLocal docker-compose up producer
-
Tweak the producer settings in
docker-compose.yml
and observe how your processing time / processing delay changes.
Your Spark streaming UI should be here, but can always be found on the Spark master UI.PRODUCER_RATE
: the event ratePRODUCER_MIN_DELAY
: the min. processing weight of the produced messagesPRODUCER_MAX_DELAY
: the max. processing weight of the produced messages
docker-compose up producer
-
Tweak the Spark streaming settings in
SparkConsumer.scala
and resubmit the job.sbt consumer/assembly consumer/sparkSubmit