This is a fully dockerized Spark streaming example using a Kafka queue to explore the new back-pressure feature introduced in Spark 1.5.
Install docker-machine & docker-compose (require >= 1.5.1)
Create / start the virtual machine
docker-machine create rspark --driver virtualbox
docker-machine start rspark
Update your environment
eval "$(docker-machine env rspark)" export DOCKER_HOST_IP=`docker-machine ip rspark`
Run zookeeper, kafka, kafkamanager, spark master and worker
docker-compose up zookeeper kafka kafkamanager sparkmaster sparkworker
Configure kafkamanager and add the required topic
- Zookeeper:
- Kafka version:
- Topic:
- Zookeeper:
Build and submit the spark streaming consumer
Unfortunately Spark expects jobs to be submitted tospark://master:7077
(the docker internal hostname), so just map it in your/etc/hosts
.sbt consumer/assembly consumer/sparkSubmit
Build and start the containerized producer
sbt producer/docker:publishLocal docker-compose up producer
Tweak the producer settings in
and observe how your processing time / processing delay changes.
Your Spark streaming UI should be here, but can always be found on the Spark master UI.PRODUCER_RATE
: the event ratePRODUCER_MIN_DELAY
: the min. processing weight of the produced messagesPRODUCER_MAX_DELAY
: the max. processing weight of the produced messages
docker-compose up producer
Tweak the Spark streaming settings in
and resubmit the job.sbt consumer/assembly consumer/sparkSubmit