.
├── api # Rest API associated with the data (ExpressJS)
├── cassandra # Cassandra Dockerfile and init script (Cassandra)
├── connect-cassandra # Kafka-Cassandra sink Dockerfile and configurations (Kafka Connect + Cassandra)
├── connect-elastic # Kafka-Elasticsearch sink Dockerfile and configurations (Kafka Connect + Elasticsearch)
├── docs # Documentation files and notebooks
├── ingestion # Data ingestion (python script + Kafka)
├── kubernetes # Kubernetes configuration files (Kubernetes)
├── spark # Spark Dockerfile and python scripts (Spark + python script)
├── stream # Kafka stream application to filter and enrich the input data (Kafka Streaming)
├── .gitattribute # .gitattribute file
├── .gitignore # .gitignore file
├── docker-compose.yaml # Base docker-compose file. Starts all the applications
├── LICENSE # Licence of the project
└── README.md # This file
- This is a project created for the subject Technologies for Advanced Programming or TAP at the university of Catania or UniCT.
- The idea is to showcase a simple ETL pipeline using some of the most widely known technologies in the big data fields.
- The main inspiration for this project was the OpenDota project, more specifically the "core" part which is opensource.
- Raw data comes from the WebAPI provided by Steam (Valve).
Step | Technology used |
---|---|
Data source | Steam API |
Data transport | Apache Kafka |
Data processing | Apache Kafka streams - Apache Spark |
Data storage | Apache Cassandra - Elasticsearch |
Data visualization | Kibana |
Programming language | Python - Java |
Index | Service | From Kafka | To Kafka |
---|---|---|---|
1 | Steam Web API | / | dota_raw |
2 | Kafka Streaming | dota_raw | dota_single - dota_lineup |
3 | Cassandra | dota_single | / |
4 | Dotingestio2 API | dota_request | dota_response |
5 | Spark | dota_lineup - dota_request | dota_response |
6 | Elasticsearch | dota_single | / |
- To run the Elasticsearch container you may need to tweak the vm.max_map_count variable. See here
- Download DataStax Apache Kafka® Connector and place it in the connect-cassandra directory
- Make sure you are in the root directory, with the docker-compose.yaml file
- Create an ingestion/settings.yaml file with the following values (see ingestion/settings.yaml.example)
All the values present in the settings file can be overwritten by any environment variable whit the same name in all caps
# You need this to access the Steam Web API, which is used to fetch basic match data. You can safely use your main account to obtain the API key. You can request an API key here: https://steamcommunity.com/dev/apikey api_key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX # Steam Web API endpoint. You should not modify this unless you know what you are doing api_endpoint: http://api.steampowered.com/IDOTA2Match_570/GetMatchHistoryBySequenceNum/V001/?key={}&start_at_match_seq_num={} # Kafka topic the producer will send the data to. The Kafka streams consumer expects this topic topic: dota_raw # Interval between each data fetch by the python script interval: 10 # 3 possible settings can be placed here: # - The sequential match id of the first match you want to fetch, as a string # - 'cassandra', will fetch the last sequential match id in the cassandra database # - 'steam', will fetch the most recent sequential match id from the "history_endpoint" match_seq_num: 4976549000 | 'steam' | 'cassandra' # Steam API Web endpoint used when 'steam' value is placed in "match_seq_num" history_endpoint: https://api.steampowered.com/IDOTA2Match_570/GetMatchHistory/V001/key={}&matches_requested=1
- Start:
docker-compose up
- Stop:
docker-compose down
- To run the Elasticsearch container you may need to tweak the vm.max_map_count variable. See here
- Make sure you are in the root directory, with the all-in-one-deploy.yaml file
- Make sure to edit the kubernetes/kafkaproducer-key.yaml file to add your Steam Web API key. All the settings shown above will be determined by the environment variable whit the same name in all caps
- Start:
kubectl apply -f all-in-one-deploy.yaml
- Stop:
kubectl delete -f all-in-one-deploy.yaml
docker exec -it <container-name> bash
Get a terminal into the running containerdocker system prune
Cleans your system of any stopped containers, images, and volumesdocker-compose build
Rebuilds your containers (e.g. for database schema updates)kubectl -n default rollout restart deploy
Restart all Kubernetes pods
- Add the much needed replay parsing to gather much more informations about each match.
- Make a usable user interface to fetch the data.
- Use cluster with more than one node for each of the distributed services.
- Improve performances.
- Use kubernetes at its fullest.
- Use the recommanded security layers like passwords and cryptography.
- OpenDota
- TeamFortress wiki
- DataStax Apache Kafka Connector
- Structured Streaming Programming Guide
- Databricks: Deploying MLlib for Scoring in Structured Streaming
- I used deep learning to predict DotA 2
- Elasticsearch: Using Docker images in production
- Elasticsearch Service Sink Connector for Confluent Platform
- How to start multiple streaming queries in a single Spark application?