Skip to content

This is a project created for the subject TAP at the university of Catania. The idea is to showcase a simple ETL pipeline using some of the most widely known technologies in the big data fields

License

Notifications You must be signed in to change notification settings

TendTo/Dotingestion2

Repository files navigation

Dotingestion 2: there is no Dotingestion 1

Binder

Project structure

.
├── api                     # Rest API associated with the data (ExpressJS)
├── cassandra               # Cassandra Dockerfile and init script (Cassandra)
├── connect-cassandra       # Kafka-Cassandra sink Dockerfile and configurations (Kafka Connect + Cassandra)
├── connect-elastic         # Kafka-Elasticsearch sink Dockerfile and configurations (Kafka Connect + Elasticsearch)
├── docs                    # Documentation files and notebooks
├── ingestion               # Data ingestion (python script + Kafka)
├── kubernetes              # Kubernetes configuration files (Kubernetes)
├── spark                   # Spark Dockerfile and python scripts (Spark + python script)
├── stream                  # Kafka stream application to filter and enrich the input data (Kafka Streaming)
├── .gitattribute           # .gitattribute file
├── .gitignore              # .gitignore file
├── docker-compose.yaml     # Base docker-compose file. Starts all the applications
├── LICENSE                 # Licence of the project
└── README.md               # This file

Brief description

  • This is a project created for the subject Technologies for Advanced Programming or TAP at the university of Catania or UniCT.
  • The idea is to showcase a simple ETL pipeline using some of the most widely known technologies in the big data fields.
  • The main inspiration for this project was the OpenDota project, more specifically the "core" part which is opensource.
  • Raw data comes from the WebAPI provided by Steam (Valve).

Technologies used

Step Technology used
Data source Steam API
Data transport Apache Kafka
Data processing Apache Kafka streams - Apache Spark
Data storage Apache Cassandra - Elasticsearch
Data visualization Kibana
Programming language Python - Java

Pipeline

pipeline

Index Service From Kafka To Kafka
1 Steam Web API / dota_raw
2 Kafka Streaming dota_raw dota_single - dota_lineup
3 Cassandra dota_single /
4 Dotingestio2 API dota_request dota_response
5 Spark dota_lineup - dota_request dota_response
6 Elasticsearch dota_single /

Quickstart local (Docker)

System requirements

Steps

  • To run the Elasticsearch container you may need to tweak the vm.max_map_count variable. See here
  • Download DataStax Apache Kafka® Connector and place it in the connect-cassandra directory
  • Make sure you are in the root directory, with the docker-compose.yaml file
  • Create an ingestion/settings.yaml file with the following values (see ingestion/settings.yaml.example)
    # You need this to access the Steam Web API, which is used to fetch basic match data. You can safely use your main account to obtain the API key. You can request an API key here: https://steamcommunity.com/dev/apikey
    api_key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    # Steam Web API endpoint. You should not modify this unless you know what you are doing
    api_endpoint: http://api.steampowered.com/IDOTA2Match_570/GetMatchHistoryBySequenceNum/V001/?key={}&start_at_match_seq_num={}
    # Kafka topic the producer will send the data to. The Kafka streams consumer expects this topic
    topic: dota_raw
    # Interval between each data fetch by the python script
    interval: 10
    # 3 possible settings can be placed here:
    # - The sequential match id of the first match you want to fetch, as a string
    # - 'cassandra', will fetch the last sequential match id in the cassandra database
    # - 'steam', will fetch the most recent sequential match id from the "history_endpoint"
    match_seq_num: 4976549000 | 'steam' | 'cassandra'
    # Steam API Web endpoint used when 'steam' value is placed in "match_seq_num"
    history_endpoint: https://api.steampowered.com/IDOTA2Match_570/GetMatchHistory/V001/key={}&matches_requested=1
    All the values present in the settings file can be overwritten by any environment variable whit the same name in all caps
  • Start:
    docker-compose up
  • Stop:
    docker-compose down

Quickstart local (Kubernetes)

System requirements

Steps

  • To run the Elasticsearch container you may need to tweak the vm.max_map_count variable. See here
  • Make sure you are in the root directory, with the all-in-one-deploy.yaml file
  • Make sure to edit the kubernetes/kafkaproducer-key.yaml file to add your Steam Web API key. All the settings shown above will be determined by the environment variable whit the same name in all caps
  • Start:
    kubectl apply -f all-in-one-deploy.yaml
  • Stop:
    kubectl delete -f all-in-one-deploy.yaml

Useful commands

  • docker exec -it <container-name> bash Get a terminal into the running container
  • docker system prune Cleans your system of any stopped containers, images, and volumes
  • docker-compose build Rebuilds your containers (e.g. for database schema updates)
  • kubectl -n default rollout restart deploy Restart all Kubernetes pods

TODO list

  • Add the much needed replay parsing to gather much more informations about each match.
  • Make a usable user interface to fetch the data.
  • Use cluster with more than one node for each of the distributed services.
  • Improve performances.
  • Use kubernetes at its fullest.
  • Use the recommanded security layers like passwords and cryptography.

Resources