EPIC Infrastructure backend mono repo. Contains all services, kubernetes deployment files, dataproc definitions...
This project is prepared to run on top of a pre-configured Kubernetes cluster. Basic knowledge of Kubernetes is required. A good start is this Udacity course. The project is structured into 2 separate parts: collection pipeline services and dashboard services.
This side is in charge of connecting to Twitter and collecting tweets 24/7. It has 2 services: TwitterStream (downloads tweets and sends to Kafka) and tweet-store (receives tweets from Kafka and uploads them to Google Cloud Storage).
Tweets are stored following a EVENT/YYYY/MM/DD/HH/
folder structure (a tweet received at 2PM on April 3rd 2019 for event winter would be stored in the folder winter/2019/04/03/14
) in a the epic-collect
Google Cloud bucket. Each tweet received is buffered. When the buffer reaches size of 1000 tweets a file is created and uploaded in the corresponding folder.
List of requirements for deploying or developing in this repository.
In order to work on the services you will need the following:
- Installed java 8. (Ex:
brew install java8
) - Installed maven. (Ex:
brew install mvn
) - Installed Make (Ex:
brew install make
) - Install our authlib:
cd authlib && mvn install
. - Set up your local Maven installation to pull from GitHub repository (read how to do so here)
- Log in on your GCloud CLI.
gcloud auth login
- Create a default token using your GCloud user.
gcloud auth application-default login
In order to deploy you will need:
- Docker CLI installed (Ex:
brew install docker
) - A hub.docker.com account and your Docker CLI connected to it (
docker login
) - Editor access to Project EPIC Docker Hub organization.
- Editor access to the GCloud project.
kubectl
installed (Ex:brew install kubectl
)kubectl
connected to the corresponding cluster (Project EPIC:gcloud container clusters get-credentials epic-prod --zone us-central1-c --project crypto-eon-164220
)
Requirements: Development requirements
- Read Getting started guide for DropWizard
mvn archetype:generate -DarchetypeGroupId=io.dropwizard.archetypes -DarchetypeArtifactId=java-simple -DarchetypeVersion=1.3.9
- Add AuthLib as a dependency (follow instructions here)
- Add authentification on your service (follow instructions here)
- Add CORS configuration (see instructions here)
- Add Makefile to service (you can copy from the Makefile template
- Add Dockerfile to service (you can copy from the Dockerfile template
- Add root resource under resources folder and register it on the application (you can copy root resource from the EventsAPI example
- Add
config.yml
file copying from this template - Set up your configuration to retrieve production key-value (see this example)
- Set up your application to get configuration parameters from environment variables
- Run project locally:
make run
- (ONLY FIRST TIME) Create new Kubernetes definition file in the api folder
- Make sure your resources are protected with the right annotations (see how to do it here)
- Make sure you have health checks configured properly for external dependencies
- Add new path field in spec within
ingress.yaml
- Ex.
- path: /new-api-folder/* backend: serviceName: new-api servicePort: 8080
- Update image version in
Makefile
- Create and upload docker image:
make push
- Update docker image version in your api definition file
kubectl replace -f api/NEW.yml
(replace NEW with your api file name), orkubectl apply -f api/NEW.yml
kubectl apply -f ingress.yaml
- Create managed Postgres instance (see cloudsql instructions)
- Create Dataproc workflow (see dataproc instructions)
- Create a Kubernetes cluster and deploy services (see kubernetes instructions)
How to run diverse queries on the system with new and old data.
Streaming collection for events happenning at the moment
- Open dashboard.gerard.space
- Select Events on side bar.
- Press the pink button on the left-down corner
- Fill form with information on event and use keywords field to add keywords to collect from. Read more about how Twitter tracking works here
- Open desired table (see sections below)
- Click Query table
- Build SQL statement for the query we are interested in. See syntax here.
- Run query
- Download data by clicking Save results
Query on an event collected in the new infrastrucure
- Open dashboard.gerard.space
- Select Events on side bar.
- Select event to query
- Select Dashboard tab on top
- (ONLY FIRST TIME) Click Create BigQuery Table
- Click Explore in BigQuery
- Open historic dataset in BigQuery
- If table exists: Execute query
- Else:
- Click Create table
- Set table configuration to the following (if not specified, leave as defaulted):
- Create table from: "Google Cloud Storage"
- Select file...: Browse file in
epic-historic_tweets
bucket, in the corresponding folder and select one file. Replace filename with wildcard instead (Ex:epic-historic-tweets/2012 Canada Fires/*
) - File format: "JSON (newline delimited)"
- Table type: "External table"
- Table name: Fill with a distinct table name
- Check Auto detect - Schema and input parameters box
- Advanced options:
- Number of errors allowed: 2147483647
- Check Ignore unknown values box
- Click Create table
Google Cloud is giving an authorization error on local
- Log in on your GCloud CLI.
gcloud auth login
- Make sure you have been added to the proper Google Cloud project.
- Create a default token using your GCloud user.
gcloud auth application-default login
- Make sure you don't have any GOOGLE_APPLICATION_CREDENTIALS environment variable set.