EPIC Infrastructure

EPIC Infrastructure backend mono repo. Contains all services, kubernetes deployment files, dataproc definitions...

Diagram

Overview

This project is prepared to run on top of a pre-configured Kubernetes cluster. Basic knowledge of Kubernetes is required. A good start is this Udacity course. The project is structured into 2 separate parts: collection pipeline services and dashboard services.

Collection Pipeline Services

This side is in charge of connecting to Twitter and collecting tweets 24/7. It has 2 services: TwitterStream (downloads tweets and sends to Kafka) and tweet-store (receives tweets from Kafka and uploads them to Google Cloud Storage).

Tweets are stored following a EVENT/YYYY/MM/DD/HH/ folder structure (a tweet received at 2PM on April 3rd 2019 for event winter would be stored in the folder winter/2019/04/03/14) in a the epic-collect Google Cloud bucket. Each tweet received is buffered. When the buffer reaches size of 1000 tweets a file is created and uploaded in the corresponding folder.

Requirements

List of requirements for deploying or developing in this repository.

Development

In order to work on the services you will need the following:

Installed java 8. (Ex: brew install java8)
Installed maven. (Ex: brew install mvn)
Installed Make (Ex: brew install make)
Install our authlib: cd authlib && mvn install.
Set up your local Maven installation to pull from GitHub repository (read how to do so here)
Log in on your GCloud CLI. gcloud auth login
Create a default token using your GCloud user. gcloud auth application-default login

Deploying

In order to deploy you will need:

Docker CLI installed (Ex: brew install docker)
A hub.docker.com account and your Docker CLI connected to it (docker login)
Editor access to Project EPIC Docker Hub organization.
Editor access to the GCloud project.
kubectl installed (Ex: brew install kubectl)
kubectl connected to the corresponding cluster (Project EPIC: gcloud container clusters get-credentials epic-prod --zone us-central1-c --project crypto-eon-164220)

New service

Start development

Requirements: Development requirements

Read Getting started guide for DropWizard
mvn archetype:generate -DarchetypeGroupId=io.dropwizard.archetypes -DarchetypeArtifactId=java-simple -DarchetypeVersion=1.3.9
Add AuthLib as a dependency (follow instructions here)
Add authentification on your service (follow instructions here)
Add CORS configuration (see instructions here)
Add Makefile to service (you can copy from the Makefile template
Add Dockerfile to service (you can copy from the Dockerfile template
Add root resource under resources folder and register it on the application (you can copy root resource from the EventsAPI example
Add config.yml file copying from this template
Set up your configuration to retrieve production key-value (see this example)
Set up your application to get configuration parameters from environment variables
Run project locally: make run

Deployment

(ONLY FIRST TIME) Create new Kubernetes definition file in the api folder
Make sure your resources are protected with the right annotations (see how to do it here)
Make sure you have health checks configured properly for external dependencies

Add new path field in spec within ingress.yaml

Ex.

    - path: /new-api-folder/*
      backend:
        serviceName: new-api
        servicePort: 8080

Update image version in Makefile
Create and upload docker image: make push
Update docker image version in your api definition file
kubectl replace -f api/NEW.yml (replace NEW with your api file name), or kubectl apply -f api/NEW.yml
kubectl apply -f ingress.yaml

System deployment

Create managed Postgres instance (see cloudsql instructions)
Create Dataproc workflow (see dataproc instructions)
Create a Kubernetes cluster and deploy services (see kubernetes instructions)

Queries

How to run diverse queries on the system with new and old data.

Collection query

Streaming collection for events happenning at the moment

Open dashboard.gerard.space
Select Events on side bar.
Press the pink button on the left-down corner
Fill form with information on event and use keywords field to add keywords to collect from. Read more about how Twitter tracking works here

BigQuery query

Open desired table (see sections below)
Click Query table
Build SQL statement for the query we are interested in. See syntax here.
Run query
Download data by clicking Save results

Open table for new infrastructure event

Query on an event collected in the new infrastrucure

Open dashboard.gerard.space
Select Events on side bar.
Select event to query
Select Dashboard tab on top
(ONLY FIRST TIME) Click Create BigQuery Table
Click Explore in BigQuery

Open table on legacy imported events

Open historic dataset in BigQuery
If table exists: Execute query
Else:
Click Create table
Set table configuration to the following (if not specified, leave as defaulted):
- Create table from: "Google Cloud Storage"
- Select file...: Browse file in epic-historic_tweets bucket, in the corresponding folder and select one file. Replace filename with wildcard instead (Ex: epic-historic-tweets/2012 Canada Fires/*)
- File format: "JSON (newline delimited)"
- Table type: "External table"
- Table name: Fill with a distinct table name
- Check Auto detect - Schema and input parameters box
- Advanced options:
  - Number of errors allowed: 2147483647
  - Check Ignore unknown values box
Click Create table

Frequent errors

Google Cloud is giving an authorization error on local

Log in on your GCloud CLI. gcloud auth login
Make sure you have been added to the proper Google Cloud project.
Create a default token using your GCloud user. gcloud auth application-default login
Make sure you don't have any GOOGLE_APPLICATION_CREDENTIALS environment variable set.

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
AuthAPI		AuthAPI
EventsAPI		EventsAPI
FilteringApi		FilteringApi
GeoTagAPI		GeoTagAPI
MediaAPI		MediaAPI
MentionsAPI		MentionsAPI
TwitterStream		TwitterStream
TwitterStreamCovid19		TwitterStreamCovid19
TwitterStreamFollow		TwitterStreamFollow
authlib		authlib
cloudsql		cloudsql
dataproc		dataproc
geo-tag-spark		geo-tag-spark
kubernetes		kubernetes
media-spark		media-spark
mentions-spark		mentions-spark
templates		templates
tweet-store		tweet-store
tweets-api		tweets-api
.gitignore		.gitignore
README.md		README.md
epic_infra.png		epic_infra.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EPIC Infrastructure

Diagram

Overview

Collection Pipeline Services

Requirements

Development

Deploying

New service

Start development

Deployment

System deployment

Queries

Collection query

BigQuery query

Open table for new infrastructure event

Open table on legacy imported events

Frequent errors

About

Releases 1

Packages

Contributors 5

Languages

Project-EPIC/epic-infra

Folders and files

Latest commit

History

Repository files navigation

EPIC Infrastructure

Diagram

Overview

Collection Pipeline Services

Requirements

Development

Deploying

New service

Start development

Deployment

System deployment

Queries

Collection query

BigQuery query

Open table for new infrastructure event

Open table on legacy imported events

Frequent errors

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 5

Languages

Packages