Skip to content

Commit

Permalink
docs(project): misc
Browse files Browse the repository at this point in the history
  • Loading branch information
Henry Lee committed Jul 20, 2024
1 parent 364fc59 commit fc58f98
Show file tree
Hide file tree
Showing 4 changed files with 84 additions and 84 deletions.
58 changes: 29 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,16 @@
![Python CI](https://github.com/pycontw/PyCon-ETL/workflows/Python%20CI/badge.svg)
![Docker Image CI](https://github.com/pycontw/PyCon-ETL/workflows/Docker%20Image%20CI/badge.svg)

Using Airflow to implement our ETL pipelines
Using Airflow to implement our ETL pipelines.

## Table of Contents

- [Prerequisites](#prerequisites)
- [Install](#install)
- [Installation](#installation)
- [Configuration](#configuration)
- [BigQuery (Optional)](#bigquery-optional)
- [Run](#run)
- [Local environment with Docker](#local-environment-with-docker)
- [Running the Project](#running-the-project)
- [Local Environment with Docker](#local-environment-with-docker)
- [Production](#production)
- [Contact](#contact)

Expand All @@ -21,69 +21,69 @@ Using Airflow to implement our ETL pipelines
- [Python 3.8+](https://www.python.org/downloads/release/python-3811/)
- [Docker](https://docs.docker.com/get-docker/)
- [Git](https://git-scm.com/book/zh-tw/v2/%E9%96%8B%E5%A7%8B-Git-%E5%AE%89%E8%A3%9D%E6%95%99%E5%AD%B8)
- [Poetry](https://python-poetry.org/docs/#installation) (Optional, only for creating virtual environment when developing)
- [Poetry](https://python-poetry.org/docs/#installation) (Optional, only for creating virtual environments during development)

## Install
## Installation

Install local environment for development:
Install the local environment for development:

```bash
# use poetry to create a virtual environment
# Use poetry to create a virtual environment
poetry install

# or use pip install on user existed python environment
# if your got any airflow error, check constraints-3.8.txt and re-install airflow dependencies
# Or use pip to install in your existing Python environment
# If you encounter any Airflow errors, check constraints-3.8.txt and reinstall Airflow dependencies
pip install -r requirements.txt
```

## Configuration

1. `cp .env.template .env.staging` for dev/test. `cp .env.template .env.production` instead if you are going to start a production instance.
1. For development or testing, run `cp .env.template .env.staging`. For production, run `cp .env.template .env.production`.

2. Follow the instructions in `.env.<staging|production>` and fill in your secrets.
If you are running the staging instance for development as a sandbox and not going to access any specific third-party service, leave the `.env.staging` as-is should be fine.
If you are running the staging instance for development as a sandbox and do not need to access any specific third-party services, leaving `.env.staging` as-is should be fine.

> Find the maintainer if you don't have those secrets.
> Contact the maintainer if you don't have these secrets.
> **⚠ WARNING: About .env**
> Please don't use the .env for local development, or it might screw up the production tables.
> Please do not use the .env file for local development, as it might affect the production tables.
### BigQuery (Optional)

Setup the Authentication of GCP: <https://googleapis.dev/python/google-api-core/latest/auth.html>
*After invoking `gcloud auth application-default login`, you'll get a credentials.json resides in `$HOME/.config/gcloud/application_default_credentials.json`. Invoke `export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"` if you have it.
* service-account.json: Please contact @david30907d using email, telegram, or discord. No worry about this json if you are running the sandbox staging instance for development.
Set up the Authentication for GCP: <https://googleapis.dev/python/google-api-core/latest/auth.html>
* After running `gcloud auth application-default login`, you will get a credentials.json file located at `$HOME/.config/gcloud/application_default_credentials.json`. Run `export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"` if you have it.
* service-account.json: Please contact @david30907d via email, Telegram, or Discord. You do not need this json file if you are running the sandbox staging instance for development.

## Run
## Running the Project

If you are a developer 👨‍💻, please check [Contributing Guide](./docs/CONTRIBUTING.md).
If you are a developer 👨‍💻, please check the [Contributing Guide](./docs/CONTRIBUTING.md).

If you are a mantainer 👨‍🔧, please check [Maintenance Guide](./docs/MAINTENANCE.md).
If you are a maintainer 👨‍🔧, please check the [Maintenance Guide](./docs/MAINTENANCE.md).

### Local environment with Docker
### Local Environment with Docker

dev/test environment:
For development/testing:

```bash
# build the dev/test local image
# Build the local dev/test image
make build-dev

# first time setup, create airflow db volume
# Create the Airflow DB volume during the first setup
docker volume create --name=airflow-db-volume

# start dev/test services
# Start dev/test services
make deploy-dev

# stop dev/test services
# Stop dev/test services
make down-dev
```

> Difference between production and dev/test compose files is dev/test compose file use local build image, and production compose file use the image from docker hub.
> The difference between production and dev/test compose files is that the dev/test compose file uses a locally built image, while the production compose file uses the image from Docker Hub.
### Production

Please check [Production Deployment Guide](./docs/DEPLOYMENT.md).
Please check the [Production Deployment Guide](./docs/DEPLOYMENT.md).

## Contact

[PyCon TW Volunteer Data Team - Discord](https://discord.com/channels/752904426057892052/900721883383758879)
[PyCon TW Volunteer Data Team - Discord](https://discord.com/channels/752904426057892052/900721883383758879)
59 changes: 30 additions & 29 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,37 @@
# Contributing Guide

## How to Contribute
## How to Contribute to this Project

1. Clone this repository:

```bash
git clone https://github.com/pycontw/pycon-etl
```
```bash
git clone https://github.com/pycontw/pycon-etl
```

2. Create a new branch:

```bash
git checkout -b <branch-name>
```
```bash
git checkout -b <branch-name>
```

3. Make your changes.

> **NOTICE:** We are still using Airflow v1, so please read the official document [Apache Airflow v1.10.13 Documentation](https://airflow.apache.org/docs/apache-airflow/1.10.13/) to make sure your changes are compatible with our current version.
> **NOTICE:** We are still using Airflow v1, so please read the official document [Apache Airflow v1.10.13 Documentation](https://airflow.apache.org/docs/apache-airflow/1.10.13/) to ensure your changes are compatible with our current version.

If your task uses an external service, add the connection and variable in the Airflow UI.
If your task uses an external service, add the connection and variable in the Airflow UI.

4. Test your changes in your local environment:

- Test that the DAG file is loaded successfully.
- Test that the task is running successfully.
- Ensure your code is formatted and linted correctly.
- Check whether the necessary dependencies are included in `requirements.txt`.
- Ensure the DAG file is loaded successfully.
- Verify that the task runs successfully.
- Confirm that your code is correctly formatted and linted.
- Check that all necessary dependencies are included in `requirements.txt`.

5. Push your branch:

```bash
git push origin <branch-name>
```
```bash
git push origin <branch-name>
```

6. Create a Pull Request (PR).

Expand All @@ -41,38 +41,39 @@ git checkout -b <branch-name>

## Release Management

Please use [GitLab Flow](https://about.gitlab.com/topics/version-control/what-is-gitlab-flow/), otherwise, you cannot pass docker hub CI
Please use [GitLab Flow](https://about.gitlab.com/topics/version-control/what-is-gitlab-flow/); otherwise, you cannot pass Docker Hub CI.

## Dependency Management

Please use poetry to manage dependencies
Airflow dependencies are managed by `requirements.txt` and `constraints-3.8.txt` via `pip`. It is not recommended to use `poetry` or other tools.

```bash
poetry add <package>
poetry remove <package>
```
`constraints-3.8.txt` is used to pin the version of the Airflow dependencies, and `requirements.txt` is used to install user-defined dependencies.

If you are using a new package, please update `requirements.txt` by running `make deps`.
Please add or update dependencies in `requirements.txt`. Do not modify `constraints-3.8.txt` unless Airflow is updated.

For more information, refer to the [Airflow Installation Documentation](https://airflow.apache.org/docs/apache-airflow/1.10.13/installation.html).

## Code Convention

### Airflow DAG

- Please refer to [this article](https://medium.com/@davidtnfsh/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B9%8B%E8%B7%AF-%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%AE%9E%E8%B7%B5-%E8%AE%80%E6%9B%B8%E5%BF%83%E5%BE%97-54e795c2b8c) for naming guidline
- Please refer to [this article](https://medium.com/@davidtnfsh/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B9%8B%E8%B7%AF-%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%AE%9E%E8%B7%B5-%E8%AE%80%E6%9B%B8%E5%BF%83%E5%BE%97-54e795c2b8c) for naming guidelines.

- examples 1. `ods/opening_crawler`: Crawlers written by @Rain. Those openings can be used for the recruitment board, which was implemented by @tai271828 and @stacy. 2. `ods/survey_cake`: A manually triggered uploader that would upload questionnaires to bigquery. The uploader should be invoked after we receive the surveycake questionnaire.
- Examples:
1. `ods/opening_crawler`: Crawlers written by @Rain. These openings can be used for the recruitment board, which was implemented by @tai271828 and @stacy.
2. `ods/survey_cake`: A manually triggered uploader that uploads questionnaires to BigQuery. The uploader should be invoked after we receive the SurveyCake questionnaire.

- table name convention:
- Table name convention:
![img](https://miro.medium.com/max/1400/1*bppuEKMnL9gFnvoRHUO8CQ.png)

### Format

Please use `make format` to format your code before commit, otherwise, the CI will fail.
Please use `make format` to format your code before committing, otherwise, the CI will fail.

### Commit Message

Recommended to use [Commitizen](https://commitizen-tools.github.io/commitizen/).
It is recommended to use [Commitizen](https://commitizen-tools.github.io/commitizen/).

### CI/CD

Please check [.github/workflows](.github/workflows) for details.
Please check the [.github/workflows](.github/workflows) directory for details.
24 changes: 7 additions & 17 deletions docs/DEPLOYMENT.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,21 @@
# Deployment Guide

1. Login to the data team's server:
1. `gcloud compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217"`
2. service:
1. Run: `gcloud compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217"`
2. Services:
* ETL: `/home/zhangtaiwei/pycon-etl`
* btw, metabase is located here: `/mnt/disks/data-team-additional-disk/pycontw-infra-scripts/data_team/metabase_server`
* Metabase is located here: `/mnt/disks/data-team-additional-disk/pycontw-infra-scripts/data_team/metabase_server`

2. Pull the latest codebase to this server: `sudo git pull`

3. Add Credentials (only need to do once):
* Airflow:
* Connections:
* kktix_api: `conn_id=kktix_api`, `host` and `extra(header)` are confidential since its KKTIX's private endpoint. Please DM @GTB or data team's teammembers for these credentials.
* extra: `{"Authorization": "bearer xxx"}`
* klaviyo_api: `conn_id=klaviyo_api`, `host` is <https://a.klaviyo.com/api>
* Variables:
* KLAVIYO_KEY: Create from <https://www.klaviyo.com/account#api-keys-tab>
* KLAVIYO_LIST_ID: Create from <https://www.klaviyo.com/lists>
* KLAVIYO_CAMPAIGN_ID: Create from <https://www.klaviyo.com/campaigns>
* kktix_events_endpoint: url path of kktix's `hosting_events`, ask @gtb for details!
3. Add credentials to the `.env` file (only needs to be done once).

4. Start the services:

```bash
# start production services
# Start production services
make deploy-prod

# stop production services
# Stop production services
# make down-prod
```
```
27 changes: 18 additions & 9 deletions docs/MAINTENANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,26 @@

Currently, the disk space is limited, so please check the disk space before running any ETL jobs.

Will deprecate this if we don't bump into out-of-disk issue any more.
This section will be deprecated if we no longer encounter out-of-disk issues.

1. Find topk biggest folders: `du -a /var/lib/docker/overlay2 | sort -n -r | head -n 20`
2. Show the folder size: `du -hs xxxx`
3. delete those pretty big folder
4. `df -h`
1. Find the largest folders:
```bash
du -a /var/lib/docker/overlay2 | sort -n -r | head -n 20
```
2. Show the folder size:
```bash
du -hs xxxx
```
3. Delete the large folders identified.
4. Check disk space:
```bash
df -h
```

## Token expiration
## Token Expiration

Some api tokens might expire, please check.
Some API tokens might expire, so please check them regularly.

## Year to Year Jobs
## Year-to-Year Jobs

Please refer [Dev Data Team - Year to Year Jobs - HackMD](https://hackmd.io/R417olqPQSWnQYY1Oc_-Sw?view) for more details.
Please refer to [Dev Data Team - Year to Year Jobs - HackMD](https://hackmd.io/R417olqPQSWnQYY1Oc_-Sw?view) for more details.

0 comments on commit fc58f98

Please sign in to comment.