The goal of this project is to build an end-to-end batch data pipeline to perform ELT (Extract Load Transform) oncountrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. Data available at US_Accidents_March23. And perform ELT daily in order to analyse the casualty analysis, accidents distribution from historical data to till March 2023(the dataset will no longer be updated, and this version should be considered the latest).
- Overview
- Data Pipeline Overview
- Technologies
- Architecture
- ELT Steps
- The Dashboard
- Reproduction
- References & Resources
-
The Dataset selected for this project is the
A Countrywide Traffic Accident Dataset (2016 - 2023)
obtained from Kaggle. This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The dataset currently contains approximately 7.7 million accident records. For more information about this dataset, please visit here.The columns in the Dataset and their description is available here
Data will be extracted from the source, and an end-to-end data pipeline will be constructed. The process includes creating an end-to-end pipeline, performing batch data processing, ingesting and storing data in a data lake, a pipeline to analytical warehouse and building interactive dashboards for visualization.
- Cloud: GCP
- Infrastructure as code (IaC): Terraform
- Workflow orchestration: Mage
- Data Warehouse: BigQuery
- Data Transformation: dbt-cloud
- Dashboard: Looker Studio
Steps in the ELT are as follows:
-
A Project is created on GCP
-
Infrastructure for the Project is created using Terraform which creates the following:
-
Data lake : Google Cloud Storage Bucket where the raw and cleaned-partitioned data will be stored
-
Data Warehouse: Three Datasets on BigQuery
-
Using docker image for Mage
Note: Lastest Mage image on Google Cloud Run can't upload large Local file or able to be mounted with a local file with it(not to my knowledge). So I host my local Mage with my file to upload it to GCS
-
-
Load data from BigQuerry to DBT for transformation and deployment
- ELT PIPELINE VISUALIZATION WITH MAGE AS ORCHESTRATION
Local CSV to GCS
GCS to BigQuerry
DBT
The dashboard is accessible from here
IMPORTANT: due to the csv being a bit large, I recomended using sample file of the dataset, I also built the pipeline for the sample.csv before implement the real dataset.
Set up GCP account
Install terraform cli with this
Install Docker via this link
Create your project, go to IAM and then Service account for
Download the json file for credential and authentification
Terraform is infrastructure as cloud for Mage
Note: Lastest Mage image on Google Cloud Run can't upload large Local file or able to be mounted with a local file with it(not to my knowledge). So I host my local Mage with my file to upload it to GCS, then use cloud run to load data from gcs to bigquerry
prerequisite google cloud CLI google cloud permissions Mage Terraform templates
Download Mage template:
git clone https://github.com/mage-ai/mage-ai-terraform-templates.git
go to gcp directory, and change the variables name using this video: part1,
modify the variable in variables.tf
variable "project_id" {
type = string
description = "The name of the project"
default = "my-data-project-13837"
}
Make sure the following APIs are enabled in this link Cloud SQL Admin API Cloud Filestore API Serverless VPC Access API
in the directory mage-ai-terraform-templates/gcp$
run
terraform init
terraform apply
Redeploy and set higher time out for cloud run
When you are done with this Mage instance, then run terraform-destroy
to avoid your free credit run out
run
git clone https://github.com/mage-ai/mage-zoomcamp.git mage-zoomcamp
cd into the mage directory. Rename dev.env to .env
run
pull mageai/mageai:latest
docker-compose build
docker-compose up
Navigate to http://localhost:6789 in your browser to access mage application
go to io_config.yaml and modify it by link it to your service account json file
If you on cloud run: go to the file strucure and right click to upload the json file
Implement the code in Mage floder to create the pipeline like here
Create a free account on dbt cloud Create a project with this. This include setting the role for the service account earlier, enable API and add deploy key to your github repo
Since you should be working on your sample files, the model should be like this
then add nightly deployment to have the transformed data in prod, to do step by step, following this video
go to (Looker Studio)[https://lookerstudio.google.com/u/0/navigation/reporting]
choose a new report, choose data source: BigQuerry and use the data in the prod schema
-Make some visualizations to get insights from the data
- Since Looker geo data must be in the format
{latitude},{longtiude}
. I create a new field call: geo_vaild:CONCAT('start_lat',',','start_long')
I would like to acknowledge the contributions of the following works:
- Moosavi, Sobhan, et al. "A Countrywide Traffic Accident Dataset.", 2019.
- Moosavi, Sobhan, et al. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.