Skip to content

The goal of this project is to apply everything we learned in this course and build an end-to-end data pipeline.

Notifications You must be signed in to change notification settings

TinChung41/US-Accidents-Analysis-zoomcamp-project

Repository files navigation

Project US Accidents (2016 - 2023)


Overview

The goal of this project is to build an end-to-end batch data pipeline to perform ELT (Extract Load Transform) oncountrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. Data available at US_Accidents_March23. And perform ELT daily in order to analyse the casualty analysis, accidents distribution from historical data to till March 2023(the dataset will no longer be updated, and this version should be considered the latest).

Table of contents

Problem statement


  • Dataset:

    The Dataset selected for this project is the A Countrywide Traffic Accident Dataset (2016 - 2023) obtained from Kaggle. This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The dataset currently contains approximately 7.7 million accident records. For more information about this dataset, please visit here.

    The columns in the Dataset and their description is available here

  • Objective:


Data will be extracted from the source, and an end-to-end data pipeline will be constructed. The process includes creating an end-to-end pipeline, performing batch data processing, ingesting and storing data in a data lake, a pipeline to analytical warehouse and building interactive dashboards for visualization.

Data Pipeline Overview

Technologies


  • Cloud: GCP
  • Infrastructure as code (IaC): Terraform
  • Workflow orchestration: Mage
  • Data Warehouse: BigQuery
  • Data Transformation: dbt-cloud
  • Dashboard: Looker Studio

Architecture


US_car_crash drawio

ELT Steps

Steps in the ELT are as follows:

  1. A Project is created on GCP

  2. Infrastructure for the Project is created using Terraform which creates the following:

    • Data lake : Google Cloud Storage Bucket where the raw and cleaned-partitioned data will be stored

    • Data Warehouse: Three Datasets on BigQuery

    • Using docker image for Mage

      Note: Lastest Mage image on Google Cloud Run can't upload large Local file or able to be mounted with a local file with it(not to my knowledge). So I host my local Mage with my file to upload it to GCS

  3. Load data from BigQuerry to DBT for transformation and deployment

    • ELT PIPELINE VISUALIZATION WITH MAGE AS ORCHESTRATION

Local CSV to GCS

image

GCS to BigQuerry

image

DBT

image

The Dashboard:


The dashboard is accessible from here

image

Reproduction:


IMPORTANT: due to the csv being a bit large, I recomended using sample file of the dataset, I also built the pipeline for the sample.csv before implement the real dataset.

1 Pre-requisites

Set up GCP account

Install terraform cli with this

Install Docker via this link

2 Google Cloud Platform (GCP)

Create your project, go to IAM and then Service account for

Mage

image

DBT

image

Download the json file for credential and authentification

3 Terraform

Terraform is infrastructure as cloud for Mage

Note: Lastest Mage image on Google Cloud Run can't upload large Local file or able to be mounted with a local file with it(not to my knowledge). So I host my local Mage with my file to upload it to GCS, then use cloud run to load data from gcs to bigquerry

prerequisite google cloud CLI google cloud permissions Mage Terraform templates

Download Mage template: git clone https://github.com/mage-ai/mage-ai-terraform-templates.git

go to gcp directory, and change the variables name using this video: part1,

modify the variable in variables.tf

variable "project_id" {
  type        = string
  description = "The name of the project"
  default     = "my-data-project-13837"
}

Make sure the following APIs are enabled in this link Cloud SQL Admin API Cloud Filestore API Serverless VPC Access API

in the directory mage-ai-terraform-templates/gcp$ run

terraform init
terraform apply

Set up for Cloud Run

part2

Redeploy and set higher time out for cloud run

When you are done with this Mage instance, then run terraform-destroy to avoid your free credit run out

4 Mage

run

git clone https://github.com/mage-ai/mage-zoomcamp.git mage-zoomcamp cd into the mage directory. Rename dev.env to .env

run

pull mageai/mageai:latest

docker-compose build
docker-compose up

Navigate to http://localhost:6789 in your browser to access mage application

go to io_config.yaml and modify it by link it to your service account json file

If you on cloud run: go to the file strucure and right click to upload the json file image

Implement the code in Mage floder to create the pipeline like here

5 DBT

Create a free account on dbt cloud Create a project with this. This include setting the role for the service account earlier, enable API and add deploy key to your github repo

Since you should be working on your sample files, the model should be like this

image

then add nightly deployment to have the transformed data in prod, to do step by step, following this video

6 Looker Studio

go to (Looker Studio)[https://lookerstudio.google.com/u/0/navigation/reporting]

choose a new report, choose data source: BigQuerry and use the data in the prod schema

-Make some visualizations to get insights from the data

  • Since Looker geo data must be in the format {latitude},{longtiude}. I create a new field call: geo_vaild: CONCAT('start_lat',',','start_long')

References & Resources


Back To Top

Acknowledgements

I would like to acknowledge the contributions of the following works:

  • Moosavi, Sobhan, et al. "A Countrywide Traffic Accident Dataset.", 2019.
  • Moosavi, Sobhan, et al. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

About

The goal of this project is to apply everything we learned in this course and build an end-to-end data pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published