Data Engineering Challenge for Company A
Build an Extract, Transform, Load (ETL) pipeline that will take this data and move it into a production-ready database that can be easily queried to answer questions for the business.
Additionally, the infrastructure required for hosting and managing the ETL pipeline, including networking and storage, will be provisioned using Terraform.
Before setting up the environment, ensure that you have the following tools installed and configured:
-
AWS CLI
Install the AWS Command Line Interface (AWS CLI) by following the instructions here.
After installation, configure the AWS CLI with your credentials:aws configure
-
Terraform
Terraform is used to provision the AWS infrastructure required for the project. You can install Terraform by following the instructions here. After installation, verify that Terraform is installed correctly by running:terraform --version
-
Python
Ensure you have Python 3.x installed to execute validation scripts for environment variables. You can download and install Python here.
This project requires specific AWS environment variables to be set before running the ETL pipeline and the infrastructure provisioning process. These environment variables are necessary for authentication with AWS services and ensuring the pipeline and infrastructure run in the correct environment.
The following variables need to be configured:
AWS_PROFILE
: The AWS profile to be used for authentication.AWS_DEFAULT_REGION
: The AWS region where the services will be hosted.ENVIRONMENT
: The environment (e.g., development, staging, production) for which the pipeline and infrastructure are being deployed.
The Makefile
includes commands to automate setting up the environmentm, provisioning the infrastructure using Terraform and deploy the serverless application.
-
make env
:
Runs a script to set AWS environment variables (AWS_PROFILE
,AWS_DEFAULT_REGION
,ENVIRONMENT
) and persists them to~/.bashrc
. -
make check
:
Verifies that the required environment variables are set using a Python script. -
make terraform-init
:
Initializes Terraform, downloading providers and setting up your environment for infrastructure provisioning. -
make terraform-plan
:
Runs Terraform'splan
command to preview the infrastructure that will be created. -
make terraform-apply
:
Applies the Terraform plan to create the infrastructure automatically. -
make terraform-destroy
:
Destroys the Terraform-managed infrastructure automatically. -
make sls-deploy
: Deploy Serverless application -
make sls-remove
: Removes Serverless application -
make all
:
Runs the entire setup process in one step. It configures environment variables, verifies them, initializes Terraform, plans the infrastructure, and applies it. -
make clean
:
Destroys all infrastructure created by Terraform and removes the Serverless application
When generating sensitive information such as database passwords using Terraform (as seen in this configuration with the random_password
resource), the generated values are stored in the Terraform state file (tfstate
). If this state file is stored locally, there's a risk it could inadvertently be exposed if committed to version control, such as GitHub. To mitigate this risk, it is highly recommended to use a remote backend (such as AWS S3 with encryption and proper access control) to securely store the state file.
For more information on securely storing sensitive data in the Terraform state file, please refer to Terraform's documentation on sensitive data in state.
After running make env
or make all
, the environment variables are saved to ~/.bashrc
to persist across terminal sessions. However, these variables might not be immediately available in your current session.
- Option 1: Restart your terminal or prompt to apply the changes.
- Option 2: Manually source the updated
~/.bashrc
file by running:source ~/.bashrc
Failure to do this may result in the environment variables not being available in the current terminal session.
The infrastructure for this project is managed using Terraform. This setup includes the creation of an Amazon S3 bucket for storage, a VPC, subnets, and other necessary networking components. The infrastructure code is located in the terraform/
directory.
The key configuration files for Terraform are:
-
resource.tf
This file defines the S3 bucket and related configurations such as server-side encryption and object storage. -
networks.tf
This file contains the configuration for the Virtual Private Cloud (VPC), public subnets, Internet Gateway (IGW), and route tables. -
local.tf
This file defines local variables that are used across the Terraform configuration files, including common tags, network CIDR blocks, and bucket names.
This project includes a Serverless Framework configuration to deploy an ETL Lambda function. After setting up the environment variables and provisioning the AWS infrastructure with Terraform, you can deploy the Serverless application.
Ensure the environment variables are correctly set by running make env
.
-
Install the Serverless Framework:
npm install -g serverless
-
Navigate to the
src/
directory:cd src
-
Deploy the Serverless application to AWS:
serverless deploy --stage dev
Alternatively, you can use the following Makefile
command to handle deployment after Terraform has completed:
make sls-deploy
This command will automatically deploy the Serverless Framework configuration, making the ETL Lambda function available.