m-labs skill assessment

Setup

Create a virtual environment and activate it
pip install -r requirements.txt
Install gcloud CLI and set your application default credentials using gcloud auth application-default login.

Usage

Extract and Transform data

Get an API key from https://api.congress.gov/sign-up/ and set the API_KEY variable.
python fetch_bills.py
A file called bills.json will be created.

Create table

python create_bills_table.py
A table called bills with the appropriate schema should be created in bigquery

Load data

python load_to_bigquery.py
The data from bills.json should be loaded to the bills table in bigquery. If it fails, check the Job History console in bigquery for errors.

Improvements

To productionalize this little ETL pipeline, I'd need to find more information about how often this API is updated, and figure out where within the dataset to stop ingesting to prevent duplicates. There doesn't seem to be a primary key right now, so we would possibly have to construct one based off of the bill type and the bill number, plus a third factor accounting for the fact that the bill numbers reset every several years. I'd like to containerize this and run it daily using github actions, or airflow. I'd pass the API_KEY in as a secret. I'd also define the data schema centrally somewhere, to avoid having it written out in two different locations.

One of my queries had a CTE that signified whether the bill was sent to committee or not. I'd like to add more columns about each bill based off of the title and action text, for easier semantic analysis. The CASE WHEN separator I used is likely a flawed way to segment the bills, so I'd want to do more experimenting to see what's an acceptable method of categorizing them. We could theoretically use some sort of text classification with sklearn or another NLP technique, but there are only so many congressional committees, so doing it by hand would probably ultimately be more accurate.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
create_bills_table.py		create_bills_table.py
fetch_bills.py		fetch_bills.py
queries.sql		queries.sql
requirements.txt		requirements.txt
write_to_bigquery.py		write_to_bigquery.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

m-labs skill assessment

Setup

Usage

Extract and Transform data

Create table

Load data

Improvements

About

Releases

Packages

Languages

crystaljgong/mlabs-assessment

Folders and files

Latest commit

History

Repository files navigation

m-labs skill assessment

Setup

Usage

Extract and Transform data

Create table

Load data

Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages