This code is an ETL (Extract, Transform and Load) process for global commodity trade data. It extracts trade data from a PostgreSQL database, as well as country and product code data from JSON and CSV files, respectively. It then transforms the product codes and countries data to make them consistent with the trade data and merges all the data into a single DataFrame. Finally, it creates dimensions for quantity, flow and year and merges them with the trade data. The final output, a clean DataFrame of global commodity trade data, is the facts table, as it contains the measurable data about the trade transactions such as the quantity, value, and weight of the commodities traded, as well as the trade flow direction and the time period of the trade. Finally the data is ready to be loaded into a data warehouse for further analysis.
Python 3.10, Pentaho, Docker
-
Clone or download the repository:
git clone https://github.com/betofleitass/etl_python_pentaho
-
Go to the project directory
cd etl_python_pentaho
-
Create a virtual environment:
PowerShell:
python -m venv venv
venv\Scripts\Activate.ps1
Linux:
python3 -m venv venv
source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
This project requires to have Docker:
Get Docker
-
Go to the project directory:
cd etl_python_pentaho
-
Run the container
docker-compose -f docker-compose.yml up
-
Connect to the database
docker exec -it postgres psql -U my_user -d etl_python --password
-
Run the script, it may take up to 20 minutes
\i /sources/public_trades.sql
-
Create a new database for the transformed data
create database trades;
-
Change to the new database
\c trades
-
Run the script to create the tables
\i /sources/script_tables.sql
-
Create a new database for pentaho
create database trades_pentaho;
-
Change to the new database
\c trades_pentaho
-
Run the script to create the tables
\i /sources/script_tables.sql
-
To execute the ETL process, there are two options available.
The first option is to run the Notebook etl.ipynb, which contains the code and instructions on how to run the ETL process.
The second option is to import the Pentaho file and execute it.
- Fork this repository;
- Create a branch with your feature:
git checkout -b my-feature
; - Commit your changes:
git commit -m "feat: my new feature"
; - Push to your branch:
git push origin my-feature
.
This project is under MIT License.