Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New notebooks #21

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
dc7e289
Added data processing workflow and (unimplemented) processing script
pmberg Aug 2, 2024
2bd2a54
Fixed setup.py installation issues with torch
pmberg Aug 2, 2024
6367ec8
Added a gitignore to exclude items in root starting with the string t…
pmberg Aug 7, 2024
8b6e3e7
Now ignores CSV files in root
pmberg Aug 7, 2024
e310f49
Wrote a script to process data and post it to the reweight repo
pmberg Aug 7, 2024
1849351
Reformatted code
pmberg Aug 7, 2024
5513fbe
Added Microsimulation lines to process_data
pmberg Aug 7, 2024
6bfab02
Reworked env in YAML file
pmberg Aug 7, 2024
1c2fa04
Merge branch 'main' of https://github.com/PolicyEngine/reweight into …
pmberg Aug 13, 2024
bb40b67
Add sketch of condensed code
nikhilwoodruff Aug 13, 2024
50a4081
Merge branch 'process-data' of https://github.com/PolicyEngine/reweig…
pmberg Aug 13, 2024
a96ac6a
Refactored process_data, splitting repeated code into two functions.
pmberg Aug 13, 2024
78be5c3
Reformatted process_data
pmberg Aug 13, 2024
9e1120d
Update reweight
pmberg Aug 13, 2024
ea5169f
Added scripts for PyPI publication.
pmberg Aug 14, 2024
3b6f0ab
Fixed pyproject.toml typo
pmberg Aug 14, 2024
dfc69af
Removed excess information from pyproject.toml
pmberg Aug 14, 2024
44c57a6
Added manual activation to the publish.yaml action
pmberg Aug 14, 2024
6afdc48
Fixed workflow-dispatch
pmberg Aug 14, 2024
d444456
Fixed workflow_dispatch
pmberg Aug 14, 2024
b4e6f93
Added new notebooks to documentation
pmberg Aug 15, 2024
fce6e81
Updated README files
pmberg Aug 15, 2024
06d481b
Reworked setup scripts to match PolicyEngine format
pmberg Aug 16, 2024
ce4d7db
Fixed a typo in setup.py
pmberg Aug 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/publish.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Publish to PyPI.org
on:
release:
types: [published]
workflow_dispatch:
inputs:
reason:
description: "Reason for manual trigger"
required: true
default: "Testing workflow"
jobs:
pypi:
runs-on: ubuntu-latest
steps:
- name: Checkout repo
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.9
- name: Install package
run: make install
- name: Build package
run: make
- name: Publish a Python distribution to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
user: __token__
password: ${{ secrets.PYPI }}
skip-existing: true
24 changes: 0 additions & 24 deletions .github/workflows/push.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,27 +47,3 @@ jobs:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
BRANCH: gh-pages
FOLDER: docs/_build/html
Publish:
runs-on: ubuntu-latest
if: |
(github.repository == 'PolicyEngine/reweight')
&& (github.event.head_commit.message == 'Update reweight')
steps:
- name: Checkout repo
uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.9
- name: Publish a git tag
run: ".github/publish-git-tag.sh || true"
- name: Install package
run: make install
- name: Build package
run: make
- name: Publish a Python distribution to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
user: __token__
password: ${{ secrets.PYPI }}
skip-existing: true
28 changes: 28 additions & 0 deletions .github/workflows/schedule.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Scheduled Data Processing

on:
schedule:
- cron: "0 0 1 * *" # Runs at 00:00 on the first day of every month
push:
branches: [main] # Runs on pushes to the main branch
pull_request:
branches: [main] # Runs on pull requests to the main branch

jobs:
process_data:
runs-on: ubuntu-latest
steps:
- name: Checkout repo
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.9
- name: Install dependencies
run: make install
- name: Run data processing script
run: python reweight/logic/process_data.py
env:
POVERTYTRACKER_RAW_URL: ${{ secrets.POVERTYTRACKER_RAW_URL }}
POLICYENGINE_GITHUB_MICRODATA_AUTH_TOKEN: ${{ secrets.POLICYENGINE_GITHUB_MICRODATA_AUTH_TOKEN}}
API_GITHUB_TOKEN: ${{ secrets.API_GITHUB_TOKEN }}
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,9 @@ docs/_build

# Testing notebooks #
#####################
/*.ipynb
/*.ipynb
/test_*

# Temporary CSV files #
#######################
/*.csv
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# reweight

This library will contain logic for consistently reweighting survey data across the PolicyEngine simulation sofware.
This library is used for consistently reweighting survey data across the PolicyEngine simulation sofware, and includes both a function called `reweight` and a script called `process_data.py` that is used to run `reweight` on PolicyEngine data.
2 changes: 2 additions & 0 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,7 @@ format: jb-book
root: index.md
chapters:
- file: current-features
- file: features/reweight
- file: features/process_data
- file: testing_notebooks/us-notebook
- file: testing_notebooks/uk-notebook
38 changes: 38 additions & 0 deletions docs/features/process_data.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The process_data file\n",
"\n",
"The process_data file in this repo is used to process, once monthly, survey data from PolicyEngine UK and PolicyEngine US, using the reweight function.\n",
"\n",
"## generate_country_weights\n",
"\n",
"This is a helper function that uses reweight to generate optimized weights for a specific country and year.\n",
"\n",
"## generate_country_csv\n",
"\n",
"This is a helper function that generates optimized weights for a country over multiple years, and then saves these weights as a CSV file.\n",
"\n",
"## Main body of code\n",
"\n",
"First, `generate_country_csv` is used to generate weights files for both the UK and the US. Then, a GitHub release is generated on the reweight repo, to which the two CSV files are uploaded with a simple helper function called `upload_file`.\n",
"\n",
"## Notes\n",
"\n",
"If you're developing on this, replace \"pmberg\" with your username, and make an environment variable titled API_GITHUB_TOKEN containing an appropriate GitHub API token.\n",
"\n",
"Also, the UK data sources are not publicly available, so if you're developing on this, you need authorization to get an API key that works with them. If you lack the necessary permissions at any stage, the code will not run."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
233 changes: 233 additions & 0 deletions docs/features/reweight.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reweighting Function Documentation\n",
"\n",
"## Purpose\n",
"\n",
"This notebook documents a Python function `reweight` that adjusts a set of initial weights to better match target statistics. It's particularly useful for calibrating survey data weights in microsimulation models, such as those used in PolicyEngine UK.\n",
"\n",
"## Import Required Libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import torch\n",
"from torch.utils.tensorboard import SummaryWriter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Function Definition and Overview\n",
"\n",
"The reweight function uses an optimization process to adjust initial weights so that the weighted sum of estimates more closely matches a set of target values.\n",
"### Parameters\n",
"\n",
"`initial_weights (torch.Tensor):` Initial weights for survey data.\n",
"\n",
"`estimate_matrix (torch.Tensor):` Matrix of estimates from a microsimulation model.\n",
"\n",
"`target_names (iterable):` Names of target statistics (not used in the function body).\n",
"\n",
"`target_values (torch.Tensor):` Values of target statistics to match.\n",
"\n",
"`epochs (int, optional):` Number of optimization iterations. Default is 1000.\n",
"\n",
"`epoch_step (int, optional):` Interval for printing loss during optimization. Default is 100.\n",
"\n",
"### Returns\n",
"\n",
"`final_weights (torch.Tensor):` Adjusted weights after optimization."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def reweight(\n",
" initial_weights,\n",
" estimate_matrix,\n",
" target_names,\n",
" target_values,\n",
" epochs=1000,\n",
" epoch_step=100,\n",
"):\n",
" \"\"\"\n",
" Main reweighting function, suitable for PolicyEngine UK use (PolicyEngine US use and testing TK)\n",
"\n",
" To avoid the need for equivalisation factors, use relative error:\n",
" |predicted - actual|/actual\n",
"\n",
" Parameters:\n",
" household_weights (torch.Tensor): The initial weights given to survey data, which are to be\n",
" adjusted by this function.\n",
" estimate_matrix (torch.Tensor): A large matrix of estimates, obtained from e.g. a PolicyEngine\n",
" Microsimulation instance.\n",
" target_names (iterable): The names of a set of target statistics treated as ground truth.\n",
" target_values (torch.Tensor): The values of these target statistics.\n",
" epochs: The number of iterations that the optimization loop should run for.\n",
" epoch_step: The interval at which to print the loss during the optimization loop.\n",
"\n",
" Returns:\n",
" final_weights: a reweighted set of household weights, obtained through an optimization process\n",
" over mean squared errors with respect to the target values.\n",
" \"\"\"\n",
" # Initialize a TensorBoard writer\n",
" writer = SummaryWriter()\n",
"\n",
" # Create a Torch tensor of log weights\n",
" log_weights = torch.log(initial_weights)\n",
" log_weights.requires_grad_()\n",
"\n",
" # estimate_matrix (cross) exp(log_weights) = target_values\n",
"\n",
" optimizer = torch.optim.Adam([log_weights])\n",
"\n",
" # Report the initial loss:\n",
" targets_estimate = torch.exp(log_weights) @ estimate_matrix\n",
" # Calculate the loss\n",
" loss = torch.mean(\n",
" ((targets_estimate - target_values) / target_values) ** 2\n",
" )\n",
" print(f\"Initial loss: {loss.item()}\")\n",
"\n",
" # Training loop\n",
" for epoch in range(epochs):\n",
"\n",
" # Estimate the targets\n",
" targets_estimate = torch.exp(log_weights) @ estimate_matrix\n",
" # Calculate the loss\n",
" loss = torch.mean(\n",
" ((targets_estimate - target_values) / target_values) ** 2\n",
" )\n",
"\n",
" writer.add_scalar(\"Loss/train\", loss, epoch)\n",
"\n",
" optimizer.zero_grad()\n",
"\n",
" # Perform backpropagation\n",
" loss.backward()\n",
"\n",
" # Update weights\n",
" optimizer.step()\n",
"\n",
" # Print loss whenever the epoch number, when one-indexed, is divisible by epoch_step\n",
" if (epoch + 1) % epoch_step == 0:\n",
" print(f\"Epoch {epoch+1}, Loss: {loss.item()}\")\n",
"\n",
" writer.flush()\n",
"\n",
" return torch.exp(log_weights.detach())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage Example\n",
"Here's how you might use the reweight function:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Initial loss: 0.14120370149612427\n",
"Epoch 100, Loss: 0.06793717294931412\n",
"Epoch 200, Loss: 0.03280560299754143\n",
"Epoch 300, Loss: 0.016901666298508644\n",
"Epoch 400, Loss: 0.010035503655672073\n",
"Epoch 500, Loss: 0.007239286322146654\n",
"Epoch 600, Loss: 0.0061649903655052185\n",
"Epoch 700, Loss: 0.005761378910392523\n",
"Epoch 800, Loss: 0.0055924332700669765\n",
"Epoch 900, Loss: 0.005493843927979469\n",
"Epoch 1000, Loss: 0.005410326179116964\n",
"Final weights: tensor([0.7894, 0.7471, 0.7306, 0.7218, 0.7163])\n"
]
}
],
"source": [
"# Prepare your data as PyTorch tensors\n",
"initial_weights = torch.tensor([1.0, 1.0, 1.0, 1.0, 1.0])\n",
"estimate_matrix = torch.tensor([\n",
" [1.0, 2.0, 3.0],\n",
" [2.0, 3.0, 4.0],\n",
" [3.0, 4.0, 5.0],\n",
" [4.0, 5.0, 6.0],\n",
" [5.0, 6.0, 7.0]\n",
"])\n",
"target_names = [\"Stat1\", \"Stat2\", \"Stat3\"]\n",
"target_values = torch.tensor([10.0, 15.0, 20.0])\n",
"\n",
"# Call the function\n",
"final_weights = reweight(\n",
" initial_weights,\n",
" estimate_matrix,\n",
" target_names,\n",
" target_values,\n",
" epochs=1000,\n",
" epoch_step=100\n",
")\n",
"\n",
"print(\"Final weights:\", final_weights)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Important Notes\n",
"\n",
"* The function uses relative error (|predicted - actual|/actual) for optimization, avoiding the need for equivalisation factors.\n",
"\n",
"* It utilizes TensorBoard for logging the loss during training.\n",
"\n",
"* The optimization process uses the Adam optimizer and performs gradient descent on the log of the weights.\n",
"\n",
"## Warning\n",
"\n",
"This function expects input data in the form of PyTorch tensors. Using data in any other format (e.g., NumPy arrays, Pandas DataFrames) without converting to PyTorch tensors first will result in errors. Make sure to convert your input data to PyTorch tensors before passing them to the function."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "policyengine",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

The PolicyEngine reweight library is a library intended to reweight survey data based on known ground truth statistics, to adjust for sampling biases. This library is designed for use with the [PolicyEngine](https://policyengine.org) software packages.

Currently, this library is still very much a work in progress, and lacks e.g. systematic functions for the reweighting code, and the ability to reweight any survey data not already converted to PyTorch tensors.
Currently, this library is still very much a work in progress, and lacks e.g. a coherent versioning system, and the ability to reweight any survey data outside PolicyEngine UK or PolicyEngine US.
Loading
Loading