PolicyEngine · pmberg · Aug 2, 2024 · Aug 2, 2024 · Aug 7, 2024 · Aug 7, 2024
diff --git a/.github/workflows/publish.yaml b/.github/workflows/publish.yaml
@@ -0,0 +1,32 @@
+name: Publish to PyPI.org
+on:
+  release:
+    types: [published]
+  workflow_dispatch:
+    inputs:
+    reason:
+      description: "Reason for manual trigger"
+      required: true
+      default: "Testing workflow"
+jobs:
+  pypi:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repo
+        uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.9
+      - name: Install package
+        run: make install
+      - name: Build package
+        run: make
+      - name: Publish a Python distribution to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          user: __token__
+          password: ${{ secrets.PYPI }}
+          skip-existing: true
diff --git a/.github/workflows/push.yaml b/.github/workflows/push.yaml
@@ -47,27 +47,3 @@ jobs:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
           BRANCH: gh-pages
           FOLDER: docs/_build/html
-  Publish:
-    runs-on: ubuntu-latest
-    if: |
-      (github.repository == 'PolicyEngine/reweight')
-      && (github.event.head_commit.message == 'Update reweight')
-    steps:
-      - name: Checkout repo
-        uses: actions/checkout@v3
-      - name: Setup Python
-        uses: actions/setup-python@v4
-        with:
-          python-version: 3.9
-      - name: Publish a git tag
-        run: ".github/publish-git-tag.sh || true"
-      - name: Install package
-        run: make install
-      - name: Build package
-        run: make
-      - name: Publish a Python distribution to PyPI
-        uses: pypa/gh-action-pypi-publish@release/v1
-        with:
-          user: __token__
-          password: ${{ secrets.PYPI }}
-          skip-existing: true
diff --git a/.github/workflows/schedule.yaml b/.github/workflows/schedule.yaml
@@ -0,0 +1,28 @@
+name: Scheduled Data Processing
+
+on:
+  schedule:
+    - cron: "0 0 1 * *" # Runs at 00:00 on the first day of every month
+  push:
+    branches: [main] # Runs on pushes to the main branch
+  pull_request:
+    branches: [main] # Runs on pull requests to the main branch
+
+jobs:
+  process_data:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repo
+        uses: actions/checkout@v3
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.9
+      - name: Install dependencies
+        run: make install
+      - name: Run data processing script
+        run: python reweight/logic/process_data.py
+        env:
+          POVERTYTRACKER_RAW_URL: ${{ secrets.POVERTYTRACKER_RAW_URL }}
+          POLICYENGINE_GITHUB_MICRODATA_AUTH_TOKEN: ${{ secrets.POLICYENGINE_GITHUB_MICRODATA_AUTH_TOKEN}}
+          API_GITHUB_TOKEN: ${{ secrets.API_GITHUB_TOKEN }}
diff --git a/.gitignore b/.gitignore
@@ -20,4 +20,9 @@ docs/_build
 
 # Testing notebooks #
 #####################
-/*.ipynb
+/*.ipynb
+/test_*
+
+# Temporary CSV files #
+#######################
+/*.csv
diff --git a/README.md b/README.md
@@ -1,3 +1,3 @@
 # reweight
 
-This library will contain logic for consistently reweighting survey data across the PolicyEngine simulation sofware. 
+This library is used for consistently reweighting survey data across the PolicyEngine simulation sofware, and includes both a function called `reweight` and a script called `process_data.py` that is used to run `reweight` on PolicyEngine data.
diff --git a/docs/_toc.yml b/docs/_toc.yml
@@ -2,5 +2,7 @@ format: jb-book
 root: index.md
 chapters:
   - file: current-features
+  - file: features/reweight
+  - file: features/process_data
   - file: testing_notebooks/us-notebook
   - file: testing_notebooks/uk-notebook
diff --git a/docs/features/process_data.ipynb b/docs/features/process_data.ipynb
@@ -0,0 +1,38 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# The process_data file\n",
+    "\n",
+    "The process_data file in this repo is used to process, once monthly, survey data from PolicyEngine UK and PolicyEngine US, using the reweight function.\n",
+    "\n",
+    "## generate_country_weights\n",
+    "\n",
+    "This is a helper function that uses reweight to generate optimized weights for a specific country and year.\n",
+    "\n",
+    "## generate_country_csv\n",
+    "\n",
+    "This is a helper function that generates optimized weights for a country over multiple years, and then saves these weights as a CSV file.\n",
+    "\n",
+    "## Main body of code\n",
+    "\n",
+    "First, `generate_country_csv` is used to generate weights files for both the UK and the US. Then, a GitHub release is generated on the reweight repo, to which the two CSV files are uploaded with a simple helper function called `upload_file`.\n",
+    "\n",
+    "## Notes\n",
+    "\n",
+    "If you're developing on this, replace \"pmberg\" with your username, and make an environment variable titled API_GITHUB_TOKEN containing an appropriate GitHub API token.\n",
+    "\n",
+    "Also, the UK data sources are not publicly available, so if you're developing on this, you need authorization to get an API key that works with them. If you lack the necessary permissions at any stage, the code will not run."
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/features/reweight.ipynb b/docs/features/reweight.ipynb
@@ -0,0 +1,233 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Reweighting Function Documentation\n",
+    "\n",
+    "## Purpose\n",
+    "\n",
+    "This notebook documents a Python function `reweight` that adjusts a set of initial weights to better match target statistics. It's particularly useful for calibrating survey data weights in microsimulation models, such as those used in PolicyEngine UK.\n",
+    "\n",
+    "## Import Required Libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "from torch.utils.tensorboard import SummaryWriter"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Function Definition and Overview\n",
+    "\n",
+    "The reweight function uses an optimization process to adjust initial weights so that the weighted sum of estimates more closely matches a set of target values.\n",
+    "### Parameters\n",
+    "\n",
+    "`initial_weights (torch.Tensor):` Initial weights for survey data.\n",
+    "\n",
+    "`estimate_matrix (torch.Tensor):` Matrix of estimates from a microsimulation model.\n",
+    "\n",
+    "`target_names (iterable):` Names of target statistics (not used in the function body).\n",
+    "\n",
+    "`target_values (torch.Tensor):` Values of target statistics to match.\n",
+    "\n",
+    "`epochs (int, optional):` Number of optimization iterations. Default is 1000.\n",
+    "\n",
+    "`epoch_step (int, optional):` Interval for printing loss during optimization. Default is 100.\n",
+    "\n",
+    "### Returns\n",
+    "\n",
+    "`final_weights (torch.Tensor):` Adjusted weights after optimization."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def reweight(\n",
+    "    initial_weights,\n",
+    "    estimate_matrix,\n",
+    "    target_names,\n",
+    "    target_values,\n",
+    "    epochs=1000,\n",
+    "    epoch_step=100,\n",
+    "):\n",
+    "    \"\"\"\n",
+    "    Main reweighting function, suitable for PolicyEngine UK use (PolicyEngine US use and testing TK)\n",
+    "\n",
+    "    To avoid the need for equivalisation factors, use relative error:\n",
+    "    |predicted - actual|/actual\n",
+    "\n",
+    "    Parameters:\n",
+    "    household_weights (torch.Tensor): The initial weights given to survey data, which are to be\n",
+    "    adjusted by this function.\n",
+    "    estimate_matrix (torch.Tensor): A large matrix of estimates, obtained from e.g. a PolicyEngine\n",
+    "    Microsimulation instance.\n",
+    "    target_names (iterable): The names of a set of target statistics treated as ground truth.\n",
+    "    target_values (torch.Tensor): The values of these target statistics.\n",
+    "    epochs: The number of iterations that the optimization loop should run for.\n",
+    "    epoch_step: The interval at which to print the loss during the optimization loop.\n",
+    "\n",
+    "    Returns:\n",
+    "    final_weights: a reweighted set of household weights, obtained through an optimization process\n",
+    "    over mean squared errors with respect to the target values.\n",
+    "    \"\"\"\n",
+    "    # Initialize a TensorBoard writer\n",
+    "    writer = SummaryWriter()\n",
+    "\n",
+    "    # Create a Torch tensor of log weights\n",
+    "    log_weights = torch.log(initial_weights)\n",
+    "    log_weights.requires_grad_()\n",
+    "\n",
+    "    # estimate_matrix (cross) exp(log_weights) = target_values\n",
+    "\n",
+    "    optimizer = torch.optim.Adam([log_weights])\n",
+    "\n",
+    "    # Report the initial loss:\n",
+    "    targets_estimate = torch.exp(log_weights) @ estimate_matrix\n",
+    "    # Calculate the loss\n",
+    "    loss = torch.mean(\n",
+    "        ((targets_estimate - target_values) / target_values) ** 2\n",
+    "    )\n",
+    "    print(f\"Initial loss: {loss.item()}\")\n",
+    "\n",
+    "    # Training loop\n",
+    "    for epoch in range(epochs):\n",
+    "\n",
+    "        # Estimate the targets\n",
+    "        targets_estimate = torch.exp(log_weights) @ estimate_matrix\n",
+    "        # Calculate the loss\n",
+    "        loss = torch.mean(\n",
+    "            ((targets_estimate - target_values) / target_values) ** 2\n",
+    "        )\n",
+    "\n",
+    "        writer.add_scalar(\"Loss/train\", loss, epoch)\n",
+    "\n",
+    "        optimizer.zero_grad()\n",
+    "\n",
+    "        # Perform backpropagation\n",
+    "        loss.backward()\n",
+    "\n",
+    "        # Update weights\n",
+    "        optimizer.step()\n",
+    "\n",
+    "        # Print loss whenever the epoch number, when one-indexed, is divisible by epoch_step\n",
+    "        if (epoch + 1) % epoch_step == 0:\n",
+    "            print(f\"Epoch {epoch+1}, Loss: {loss.item()}\")\n",
+    "\n",
+    "    writer.flush()\n",
+    "\n",
+    "    return torch.exp(log_weights.detach())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage Example\n",
+    "Here's how you might use the reweight function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Initial loss: 0.14120370149612427\n",
+      "Epoch 100, Loss: 0.06793717294931412\n",
+      "Epoch 200, Loss: 0.03280560299754143\n",
+      "Epoch 300, Loss: 0.016901666298508644\n",
+      "Epoch 400, Loss: 0.010035503655672073\n",
+      "Epoch 500, Loss: 0.007239286322146654\n",
+      "Epoch 600, Loss: 0.0061649903655052185\n",
+      "Epoch 700, Loss: 0.005761378910392523\n",
+      "Epoch 800, Loss: 0.0055924332700669765\n",
+      "Epoch 900, Loss: 0.005493843927979469\n",
+      "Epoch 1000, Loss: 0.005410326179116964\n",
+      "Final weights: tensor([0.7894, 0.7471, 0.7306, 0.7218, 0.7163])\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Prepare your data as PyTorch tensors\n",
+    "initial_weights = torch.tensor([1.0, 1.0, 1.0, 1.0, 1.0])\n",
+    "estimate_matrix = torch.tensor([\n",
+    "    [1.0, 2.0, 3.0],\n",
+    "    [2.0, 3.0, 4.0],\n",
+    "    [3.0, 4.0, 5.0],\n",
+    "    [4.0, 5.0, 6.0],\n",
+    "    [5.0, 6.0, 7.0]\n",
+    "])\n",
+    "target_names = [\"Stat1\", \"Stat2\", \"Stat3\"]\n",
+    "target_values = torch.tensor([10.0, 15.0, 20.0])\n",
+    "\n",
+    "# Call the function\n",
+    "final_weights = reweight(\n",
+    "    initial_weights,\n",
+    "    estimate_matrix,\n",
+    "    target_names,\n",
+    "    target_values,\n",
+    "    epochs=1000,\n",
+    "    epoch_step=100\n",
+    ")\n",
+    "\n",
+    "print(\"Final weights:\", final_weights)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Important Notes\n",
+    "\n",
+    "* The function uses relative error (|predicted - actual|/actual) for optimization, avoiding the need for equivalisation factors.\n",
+    "\n",
+    "* It utilizes TensorBoard for logging the loss during training.\n",
+    "\n",
+    "* The optimization process uses the Adam optimizer and performs gradient descent on the log of the weights.\n",
+    "\n",
+    "## Warning\n",
+    "\n",
+    "This function expects input data in the form of PyTorch tensors. Using data in any other format (e.g., NumPy arrays, Pandas DataFrames) without converting to PyTorch tensors first will result in errors. Make sure to convert your input data to PyTorch tensors before passing them to the function."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "policyengine",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/index.md b/docs/index.md
@@ -2,4 +2,4 @@
 
 The PolicyEngine reweight library is a library intended to reweight survey data based on known ground truth statistics, to adjust for sampling biases. This library is designed for use with the [PolicyEngine](https://policyengine.org) software packages.
 
-Currently, this library is still very much a work in progress, and lacks e.g. systematic functions for the reweighting code, and the ability to reweight any survey data not already converted to PyTorch tensors.
+Currently, this library is still very much a work in progress, and lacks e.g. a coherent versioning system, and the ability to reweight any survey data outside PolicyEngine UK or PolicyEngine US.