Skip to content

Use Case I: Partisan Analysis on US voter dataset

dkakkar edited this page Jan 30, 2020 · 3 revisions

Problem Introduction KNN

  • K Nearest Neighbor (KNN) calculation is a well known GIS operation which is hard to scale
  • The three common methods of finding KNN:
    • Traditional algorithms
    • Geohash based KNN
    • Index-based Search

image

Geohash Clustering with KNN search

  • Combining two popular spatial index for best results: Geohash and R-tree
  • A two layer approach:
  • Bottom layer : Geohash index based clustering
  • Top layer: R-tree index based searches
  • Extremely fast: 200,000 distance calculations per second on moderate resources: m4.xlarge EC2 with 300 GB EBS

image

Partisan Analysis

  • Construct individual levels of partisan exposure at several approximated geographies for each voter in US
  • Performing k-means clustering on a voter dataset of 180 million, with k=1000
  • Solution so far is in form Amazon AMI for easy and fast implementation
    • AMI replicates the entire KNN computation environment on launch
    • User friendly solution with no GIS expertise needed
    • Results stored as PostGIS dumps on S3 in instead of EBS
    • 87% compression in the result dataset obtained
    • 18TG of DB reduced to 1500 GB
    • Storage cost reduced form $1800/month to $35/month (98% reduction)