-
Notifications
You must be signed in to change notification settings - Fork 3
Use Case I: Partisan Analysis on US voter dataset
dkakkar edited this page Jan 30, 2020
·
3 revisions
- K Nearest Neighbor (KNN) calculation is a well known GIS operation which is hard to scale
- The three common methods of finding KNN:
- Traditional algorithms
- Geohash based KNN
- Index-based Search
- Combining two popular spatial index for best results: Geohash and R-tree
- A two layer approach:
- Bottom layer : Geohash index based clustering
- Top layer: R-tree index based searches
- Extremely fast: 200,000 distance calculations per second on moderate resources: m4.xlarge EC2 with 300 GB EBS
- Construct individual levels of partisan exposure at several approximated geographies for each voter in US
- Performing k-means clustering on a voter dataset of 180 million, with k=1000
- Solution so far is in form Amazon AMI for easy and fast implementation
- AMI replicates the entire KNN computation environment on launch
- User friendly solution with no GIS expertise needed
- Results stored as PostGIS dumps on S3 in instead of EBS
- 87% compression in the result dataset obtained
- 18TG of DB reduced to 1500 GB
- Storage cost reduced form $1800/month to $35/month (98% reduction)