Getting Started

Requires:

docker
docker-compose
python3

To run the development setup (postgres backend defined in sql/, API backend defined in api/), use docker-compose:
```
docker-compose up
```
This should bring up the Postgresql server running on port 5434 and the API server running on port 5000.
(optional) load in supporting parcel data for Alameda county (takes a few minutes)
- make install-python-dependencies
- make load-data

(optional) test out the API:

# if you haven't run already
make install-python-dependencies
# activate virtual env
. .env/bin/activate

# post trace data via the API
python api/post.py

# query the grid
python api/query.py

Schema Overview

Following notes on this doc

Temporal Grid Schema

The temporal grid schema divides space-time into 30-minute chunks of 14m-14m space. Timestamps are aligned to the hour and half-hour and are stored in UTC time. The 14m-14m grid is represented by a full plus code stored as text. I refer to the unique tuple of 30-min aligned timestamp and pluscode a place-time.

A domain is currently the SHA-256 hash of the timestamp and plus code; it is intended as the foreign key for external lookups (Note: my impression of the domain was unclear from the API document. I'm not really using it at the moment, so let me know how I can make it work better for you).

The current implementation of the grid is a sparse representation.

CREATE TABLE IF NOT EXISTS grid(
    time TIMESTAMP NOT NULL,
    pluscode TEXT NOT NULL,
    domain TEXT NOT NULL
);
CREATE UNIQUE INDEX IF NOT EXISTS grid_time_code_idx ON grid(time, pluscode);
CREATE INDEX IF NOT EXISTS domain_idx ON grid(domain);

The grid is supported by two UDFs:

plus_code(wkt_text) accepts WKT-formatted strings of geographic points and returns a full plus-code:

select start_time, plus_code(st_astext(geographic_location)) from location_history;

-- +---------------------+-------------+
-- | start_time          | plus_code   |
-- |---------------------+-------------|
-- | 2020-04-03 22:36:13 | 849VRPXM+JQ |
-- | 2020-04-03 23:24:00 | 849VVP2M+5J |
-- | 2020-04-03 23:31:52 | 849VVP2M+CV |

half_hour(timestamp) accepts YYYY-MM-DD HH:MM:SS formatted timestamps and returns the timestamp aligned to the nearest previous half hour.

select half_hour(start_time), plus_code(st_astext(geographic_location)) from location_history;

-- +---------------------+-------------+
-- | start_time          | plus_code   |
-- |---------------------+-------------|
-- | 2020-04-03 22:30:00 | 849VRPXM+JQ |
-- | 2020-04-03 23:00:00 | 849VVP2M+5J |
-- | 2020-04-03 23:30:00 | 849VVP2M+CV |

Storing People Counts in the Grid

Because I'm not sure of how I want to handle arbitrary key-value pairs at each grid square, I'm currently defining two views:

grid_count: stores the number of unique location records at each place-time:

CREATE OR REPLACE VIEW grid_count AS
SELECT count(distinct userid) as num,
       plus_code(st_astext(geographic_location)) as geocode,
       half_hour(start_time) as timestamp
FROM location_history
group by geocode, timestamp
ORDER BY timestamp asc;

select * from grid_count limit 10;
-- +------+-------------+---------------------+
-- | num  | geocode     | timestamp           |
-- |------+-------------+---------------------|
-- | 2    | 849VVP2M+9W | 2020-04-02 00:00:00 |
-- | 1    | 849VVP2M+9W | 2020-04-02 00:30:00 |
-- | 1    | 849VVP2M+9W | 2020-04-02 01:00:00 |
-- | 1    | 849VVP2M+9W | 2020-04-02 01:30:00 |
-- | 1    | 849VVP2M+9V | 2020-04-02 02:00:00 |
-- | 1    | 849VVP2M+9W | 2020-04-02 02:00:00 |
-- | 1    | 849VVP2M+9V | 2020-04-02 02:30:00 |
-- | 1    | 849VVP2M+9W | 2020-04-02 02:30:00 |
-- | 1    | 849VVP2M+9W | 2020-04-02 03:00:00 |
-- | 1    | 849VVP2M+9W | 2020-04-02 03:30:00 |
-- +------+-------------+---------------------+

grid_sick: stores the number of unique location records at each place-time that correspond to users who have self-identified as "feeling sick":

CREATE OR REPLACE VIEW grid_sick AS
SELECT
       count(distinct lh.userid) as num_sick,
       plus_code(st_astext(geographic_location)) as geocode,
       half_hour(start_time) as timestamp
FROM location_history as lh
JOIN sick_continuous as sc on lh.userid = sc.userid and half_hour(start_time) = sc.timestamp
group by geocode, half_hour(start_time)
ORDER BY timestamp asc;

select * from grid_sick limit 10;
-- +------------+-------------+---------------------+
-- | num_sick   | geocode     | timestamp           |
-- |------------+-------------+---------------------|
-- | 1          | 849VVP2M+9W | 2020-04-02 00:00:00 |
-- | 1          | 849VVP2M+9W | 2020-04-03 00:00:00 |
-- | 1          | 849VVP2M+9W | 2020-04-03 01:00:00 |
-- | 1          | 849VVP2M+9W | 2020-04-03 01:30:00 |
-- | 1          | 849VVP2M+CW | 2020-04-03 01:30:00 |
-- | 1          | 849VVP2M+9W | 2020-04-03 02:00:00 |
-- | 1          | 849VVP2M+9W | 2020-04-03 02:30:00 |
-- | 1          | 849VVP2M+9V | 2020-04-03 03:00:00 |
-- | 1          | 849VVP2M+9W | 2020-04-03 03:00:00 |
-- | 1          | 849VVP2M+9V | 2020-04-03 03:30:00 |
-- +------------+-------------+---------------------+

Supporting Data Schemas

The above grid schema exists independent of any individual data sources. The two views defined above take advantage of a supporting schema that captures individual location history traces, self-reporting on feeling sick and other municipal data sources.

Location History

Individual location histories can be captured by the following schema

CREATE TABLE IF NOT EXISTS location_history(
    userid  INT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL,
    geographic_location geography(POINT, 4326),
    semantic_location TEXT,
    address TEXT
);
CREATE INDEX IF NOT EXISTS geoloc_idx ON location_history USING GIST (geographic_location);

userid: a unique identifier to tie a location history to a sickness reporting (see sick table schema below)
start_time, end_time: store the UTC range of time that a user was at a given location
geographic_location: a geographic POINT indicating a user's location
semantic_location: the semantic name of the user's location (e.g. "Berkeley Bowl")
address: the street address of the user's location (e.g. "1960 Adeline St, Oakland, CA 94607")

It is expected that at least one of geographic_location, semantic_location and address is populated.

Sickness Reporting

The sick table captures reports of sickness over time. Through a survey or other means, a user (identified by userid) reports that they feel sick (feels_sick = 1) or not (feels_sick = 0) at a given time (likely specified at the granularity of a day).

The sick_continuous view takes advantage of TimescaleDB's aggregation API to carry a report of being sick or not forward through time so it is easier to join against (see examples at the end of the SQL snippet below):

CREATE TABLE IF NOT EXISTS sick(
    userid INT NOT NULL,
    time TIMESTAMP NOT NULL,
    feels_sick INT NOT NULL
);

SELECT create_hypertable('sick', 'time');

CREATE OR REPLACE VIEW sick_continuous AS
SELECT timestamp, userid, COALESCE(locf, 0) as feels_sick FROM
(
    SELECT
        time_bucket_gapfill('30 minute', time, '2020-01-01 00:00:00'::timestamp, now()) as timestamp,
        userid,
        max(feels_sick) as feels_sick,
        locf(max(feels_sick)) as locf
    FROM sick
    GROUP BY timestamp, userid
) a
WHERE a.locf IS NOT NULL
ORDER BY timestamp ASC;

-- raw "sick" reports
select * from sick;
+----------+---------------------+--------------+
| userid   | time                | feels_sick   |
|----------+---------------------+--------------|
| 1        | 2020-04-02 00:00:00 | 0            |
| 1        | 2020-04-03 00:00:00 | 1            |
| 2        | 2020-04-03 00:00:00 | 1            |
+----------+---------------------+--------------+

-- made continuous
+------------------------+----------+--------------+
| timestamp              | userid   | feels_sick   |
|------------------------+----------+--------------|
| 2020-04-02 21:30:00-04 | 1        | 0            |
| 2020-04-02 22:00:00-04 | 1        | 0            |
| 2020-04-02 22:30:00-04 | 1        | 0            |
| 2020-04-02 23:00:00-04 | 1        | 0            |
| 2020-04-02 23:30:00-04 | 1        | 0            |
| 2020-04-03 00:00:00-04 | 2        | 1            |
| 2020-04-03 00:00:00-04 | 1        | 1            |
| 2020-04-03 00:30:00-04 | 2        | 1            |
| 2020-04-03 00:30:00-04 | 1        | 1            |
| 2020-04-03 01:00:00-04 | 1        | 1            |
| 2020-04-03 01:00:00-04 | 2        | 1            |
| 2020-04-03 01:30:00-04 | 2        | 1            |
| 2020-04-03 01:30:00-04 | 1        | 1            |
| 2020-04-03 02:00:00-04 | 1        | 1            |

Geographic Context

I've done some work integrating parcel data from Alameda county which allows us to map the following

parcel # (APN)  <---->  geographic bounding box <---> user geographic coordinates
    ^
    |
    |
    v
street address

I can document this later if we need; data files available for download here (via Alameda county data hub)

Data Ingestion

API

api/app.py implements a simple Flask application that supports:

pushing data to the server: location tracing and sickness reporting
- example: api/post.py
simple grid queries: count of people (sick, or total) in a placetime
- example: api/query.py

TODO: dockerize the API server

Bulk Data Load

Following scrips have been written:

load_kml.py: loads a KML export from Google timeline into the location_history table
load_takeout.py: loads a GeoJSON export from Google takeout into the location_history table
load-alameda-geo.py: loads in Alameda county shape files
load_alameda_address.py: loads in Alameda county address records (ties APNs to street addresses)

Data Formats

Google Timeline KML Export

Sample file: data/history-2020-04-01.kml

Load in sample location history data:

python scripts/load_kml.py 1 data/history-2020-04-01.kml

To download for yourself:

Navigate to maps.google.com
Use the menu to navigate to "Your timeline"
Pick the day you want, add places or export your location history (if you do not have location history enabled, you should be able to add a sequence of places)
Use the "gear icon" in the lower right corner and select the option "Export this day to KML"
Use the python script above to load it into the database. Choose a user id (1, above) that is unique (working on autoassigning these)

Setup/Install (non-docker)

relies on Postgresql 11 (as of now)
PostGIS:
- install: sudo apt install postgresql-11-postgis-3
- SQL: CREATE EXTENSION postgis;
PLpython3:
- install: sudo apt install postgresql-plpython3-11
- SQL: CREATE EXTENSION plpython3u;
Timescaledb:
- install: look at website
- SQL: CREATE EXTENSION timescaledb CASCADE;
setup.sql defines all tables, functions, etc:
- remember to change relative import paths in the half_hour and plus_code UDFs

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
api		api
data		data
grid-db		grid-db
ontology		ontology
scripts		scripts
sql		sql
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting Started

Schema Overview

Temporal Grid Schema

Storing People Counts in the Grid

Supporting Data Schemas

Location History

Sickness Reporting

Geographic Context

Data Ingestion

API

Bulk Data Load

Data Formats

Google Timeline KML Export

Setup/Install (non-docker)

About

Releases

Packages

Languages

License

covid19database/grid_pg

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Schema Overview

Temporal Grid Schema

Storing People Counts in the Grid

Supporting Data Schemas

Location History

Sickness Reporting

Geographic Context

Data Ingestion

API

Bulk Data Load

Data Formats

Google Timeline KML Export

Setup/Install (non-docker)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages