Two scripts to pipe airmonitor API's data into BigQuery.
First steps: create a service account on GCP with roles BigQuery Data Owner or BigQuery Admin, and Logs Writer. Create a key for that service account, download the json file and put export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/json"
(mind the quotation marks) in your bashrc/zshrc.
AQMesh
| get_history.py
| schema.py
| query.py
| scraper.py
| tools.py
|
└─── visu
| BigQueryInlineQuery.ipynb
| BigQueryPandasPlotly.ipynb
| global_air_quality.ipynb
get_history.py
: requests all historic data (up to today) of the airmonitor API, reformats it and pipes it into BigQuery. If a new table/dataset needs to be created in the process (as specified in the file in the top section), the currently used table schema is read fromschema.py
. Logs are written to a file, per defaultairmonitorHistory.log
and to stdout.query.py
: contains a class,Query
that is used to organise and build a string that can be used to query BigQuery. (helper class)scraper.py
: is in principal almost identical toget_history.py
; this script should be run by e.g. a cronjob, to scrape the latest data off the API. It checks the timestamp of the latest entry in BigQuery for every available station and starts scraping from there. Has logging toStackdriver.Logging
enabled, so all logging messages are available in GCP. Also logs to stdout, but not to a file (can still be enabled if wanted though).tools.py
: contains two functions that are needed for the visualisations to unclutter the code. The first one (read_ts
) makes reading data from the BigQuery table easier, the second one (bounded_graph
) helps to draw a bounded graph withplotly
. Both are used in the visualisations, described below.
BigQueryInlineQuery.ipynb
: example of how to use jupyter magic commands to query BigQuery and use the data (here withmatplotlib
).BigQueryPandasPlotly.ipynb
: example of how to usepandas.io.gbq
to query BigQuery. Visualising time series usingpandas
andplotly
. Also trying to forecast timeseries (temperature and carbon monoxide time series in this example) using the packagefbprophet
:global_air_quality.ipynb
: uses the historical open data of EPA to do the same as inBigQueryPandasPlotly.ipynb
, but with a longer historical record to train the model from, resulting in better forecasts. In this example, a site in St. Louis, Missouri, was used with a hourly temperature record going back to 2013:
It is recommended to use Python
>= 3.7 to avoid problems caused by new syntax features (e.g. string interpolation, type hinting, ...). All dependencies can be installed via pip
(version numbers are of this writing):
package | version |
---|---|
google-cloud-bigquery |
1.5.1 |
google-cloud-logging |
1.8.0 |
numpy |
1.15.2 |
fbprophet |
0.3.post2 |
jupyter notebook |
5.5.0 |
matplotlib |
3.0.0 |
pandas |
0.23.4 |
pandas-gbq |
0.6.1 |
plotly |
3.2.1 |
fbprophet
depends on pyStan
, which needs quite a lot of RAM during the installation (a few GBs). If you run into problems, consider using a swapfile.