Transform vceregen renewable generation profiles (#3898)

* Add source metadata for vceregen * Add profiles to vceregen dataset name * Remove blank line in description * Add blank Data Source template for vceregen * Add links to download docs section * Add availability section * Add respondents section * Add original data section * Stash WIP of extraction * Extract VCE tables to raw dask dfs * Clean up warnings and restore EIA 176 * Revert to pandas concatenation * Add latlonfips * Add blank transform module for vceregen * Fill out the basic vceregen transforms * Add underscores back to function names * Update time col calculation * Update docstrings and comments to reflect new time cols * Change merge to concat * Remove dask, coerce dtypes on read-in * override load_column_maps behavior * Update addition of county and state name fields * Add vceregen to init files and metadata so that it will run on dagster, update column names, and add new fields * Add resource metadata for vcregen * Clean county strings more * Add release notes * Add function to validate state_county_names and improve performance of add_time_cols function * make for loops into dict comp, update loggers, and improve regex * Add asset checks and remove inline checks * Change hour_utc to datetime_utc * Remove incorrect docstring * Update dataset and field metadata * Rename county col to county_or_subregion * Update data_source docs page * change axis=1 to axis=columns * Update DOI to sandbox and temporarily xfail DOI test * Change county_or_subregion to county_or_lake_name * Change county_or_subregion to county_or_lake_name * Update docs to explain solar cap fac * Update regen to rare * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci * Update gsutil in zenodo-cache-sync * Rename vceregen to vcerare * Add back user project * Update project path * Update project to billing project * Update dockerfile to replace gsutil with gcloud storage * Update docs/release_notes.rst Co-authored-by: E. Belfer <[email protected]> * Update docs/release_notes.rst Co-authored-by: E. Belfer <[email protected]> * Update docs/templates/vcerare_child.rst.jinja Co-authored-by: E. Belfer <[email protected]> * First batch of little docs fixes Co-authored-by: E. Belfer <[email protected]> * Restructure _combine_city_county_records function * Add link to zenodo archive to data source page * Clarify 1 vs. 100 in data source page * Spread out comments in the _prep_lat_long_fips_df function * Update docstring for _prep_lat_long_fips_df * Switch order of add_time_cols and make_cap_frac functions * Update _combine_city_county_records and move assertion to asset checks * Change all().all() to any().any() * Add validations to merges * docs cleanup tidbits * Turn _combine_city_county_records function into _drop_city_records and a few other tweaks * Make fips columns categorical and narrow scope of regex * data source docs updates * Add downloadable docs to vcerare data source and fix data source file name to vcerare from vceregen * Remove 1.34 from field description for capacity_factor_solar_pv * Add some logs and a function to null county_id_fips values from lakes and an asset check to match * Update solar_pv metadata * Rename RARE dataset in the release notes * Add issue number to release notes * Update field description for county_or_lake_name * Update docstring for transform module * Make all references to FIPS uppercase in notes and comments * Correct inline comment in _null_non_county_fips_rows * Fix asset check * Minor late-night PR fixes - Fix failing asset check. - Change dtype for `hour_of_year` from float to int. - Clarify some release notes / data source docs. * Log during VCE RARE asset checks to see what's slow. * Add simple notebook for processing vcerare data * Re-enable Zenodo DOI validation unit test. * Update docs to use gcloud storage not gsutil * Try to reduce memory use & concurrency for VCE RARE dataset * Retry policy for VCE + highmem use for VCE asset check. * Bump VM RAM and remove very-high memory tag & retry * Bump vCPUs to 16 * Add fancy charts to notebook * Add link to VCE data in nightly build outputs. Other docs tweaks. --------- Co-authored-by: e-belfer <[email protected]> Co-authored-by: E. Belfer <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Zane Selvans <[email protected]>
catalyst-cooperative · Oct 19, 2024 · 0ffb363 · 0ffb363
1 parent f33711e
commit 0ffb363
Show file tree

Hide file tree

Showing 16 changed files with 1,147 additions and 13 deletions.
diff --git a/devtools/generate_batch_config.py b/devtools/generate_batch_config.py
@@ -56,8 +56,8 @@ def to_config(
                         }
                     ],
                     "computeResource": {
-                        "cpuMilli": 8000,
-                        "memoryMib": int(63 * MIB_PER_GB),
+                        "cpuMilli": 16000,
+                        "memoryMib": int(127 * MIB_PER_GB),
                         "bootDiskMib": 100 * 1024,
                     },
                     "maxRunDuration": f"{60 * 60 * 12}s",

diff --git a/docs/conf.py b/docs/conf.py
@@ -162,6 +162,7 @@ def data_sources_metadata_to_rst(app):
         "epacems",
         "phmsagas",
         "gridpathratoolkit",
+        "vcerare",
     ]
     package = PUDL_PACKAGE
     extra_etl_groups = {"eia860": ["entity_eia"], "ferc1": ["glue"]}
@@ -213,6 +214,7 @@ def cleanup_rsts(app, exception):
     (DOCS_DIR / "data_sources/epacems.rst").unlink()
     (DOCS_DIR / "data_sources/phmsagas.rst").unlink()
     (DOCS_DIR / "data_sources/gridpathratoolkit.rst").unlink()
+    (DOCS_DIR / "data_sources/vcerare.rst").unlink()
 
 
 def cleanup_csv_dir(app, exception):

diff --git a/docs/data_access.rst b/docs/data_access.rst
@@ -130,6 +130,7 @@ so we have moved to publishing all our hourly tables using the compressed, colum
 * `FERC-714 Hourly Estimated State Demand <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_ferc714__hourly_estimated_state_demand.parquet>`__
 * `FERC-714 Hourly Planning Area Demand <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_ferc714__hourly_planning_area_demand.parquet>`__
 * `GridPath RA Toolkit Hourly Available Capacity Factors <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_gridpathratoolkit__hourly_available_capacity_factor.parquet>`__
+* `VCE Resoruce Adequacy Renewable Energy Dataset <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/ out_vcerare__hourly_available_capacity_factor.parquet>`__
 
 Raw FERC DBF & XBRL data converted to SQLite
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/data_sources/index.rst b/docs/data_sources/index.rst
@@ -18,6 +18,7 @@ The following data sources serve as the foundation for our data pipeline.
    ferc714
    phmsagas
    gridpathratoolkit
+   vcerare
    other_data
 
 .. toctree::

diff --git a/...E-Weather-Dataset-Overview_August2020.pdf → ..._Weather_Dataset_Overview_August_2020.pdf b/...E-Weather-Dataset-Overview_August2020.pdf → ..._Weather_Dataset_Overview_August_2020.pdf
diff --git a/docs/dev/nightly_data_builds.rst b/docs/dev/nightly_data_builds.rst
@@ -265,7 +265,7 @@ ways to install the Google Cloud SDK explained in the link above.
 
 .. code::
 
-  conda install -c conda-forge google-cloud-sdk
+  mamba install -c conda-forge google-cloud-sdk
 
 Log into the account you used to create your new project above by running:
 
@@ -297,16 +297,17 @@ that are available:
 
 .. code::
 
-   gsutil ls -lh gs://builds.catalyst.coop
+   gcloud storage ls --long --readable-sizes gs://builds.catalyst.coop
 
 You should see a list of directories with build IDs that have a naming convention:
 ``<YYYY-MM-DD-HHMM>-<short git commit SHA>-<git branch>``.
 
-To see what the outputs are for a given nightly build, you can use ``gsutil`` like this:
+To see what the outputs are for a given nightly build, you can use ``gcloud storage``
+like this:
 
 .. code::
 
-    gsutil ls -lh gs://builds.catalyst.coop/2024-01-03-0605-e9a91be-dev/
+    gcloud storage ls --long --readable-sizes gs://builds.catalyst.coop/2024-01-03-0605-e9a91be-dev/
 
     804.57 MiB  2024-01-03T11:19:15Z  gs://builds.catalyst.coop/2024-01-03-0605-e9a91be-dev/censusdp1tract.sqlite
       5.01 GiB  2024-01-03T11:20:02Z  gs://builds.catalyst.coop/2024-01-03-0605-e9a91be-dev/core_epacems__hourly_emissions.parquet
@@ -337,22 +338,23 @@ To see what the outputs are for a given nightly build, you can use ``gsutil`` li
     TOTAL: 25 objects, 23557650395 bytes (21.94 GiB)
 
 If you want to copy these files down directly to your computer, you can use
-the ``gsutil cp`` command, which behaves very much like the Unix ``cp`` command:
+the ``gcloud storage cp`` command, which behaves very much like the Unix ``cp`` command:
 
 .. code::
 
-   gsutil cp gs://builds.catalyst.coop/<build ID>/pudl.sqlite ./
+   gcloud storage cp gs://builds.catalyst.coop/<build ID>/pudl.sqlite ./
 
 If you wanted to download all of the build outputs (more than 10GB!) you could use ``cp
 -r`` on the whole directory:
 
 .. code::
 
-   gsutil cp -r gs://builds.catalyst.coop/<build ID>/ ./
+   gcloud storage cp --recursive gs://builds.catalyst.coop/<build ID>/ ./
 
-For more details on how to use ``gsutil`` in general see the
-`online documentation <https://cloud.google.com/storage/docs/gsutil>`__ or run:
+For more background on ``gcloud storage`` see the
+`quickstart guide <https://cloud.google.com/storage/docs/discover-object-storage-gcloud>`__
+or check out the CLI documentation with:
 
 .. code::
 
-   gsutil --help
+   gcloud storage --help
diff --git a/docs/release_notes.rst b/docs/release_notes.rst
@@ -6,6 +6,21 @@ PUDL Release Notes
 v2024.X.x (2024-XX-XX)
 ---------------------------------------------------------------------------------------
 
+New Data
+^^^^^^^^
+
+Vibrant Clean Energy Resource Adequacy Renewable Energy (RARE) Power Dataset
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* Integrate the VCE hourly capacity factor data for solar PV, onshore wind, and
+  offshore wind from 2019 through 2023. The data in this table were produced by
+  Vibrant Clean Energy, and are licensed to the public under the Creative Commons
+  Attribution 4.0 International license (CC-BY-4.0). This data complements the
+  WECC-wide GridPath RA Toolkit data currently incorporated into PUDL, providing
+  capacity factor data nation-wide with a different set of modeling assumptions and
+  a different granularity for the aggregation of outputs.
+  See :doc:`data_sources/gridpathratoolkit` and :doc:`data_sources/vcerare` for
+  more information.  See :issue:`#3872`.
+
 New Data Coverage
 ^^^^^^^^^^^^^^^^^
 

diff --git a/docs/templates/vcerare_child.rst.jinja b/docs/templates/vcerare_child.rst.jinja
@@ -0,0 +1,68 @@
+{% extends "data_source_parent.rst.jinja" %}
+{% block background %}
+The data in the Resource Adequacy Renewable Energy (RARE) Power Dataset was produced by
+Vibrant Clean Energy based on outputs from the NOA HRRR model and are licensed
+to the public under the Creative Commons Attribution 4.0 International license
+(CC-BY-4.0).
+
+See the `README <https://doi.org/10.5281/zenodo.13937523>`__ archived on Zenodo for more
+detailed information.
+{% endblock %}
+
+{% block download_docs %}
+{% for filename in download_paths %}
+* :download:`{{ filename.stem.replace("_", " ").title() }} ({{ filename.suffix.replace('.', '').upper() }}) <{{ filename }}>`
+{% endfor %}
+* `NOAA HRRR Model Overview <https://rapidrefresh.noaa.gov/hrrr/>`__
+{% endblock %}
+
+
+{% block availability %}
+Hourly, county-level data from 2019 - 2023 is integrated into PUDL. There is a
+second release of data for the years 2014 - 2018 expected in Q1 of 2025, which will be
+integrated into PUDL pending funding availability.
+{% endblock %}
+
+{% block respondents %}
+This data does not come from a government agency, and is not the result of compulsory
+data reporting.
+{% endblock %}
+
+{% block original_data %}
+The contents of the original CSVs are formatted so that Excel can display the
+data without crashing. There's one file per year per generation type, and each
+file contains an index column for time (simply 1, 2, 3...8760 to
+represent the hours in a year) and columns for each county populated with capacity
+factor values as a percentage from 0-100.
+{% endblock %}
+
+{% block notable_irregularities %}
+Non-county regions
+------------------
+
+The original data include capacity factors for some non-county areas including the Great
+Lakes and 2 small cities (Bedford City, VA and Clifton Forge City, VA). It associated
+"county" FIPS IDs with those areas, meaning that there was not a 1:1 relationship
+between the FIPS IDs and the named areas, and the geographic region implied by the
+FIPS IDs did not correspond to the named area. We've dropped the cities -- one of which
+contained no data -- and set the FIPS codes for the Great Lakes to NA. Note that lakes
+bordering multiple states will appear more than once in the data. VCE used a nearest
+neighbor technique to assign the state waters to the counties (this pertains to coastal
+areas as well).
+
+Capacity factors > 1
+--------------------
+There are a couple of capacity factor values for the solar pv data that exceed
+the maximum value of 1 for capacity factor (or 100 for the raw data--PUDL converts the
+data from a percentage to a fraction to match other reported capacity factor data). This
+is due to power production performance being correlated with panel temperatures. During
+cold sunny periods, some solar capacity factor values are greater than 1 (but less that
+1.1).
+
+8760-hour years
+---------------
+This data is primarily used for modeling purposes and conforms to the 8760 hour/year
+standard regardless of leap years. This means that 2020 is missing data for December
+31st.
+
+{% endblock %}