Skip to content

Commit

Permalink
Transform vceregen renewable generation profiles (#3898)
Browse files Browse the repository at this point in the history
* Add source metadata for vceregen

* Add profiles to vceregen dataset name

* Remove blank line in description

* Add blank Data Source template for vceregen

* Add links to download docs section

* Add availability section

* Add respondents section

* Add original data section

* Stash WIP of extraction

* Extract VCE tables to raw dask dfs

* Clean up warnings and restore EIA 176

* Revert to pandas concatenation

* Add latlonfips

* Add blank transform module for vceregen

* Fill out the basic vceregen transforms

* Add underscores back to function names

* Update time col calculation

* Update docstrings and comments to reflect new time cols

* Change merge to concat

* Remove dask, coerce dtypes on read-in

* override load_column_maps behavior

* Update addition of county and state name fields

* Add vceregen to init files and metadata so that it will run on dagster, update column names, and add new fields

* Add resource metadata for vcregen

* Clean county strings more

* Add release notes

* Add function to validate state_county_names and improve performance of add_time_cols function

* make for loops into dict comp, update loggers, and improve regex

* Add asset checks and remove inline checks

* Change hour_utc to datetime_utc

* Remove incorrect docstring

* Update dataset and field metadata

* Rename county col to county_or_subregion

* Update data_source docs page

* change axis=1 to axis=columns

* Update DOI to sandbox and temporarily xfail DOI test

* Change county_or_subregion to county_or_lake_name

* Change county_or_subregion to county_or_lake_name

* Update docs to explain solar cap fac

* Update regen to rare

* [pre-commit.ci] auto fixes from pre-commit.com hooks

For more information, see https://pre-commit.ci

* Update gsutil in zenodo-cache-sync

* Rename vceregen to vcerare

* Add back user project

* Update project path

* Update project to billing project

* Update dockerfile to replace gsutil with gcloud storage

* Update docs/release_notes.rst

Co-authored-by: E. Belfer <[email protected]>

* Update docs/release_notes.rst

Co-authored-by: E. Belfer <[email protected]>

* Update docs/templates/vcerare_child.rst.jinja

Co-authored-by: E. Belfer <[email protected]>

* First batch of little docs fixes

Co-authored-by: E. Belfer <[email protected]>

* Restructure _combine_city_county_records function

* Add link to zenodo archive to data source page

* Clarify 1 vs. 100 in data source page

* Spread out comments in the _prep_lat_long_fips_df function

* Update docstring for _prep_lat_long_fips_df

* Switch order of add_time_cols and make_cap_frac functions

* Update _combine_city_county_records and move assertion to asset checks

* Change all().all() to any().any()

* Add validations to merges

* docs cleanup tidbits

* Turn _combine_city_county_records function into _drop_city_records and a few other tweaks

* Make fips columns categorical and narrow scope of regex

* data source docs updates

* Add downloadable docs to vcerare data source and fix data source file name to vcerare from vceregen

* Remove 1.34 from field description for capacity_factor_solar_pv

* Add some logs and a function to null county_id_fips values from lakes and an asset check to match

* Update solar_pv metadata

* Rename RARE dataset in the release notes

* Add issue number to release notes

* Update field description for county_or_lake_name

* Update docstring for transform module

* Make all references to FIPS uppercase in notes and comments

* Correct inline comment in _null_non_county_fips_rows

* Fix asset check

* Minor late-night PR fixes

- Fix failing asset check.
- Change dtype for `hour_of_year` from float to int.
- Clarify some release notes / data source docs.

* Log during VCE RARE asset checks to see what's slow.

* Add simple notebook for processing vcerare data

* Re-enable Zenodo DOI validation unit test.

* Update docs to use gcloud storage not gsutil

* Try to reduce memory use & concurrency for VCE RARE dataset

* Retry policy for VCE + highmem use for VCE asset check.

* Bump VM RAM and remove very-high memory tag & retry

* Bump vCPUs to 16

* Add fancy charts to notebook

* Add link to VCE data in nightly build outputs. Other docs tweaks.

---------

Co-authored-by: e-belfer <[email protected]>
Co-authored-by: E. Belfer <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Zane Selvans <[email protected]>
  • Loading branch information
5 people authored Oct 19, 2024
1 parent f33711e commit 0ffb363
Show file tree
Hide file tree
Showing 16 changed files with 1,147 additions and 13 deletions.
4 changes: 2 additions & 2 deletions devtools/generate_batch_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ def to_config(
}
],
"computeResource": {
"cpuMilli": 8000,
"memoryMib": int(63 * MIB_PER_GB),
"cpuMilli": 16000,
"memoryMib": int(127 * MIB_PER_GB),
"bootDiskMib": 100 * 1024,
},
"maxRunDuration": f"{60 * 60 * 12}s",
Expand Down
2 changes: 2 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@ def data_sources_metadata_to_rst(app):
"epacems",
"phmsagas",
"gridpathratoolkit",
"vcerare",
]
package = PUDL_PACKAGE
extra_etl_groups = {"eia860": ["entity_eia"], "ferc1": ["glue"]}
Expand Down Expand Up @@ -213,6 +214,7 @@ def cleanup_rsts(app, exception):
(DOCS_DIR / "data_sources/epacems.rst").unlink()
(DOCS_DIR / "data_sources/phmsagas.rst").unlink()
(DOCS_DIR / "data_sources/gridpathratoolkit.rst").unlink()
(DOCS_DIR / "data_sources/vcerare.rst").unlink()


def cleanup_csv_dir(app, exception):
Expand Down
1 change: 1 addition & 0 deletions docs/data_access.rst
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ so we have moved to publishing all our hourly tables using the compressed, colum
* `FERC-714 Hourly Estimated State Demand <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_ferc714__hourly_estimated_state_demand.parquet>`__
* `FERC-714 Hourly Planning Area Demand <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_ferc714__hourly_planning_area_demand.parquet>`__
* `GridPath RA Toolkit Hourly Available Capacity Factors <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_gridpathratoolkit__hourly_available_capacity_factor.parquet>`__
* `VCE Resoruce Adequacy Renewable Energy Dataset <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/ out_vcerare__hourly_available_capacity_factor.parquet>`__

Raw FERC DBF & XBRL data converted to SQLite
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
1 change: 1 addition & 0 deletions docs/data_sources/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ The following data sources serve as the foundation for our data pipeline.
ferc714
phmsagas
gridpathratoolkit
vcerare
other_data

.. toctree::
Expand Down
22 changes: 12 additions & 10 deletions docs/dev/nightly_data_builds.rst
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ ways to install the Google Cloud SDK explained in the link above.

.. code::
conda install -c conda-forge google-cloud-sdk
mamba install -c conda-forge google-cloud-sdk
Log into the account you used to create your new project above by running:

Expand Down Expand Up @@ -297,16 +297,17 @@ that are available:

.. code::
gsutil ls -lh gs://builds.catalyst.coop
gcloud storage ls --long --readable-sizes gs://builds.catalyst.coop
You should see a list of directories with build IDs that have a naming convention:
``<YYYY-MM-DD-HHMM>-<short git commit SHA>-<git branch>``.

To see what the outputs are for a given nightly build, you can use ``gsutil`` like this:
To see what the outputs are for a given nightly build, you can use ``gcloud storage``
like this:

.. code::
gsutil ls -lh gs://builds.catalyst.coop/2024-01-03-0605-e9a91be-dev/
gcloud storage ls --long --readable-sizes gs://builds.catalyst.coop/2024-01-03-0605-e9a91be-dev/
804.57 MiB 2024-01-03T11:19:15Z gs://builds.catalyst.coop/2024-01-03-0605-e9a91be-dev/censusdp1tract.sqlite
5.01 GiB 2024-01-03T11:20:02Z gs://builds.catalyst.coop/2024-01-03-0605-e9a91be-dev/core_epacems__hourly_emissions.parquet
Expand Down Expand Up @@ -337,22 +338,23 @@ To see what the outputs are for a given nightly build, you can use ``gsutil`` li
TOTAL: 25 objects, 23557650395 bytes (21.94 GiB)
If you want to copy these files down directly to your computer, you can use
the ``gsutil cp`` command, which behaves very much like the Unix ``cp`` command:
the ``gcloud storage cp`` command, which behaves very much like the Unix ``cp`` command:

.. code::
gsutil cp gs://builds.catalyst.coop/<build ID>/pudl.sqlite ./
gcloud storage cp gs://builds.catalyst.coop/<build ID>/pudl.sqlite ./
If you wanted to download all of the build outputs (more than 10GB!) you could use ``cp
-r`` on the whole directory:

.. code::
gsutil cp -r gs://builds.catalyst.coop/<build ID>/ ./
gcloud storage cp --recursive gs://builds.catalyst.coop/<build ID>/ ./
For more details on how to use ``gsutil`` in general see the
`online documentation <https://cloud.google.com/storage/docs/gsutil>`__ or run:
For more background on ``gcloud storage`` see the
`quickstart guide <https://cloud.google.com/storage/docs/discover-object-storage-gcloud>`__
or check out the CLI documentation with:

.. code::
gsutil --help
gcloud storage --help
15 changes: 15 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,21 @@ PUDL Release Notes
v2024.X.x (2024-XX-XX)
---------------------------------------------------------------------------------------

New Data
^^^^^^^^

Vibrant Clean Energy Resource Adequacy Renewable Energy (RARE) Power Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Integrate the VCE hourly capacity factor data for solar PV, onshore wind, and
offshore wind from 2019 through 2023. The data in this table were produced by
Vibrant Clean Energy, and are licensed to the public under the Creative Commons
Attribution 4.0 International license (CC-BY-4.0). This data complements the
WECC-wide GridPath RA Toolkit data currently incorporated into PUDL, providing
capacity factor data nation-wide with a different set of modeling assumptions and
a different granularity for the aggregation of outputs.
See :doc:`data_sources/gridpathratoolkit` and :doc:`data_sources/vcerare` for
more information. See :issue:`#3872`.

New Data Coverage
^^^^^^^^^^^^^^^^^

Expand Down
68 changes: 68 additions & 0 deletions docs/templates/vcerare_child.rst.jinja
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
{% extends "data_source_parent.rst.jinja" %}
{% block background %}
The data in the Resource Adequacy Renewable Energy (RARE) Power Dataset was produced by
Vibrant Clean Energy based on outputs from the NOA HRRR model and are licensed
to the public under the Creative Commons Attribution 4.0 International license
(CC-BY-4.0).

See the `README <https://doi.org/10.5281/zenodo.13937523>`__ archived on Zenodo for more
detailed information.
{% endblock %}

{% block download_docs %}
{% for filename in download_paths %}
* :download:`{{ filename.stem.replace("_", " ").title() }} ({{ filename.suffix.replace('.', '').upper() }}) <{{ filename }}>`
{% endfor %}
* `NOAA HRRR Model Overview <https://rapidrefresh.noaa.gov/hrrr/>`__
{% endblock %}


{% block availability %}
Hourly, county-level data from 2019 - 2023 is integrated into PUDL. There is a
second release of data for the years 2014 - 2018 expected in Q1 of 2025, which will be
integrated into PUDL pending funding availability.
{% endblock %}

{% block respondents %}
This data does not come from a government agency, and is not the result of compulsory
data reporting.
{% endblock %}

{% block original_data %}
The contents of the original CSVs are formatted so that Excel can display the
data without crashing. There's one file per year per generation type, and each
file contains an index column for time (simply 1, 2, 3...8760 to
represent the hours in a year) and columns for each county populated with capacity
factor values as a percentage from 0-100.
{% endblock %}

{% block notable_irregularities %}
Non-county regions
------------------

The original data include capacity factors for some non-county areas including the Great
Lakes and 2 small cities (Bedford City, VA and Clifton Forge City, VA). It associated
"county" FIPS IDs with those areas, meaning that there was not a 1:1 relationship
between the FIPS IDs and the named areas, and the geographic region implied by the
FIPS IDs did not correspond to the named area. We've dropped the cities -- one of which
contained no data -- and set the FIPS codes for the Great Lakes to NA. Note that lakes
bordering multiple states will appear more than once in the data. VCE used a nearest
neighbor technique to assign the state waters to the counties (this pertains to coastal
areas as well).

Capacity factors > 1
--------------------
There are a couple of capacity factor values for the solar pv data that exceed
the maximum value of 1 for capacity factor (or 100 for the raw data--PUDL converts the
data from a percentage to a fraction to match other reported capacity factor data). This
is due to power production performance being correlated with panel temperatures. During
cold sunny periods, some solar capacity factor values are greater than 1 (but less that
1.1).

8760-hour years
---------------
This data is primarily used for modeling purposes and conforms to the 8760 hour/year
standard regardless of leap years. This means that 2020 is missing data for December
31st.

{% endblock %}
Loading

0 comments on commit 0ffb363

Please sign in to comment.