Transform vceregen renewable generation profiles #3898

aesharpe · 2024-10-04T23:31:10Z

Overview

What problem does this address?

Transform the vceregen tables!

What did you change?

Add and populate a transform module for the vceregen data source. The output is a single table that combines the three capacity factor tables with the FIPS table. I will highlight any other lingering questions in the comments.

TO-DO: Other transforms

Give feedback

divide capacity factor values by 100 based on this thread.
figure out how to handle capacity factor values > 100.
Decide whether to have year_hour values start at 0 or 1
I noticed that not all of the counties in the capacity factor tables show up in the FIPS table. Look into this more.
Add more sanity checks.
Add asset checks.
finalize dataset name
Decide whether to revert back to 0-100 scale for capacity factor
Options

TO-DO: Once we agree on the transforms

Give feedback

add RESOURCE_METADATA
add any new fields to fields.py
Options

Documentation

Make sure to update relevant aspects of the documentation.

TO-DO: Documentation

Give feedback

Update the release notes: reference the PR and related issues.
Update relevant Data Source jinja templates (see docs/data_sources/templates) -- there's a separate branch for this.
Update relevant table or source description metadata (see src/metadata).
Review and update any other aspects of the documentation that might be affected by this PR.
Options

Testing

How did you make sure this worked? How can a reviewer verify this?

TO-DO: Testing

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows()).
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have.
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.
Options

aesharpe

Hello! Here is the basis for vceregen transforms. And here's a little bit of my reasoning.

There are four tables, but I decided to combine them all into one sudo-output table because it's so big. I could have saved the FIPS table as it's own asset, but that didn't feel particularly useful because we already have a table like that and this one is so hyper-specific to this table.

To the best of my ability, these transforms are designed to preserve memory. If there are any further memory-saving opportunities that you see, let me know!

There are plenty of other little defensive checks I could add it--let me know if you see an opportunity to do so!

I tried to minimize function nesting wherever possible so you can follow what's happening to the tables clearly in out_vceregen__hourly_available_capacity_factor()

src/pudl/transform/vceregen.py

aesharpe · 2024-10-04T23:47:33Z

src/pudl/transform/vceregen.py

+    TO-DO: decide whether to have the year_hour start at 0 or 1.
+    and update the date_range accordingly.


Flagging this decision

The data comes with an hour-of-year starting at 1, so I think we should probably stick with that convention.

@aesharpe Can you explain why hour 1 is equal to YYYY-MM-DD 00:00 instead of 01:00? This isn't intuitive to me.

This is tricky -- in datetime the year starts at 2022-01-01 00:00:00. If we bumped it up by an hour and coorelated hour 1 with 2022-01-01 01:00:00 then the 8760th hour would be the next year: 2023-01-01 00:00:00 and not line up with the report year column which would still say 2022.

But yeah, it's weird. I could see us going either way.

I think it's an arbitrary choice to index hour-of-year from 0 or 1. We went with 1 because the input data was indexed from 1 (1-8760 rather than 0-8759 for each year). I asked around about this on the socials and different people use different conventions. E.g. searching for 8760 data I come across this NREL report and the plots run from hours 1-8760, but you can find plenty of others that look like they start with 0. Sometimes this comes up as "hour beginning" (with 0) vs. "hour ending" (with 1).

src/pudl/transform/vceregen.py

e-belfer · 2024-10-08T13:51:25Z

Some asset checks to consider:

The difference between the datetime index in our df and one that runs from the start to the end date of the selected years in the run is only missing the known leap year timestamps (see, e.g. https://stackoverflow.com/questions/52044348/check-for-any-missing-dates-in-the-index)
No capacity factors <0

…erative/pudl into transform-vceregen

- Fix failing asset check. - Change dtype for `hour_of_year` from float to int. - Clarify some release notes / data source docs.

zaneselvans

There are a couple of small changes in the order of operations that may speed things up significantly.
The row count check looks like it will always fail in the fast ETL.
There are two lake name irregularities (misspelled Hurron, St. vs. Saint Claire)

Do we want to try and make the low-cardinality string columns into categoricals to save memory / space? If you read the Parquet into Pandas from Parquet naively now, it takes like 25-28GB of memory.

zaneselvans · 2024-10-18T04:57:24Z

src/pudl/transform/vceregen.py

+    TO-DO: decide whether to have the year_hour start at 0 or 1.
+    and update the date_range accordingly.


I think it's an arbitrary choice to index hour-of-year from 0 or 1. We went with 1 because the input data was indexed from 1 (1-8760 rather than 0-8759 for each year). I asked around about this on the socials and different people use different conventions. E.g. searching for 8760 data I come across this NREL report and the plots run from hours 1-8760, but you can find plenty of others that look like they start with 0. Sometimes this comes up as "hour beginning" (with 0) vs. "hour ending" (with 1).

zaneselvans · 2024-10-18T05:34:22Z

src/pudl/transform/vcerare.py

+    logger.info("Nulling FIPS IDs for lake rows")
+    lake_county_state_names = [
+        "lake_erie_ohio",
+        "lake_hurron_michigan",


I think we should fix the spelling of Lake Huron (and also tell Pattern so they can fix it upstream)

zaneselvans · 2024-10-18T05:38:58Z

src/pudl/transform/vcerare.py

+        "lake_michigan_michigan",
+        "lake_michigan_wisconsin",
+        "lake_ontario_new_york",
+        "lake_st_clair_michigan",


Every other instance of "Saint" in all of the place names is spelled out. This is the only oddball. Can we standardize on "Saint" and ask Pattern to change it upstream when we update? (it's also the only name that includes punctuation in the original data)

src/pudl/transform/vcerare.py

zaneselvans · 2024-10-18T06:09:36Z

src/pudl/transform/vcerare.py

+        .rename(
+            columns={"level_3": "county_state_names", 0: f"capacity_factor_{df_name}"}
+        )
+        .assign(county_state_names=lambda x: pd.Categorical(x.county_state_names))


What's the intention behind making this a Categorical?

If we want the output Parquet file to use Categorical values to save memory (for the county_or_lake_name, state, or county_id_fips strings) then the fields need to have enum constraints set on them in FIELD_METADATA_BY_GROUP or FIELD_METADATA_BY_RESOURCE at the bottom of pudl/metadata/fields.py. This will also result in an error if any unexpected values show up.

This should be easy to do for state (we do it for the CEMS already) but it'd be harder for the other two columns, since we don't know all the values that they'll take on until we've read the data in (unless we want to save them aside as a constant now that we know them)

zaneselvans · 2024-10-18T07:09:09Z

src/pudl/metadata/fields.py

@@ -1970,6 +2027,10 @@
        "description": "The energy contained in fuel burned, measured in million BTU.",
        "unit": "MMBtu",
    },
+    "hour_of_year": {
+        "type": "integer",


I changed this to integer.

zaneselvans · 2024-10-18T07:12:31Z

src/pudl/transform/vcerare.py

+def check_hourly_available_cap_fac_table(asset_df: pd.DataFrame):
+    """Check that the final output table is as expected."""
+    # Make sure the table is the expected length
+    if (length := len(asset_df)) != 136437000:


This will fail in the fast ETL that only uses a single year, won't it? We have a pattern for breaking out the per-year expectations elsewhere that @jdangerx came up with.

I think we were able to get the asset checks running in the integration tests at some point, but they've stopped running again and I don't think we understand why. But in any case I think it'd fail if someone ran the fast ETL in the Dagster UI currently.

zaneselvans · 2024-10-18T08:03:51Z

docs/data_sources/vcerare/2018 VCE Study_Dataset Methods and Analysis.pdf

We asked Chris which of the two reports ought to be used as a reference and he indicated the 2020 one was preferable. Is there a reason to include this one too?

My reasoning was that it has more detail about the methodology. When Chris said the other one was preferable I didn't necessarily take that to mean that this one wasn't accurate, rather that the other was more succinct. But we can remove it if you think it might not be what Chris intended!

zaneselvans · 2024-10-18T08:11:36Z

src/pudl/transform/vcerare.py

+    blocking=True,
+    description="Check that output table is as expected.",
+)
+def check_hourly_available_cap_fac_table(asset_df: pd.DataFrame):  # noqa: C901


Not a blocker, but with a lot of different checks inside a single function like this, we probably want to accumulate some kind of error report and then at the end of the function, if there are any errors in that report, we return the AssetCheckResult and include all of that information, rather than bailing as soon as we find any failure. Otherwise if there are multiple failures you'll have to run the whole asset check multiple times to find them all.

review-notebook-app · 2024-10-18T16:48:36Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

e-belfer

Some comments on the notebook, which is very close but doesn't currently work out of the box.