Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform vceregen renewable generation profiles #3898

Merged
merged 106 commits into from
Oct 19, 2024
Merged

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented Oct 4, 2024

Overview

Closes #3874

What problem does this address?

Transform the vceregen tables!

What did you change?

Add and populate a transform module for the vceregen data source. The output is a single table that combines the three capacity factor tables with the FIPS table. I will highlight any other lingering questions in the comments.

TO-DO: Other transforms

TO-DO: Once we agree on the transforms

Documentation

Make sure to update relevant aspects of the documentation.

TO-DO: Documentation

Testing

How did you make sure this worked? How can a reviewer verify this?

TO-DO: Testing

@aesharpe aesharpe changed the base branch from main to extract-vceregen October 4, 2024 23:40
Copy link
Member Author

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Here is the basis for vceregen transforms. And here's a little bit of my reasoning.

There are four tables, but I decided to combine them all into one sudo-output table because it's so big. I could have saved the FIPS table as it's own asset, but that didn't feel particularly useful because we already have a table like that and this one is so hyper-specific to this table.

To the best of my ability, these transforms are designed to preserve memory. If there are any further memory-saving opportunities that you see, let me know!

There are plenty of other little defensive checks I could add it--let me know if you see an opportunity to do so!

I tried to minimize function nesting wherever possible so you can follow what's happening to the tables clearly in out_vceregen__hourly_available_capacity_factor()

src/pudl/transform/vceregen.py Outdated Show resolved Hide resolved
Comment on lines 73 to 74
TO-DO: decide whether to have the year_hour start at 0 or 1.
and update the date_range accordingly.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this decision

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data comes with an hour-of-year starting at 1, so I think we should probably stick with that convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aesharpe Can you explain why hour 1 is equal to YYYY-MM-DD 00:00 instead of 01:00? This isn't intuitive to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky -- in datetime the year starts at 2022-01-01 00:00:00. If we bumped it up by an hour and coorelated hour 1 with 2022-01-01 01:00:00 then the 8760th hour would be the next year: 2023-01-01 00:00:00 and not line up with the report year column which would still say 2022.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah, it's weird. I could see us going either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's an arbitrary choice to index hour-of-year from 0 or 1. We went with 1 because the input data was indexed from 1 (1-8760 rather than 0-8759 for each year). I asked around about this on the socials and different people use different conventions. E.g. searching for 8760 data I come across this NREL report and the plots run from hours 1-8760, but you can find plenty of others that look like they start with 0. Sometimes this comes up as "hour beginning" (with 0) vs. "hour ending" (with 1).

@aesharpe aesharpe self-assigned this Oct 4, 2024
@aesharpe aesharpe requested a review from e-belfer October 4, 2024 23:50
@aesharpe aesharpe added new-data Requests for integration of new data. vcerare Pertaining to Vibrant Clean Energy's Resource Adequacy Renewable Energy Power datasets labels Oct 4, 2024
@e-belfer
Copy link
Member

e-belfer commented Oct 8, 2024

Some asset checks to consider:

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There are a couple of small changes in the order of operations that may speed things up significantly.
  • The row count check looks like it will always fail in the fast ETL.
  • There are two lake name irregularities (misspelled Hurron, St. vs. Saint Claire)

Do we want to try and make the low-cardinality string columns into categoricals to save memory / space? If you read the Parquet into Pandas from Parquet naively now, it takes like 25-28GB of memory.

Comment on lines 73 to 74
TO-DO: decide whether to have the year_hour start at 0 or 1.
and update the date_range accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's an arbitrary choice to index hour-of-year from 0 or 1. We went with 1 because the input data was indexed from 1 (1-8760 rather than 0-8759 for each year). I asked around about this on the socials and different people use different conventions. E.g. searching for 8760 data I come across this NREL report and the plots run from hours 1-8760, but you can find plenty of others that look like they start with 0. Sometimes this comes up as "hour beginning" (with 0) vs. "hour ending" (with 1).

logger.info("Nulling FIPS IDs for lake rows")
lake_county_state_names = [
"lake_erie_ohio",
"lake_hurron_michigan",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix the spelling of Lake Huron (and also tell Pattern so they can fix it upstream)

"lake_michigan_michigan",
"lake_michigan_wisconsin",
"lake_ontario_new_york",
"lake_st_clair_michigan",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every other instance of "Saint" in all of the place names is spelled out. This is the only oddball. Can we standardize on "Saint" and ask Pattern to change it upstream when we update? (it's also the only name that includes punctuation in the original data)

src/pudl/transform/vcerare.py Outdated Show resolved Hide resolved
src/pudl/transform/vcerare.py Outdated Show resolved Hide resolved
.rename(
columns={"level_3": "county_state_names", 0: f"capacity_factor_{df_name}"}
)
.assign(county_state_names=lambda x: pd.Categorical(x.county_state_names))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intention behind making this a Categorical?

If we want the output Parquet file to use Categorical values to save memory (for the county_or_lake_name, state, or county_id_fips strings) then the fields need to have enum constraints set on them in FIELD_METADATA_BY_GROUP or FIELD_METADATA_BY_RESOURCE at the bottom of pudl/metadata/fields.py. This will also result in an error if any unexpected values show up.

This should be easy to do for state (we do it for the CEMS already) but it'd be harder for the other two columns, since we don't know all the values that they'll take on until we've read the data in (unless we want to save them aside as a constant now that we know them)

@@ -1970,6 +2027,10 @@
"description": "The energy contained in fuel burned, measured in million BTU.",
"unit": "MMBtu",
},
"hour_of_year": {
"type": "integer",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to integer.

def check_hourly_available_cap_fac_table(asset_df: pd.DataFrame):
"""Check that the final output table is as expected."""
# Make sure the table is the expected length
if (length := len(asset_df)) != 136437000:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail in the fast ETL that only uses a single year, won't it? We have a pattern for breaking out the per-year expectations elsewhere that @jdangerx came up with.

I think we were able to get the asset checks running in the integration tests at some point, but they've stopped running again and I don't think we understand why. But in any case I think it'd fail if someone ran the fast ETL in the Dagster UI currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We asked Chris which of the two reports ought to be used as a reference and he indicated the 2020 one was preferable. Is there a reason to include this one too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was that it has more detail about the methodology. When Chris said the other one was preferable I didn't necessarily take that to mean that this one wasn't accurate, rather that the other was more succinct. But we can remove it if you think it might not be what Chris intended!

blocking=True,
description="Check that output table is as expected.",
)
def check_hourly_available_cap_fac_table(asset_df: pd.DataFrame): # noqa: C901
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but with a lot of different checks inside a single function like this, we probably want to accumulate some kind of error report and then at the end of the function, if there are any errors in that report, we return the AssetCheckResult and include all of that information, rather than bailing as soon as we find any failure. Otherwise if there are multiple failures you'll have to run the whole asset check multiple times to find them all.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the notebook, which is very close but doesn't currently work out of the box.

notebooks/work-in-progress/VCERARE.ipynb Outdated Show resolved Hide resolved
notebooks/work-in-progress/VCERARE.ipynb Show resolved Hide resolved
notebooks/work-in-progress/VCERARE.ipynb Outdated Show resolved Hide resolved
notebooks/work-in-progress/VCERARE.ipynb Show resolved Hide resolved
notebooks/work-in-progress/VCERARE.ipynb Show resolved Hide resolved
notebooks/work-in-progress/VCERARE.ipynb Outdated Show resolved Hide resolved
"\n",
"# Make the chart\n",
"plt.hist(plot_df[f\"capacity_factor_{gen_type}\"], bins=100, range=(0, 1))\n",
"plt.title(\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add x and y axis labels here.

"source": [
"#### Memory check - will it Excel? \n",
"\n",
"If you want to know whether your table is capable of being processed as a csv you can use this memory estimator. If the estimated memory exceeds 500 MB it's too big! Cutting columns or making the filter scope smaller will help reduce the file size."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're saying "cutting columns", we should show people how to drop columns probably.

"metadata": {},
"source": [
"#### Download!\n",
"Specify your desired download location by filling in the `download_path` and running this cell will output the data to that location under the name `rare_power_data.csv`."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to give people more guidance here, since path names often trip people up. Also, having the slash on the file name will cause problems if people just leave this blank.

I'd suggest:

# Add the file path you want to download the data to
# Leave this blank to save the data in the same folder as this notebook.
download_path = "" 
# e.g. download_path = "/home/user/Desktop/folder/data_folder/"
plot_df.to_csv(download_path+"rare_power_data.csv")

"metadata": {},
"outputs": [],
"source": [
"# Add the file path you want to download the data to\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to give people more guidance here, since path names often trip people up. Also, having the slash on the file name will cause problems if people just leave this blank.

I'd suggest:

# Add the file path you want to download the data to
# Leave this blank to save the data in the same folder as this notebook.
download_path = "" 
# e.g. download_path = "/home/user/Desktop/folder/data_folder/"
plot_df.to_csv(download_path+"rare_power_data.csv")

@zaneselvans
Copy link
Member

  • The two small order-of-operations changes didn't have big impacts.
  • Converting the input dataframes to use dtype_backend="pyarrow" also did not help.
  • Peak memory usage for the output asset appears to be about 13GB (based on watching btop locally)
  • The df.stack() operations each take about 20 sec to run, and spike memory usage.
  • The df.concat() operation takes about 2 min to run and also spikes memory usage.

I haven't found a simple way around it, so I have created a new superchonk asset tag: "memory-use": "very-high" that limits concurrency to 1 for the VCE asset.

I've also added "memory-use": "high" to the asset check since it seems to be peaking at more than 4GB of memory usage.

@zaneselvans
Copy link
Member

  • Unfortunately concurrency limits only operate within a single tag group, so the very-high tag didn't help.
  • Additional configuration is required to make retries work, so it won't help us immediately.
  • Going with the brute force approach of making the VM bigger temporarily.

@zaneselvans
Copy link
Member

I've compiled some loose ends to clean up after the data release in #3914

@zaneselvans zaneselvans dismissed e-belfer’s stale review October 19, 2024 04:26

Need to get this in the builds tonight.

@aesharpe aesharpe added this pull request to the merge queue Oct 19, 2024
Merged via the queue into main with commit 0ffb363 Oct 19, 2024
17 checks passed
@aesharpe aesharpe deleted the transform-vceregen branch October 19, 2024 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-data Requests for integration of new data. vcerare Pertaining to Vibrant Clean Energy's Resource Adequacy Renewable Energy Power datasets
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Clean, transform, and sanity check the vceregen data
3 participants