-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform vceregen renewable generation profiles #3898
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! Here is the basis for vceregen
transforms. And here's a little bit of my reasoning.
There are four tables, but I decided to combine them all into one sudo-output table because it's so big. I could have saved the FIPS table as it's own asset, but that didn't feel particularly useful because we already have a table like that and this one is so hyper-specific to this table.
To the best of my ability, these transforms are designed to preserve memory. If there are any further memory-saving opportunities that you see, let me know!
There are plenty of other little defensive checks I could add it--let me know if you see an opportunity to do so!
I tried to minimize function nesting wherever possible so you can follow what's happening to the tables clearly in out_vceregen__hourly_available_capacity_factor()
src/pudl/transform/vceregen.py
Outdated
TO-DO: decide whether to have the year_hour start at 0 or 1. | ||
and update the date_range accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging this decision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data comes with an hour-of-year starting at 1, so I think we should probably stick with that convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aesharpe Can you explain why hour 1 is equal to YYYY-MM-DD 00:00 instead of 01:00? This isn't intuitive to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky -- in datetime
the year starts at 2022-01-01 00:00:00
. If we bumped it up by an hour and coorelated hour 1 with 2022-01-01 01:00:00
then the 8760th hour would be the next year: 2023-01-01 00:00:00
and not line up with the report year column which would still say 2022.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah, it's weird. I could see us going either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's an arbitrary choice to index hour-of-year from 0 or 1. We went with 1 because the input data was indexed from 1 (1-8760 rather than 0-8759 for each year). I asked around about this on the socials and different people use different conventions. E.g. searching for 8760 data I come across this NREL report and the plots run from hours 1-8760, but you can find plenty of others that look like they start with 0. Sometimes this comes up as "hour beginning" (with 0) vs. "hour ending" (with 1).
Some asset checks to consider:
|
…erative/pudl into transform-vceregen
- Fix failing asset check. - Change dtype for `hour_of_year` from float to int. - Clarify some release notes / data source docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There are a couple of small changes in the order of operations that may speed things up significantly.
- The row count check looks like it will always fail in the fast ETL.
- There are two lake name irregularities (misspelled Hurron, St. vs. Saint Claire)
Do we want to try and make the low-cardinality string columns into categoricals to save memory / space? If you read the Parquet into Pandas from Parquet naively now, it takes like 25-28GB of memory.
src/pudl/transform/vceregen.py
Outdated
TO-DO: decide whether to have the year_hour start at 0 or 1. | ||
and update the date_range accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's an arbitrary choice to index hour-of-year from 0 or 1. We went with 1 because the input data was indexed from 1 (1-8760 rather than 0-8759 for each year). I asked around about this on the socials and different people use different conventions. E.g. searching for 8760 data I come across this NREL report and the plots run from hours 1-8760, but you can find plenty of others that look like they start with 0. Sometimes this comes up as "hour beginning" (with 0) vs. "hour ending" (with 1).
src/pudl/transform/vcerare.py
Outdated
logger.info("Nulling FIPS IDs for lake rows") | ||
lake_county_state_names = [ | ||
"lake_erie_ohio", | ||
"lake_hurron_michigan", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fix the spelling of Lake Huron (and also tell Pattern so they can fix it upstream)
src/pudl/transform/vcerare.py
Outdated
"lake_michigan_michigan", | ||
"lake_michigan_wisconsin", | ||
"lake_ontario_new_york", | ||
"lake_st_clair_michigan", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every other instance of "Saint" in all of the place names is spelled out. This is the only oddball. Can we standardize on "Saint" and ask Pattern to change it upstream when we update? (it's also the only name that includes punctuation in the original data)
.rename( | ||
columns={"level_3": "county_state_names", 0: f"capacity_factor_{df_name}"} | ||
) | ||
.assign(county_state_names=lambda x: pd.Categorical(x.county_state_names)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the intention behind making this a Categorical?
If we want the output Parquet file to use Categorical values to save memory (for the county_or_lake_name
, state
, or county_id_fips
strings) then the fields need to have enum
constraints set on them in FIELD_METADATA_BY_GROUP
or FIELD_METADATA_BY_RESOURCE
at the bottom of pudl/metadata/fields.py
. This will also result in an error if any unexpected values show up.
This should be easy to do for state (we do it for the CEMS already) but it'd be harder for the other two columns, since we don't know all the values that they'll take on until we've read the data in (unless we want to save them aside as a constant now that we know them)
@@ -1970,6 +2027,10 @@ | |||
"description": "The energy contained in fuel burned, measured in million BTU.", | |||
"unit": "MMBtu", | |||
}, | |||
"hour_of_year": { | |||
"type": "integer", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to integer.
def check_hourly_available_cap_fac_table(asset_df: pd.DataFrame): | ||
"""Check that the final output table is as expected.""" | ||
# Make sure the table is the expected length | ||
if (length := len(asset_df)) != 136437000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail in the fast ETL that only uses a single year, won't it? We have a pattern for breaking out the per-year expectations elsewhere that @jdangerx came up with.
I think we were able to get the asset checks running in the integration tests at some point, but they've stopped running again and I don't think we understand why. But in any case I think it'd fail if someone ran the fast ETL in the Dagster UI currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We asked Chris which of the two reports ought to be used as a reference and he indicated the 2020 one was preferable. Is there a reason to include this one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning was that it has more detail about the methodology. When Chris said the other one was preferable I didn't necessarily take that to mean that this one wasn't accurate, rather that the other was more succinct. But we can remove it if you think it might not be what Chris intended!
blocking=True, | ||
description="Check that output table is as expected.", | ||
) | ||
def check_hourly_available_cap_fac_table(asset_df: pd.DataFrame): # noqa: C901 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker, but with a lot of different checks inside a single function like this, we probably want to accumulate some kind of error report and then at the end of the function, if there are any errors in that report, we return the AssetCheckResult
and include all of that information, rather than bailing as soon as we find any failure. Otherwise if there are multiple failures you'll have to run the whole asset check multiple times to find them all.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on the notebook, which is very close but doesn't currently work out of the box.
"\n", | ||
"# Make the chart\n", | ||
"plt.hist(plot_df[f\"capacity_factor_{gen_type}\"], bins=100, range=(0, 1))\n", | ||
"plt.title(\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add x and y axis labels here.
"source": [ | ||
"#### Memory check - will it Excel? \n", | ||
"\n", | ||
"If you want to know whether your table is capable of being processed as a csv you can use this memory estimator. If the estimated memory exceeds 500 MB it's too big! Cutting columns or making the filter scope smaller will help reduce the file size." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're saying "cutting columns", we should show people how to drop columns probably.
"metadata": {}, | ||
"source": [ | ||
"#### Download!\n", | ||
"Specify your desired download location by filling in the `download_path` and running this cell will output the data to that location under the name `rare_power_data.csv`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to give people more guidance here, since path names often trip people up. Also, having the slash on the file name will cause problems if people just leave this blank.
I'd suggest:
# Add the file path you want to download the data to
# Leave this blank to save the data in the same folder as this notebook.
download_path = ""
# e.g. download_path = "/home/user/Desktop/folder/data_folder/"
plot_df.to_csv(download_path+"rare_power_data.csv")
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Add the file path you want to download the data to\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to give people more guidance here, since path names often trip people up. Also, having the slash on the file name will cause problems if people just leave this blank.
I'd suggest:
# Add the file path you want to download the data to
# Leave this blank to save the data in the same folder as this notebook.
download_path = ""
# e.g. download_path = "/home/user/Desktop/folder/data_folder/"
plot_df.to_csv(download_path+"rare_power_data.csv")
I haven't found a simple way around it, so I have created a new superchonk asset tag: I've also added |
|
I've compiled some loose ends to clean up after the data release in #3914 |
Need to get this in the builds tonight.
Overview
Closes #3874
What problem does this address?
Transform the
vceregen
tables!What did you change?
Add and populate a transform module for the
vceregen
data source. The output is a single table that combines the three capacity factor tables with the FIPS table. I will highlight any other lingering questions in the comments.TO-DO: Other transforms
TO-DO: Once we agree on the transforms
Documentation
Make sure to update relevant aspects of the documentation.
TO-DO: Documentation
Testing
How did you make sure this worked? How can a reviewer verify this?
TO-DO: Testing