Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate FERC 2021-2022 data with new extractor #2811

Closed
12 of 15 tasks
Tracked by #2699
jdangerx opened this issue Aug 28, 2023 · 17 comments · Fixed by #2947
Closed
12 of 15 tasks
Tracked by #2699

Integrate FERC 2021-2022 data with new extractor #2811

jdangerx opened this issue Aug 28, 2023 · 17 comments · Fixed by #2947
Assignees
Labels
ferc1 Anything having to do with FERC Form 1 new-data Requests for integration of new data. xbrl Related to the FERC XBRL transition

Comments

@jdangerx
Copy link
Member

jdangerx commented Aug 28, 2023

Follow-up to #2810 :

Now that we've gotten the new, more comprehensive extractor working for 2021 data, we should try to get it working for both 2021 and 2022 data.

Note: we're starting work on this before #2810 is complete - there are still some test updates necessary there but we're working on those in parallel.

Scope

  1. jdangerx

Individual error classes below.

If you fix these errors but the assets still don't build, that's fine - we'll merge the fixes into this branch and catalog the new errors.

Error messages

  1. jdangerx
  2. 0 of 8
    ferc1 new-data xbrl
    jdangerx
  3. 0 of 5
    ferc1 new-data xbrl
  4. 0 of 2
    ferc1 new-data xbrl
    e-belfer
  5. 0 of 2
    ferc1 new-data xbrl
  6. 1 of 1
    ferc1 new-data xbrl
    e-belfer
  7. jdangerx
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

  File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 2158, in merge_instant_and_duration_tables_xbrl
    out_df = pd.concat(
             ^^^^^^^^^^
  • plants_small_ferc1
  • plants_steam_ferc1
  • plants_hydro_ferc1
  • plants_pumped_storage_ferc1
AssertionError: Looks like there are multiple entries per year--not sure which to use for the start/end balance. params=UnstackBalancesToReportYearInstantXbrl(unstack_balances_to_report_year=True) primary_key_cols=['report_year', 'entity_id', 'electric_plant_classification_axis', 'utility_type_axis']
  File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 609, in unstack_balances_to_report_year_instant_xbrl
    raise AssertionError(
  • balance_sheet_assets_ferc1
  • balance_sheet_liabilities_ferc1
  • cash_flow_ferc1
  • electric_plant_depreciation_changes_ferc1
  • electric_plant_depreciation_functional_ferc1
  • other_regulatory_liabilities_ferc1
  • retained_earnings_ferc1
ValueError: electric_energy_dispositions_ferc1 Duplicate primary keys when enforcing schema.
  File "/Users/dazhong-catalyst/work/pudl/src/pudl/metadata/classes.py", line 1506, in enforce_schema
    raise ValueError(
  • electric_energy_dispositions_ferc1
@jdangerx jdangerx changed the title Test new extractor against 2021 + 2022 data and explode_ferc1 branch Test new extractor with 2021-2022 data Aug 28, 2023
@jdangerx
Copy link
Member Author

jdangerx commented Sep 18, 2023

The good news is that xbrl2sqlite works with the 2022 data; the other news is that there are a handful of different errors to crank through:


File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 609, in unstack_balances_to_report_year_instant_xbrl
    df.loc[df.report_year == (df.year + 1), "balance_type"] = "starting_balance"
    ~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set a frame with no defined index and a scalar
  • other_regulatory_liabilities_ferc1
  • cash_flow_ferc1
  • electric_plant_depreciation_changes_ferc1
  • balance_sheet_assets_ferc1
  • electric_plant_depreciation_functional_ferc1
  • retained_earnings_ferc1
  • balance_sheet_liabilities_ferc1
  • plant_in_service_ferc1

AssertionError: Unexpected number of columns dropped in wide_to_tidy:

  • electric_energy_dispositions_ferc1 (27 instead of 19)
  • electric_operating_expenses_ferc1 (171 instead of 0)
  • electric_energy_sources_ferc1 (27 instead of 10)
  • income_statement_ferc1 (70 instead of 2)
  • electric_operating_revenues_ferc1 (48 instead of 2)
File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 1236, in transform_end
    raise ValueError(

ValueError: plants_hydro_ferc1: Column capex_per_mw is entirely NULL!

(or, you know, other null columns)

  • plants_hydro_ferc1
  • electricity_sales_by_rate_schedule_ferc1
File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 287, in wide_to_tidy
    df_out = df.set_index(params.idx_cols).filter(regex=pat)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

KeyError: "None of ['plant_function'] are in the columns"

(or other columns not in index)

  • depreciation_amortization_summary_ferc1
  • utility_plant_summary_ferc1
File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 2899, in aggregate_duplicate_fuel_types_xbrl
    agg_row_fraction = (len(fuel_pk_dupes) + len(fuel_multi_unit)) / len(fuel_xbrl)
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

ZeroDivisionError: division by zero
  • fuel_ferc1

@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 new-data Requests for integration of new data. xbrl Related to the FERC XBRL transition labels Sep 20, 2023
@jdangerx
Copy link
Member Author

It looks like we have some remapping to do, to get valid 2022 data into the pipeline.

For starters, we have the following 4 entity_ids in the XBRL data that don't appear in our mapping yet:

    entity_id                    respondent_legal_name
176   C004679                 Transource Oklahoma, LLC
217   C011644          Pioneer Power and Light Company
218   C011745                     PJM Settlement, Inc.
219   C011785  Mountrail-Williams Electric Cooperative

I ran some simple difflib.get_close_matches() between these and the respondent_names in the DBF data, to get this list of potential matches. I used the identification_001_duration and f1_respondent_id tables in the XBRL / DBF data respectively.

  entity_id                                xbrl_name                                dbf_name  respondent_id
0   C004679                 Transource Oklahoma, LLC                Transource Maryland, LLC            521
1   C004679                 Transource Oklahoma, LLC            Transource Pennsylvania, LLC            520
2   C004679                 Transource Oklahoma, LLC                Transource Missouri, LLC            453
3   C011644          Pioneer Power and Light Company         Pioneer Power and Light Company            140
4   C011644          Pioneer Power and Light Company       Wisconsin Power and Light Company            194
5   C011644          Pioneer Power and Light Company      Interstate Power and Light Company            281
6   C011745                     PJM Settlement, Inc.                    PJM Settlement, Inc.            447
7   C011785  Mountrail-Williams Electric Cooperative         Lyon Rural Electric Cooperative            412
8   C011785  Mountrail-Williams Electric Cooperative  Southern Maryland Electric Cooperative            296
9   C011785  Mountrail-Williams Electric Cooperative             Oconto Electric Cooperative            205

Case-by-case:

  • C004679: looks like get_close_matches picked up on Transource being the most relevant thing here, and it looks like there is indeed no DBF respondent that matches with Transource Oklahoma - we should add this to the utility_id_ferc1.csv, with no corresponding DBF ID.
  • C011644: There's an exact match, with DBF respondent ID 140! Hooray! However, DBF respondent ID 140 is already mapped to another ID:
         utility_id_ferc1 utility_id_ferc1_xbrl  utility_id_ferc1_dbf
    403               414               R001140                   140
    3                <NA>               C011644                   140
    
    Weirdly, R001140 doesn't appear to show up anywhere in the XBRL identification_001_duration table. So we could probably just edit the utility_id_ferc1.csv to map DBF 140 to XBRL C011644 and PUDL 414.
  • C011745 : Another exact match, this time with DBF respondent ID 447! Also, this is already mapped to another ID too:
         utility_id_ferc1 utility_id_ferc1_xbrl  utility_id_ferc1_dbf
    152               153               C000030                   447
    6                <NA>               C011745                   447
    
    And unfortunately, both of these entity IDs show up in our XBRL table attached to PJM!
       utility_id_ferc1 utility_id_ferc1_xbrl  utility_id_ferc1_dbf                        xbrl_name                         dbf_name
    0               153               C000030                   447         PJM Interconnection, LLC             PJM Settlement, Inc.
    1              <NA>               C011745                   447             PJM Settlement, Inc.             PJM Settlement, Inc.
    
    But also, both PJM Interconnection and PJM Settlement have their own DBF ID, so I'm not sure why we mapped C000030 to 447 in the first place...
         respondent_id           respondent_name
    245            255  PJM Interconnection, LLC
    354            447      PJM Settlement, Inc.
    
  • C011785: the "closest matches" seem to be all different electric co-ops that seem unrelated to Mountrail-Williams so this looks like another straightforward addition to the map. I also looked at all DBF respondent_names that started with M and didn't find anything, so I feel confident that this utility just never reported in DBF format.

My questions to those who are more experienced in the mapping arena (@aesharpe ? @zaneselvans ? @katie-lamb ? ) are:

  1. Is it cool to remap DBF 140 to XBRL C011644 instead of XBRL R001140 for Pioneer?
  2. Is it cool to map C000030 to 255 and C011745 to 447 so PJM Interconnection and PJM Settlement are both mapped to themselves across the XBRL/DBF barrier?
  3. Is there anywhere else I should be looking for DBF IDs for Transource Oklahoma and Mountrail-Williams?

@jdangerx
Copy link
Member Author

jdangerx commented Sep 21, 2023

Well the good news is, I put in my draft remapping of XBRL -> DBF utility IDs and now more stuff builds...

diff --git a/src/pudl/package_data/glue/utility_id_ferc1.csv b/src/pudl/package_data/glue/utility_id_ferc1.csv
index cde4d3f71..a0623757c 100644
--- a/src/pudl/package_data/glue/utility_id_ferc1.csv
+++ b/src/pudl/package_data/glue/utility_id_ferc1.csv
@@ -151,7 +151,7 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
 150,,1001
 151,,1003
 152,C000029,231
-153,C000030,447
+153,C000030,255
 154,C000038,250
 155,C000041,161
 156,C000045,294
@@ -402,7 +402,7 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
 408,R001052,52
 409,R001072,72
 411,R001087,87
-414,R001140,140
+414,C011644,140
 419,R001268,268
 420,R001298,298
 421,R001301,301
@@ -412,10 +412,12 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
 428,R001419,419
 429,R001422,422
 436,R001445,445
-437,,255
 439,R001515,515
 440,R001520,520
 441,R001521,521
 442,,278
 443,C010845,
 444,C011304,
+445,C011745,447
+446,C004679,
+447,C011785,
\ No newline at end of file

Now I only get two failures in the core asset group:

fuel_ferc1:

ValueError: fuel_ferc1 (xbrl): Found null primary key values.
report_year              False
utility_id_ferc1_xbrl    False
fuel_type_code_pudl       True
plant_name_ferc1         False
dtype: bool

And retained_earnings_ferc1:

ValueError: not enough values to unpack (expected 1, got 0)
...
  File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 4680, in add_previous_year_factoid
    [missing_year] = [
    ^^^^^^^^^^^^^^

@e-belfer
Copy link
Member

e-belfer commented Sep 21, 2023

@jdangerx see my PR re: the fuel_type_code null value failure, it's a simple fix: #2881

This was referenced Sep 25, 2023
@e-belfer e-belfer changed the title Test new extractor with 2021-2022 data Integrate FERC 2021-2022 data with new extractor Sep 25, 2023
@jdangerx
Copy link
Member Author

jdangerx commented Oct 17, 2023

OK, some notes on the validation tests:

current status

These are the two main things that I'm seeing:

  • some rows from 2021 are being dropped - this is likely due to an issue where the data for 2021 isn't associated with the 2021 report_year due to a missing ReportYear fact in the XBRL report. I'll use some trickery to turn the report_year into the year of the

  • in plants_steam, we're seeing ~750 plants being changed for every year, including DBF-only years. what's going on there?

next steps

  • use some trickery to fix the report year, so this doesn't happen anymore:
        entity_id                                        filing_name  report_year  start_date    end_date  coolants_and_water
    300   C000041  Southern_California_Edison_Company_form1_Q4_16...         2021  2020-01-01  2020-12-31           7649530.0
    955   C000041  Southern_California_Edison_Company_form1_Q4_16...         2022  2022-01-01  2022-12-31           9136620.0
    954   C000041  Southern_California_Edison_Company_form1_Q4_16...         2022  2021-01-01  2021-12-31           7890927.0
    
  • figure out what the deal is with all the changes in plants_steam

fbp_ferc1

Found 26112 rows, expected 25406. Off by 2.779%, allowed margin of 0.000%

Table name: denorm_fuel_by_plant_ferc1

  • Total 706 new rows:
    • 689 new rows in 2022, 17 new rows from 2021 - looks like duke energy carolinas + entergy new orleans had some updates
    • 2 rows updated: 1 from duke energy carolinas + 1 from MS power company
  • most have fuel_cost and fuel_mmbtu; primary fraction, gas fraction
  • some have oil/coal fractions
  • few have nuclear and waste fractions;
  • the gas_fraction_mmbtu distributions look similar between new data & all old data

fuel_ferc1

Found 49932 rows, expected 48815. Off by 2.288%, allowed margin of 0.000%

Table name: denorm_fuel_ferc1

  • 1117 new rows, nothing dropped from old tables, no updates.
    • report year all 2022
    • looks like fuel cost per mmbtu has similar shapes to all years but generally things are more expensive, that seems reasonable

plant_in_service_ferc1

Found 335365 rows, expected 315112. Off by 6.427%, allowed margin of 0.000%

Table name: denorm_plant_in_service_ferc1

  • 20,937 new rows
    • almost all from 2022
  • 8,649 rows that changed
    • all from 2021
    • these are rows where some 2021 values became null when reported with the 2022 data.
    • there is at least once instance where Vermont Electric Power Company reported a value for 2021 in the 2021 filing, that then shows up in VETCO's 2021 value in the 2022 filing - these seem like corrections.
  • 684 rows dropped
    • mostly (~80/y) from 1998-2003, but a small amount each other year.
    • this is due to some utility ID remapping to differentiate PJM Settlement, Inc. from PJM Interconnection, LLC.

plants_all_ferc1

Found 56312 rows, expected 54415. Off by 3.486%, allowed margin of 0.000%

Table name: denorm_plants_all_ferc1

  • 2028 new rows from 2022
  • 131 rows dropped
  • 21456 rows updated
    • ... about 750 from every year, and 900 from 2021.
    • Seems like most of them are getting values that were NA before... but from where?

plants_hydro_ferc1

Found 6979 rows, expected 6798. Off by 2.663%, allowed margin of 0.000%

Table name: denorm_plants_hydro_ferc1

  • 183 new rows, all from 2022, with lots of non-null values
  • 2 dropped rows from 2021 - but they look like weird fake plants:
    • Massachusetts Electric Company/"n/a" where all values are NaN or None except pudl plant ID
    • ALABAMA POWER COMPANY/"rain gauge" where all values are NaN or None except pudl plant ID, capex_per_mw = 0.0, opex_per_mwh = 0.0, project_num = 0.0

plants_pumped_storage_ferc1

Found 562 rows, expected 544. Off by 3.309%, allowed margin of 0.000%

Table name: denorm_plants_pumped_storage_ferc1

  • 18 new entries which are mostly non-NA, from 2022
  • no dropped/updated entries

plants_small_ferc1

Found 16988 rows, expected 16248. Off by 4.554%, allowed margin of 0.000%

Table name: denorm_plants_small_ferc1

  • 754 new entries - 752 from 2022, 2 from 2021
    • non-null counts seem reasonable, not a lot of license_id_ferc1 values, but that seems par for the course.
  • 14 dropped entries
    • seem like values that were erased in the 2021-2022 version of the 2021 data
  • 141 changed entries
    • there is one value here that seem like they were updated in the DBF data?
    • the rest look like updates from 2021 data that is being reinterpreted via the new dedupe structure

plants_steam_ferc1

Found 31783 rows, expected 30825. Off by 3.108%, allowed margin of 0.000%

Table name: denorm_plants_steam_ferc1

  • 1073 new rows - 1070 from 2022, 3 from 2021
  • 115 dropped rows, all from 2021
    • looks like maybe the ReportYear fact isn't reported for all 2021 data in the 2022 filings? e.g. for Southern California Edison, cr-00603 is a context that refers to 2021 data, but it doesn't have a ReportYear fact - only cr-00604 has a ReportYear fact (which is 2022, of course).
  • 21315 changed rows
    • follows the same pattern as plants_all_ferc1 - ~750/year and about 900 for 2021. I guess this is the source of that discrepancy.

purchased_power_ferc1

Found 204720 rows, expected 197947. Off by 3.422%, allowed margin of 0.000%

Table name: denorm_purchased_power_ferc1

  • 7122 new rows, of which 7052 are from 2022 and 70 are from 2021
  • 340 dropped rows, all from 2021
  • 0 updated rows

@jdangerx
Copy link
Member Author

jdangerx commented Oct 17, 2023

There are 26,113 individual facts which changed in the plants_steam table - of which 21,065 are changes where the plant_id_ferc1 changed number between the dev build and the build on this branch.

Of the remaining changes, we see a lot of "DBF data for a specific plant went from non-NA to NA," and a lot of "XBRL data for 2021 went from NA to not-NA." So it seems like we got more valid data for 2021, which is suspicious due to the aforementioned report_year bug. And we lost a bunch of data from before 2021! Which is bad!

I have a few questions:

  1. Why did the plant_id_ferc1s change so much?
  2. Why did we lose so much data from the DBF years? Is it related to the plant ID changes? If not, what else did I change that could have messed up this process?

I think the next steps are still:

  1. fix the report_year bug, so we can better understand how the 2021 data has changed without that confounding factor
  2. continue investigating the issues with the pre-2021 years - how does plant_id_ferc1 get assigned? why would it change? would that impact downstream data somehow?

@zaneselvans
Copy link
Member

zaneselvans commented Oct 17, 2023

plant_id_ferc1 is algorithmically assigned based on the cosine similarity between records, and the numbers themselves aren't expected to be stable if the data changes, so this by itself isn't surprising or concerning. The addition of a new year of data would be expected to change the clustering of records that end up being associated with each other and given the same ID.

If there are changes to the data, that could also be affecting the IDs that get assigned, if any of the columns which are used to calculate the cosine similarity are changing. This is the matching problem that @zschira is about to swap in a much faster algorithm for (though it should have the same results I think, as it's just a faster way to calculate only the top-K cosine similarities I think, rather than the entire gigantic matrix)

Is there any pattern to the DBF data that's getting nullified?

@zaneselvans
Copy link
Member

The code that does the FERC 1 plant ID assignment is in pudl.analysis.classify_plants_ferc1

@jdangerx
Copy link
Member Author

Woohoo! Turns out that if you ignore the plant_id_ferc1, the big data discrepancies in the plants_steam/all tables just go away!

There's still some shuffling around in the 2021 data but I think that is related to #2947 - I'll fix up the PR and re-run ETL to see if that fixes things.

@zaneselvans zaneselvans linked a pull request Oct 19, 2023 that will close this issue
@jdangerx
Copy link
Member Author

After applying the fix from #2947 , all the ferc1 tables have only the following sorts of changes:

  • new rows from 2022
  • updated rows from 2021 - due to report year fix
  • updated rows from < 2021 - due to utility remapping or FERC plant ID reassignment
    • in the case of denorm_plants_steam_ferc1, there some changes in the annual capex additions because we calculated this by grouping on FERC plant ID - order of a couple dozen per year.

So I updated the minmax row counts in validate/ferc1_test.py.

There is still a validation test failure in validate/plants_steam_ferc1_test.py:

FAILED test/validate/plants_steam_ferc1_test.py::test_vs_bounds[ferc1_annual-capacity_ratios] - ValueError: 5% quantile (0.0) is below lower bound (0.5) in validation entitled Capability R...    

Which indicates that too many plants have a 0 capability ratio.

I think this is because the XBRL plants inherently have a much higher % of plants with a 0 capability ratio - see the table below. These 0 values are traceable directly to the raw XBRL files we're getting, so it's not a quirk of our processing.

years old new
2021 image image
2022 n/a image
all image image
xbrl-only image image
dbf-only image image

My pitch is to update the validation test so that it filters out the 0-values from 2021 and beyond before asserting stuff about the distribution. WDYT @zaneselvans ?

@zaneselvans
Copy link
Member

@jdangerx Conceptually that sounds right, but I don't know if that's an easy thing to do within the structure of the validation tests.

@jdangerx
Copy link
Member Author

jdangerx commented Oct 20, 2023

I just updated the "query" - it was easy!

diff --git a/src/pudl/validate.py b/src/pudl/validate.py
index 3fe248f68..7e3d7f79d 100644
--- a/src/pudl/validate.py
+++ b/src/pudl/validate.py
@@ -832,9 +832,9 @@ plants_steam_ferc1_capacity_ratios = [
         "data_col": "capability_ratio",
         "weight_col": "",
     },
-    {
+    {  # XBRL data (post-2021) reports 0 capability for ~22% of plants, so we exclude.
         "title": "Capability Ratio (tails)",
-        "query": "",
+        "query": "report_year < 2021 | plant_capability_mw > 0",
         "low_q": 0.05,
         "low_bound": 0.5,
         "hi_q": 0.95,

@zaneselvans
Copy link
Member

Ah okay great!

@jrea-rmi
Copy link
Collaborator

RMI would very much like to have the zero plant_capability_mw plant/generator/units included rather than dropped.

Reviewing the raw steam_electric_generating_plant_statistics_large_plants_402_duration, there are lots of zero cases that have valuable capital and operating cost data that we'd like to use.

So I suggest increasing the validation limit rather than dropping data.

@e-belfer
Copy link
Member

I believe the change is just to filter what's getting validated to ignore 0 records, rather than dropping actual data in the database @jrea-rmi. Here's the relevant commit.

@zaneselvans Let me know if I'm misrepresenting this, though!

@jrea-rmi
Copy link
Collaborator

oo that'd be an important difference. Thanks @e-belfer

@zaneselvans
Copy link
Member

Yes, that's my understanding as well @e-belfer.

@e-belfer e-belfer closed this as completed Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 new-data Requests for integration of new data. xbrl Related to the FERC XBRL transition
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants