Extract more data from FERC XBRLs and handle that new data in ETL (#2821

) * Update to use new version of ferc-xbrl-extractor * Fix issues arising from stricter typing used in pandas 2.1 * Use integer transmission circuits. * Remove obsolete references to ferc1_schema tests. * Make new extractor compatible with 2021 data The new extractor added some data to the 2021 XBRL archives. This caused some integration and validation test fails. I added some plants to the pudl_id mapping spreadsheet, all of which are considered totals. I.e., not real plants, but we're mapping them for the sake of giving them an ID (they are not connected to EIA records). Because this is how we treat other total records reported to FERC1. This also updates the way that values were assigned to a slice of the ferc1_eia_train output spreadsheets. NA values were causing an issue, so I had to change how the values were being converted. This also updates the test_minmax_rows test to reflect the new rows in the 2021 data. * Add a few plants to pudl_id_mapping Totally new: * 18012: pjm interconnection, llc / total * 18013: new york state electric & gas corporation / see footnote * 18014: southwest power pool, inc. / total * 18015: public service company of colorado / community solar gardens * 18016: the empire district electric company / n/a each & 73 units at 2.52 mw each) * 18017: wisconsin electric power company / see footnote * 18018: upper michigan energy resources company (pudl determined) / total * 18019: new york transco, llc / total * 18020: wilderness line holdings, llc / total * 18021: mt. carmel public utility co / total Mapped to existing PUDL ID: * 8671: pacific gas & electric company, small hydroelectric generating plants * 15000: idaho power company / hydro * 15001: idaho power company / internal combustion * 15068: public service company of colorado / conventional hydro * 12926: midamerican energy company / ida grove ii wind farm (8 units at 2.3 mw * 1287: alaska electric light and power company / salmon creek hyrdo Note the misspelling of the plant name in 1287. Changed: * 15031: mt. carmel public utility co / not applicable -> ameren illinois company / not applicable This one had a mismatch between utility_id_ferc 222, which corresponds to Ameren, not Mt. Carmel (397). * Update validation test expectations. There are some missing data due to messy deduplication: #2822 But we'll do the deduplication better in here: #2899 --------- Co-authored-by: zschira <[email protected]> Co-authored-by: Zane Selvans <[email protected]> Co-authored-by: Austen Sharpe <[email protected]>
catalyst-cooperative · Oct 6, 2023 · e36cec5 · e36cec5
1 parent 8315219
commit e36cec5
Show file tree

Hide file tree

Showing 16 changed files with 212 additions and 579 deletions.
diff --git a/docs/dev/testing.rst b/docs/dev/testing.rst
@@ -155,18 +155,19 @@ their own:
     doc8             -> Check the documentation input files for syntactical correctness.
     docs             -> Remove old docs output and rebuild HTML from scratch with Sphinx
     unit             -> Run all the software unit tests.
-    ferc1_solo       -> Test whether FERC 1 can be loaded into the PUDL database alone.
     integration      -> Run all software integration tests and process a full year of data.
+    minmax_rows      -> Check that all outputs have the expected number of rows.
     validate         -> Run all data validation tests. This requires a complete PUDL DB.
-    ferc1_schema     -> Verify FERC Form 1 DB schema are compatible for all years.
+    jupyter          -> Ensure that designated Jupyter notebooks can be executed.
     full_integration -> Run ETL and integration tests for all years and data sources.
     full             -> Run all CI checks, but for all years of data.
-    build            -> Prepare Python source and binary packages for release.
-    testrelease      -> Do a dry run of Python package release using the PyPI test server.
-    release          -> Release the PUDL package to the production PyPI server.
+    nuke             -> Nuke & recreate SQLite & Parquet outputs, then run all tests and
+                        data validations against the new outputs.
+    get_unmapped_ids -> Make the raw FERC1 DB and generate a PUDL database with only EIA in
+                        order to generate any unmapped IDs.
 
-Note that not all of them literally run tests. For instance, to lint and
-build the documentation you can run:
+Note that not all of them literally run tests. For instance, to lint and build the
+documentation you can run:
 
 .. code-block:: console
 
@@ -321,41 +322,25 @@ with the construction of that database. For example, the output routines:
 
 We also use this option to run the data validations.
 
-Assuming you do want to run the ETL and build new databases as part of the test
-you're running, the contents of that database are determined by an ETL settings
-file. By default, the settings file that's used is
-``test/settings/integration-test.yml`` But it's also possible to use a
-different input file, generating a different database, and then run some
-tests against that database.
-
-For example, we test that FERC 1 data can be loaded into a PUDL database all
-by itself by running the ETL tests with a settings file that includes only A
-couple of FERC 1 tables for a single year. This is the ``ferc1_solo`` Tox
-test environment:
-
-.. code-block:: console
-
-  $ pytest --etl-settings=test/settings/ferc1-solo-test.yml test/integration/etl_test.py
-
-Similarly, we use the ``test/settings/full-integration-test.yml`` settings file
-to specify an exhaustive collection of input data, and then we run a test that
-checks that the database schemas extracted from all historical FERC 1 databases
-are compatible with each other. This is the ``ferc1_schema`` test:
-
-.. code-block:: console
-
-  $ pytest --etl-settings test/settings/full-integration-test.yml test/integration/etl_test.py::test_ferc1_schema
-
-The raw input data that all the tests use is ultimately coming from our
-`archives on Zenodo <https://zenodo.org/communities/catalyst-cooperative>`__.
-However, you can optionally tell the tests to look in a different places for more
-rapidly accessible caches of that data and to force the download of a fresh
-copy (especially useful when you are testing the datastore functionality
-specifically). By default, the tests will use the datastore that's part of your
-local PUDL workspace.
-
-For example, to run the ETL portion of the integration tests and download
-fresh input data to a temporary datastore that's later deleted automatically:
+Assuming you do want to run the ETL and build new databases as part of the test you're
+running, the contents of that database are determined by an ETL settings file. By
+default, the settings file that's used is
+``src/pudl/package_data/settings/etl_fast.yml`` But it's also possible to use a
+different input file, generating a different database, and then run some tests against
+that database.
+
+We use the ``src/pudl/package_data/etl_full.yml`` settings file to specify an exhaustive
+collection of input data.
+
+The raw input data that all the tests use is ultimately coming from our `archives on
+Zenodo <https://zenodo.org/communities/catalyst-cooperative>`__. However, you can
+optionally tell the tests to look in a different places for more rapidly accessible
+caches of that data and to force the download of a fresh copy (especially useful when
+you are testing the datastore functionality specifically). By default, the tests will
+use the datastore that's part of your local PUDL workspace.
+
+For example, to run the ETL portion of the integration tests and download fresh input
+data to a temporary datastore that's later deleted automatically:
 
 .. code-block:: console
 

diff --git a/docs/pudl/pudl-etl.dot b/docs/pudl/pudl-etl.dot
diff --git a/pyproject.toml b/pyproject.toml
@@ -16,17 +16,17 @@ dependencies = [
     "anyascii>=0.3.2,<0.4",  # recordlinkage dependency
     "boto3>=1.28.55",
     "bottleneck>=1.3.4",  # pandas[performance]
-    "catalystcoop.dbfread>=3,<3.1",
-    "catalystcoop.ferc-xbrl-extractor==0.8.3",
+    "catalystcoop.dbfread>=3.0,<3.1",
+    "catalystcoop.ferc-xbrl-extractor>=1.1.1,<1.2",
     "coloredlogs>=14.0,<15.1",  # Dagster requires 14.0
     "dagster-webserver>=1.4,<1.5",
     "dagster>=1.4,<1.5",
     "dask>=2022.5,<2023.9.4",
     "datapackage>=1.11,<1.16",  # Transition datastore to use frictionless.
     "email-validator>=1.0.3",  # pydantic[email]
     "fsspec>=2022.5,<2023.9.3",
-    "geopandas>=0.13,<0.15",
     "gcsfs>=2022.5,<2023.9.3",
+    "geopandas>=0.13,<0.15",
     "grpcio==1.57.0",  # Required by dagster. Version works with MacOS
     "grpcio-health-checking==1.57.0",  # Required by dagster. Version works with MacOS
     "grpcio-status==1.57.0",  # Required by dagster. Version works with MacOS
@@ -38,7 +38,7 @@ dependencies = [
     "numexpr>=2.8.0",  # pandas[performance]
     "numpy>=1.24,<2.0a0",
     "openpyxl>=3.0.10",  # pandas[excel]
-    "pandas>=2,<2.1",
+    "pandas[parquet,excel,fss,gcp,compression]>=2,<2.2",
     "pyarrow>=12,<13",  # pandas[parquet]
     "pydantic>=1.7,<2",
     "python-dotenv>=1,<1.1",

diff --git a/src/pudl/analysis/ferc1_eia_train.py b/src/pudl/analysis/ferc1_eia_train.py
@@ -163,7 +163,6 @@ def _prep_ferc1_eia(ferc1_eia, utils_eia860) -> pd.DataFrame:
     logger.debug("Prepping FERC-EIA table")
     # Only want to keep the plant_name_ppe field which replaces plant_name_eia
     ferc1_eia_prep = ferc1_eia.copy().drop(columns="plant_name_eia")
-
     # Add utility_name_eia - this must happen before renaming the cols or else there
     # will be duplicate utility_name_eia columns.
     utils_eia860.loc[:, "report_year"] = utils_eia860.report_date.dt.year
@@ -183,23 +182,24 @@ def _prep_ferc1_eia(ferc1_eia, utils_eia860) -> pd.DataFrame:
     ferc1_eia_prep = ferc1_eia_prep.rename(columns=RENAME_COLS_FERC1_EIA)[
         list(RENAME_COLS_FERC1_EIA.values())
     ]
-
     # Add in pct diff values
     for pct_diff_col in [x for x in RENAME_COLS_FERC1_EIA.values() if "_pct_diff" in x]:
         ferc1_eia_prep = _pct_diff(ferc1_eia_prep, pct_diff_col)
-
     # Add in fuel_type_code_pudl diff (qualitative bool)
-    ferc1_eia_prep.loc[
+    ferc1_eia_prep["fuel_type_code_pudl_diff"] = False
+    ferc1_eia_prep_nona = ferc1_eia_prep[
         ferc1_eia_prep.fuel_type_code_pudl_eia.notna()
-        & ferc1_eia_prep.fuel_type_code_pudl_ferc1.notna(),
-        "fuel_type_code_pudl_diff",
-    ] = ferc1_eia_prep.fuel_type_code_pudl_eia == (
-        ferc1_eia_prep.fuel_type_code_pudl_ferc1
+        & ferc1_eia_prep.fuel_type_code_pudl_ferc1.notna()
+    ].copy()
+    ferc1_eia_prep_nona["fuel_type_code_pudl_diff"] = (
+        ferc1_eia_prep_nona.fuel_type_code_pudl_eia
+        == ferc1_eia_prep_nona.fuel_type_code_pudl_ferc1
     )
+    ferc1_eia_prep.update(ferc1_eia_prep_nona)
 
     # Add in installation_year diff (diff vs. pct_diff)
     ferc1_eia_prep.loc[
-        :, "installation_year_ferc1"
+        ferc1_eia_prep.installation_year_ferc1.notna(), "installation_year_ferc1"
     ] = ferc1_eia_prep.installation_year_ferc1.astype("Int64")
 
     ferc1_eia_prep.loc[
@@ -212,7 +212,6 @@ def _prep_ferc1_eia(ferc1_eia, utils_eia860) -> pd.DataFrame:
 
     # Add best match col
     ferc1_eia_prep = _is_best_match(ferc1_eia_prep)
-
     return ferc1_eia_prep