Skip to content

Commit

Permalink
Extract more data from FERC XBRLs and handle that new data in ETL (#2821
Browse files Browse the repository at this point in the history
)

* Update to use new version of ferc-xbrl-extractor

* Fix issues arising from stricter typing used in pandas 2.1

* Use integer transmission circuits.

* Remove obsolete references to ferc1_schema tests.

* Make new extractor compatible with 2021 data

The new extractor added some data to the 2021 XBRL archives. This caused some integration and validation test fails. I added some plants to the pudl_id mapping spreadsheet, all of which are considered totals. I.e., not real plants, but we're mapping them for the sake of giving them an ID (they are not connected to EIA records). Because this is how we treat other total records reported to FERC1.

This also updates the way that values were assigned to a slice of the ferc1_eia_train output spreadsheets. NA values were causing an issue, so I had to change how the values were being converted.

This also updates the test_minmax_rows test to reflect the new rows in the 2021 data.


* Add a few plants to pudl_id_mapping

Totally new:

* 18012: pjm interconnection, llc / total
* 18013: new york state electric & gas corporation / see footnote
* 18014: southwest power pool, inc. / total
* 18015: public service company of colorado / community solar gardens
* 18016: the empire district electric company / n/a
  each & 73 units at 2.52 mw each)
* 18017: wisconsin electric power company / see footnote
* 18018: upper michigan energy resources company (pudl determined) / total
* 18019: new york transco, llc / total
* 18020: wilderness line holdings, llc / total
* 18021: mt. carmel public utility co / total

Mapped to existing PUDL ID:

* 8671: pacific gas & electric company, small hydroelectric generating plants
* 15000: idaho power company / hydro
* 15001: idaho power company / internal combustion
* 15068: public service company of colorado / conventional hydro
* 12926: midamerican energy company / ida grove ii wind farm (8 units at 2.3 mw
* 1287: alaska electric light and power company / salmon creek hyrdo

Note the misspelling of the plant name in 1287.

Changed:

* 15031: mt. carmel public utility co / not applicable -> ameren
  illinois company / not applicable

  This one had a mismatch between utility_id_ferc 222, which corresponds
  to Ameren, not Mt. Carmel (397).

* Update validation test expectations.

There are some missing data due to messy deduplication:
#2822

But we'll do the deduplication better in here:
#2899

---------

Co-authored-by: zschira <[email protected]>
Co-authored-by: Zane Selvans <[email protected]>
Co-authored-by: Austen Sharpe <[email protected]>
  • Loading branch information
4 people authored Oct 6, 2023
1 parent 8315219 commit e36cec5
Show file tree
Hide file tree
Showing 16 changed files with 212 additions and 579 deletions.
69 changes: 27 additions & 42 deletions docs/dev/testing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -155,18 +155,19 @@ their own:
doc8 -> Check the documentation input files for syntactical correctness.
docs -> Remove old docs output and rebuild HTML from scratch with Sphinx
unit -> Run all the software unit tests.
ferc1_solo -> Test whether FERC 1 can be loaded into the PUDL database alone.
integration -> Run all software integration tests and process a full year of data.
minmax_rows -> Check that all outputs have the expected number of rows.
validate -> Run all data validation tests. This requires a complete PUDL DB.
ferc1_schema -> Verify FERC Form 1 DB schema are compatible for all years.
jupyter -> Ensure that designated Jupyter notebooks can be executed.
full_integration -> Run ETL and integration tests for all years and data sources.
full -> Run all CI checks, but for all years of data.
build -> Prepare Python source and binary packages for release.
testrelease -> Do a dry run of Python package release using the PyPI test server.
release -> Release the PUDL package to the production PyPI server.
nuke -> Nuke & recreate SQLite & Parquet outputs, then run all tests and
data validations against the new outputs.
get_unmapped_ids -> Make the raw FERC1 DB and generate a PUDL database with only EIA in
order to generate any unmapped IDs.
Note that not all of them literally run tests. For instance, to lint and
build the documentation you can run:
Note that not all of them literally run tests. For instance, to lint and build the
documentation you can run:

.. code-block:: console
Expand Down Expand Up @@ -321,41 +322,25 @@ with the construction of that database. For example, the output routines:
We also use this option to run the data validations.

Assuming you do want to run the ETL and build new databases as part of the test
you're running, the contents of that database are determined by an ETL settings
file. By default, the settings file that's used is
``test/settings/integration-test.yml`` But it's also possible to use a
different input file, generating a different database, and then run some
tests against that database.

For example, we test that FERC 1 data can be loaded into a PUDL database all
by itself by running the ETL tests with a settings file that includes only A
couple of FERC 1 tables for a single year. This is the ``ferc1_solo`` Tox
test environment:

.. code-block:: console
$ pytest --etl-settings=test/settings/ferc1-solo-test.yml test/integration/etl_test.py
Similarly, we use the ``test/settings/full-integration-test.yml`` settings file
to specify an exhaustive collection of input data, and then we run a test that
checks that the database schemas extracted from all historical FERC 1 databases
are compatible with each other. This is the ``ferc1_schema`` test:

.. code-block:: console
$ pytest --etl-settings test/settings/full-integration-test.yml test/integration/etl_test.py::test_ferc1_schema
The raw input data that all the tests use is ultimately coming from our
`archives on Zenodo <https://zenodo.org/communities/catalyst-cooperative>`__.
However, you can optionally tell the tests to look in a different places for more
rapidly accessible caches of that data and to force the download of a fresh
copy (especially useful when you are testing the datastore functionality
specifically). By default, the tests will use the datastore that's part of your
local PUDL workspace.

For example, to run the ETL portion of the integration tests and download
fresh input data to a temporary datastore that's later deleted automatically:
Assuming you do want to run the ETL and build new databases as part of the test you're
running, the contents of that database are determined by an ETL settings file. By
default, the settings file that's used is
``src/pudl/package_data/settings/etl_fast.yml`` But it's also possible to use a
different input file, generating a different database, and then run some tests against
that database.

We use the ``src/pudl/package_data/etl_full.yml`` settings file to specify an exhaustive
collection of input data.

The raw input data that all the tests use is ultimately coming from our `archives on
Zenodo <https://zenodo.org/communities/catalyst-cooperative>`__. However, you can
optionally tell the tests to look in a different places for more rapidly accessible
caches of that data and to force the download of a fresh copy (especially useful when
you are testing the datastore functionality specifically). By default, the tests will
use the datastore that's part of your local PUDL workspace.

For example, to run the ETL portion of the integration tests and download fresh input
data to a temporary datastore that's later deleted automatically:

.. code-block:: console
Expand Down
184 changes: 0 additions & 184 deletions docs/pudl/pudl-etl.dot

This file was deleted.

8 changes: 4 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,17 @@ dependencies = [
"anyascii>=0.3.2,<0.4", # recordlinkage dependency
"boto3>=1.28.55",
"bottleneck>=1.3.4", # pandas[performance]
"catalystcoop.dbfread>=3,<3.1",
"catalystcoop.ferc-xbrl-extractor==0.8.3",
"catalystcoop.dbfread>=3.0,<3.1",
"catalystcoop.ferc-xbrl-extractor>=1.1.1,<1.2",
"coloredlogs>=14.0,<15.1", # Dagster requires 14.0
"dagster-webserver>=1.4,<1.5",
"dagster>=1.4,<1.5",
"dask>=2022.5,<2023.9.4",
"datapackage>=1.11,<1.16", # Transition datastore to use frictionless.
"email-validator>=1.0.3", # pydantic[email]
"fsspec>=2022.5,<2023.9.3",
"geopandas>=0.13,<0.15",
"gcsfs>=2022.5,<2023.9.3",
"geopandas>=0.13,<0.15",
"grpcio==1.57.0", # Required by dagster. Version works with MacOS
"grpcio-health-checking==1.57.0", # Required by dagster. Version works with MacOS
"grpcio-status==1.57.0", # Required by dagster. Version works with MacOS
Expand All @@ -38,7 +38,7 @@ dependencies = [
"numexpr>=2.8.0", # pandas[performance]
"numpy>=1.24,<2.0a0",
"openpyxl>=3.0.10", # pandas[excel]
"pandas>=2,<2.1",
"pandas[parquet,excel,fss,gcp,compression]>=2,<2.2",
"pyarrow>=12,<13", # pandas[parquet]
"pydantic>=1.7,<2",
"python-dotenv>=1,<1.1",
Expand Down
19 changes: 9 additions & 10 deletions src/pudl/analysis/ferc1_eia_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,6 @@ def _prep_ferc1_eia(ferc1_eia, utils_eia860) -> pd.DataFrame:
logger.debug("Prepping FERC-EIA table")
# Only want to keep the plant_name_ppe field which replaces plant_name_eia
ferc1_eia_prep = ferc1_eia.copy().drop(columns="plant_name_eia")

# Add utility_name_eia - this must happen before renaming the cols or else there
# will be duplicate utility_name_eia columns.
utils_eia860.loc[:, "report_year"] = utils_eia860.report_date.dt.year
Expand All @@ -183,23 +182,24 @@ def _prep_ferc1_eia(ferc1_eia, utils_eia860) -> pd.DataFrame:
ferc1_eia_prep = ferc1_eia_prep.rename(columns=RENAME_COLS_FERC1_EIA)[
list(RENAME_COLS_FERC1_EIA.values())
]

# Add in pct diff values
for pct_diff_col in [x for x in RENAME_COLS_FERC1_EIA.values() if "_pct_diff" in x]:
ferc1_eia_prep = _pct_diff(ferc1_eia_prep, pct_diff_col)

# Add in fuel_type_code_pudl diff (qualitative bool)
ferc1_eia_prep.loc[
ferc1_eia_prep["fuel_type_code_pudl_diff"] = False
ferc1_eia_prep_nona = ferc1_eia_prep[
ferc1_eia_prep.fuel_type_code_pudl_eia.notna()
& ferc1_eia_prep.fuel_type_code_pudl_ferc1.notna(),
"fuel_type_code_pudl_diff",
] = ferc1_eia_prep.fuel_type_code_pudl_eia == (
ferc1_eia_prep.fuel_type_code_pudl_ferc1
& ferc1_eia_prep.fuel_type_code_pudl_ferc1.notna()
].copy()
ferc1_eia_prep_nona["fuel_type_code_pudl_diff"] = (
ferc1_eia_prep_nona.fuel_type_code_pudl_eia
== ferc1_eia_prep_nona.fuel_type_code_pudl_ferc1
)
ferc1_eia_prep.update(ferc1_eia_prep_nona)

# Add in installation_year diff (diff vs. pct_diff)
ferc1_eia_prep.loc[
:, "installation_year_ferc1"
ferc1_eia_prep.installation_year_ferc1.notna(), "installation_year_ferc1"
] = ferc1_eia_prep.installation_year_ferc1.astype("Int64")

ferc1_eia_prep.loc[
Expand All @@ -212,7 +212,6 @@ def _prep_ferc1_eia(ferc1_eia, utils_eia860) -> pd.DataFrame:

# Add best match col
ferc1_eia_prep = _is_best_match(ferc1_eia_prep)

return ferc1_eia_prep


Expand Down
Loading

0 comments on commit e36cec5

Please sign in to comment.