Issue 3770 transform phmsagas data #3929

seeess1 · 2024-10-24T13:21:56Z

Overview

This is still WIP!

Relates to #3770.

What problem does this address?

Goal of the PR is to complete the first transformation of raw PHMSA data into a core asset.

What did you change?

Added a new transform script for PHMSA data.
Made one helper method (fix_eia_na) more generic since it can be applied across data sets (and updated references accordingly).
Added a few other helper methods.
Updated column map files with 2023 values (pulled from whatever was in 2022).
Specifically changed the ordering of columns in /Users/sam/Documents/pudl/src/pudl/package_data/phmsagas/column_maps/yearly_distribution.csv since we had fax columns mapped to emails and vice versa.

Will come back to everything below once this draft PR has gone through an initial round of review.

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Give feedback

Update the release notes: reference the PR and related issues.
Update relevant Data Source jinja templates (see docs/data_sources/templates).
Update relevant table or source description metadata (see src/metadata).
Review and update any other aspects of the documentation that might be affected by this PR.
Options

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows()).
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have.
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.
Options

…m-phmsagas-data

review-notebook-app · 2024-10-24T13:22:03Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

For more information, see https://pre-commit.ci

e-belfer

This is looking really good! A few comments on what's already been done, you're making great progress.

Some next steps:

the values in percent_unaccounted_for_gas look like they could use some normalization (I see some negative values)
once you're happy with the table, we can go ahead and write it to the database. See here for guidance on how to do this.

e-belfer · 2024-10-25T18:19:02Z

src/pudl/helpers.py

@@ -1009,19 +1009,19 @@ def remove_leading_zeros_from_numeric_strings(
    return df


-def fix_eia_na(df: pd.DataFrame) -> pd.DataFrame:
+def fix_na(df: pd.DataFrame) -> pd.DataFrame:
    """Replace common ill-posed EIA NA spreadsheet values with np.nan.


Suggested change

"""Replace common ill-posed EIA NA spreadsheet values with np.nan.

"""Replace common NA spreadsheet values with np.nan.

We should just update the doctsring to match here.

If we're changing this function name, maybe something more descriptive like standardize_na_values() would be better than "fix" since I could see fixing NA values meaning filling them in with non-null values.

e-belfer · 2024-10-25T18:22:07Z

src/pudl/helpers.py

+
+        # Remove unwanted characters (parentheses, spaces, periods, and dashes) from the main phone number
+        phone_main = re.sub(
+            r"[^\d]", "", phone_main.replace(".0", "")


Can you explain what the special instance of replace is doing here?

e-belfer · 2024-10-25T18:26:01Z

src/pudl/helpers.py

+            intl_code = phone_main[:-10]  # Digits before the last 10
+            main_number = phone_main[-10:]  # Last 10 digits are the phone number
+            formatted_phone = (
+                f"+{intl_code}-{main_number[:3]}-{main_number[3:6]}-{main_number[6:]}"


International phone numbers don't all follow the pattern 3-3-#, so we probably just want to report the number without enforcing formatting in this case.

e-belfer · 2024-10-25T18:28:39Z

src/pudl/package_data/phmsagas/column_maps/yearly_distribution.csv

+additional_information,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,,,,,,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information
+preparer_name,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name
+preparer_title,,,,,,,,,,,,,,,,,,,,,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title
+preparer_phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,pphone,pphone,pphone,pphone,pphone,pphone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone


Good catch!

e-belfer · 2024-10-25T18:31:03Z

src/pudl/transform/phmsagas.py

+@asset(
+    ins={"raw_data": AssetIn(key="raw_phmsagas__yearly_distribution")},
+    io_manager_key=None,  # TODO: check on this one ("pudl_io_manager"? something else?)
+    compute_kind=None,  # TODO: check on this one


All this does is add a pretty graphic to dagster, it's totally optional but if you want to include it use pandas.

e-belfer · 2024-10-25T18:35:59Z

src/pudl/transform/phmsagas.py

+    # Ensure all "report_year" values have four digits
+    mask = df["report_year"] < 100
+
+    # Convert 2-digit years to appropriate 4-digit format (assume cutoff at year 50)


Are you seeing values as low as 50? Anything lower than 1990 would be a red flag for me at this point (at least until we bring in the older data in #3290).

If this column is unreliable, we can also parse the spreadsheet data year and add it as a column during the extraction.

e-belfer · 2024-10-25T18:44:57Z

src/pudl/helpers.py

+    return exception_cols
+
+
+def standardize_state_columns(df: pd.DataFrame, state_columns: list) -> pd.DataFrame:


See POLITICAL_SUBDIVISIONS in pudl.metadata.dfs - there's already a mapping that can be used to convert names to state code, so this function should be redundant.

e-belfer · 2024-10-25T18:49:42Z

src/pudl/transform/phmsagas.py

+    ].apply(lambda col: col.str.title())
+
+    # List of state columns to standardize
+    df = standardize_state_columns(


See my comments in the helpers function - use POLITICAL_SUBDIVISIONS here instead of creating a new function, but this is definitely a step we still want to take! Would also be good to check that there aren't any non-valid entries here.

@zaneselvans - is there any conceivable reason we don't already use an enum constraint on the state field in fields.py? I can't imagine us not wanting to enforce restrictions on this one.

I think we've used an enum constraint for the state column in EPA CEMS because it's a huge table and we want it to be interpreted as a categorical value to save space, and we don't use foreign key constraints in the hourly data that's only written to parquet.

But for tables going into the database we can either constrain the value by defining a data source or table specific enum constraint on the field (see the bottom of pudl/metadata/fields.py -- this would result in it being a categorical dtype) or by defining a foreign key constraint that points to the subdivision_code column of the core_pudl__codes_subdivisions (which would result in it remaining a string).

We haven't historically applied either kind of constraint on the state fields, but we really should! The enum route is probably the simpler for now.

e-belfer · 2024-10-25T19:01:20Z

src/pudl/transform/phmsagas.py

+    # (on the scale of thousands) do not actually sum up to "excavation_damage_total".
+    df[YEARLY_DISTRIBUTION_OPERATORS_COLUMNS["columns_with_nas_to_fill"]] = df[
+        YEARLY_DISTRIBUTION_OPERATORS_COLUMNS["columns_with_nas_to_fill"]
+    ].fillna(0)


We distinguish between NA as "not reported" and 0 as "reported as 0", so filling this would actually change the meaning here (from "I didn't fill this in" to "I had 0 incidents"). Please drop this step.

e-belfer · 2024-10-25T19:01:56Z

src/pudl/transform/phmsagas.py

+    )
+
+    # Fill NA values with zeroes because these columns are simply counts.
+    # Note that "excavation_damage..." columns should sum up to the value in "excavation_damage_total". However, many rows


We should definitely add this validation as an asset check, and use fillna(0) there to do the comparison calculation.

zaneselvans · 2024-10-26T04:34:20Z

src/pudl/helpers.py

+    }
+
+    for col in state_columns:
+        df[col] = df[col].replace(state_abbreviations).str.upper()


When you have a complete mapping between old and new values, the col.map() method should be much faster. However any value that appears in the series but not in the dictionary defining the mapping gets converted to NA.

zaneselvans · 2024-10-26T04:42:58Z

src/pudl/helpers.py

+    """Standardize phone numbers in the specified columns of the DataFrame.
+    US numbers: ###-###-####
+    International numbers with the international code at the beginning.
+    Numbers with extensions will be appended with "x#".
+    Non-numeric entries will be returned as np.nan. Entries with fewer than
+    10 digits will be returned with no hyphens.
+
+    Args:
+    df: The DataFrame to modify.
+    columns: A list of the names of the columns that need to be standardized
+    Returns:
+    The modified DataFrame with standardized phone numbers in the same column.
+    """


This docstring won't render correctly in the docs. Check how it looks in the built API docutmentation.

Suggested change

"""Standardize phone numbers in the specified columns of the DataFrame.

US numbers: ###-###-####

International numbers with the international code at the beginning.

Numbers with extensions will be appended with "x#".

Non-numeric entries will be returned as np.nan. Entries with fewer than

10 digits will be returned with no hyphens.

Args:

df: The DataFrame to modify.

columns: A list of the names of the columns that need to be standardized

Returns:

The modified DataFrame with standardized phone numbers in the same column.

"""

"""Standardize phone numbers in the specified columns of the DataFrame.

US numbers: ###-###-####

International numbers with the international code at the beginning.

Numbers with extensions will be appended with "x#".

Non-numeric entries will be returned as np.nan. Entries with fewer than

10 digits will be returned with no hyphens.

Args:

df: The DataFrame to modify.

columns: A list of the names of the columns that need to be standardized

Returns:

The modified DataFrame with standardized phone numbers in the same column.

"""

zaneselvans · 2024-10-26T04:44:14Z

src/pudl/helpers.py

+    """Analyze columns of a DataFrame for missing or invalid values. Note that this is purely for analysis
+    and does not perform any data transformation or cleaning.
+    This function checks each column for missing or custom missing values and prints
+    a summary of the findings for string (object), numeric, and datetime columns.


Suggested change

"""Analyze columns of a DataFrame for missing or invalid values. Note that this is purely for analysis

and does not perform any data transformation or cleaning.

This function checks each column for missing or custom missing values and prints

a summary of the findings for string (object), numeric, and datetime columns.

"""Analyze columns of a DataFrame for missing or invalid values.

Note that this is purely for analysis and does not perform any data

transformation or cleaning. This function checks each column for

missing or custom missing values and prints a summary of the findings

for string (object), numeric, and datetime columns.

sam and others added 16 commits September 21, 2024 13:53

Update phmsagas DOI and start transformation

d41717d

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

ba27b7e

…m-phmsagas-data

Starting data transformation

b40071e

Update notebook and change etl_fast phmsagas years

90fd277

Add 2023 package data columns for new phmsagas run

c5406b1

Added documentation

9fc3f45

Add troubleshooting to index

c3e2c67

Update troubleshooting

1bfb71d

Add helpers

7760e9a

Temp add change

e23bd61

Update column mappings

b5c7acd

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

035712e

…m-phmsagas-data

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

5d7d00f

…m-phmsagas-data

Update notebook and add draft transform script

9518e83

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

418cd55

…m-phmsagas-data

Remove old files and cleanup helpers

66b63b2

[pre-commit.ci] auto fixes from pre-commit.com hooks

cfc62d4

For more information, see https://pre-commit.ci

e-belfer self-requested a review October 24, 2024 18:11

e-belfer added phmsa Data from the Pipeline and Hazardous Material Safety Administration community labels Oct 24, 2024

e-belfer assigned seeess1 Oct 24, 2024

e-belfer reviewed Oct 25, 2024

View reviewed changes

zaneselvans reviewed Oct 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 3770 transform phmsagas data #3929

Issue 3770 transform phmsagas data #3929

seeess1 commented Oct 24, 2024

Tasks

To-do list

review-notebook-app bot commented Oct 24, 2024

e-belfer left a comment

e-belfer Oct 25, 2024

zaneselvans Oct 26, 2024

e-belfer Oct 25, 2024

e-belfer Oct 25, 2024

e-belfer Oct 25, 2024

e-belfer Oct 25, 2024

e-belfer Oct 25, 2024

e-belfer Oct 25, 2024

e-belfer Oct 25, 2024

e-belfer Oct 25, 2024

zaneselvans Oct 26, 2024 •

edited

Loading

e-belfer Oct 25, 2024

e-belfer Oct 25, 2024

zaneselvans Oct 26, 2024

zaneselvans Oct 26, 2024

zaneselvans Oct 26, 2024

	"""Replace common ill-posed EIA NA spreadsheet values with np.nan.
	"""Replace common NA spreadsheet values with np.nan.

		return exception_cols


		def standardize_state_columns(df: pd.DataFrame, state_columns: list) -> pd.DataFrame:

Issue 3770 transform phmsagas data #3929

Are you sure you want to change the base?

Issue 3770 transform phmsagas data #3929

Conversation

seeess1 commented Oct 24, 2024

Overview

What problem does this address?

What did you change?

Documentation

Tasks

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

review-notebook-app bot commented Oct 24, 2024

e-belfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans Oct 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans Oct 26, 2024 •

edited

Loading