Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 3770 transform phmsagas data #3929

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

seeess1
Copy link
Contributor

@seeess1 seeess1 commented Oct 24, 2024

Overview

This is still WIP!

Relates to #3770.

What problem does this address?

Goal of the PR is to complete the first transformation of raw PHMSA data into a core asset.

What did you change?

  • Added a new transform script for PHMSA data.
  • Made one helper method (fix_eia_na) more generic since it can be applied across data sets (and updated references accordingly).
  • Added a few other helper methods.
  • Updated column map files with 2023 values (pulled from whatever was in 2022).
  • Specifically changed the ordering of columns in /Users/sam/Documents/pudl/src/pudl/package_data/phmsagas/column_maps/yearly_distribution.csv since we had fax columns mapped to emails and vice versa.

Will come back to everything below once this draft PR has gone through an initial round of review.

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@e-belfer e-belfer self-requested a review October 24, 2024 18:11
@e-belfer e-belfer added phmsa Data from the Pipeline and Hazardous Material Safety Administration community labels Oct 24, 2024
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good! A few comments on what's already been done, you're making great progress.

Some next steps:

  • the values in percent_unaccounted_for_gas look like they could use some normalization (I see some negative values)
  • once you're happy with the table, we can go ahead and write it to the database. See here for guidance on how to do this.

@@ -1009,19 +1009,19 @@ def remove_leading_zeros_from_numeric_strings(
return df


def fix_eia_na(df: pd.DataFrame) -> pd.DataFrame:
def fix_na(df: pd.DataFrame) -> pd.DataFrame:
"""Replace common ill-posed EIA NA spreadsheet values with np.nan.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Replace common ill-posed EIA NA spreadsheet values with np.nan.
"""Replace common NA spreadsheet values with np.nan.

We should just update the doctsring to match here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're changing this function name, maybe something more descriptive like standardize_na_values() would be better than "fix" since I could see fixing NA values meaning filling them in with non-null values.


# Remove unwanted characters (parentheses, spaces, periods, and dashes) from the main phone number
phone_main = re.sub(
r"[^\d]", "", phone_main.replace(".0", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what the special instance of replace is doing here?

intl_code = phone_main[:-10] # Digits before the last 10
main_number = phone_main[-10:] # Last 10 digits are the phone number
formatted_phone = (
f"+{intl_code}-{main_number[:3]}-{main_number[3:6]}-{main_number[6:]}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

International phone numbers don't all follow the pattern 3-3-#, so we probably just want to report the number without enforcing formatting in this case.

additional_information,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,,,,,,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information
preparer_name,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name
preparer_title,,,,,,,,,,,,,,,,,,,,,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title
preparer_phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,pphone,pphone,pphone,pphone,pphone,pphone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@asset(
ins={"raw_data": AssetIn(key="raw_phmsagas__yearly_distribution")},
io_manager_key=None, # TODO: check on this one ("pudl_io_manager"? something else?)
compute_kind=None, # TODO: check on this one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this does is add a pretty graphic to dagster, it's totally optional but if you want to include it use pandas.

# Ensure all "report_year" values have four digits
mask = df["report_year"] < 100

# Convert 2-digit years to appropriate 4-digit format (assume cutoff at year 50)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you seeing values as low as 50? Anything lower than 1990 would be a red flag for me at this point (at least until we bring in the older data in #3290).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this column is unreliable, we can also parse the spreadsheet data year and add it as a column during the extraction.

return exception_cols


def standardize_state_columns(df: pd.DataFrame, state_columns: list) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See POLITICAL_SUBDIVISIONS in pudl.metadata.dfs - there's already a mapping that can be used to convert names to state code, so this function should be redundant.

].apply(lambda col: col.str.title())

# List of state columns to standardize
df = standardize_state_columns(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments in the helpers function - use POLITICAL_SUBDIVISIONS here instead of creating a new function, but this is definitely a step we still want to take! Would also be good to check that there aren't any non-valid entries here.

@zaneselvans - is there any conceivable reason we don't already use an enum constraint on the state field in fields.py? I can't imagine us not wanting to enforce restrictions on this one.

Copy link
Member

@zaneselvans zaneselvans Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've used an enum constraint for the state column in EPA CEMS because it's a huge table and we want it to be interpreted as a categorical value to save space, and we don't use foreign key constraints in the hourly data that's only written to parquet.

But for tables going into the database we can either constrain the value by defining a data source or table specific enum constraint on the field (see the bottom of pudl/metadata/fields.py -- this would result in it being a categorical dtype) or by defining a foreign key constraint that points to the subdivision_code column of the core_pudl__codes_subdivisions (which would result in it remaining a string).

We haven't historically applied either kind of constraint on the state fields, but we really should! The enum route is probably the simpler for now.

# (on the scale of thousands) do not actually sum up to "excavation_damage_total".
df[YEARLY_DISTRIBUTION_OPERATORS_COLUMNS["columns_with_nas_to_fill"]] = df[
YEARLY_DISTRIBUTION_OPERATORS_COLUMNS["columns_with_nas_to_fill"]
].fillna(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We distinguish between NA as "not reported" and 0 as "reported as 0", so filling this would actually change the meaning here (from "I didn't fill this in" to "I had 0 incidents"). Please drop this step.

)

# Fill NA values with zeroes because these columns are simply counts.
# Note that "excavation_damage..." columns should sum up to the value in "excavation_damage_total". However, many rows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely add this validation as an asset check, and use fillna(0) there to do the comparison calculation.

}

for col in state_columns:
df[col] = df[col].replace(state_abbreviations).str.upper()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you have a complete mapping between old and new values, the col.map() method should be much faster. However any value that appears in the series but not in the dictionary defining the mapping gets converted to NA.

Comment on lines +2191 to +2203
"""Standardize phone numbers in the specified columns of the DataFrame.
US numbers: ###-###-####
International numbers with the international code at the beginning.
Numbers with extensions will be appended with "x#".
Non-numeric entries will be returned as np.nan. Entries with fewer than
10 digits will be returned with no hyphens.

Args:
df: The DataFrame to modify.
columns: A list of the names of the columns that need to be standardized
Returns:
The modified DataFrame with standardized phone numbers in the same column.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring won't render correctly in the docs. Check how it looks in the built API docutmentation.

Suggested change
"""Standardize phone numbers in the specified columns of the DataFrame.
US numbers: ###-###-####
International numbers with the international code at the beginning.
Numbers with extensions will be appended with "x#".
Non-numeric entries will be returned as np.nan. Entries with fewer than
10 digits will be returned with no hyphens.
Args:
df: The DataFrame to modify.
columns: A list of the names of the columns that need to be standardized
Returns:
The modified DataFrame with standardized phone numbers in the same column.
"""
"""Standardize phone numbers in the specified columns of the DataFrame.
US numbers: ###-###-####
International numbers with the international code at the beginning.
Numbers with extensions will be appended with "x#".
Non-numeric entries will be returned as np.nan. Entries with fewer than
10 digits will be returned with no hyphens.
Args:
df: The DataFrame to modify.
columns: A list of the names of the columns that need to be standardized
Returns:
The modified DataFrame with standardized phone numbers in the same column.
"""

Comment on lines +2264 to +2267
"""Analyze columns of a DataFrame for missing or invalid values. Note that this is purely for analysis
and does not perform any data transformation or cleaning.
This function checks each column for missing or custom missing values and prints
a summary of the findings for string (object), numeric, and datetime columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Analyze columns of a DataFrame for missing or invalid values. Note that this is purely for analysis
and does not perform any data transformation or cleaning.
This function checks each column for missing or custom missing values and prints
a summary of the findings for string (object), numeric, and datetime columns.
"""Analyze columns of a DataFrame for missing or invalid values.
Note that this is purely for analysis and does not perform any data
transformation or cleaning. This function checks each column for
missing or custom missing values and prints a summary of the findings
for string (object), numeric, and datetime columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community phmsa Data from the Pipeline and Hazardous Material Safety Administration
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants