Skip to content

syuu-syuu/Auto-Reviewer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSS Documentation

Standard Operating Procedure

This SOP outlines the steps for setting up and running the Python-based program,which is designed for automated data processing for Carolina Solar Services.

Requirements

  • A computer with Python installed.
  • Internet access for downloading the program files.

Setup and Installation

  1. Clone the Repository:
    • Open a terminal or command prompt.
    • Run the command: git clone https://github.com/syuu-syuu/carolinasolar.git
    • This will download the program files to your local machine.
  2. Create a Python Virtual Environment:
    • Navigate to the cloned directory: cd carolinasolar
    • Create a virtual environment named 'venv' by running: python -m venv venv
    • Activate the virtual environment:
      • On Windows: venv\Scripts\activate
      • On macOS/Linux: source venv/bin/activate
  3. Install Dependencies:
    • Ensure you are in the root directory of the project (carolinasolar).
    • Install required Python packages: pip install -r requirements.txt

Running the Program

  1. Prepare Data Files:
    • Place all data files that need to be processed into the data folder within the project directory.
    • Ensure that geo coding data, site address data, and work order files are also placed in the data folder.
  2. Execute the Program:
    • Navigate to the src directory within the project: cd src
    • Run the program: python -u main.py

Post-Execution

  1. Processed Data:
    • Once the program has run successfully, processed data will be located in the output/exportedData directory.
    • Log files generated during the process will be saved in the output/log directory.
  2. Original Data Files:
    • The original data files that have been processed will be moved to the data/processed folder for record-keeping.

Documentation for Each Module

main

Overview

This module serves as the main function of the program. Its primary task is to identify and process files named with the suffix "Monthly.csv" located in the data folder at the upper level of the directory structure.

Functionalities

Each file is processed sequentially using the process_file function from the processFile module.

After processing a file, the program clears any logged messages accumulated in the getInfo.log_messages list, a variable defined at the module level.

Once all targeted files have been handled, the script compiles a summary. This summary is generated by arranging the data in the Summary.summary DataFrame in order based on the "Site Name" column. Finally, this organized summary is saved as a CSV file in the "**../output/summary.csv**" path.

processFile

Overview

The processFile.py module is a comprehensive unit that integrates various sub-modules to execute the complete data processing workflow. It consists of only one function process_file, which is designed to be called for each file in a data folder with the file path as its argument.

Functionalities

  • Initial Logging

    Logs introductory notes about display constraints for record counts and inverter column details.

  • Data Reading and Normalization

    Extracts the site name from the file path.

    Reads site data using the read_site function from the readData module.

    Processes data through a pipeline involving renaming columns (with rename function) and normalizing data specific to the site ( with normalize function).

  • Missing Data Handling

    Checks for missing data in columns including 'irradiance', 'temperature', 'wind speed', 'meter power', ‘meter voltage’ and ‘inverter_n’ using check_missing.

    Retrieves and logs work order records for any missing 'meter power' data using fetch_workorder. If there are no missing records, a corresponding message is logged.

  • Data Renaming and Output

    Revert inverter column to their original names before saving.

    Outputs the processed data into a CSV file in the "**../output/exportedData/**" directory.

    Writes log messages to a text file specific to the site in the "**../output/log/**" directory.

  • File Management

    Moves processed files to a designated "processed" directory for organizational purposes.

  • Summary Update

    Appends a new row with site-specific summaries to the global DataFrame Summary.summary.

    Resets the specific summary status variables for the next iteration.

Notes

  • Ensure that the required directories ("**../output/exportedData/**", "**../output/log/**", "**../data/processed**") exist and have the necessary permissions for file operations.

readData

Overview

This module provides functionality for reading and preprocessing data files for site-specific analysis. It includes three functions, one for reading site data, one for reading work order records, and a help function to detect separator used in the original files.

Functionalities

  1. detect_separator

    • Purpose

      Determines the delimiter used in a data file.

    • Inputs

      line (string): A line from the data file.

    • Returns

      A string representing the detected delimiter (comma , or semicolon ;).

  2. read_site

    • Purpose

      Reads site data from a specified file path.

    • Inputs

      file_path (string): The path to the data file.

    • Process

      Reads the first 10 lines of the file to determine the header line, which contains the word "timestamp" (case insensitive).

      Identifies the correct delimiter using detect_separator with the determined head line

      Uses pandas.read_csv to read the file, skipping rows up to the header.

      Checks for and handles an optional extra unit row immediately after the header.

    • Error Handling

      Logs and prints an error message if the header is not found; exits the program on this error condition.

    • Returns

      A pandas DataFrame with the site data.

  3. read_workorder

    • Purpose

      Reads work order data from a predefined file.

    • Process:

      Uses pandas.read_csv to read the work order file located at "**../data/(PE)OpenedWOs.csv**" with no row skipping.

    • Returns:

      A pandas DataFrame containing the work order data.

Notes

  • Ensure the file paths provided to these functions are correct and accessible.
  • The read_site function assumes a specific structure in the data files, particularly the presence of the word "timestamp" in the header of site data.

renameColumn

Overview

This module is responsible for the systematic restructuring of column names in a DataFrame.

Functions

  1. column_temperature

    • Purpose

      Standardizes temperature-related column names.

    • Process

      Identifies all columns containing "temperature".

      Renames the relevant column to "Temperature", prioritizes the column with keyword "ambient" if multiple temperature columns are found.

      Removes any redundant temperature columns.

  2. column_wind

    • Purpose

      Standardizes wind-related column names.

    • Process:

      Identifies all columns containing “wind” or “speed”

      Renames the first relevant column to "Wind Speed".

      Adds a new "Wind Speed" column with NaN values if none is found.

  3. column_voltage

    • Purpose

      Standardizes voltage-related column names.

    • Process:

      Identifies all columns containing "voltage".

      Renames the column with the least NaN values to "Meter Voltage" if multiple are found.

      Removes any redundant voltage columns.

  4. find_keywords

    • Purpose

      Search for specific keywords within a column name.

    • Inputs

      column (string): The name of the column to check.

      keywords_list (list): A list of keywords or keyword pairs to search for in the column name.

    • Returns

      Boolean indicating whether the keywords were found in the column name.

  5. column_others

    • Purpose

      Handles the renaming of other essential columns of "Timestamp", "POA Irradiance", and "Meter Power".

    • Process

      Defines a keyword_mapping dictionary where keys are the new standardized column names and values are lists of keywords (or lists of keywords).

      Iterates through the **keyword_mapping** dictionary, and for each mapping checks each column in the DataFrame using find_keywords to see if it matches the specified keywords. (a) If a matching column is found, it's added to a rename_mapping dictionary as key with its new name as value; (b) If no matching column is found for a particular key, a new column with that key as the name is created, filled with NaN values.

      Renames the columns in the DataFrame based on the rename_mapping.

  6. column_inverter

    • Purpose

      Renames and organizes inverter-related columns.

    • Process:

      Renames any remaining columns, which are not identifiable through specific keywords, as "Inverter_x" (where x is a numeric).

      Maintains a record of the original inverter column names during the renaming process using a global dictionary called **name_mapping**. This dictionary is utilized later to revert the inverter column names to their original forms when exporting the processed data files.

  7. column_reorder

    • Purpose

      Reorders the columns in the DataFrame for standardization.

    • Process:

      Orders columns with specific names first, followed by sorted inverter columns.

  8. rename

    • Purpose

      Aggregates all column renaming functions into a pipeline.

    • Process

      Applies each column renaming function in sequence to the DataFrame.

Notes

  • The renaming process is sensitive to the presence of specific keywords in the column names.
  • The module maintains a record of original inverter column names, which can be useful for backtracking and reference purposes.

normalizeData

Overview

This module is mainly used for converting timestamp formats, identifying day or night based on geographical coordinates, and normalizing non-numeric data values.

Functions

  1. custom_to_datetime

    • Purpose

      Converts the 'Timestamp' column in a DataFrame to a standard datetime object.

    • Inputs

      df (DataFrame): The DataFrame processed by the previous module **renameColumn**.

    • Process

      Tries multiple datetime formats to find the correct one for conversion.

      Converts 'Timestamp' to a datetime object using the first matching format.

    • Error Handling

      Raises a ValueError if no suitable format is found, indicating an issue with the timestamp format in the data.

    • Returns

      The DataFrame with the 'Timestamp' column converted to datetime objects.

  2. determine_day_night

    • Purpose

      Assesses each timestamp in a DataFrame to categorize it as either day or night based on the retrieved geographical information and the based on the calculated local sunrise and sunset times.

    • Inputs

      row (Series): A row from the DataFrame.

      lat, lng (float): Latitude and longitude of the site.

      tz (string): The timezone of the site.

    • Process

      Uses the Sun class from the suntime library, along with the latitude and longitude data, to calculate the sunrise and sunset times for the date of the given timestamp.

      Converts the calculated sunrise and sunset times from UTC to the local timezone of the site.

      Compares the timestamp from the DataFrame row with the local sunrise and sunset times. (a) If the timestamp falls between sunrise and sunset (inclusive), it is categorized as "Day"; (b) else, categorized as "Night".

    • Returns

      A string ("Day" or "Night") indicating the time of day for each timestamp.

  3. normalize

    • Purpose

      Aggregates multiple functions to implement comprehensive data normalization, including converting timestamps and adding a 'Day/Night' column.

    • Inputs

      df (DataFrame): The DataFrame to be normalized.

      site_name (string): The name of the site, used for geolocation.

    • Process

      Transforms all non-timestamp columns in the DataFrame into numeric data types before any further processing.

      Uses the custom_to_datetime function to convert the 'Timestamp' column into a uniform datetime object format.

      Retrieves geographical coordinates and timezone for the site for the given site_name and determines the timezone corresponding to these geographical coordinates.

      Applies determine_day_night to each row to create a new 'Day/Night' column and concatenates this new column with the original DataFrame.

    • Returns

      The normalized DataFrame with additional 'Day/Night' information.

    Notes

    • Accurate geolocation information (latitude, longitude) is crucial for the functioning of the determine_day_night function.

    getGeocoding

    Overview

    This module provides functions for retrieving geocoding information (latitude and longitude) for specific sites, determining the corresponding timezone based on geographical coordinates, and converting time data between different timezones.

    Functions

    1. getGeocoding

      • Purpose

        Retrieves latitude and longitude for a given site from a predefined geocoding data file

      • Inputs

        site_name (string): The name of the site for which geocoding information is required.

      • Process

        Reads a CSV file ("**../data/geoCoding.csv**") containing geocoding information.

        Searches for the specific row matching the site_name.

        Extracts latitude and longitude from the found row.

      • Returns

        A tuple containing (latitude, longitude) of the site.

        Returns (None, None) if the site is not found in the geocoding file.

    2. getTimeZone

      • Purpose

        Determines the timezone for a given set of geographical coordinates.

      • Inputs

        latitude, longitude (float): The geographical coordinates of the site.

      • Process

        Utilizes the TimezoneFinder library to find the timezone at the specified coordinates.

      • Returns

        The timezone as a string.

        If coordinates are not provided or invalid, returns None.

    3. getTargetTime

      • Purpose

        Converts a given timestamp from its original timezone to a target timezone. (In the normalizeData module, this function is used to convert timestamps from UTC to the specific site's local timezone, enabling comparisons between local times and the local sunrise and sunset times.)

      • Inputs

        original_time (datetime): The timestamp to be converted.

        original_timezone, target_timezone (string): The original and target timezones for the conversion.

      • Process

        Localizes the original_time to its original timezone if it's not timezone-aware.

        Then converts the timestamp to the target timezone.

      • Returns

        The converted timestamp in the target timezone.

    Notes

    • Ensure that the "**../data/geoCoding.csv**" file exists and is properly formatted with "Site Name", "Latitude", and "Longitude" columns.
    • getTargetTime assumes that the input timestamp is either timezone-aware or in the original timezone specified.

    getInfo

    Overview

    This module provides functionality for formatting raw DataFrame outputs into more user-friendly tabular formats, managing subsets of data for display, and logging messages which will be later printed in a log file.

    Functions

    1. log

      • Purpose

        Stores and appends messages to a log list.

      • Inputs

        message (string): The content to be logged.

      • Process

        Appends the provided message to the log_messages list.

    2. get_info

      • Purpose

        Formats a subset of rows from a DataFrame into a table for display.

      • Inputs

        df (DataFrame): The DataFrame from which information needs to be extracted.

      • Process

        Retrieves a subset of the DataFrame using get_subset.

        Creates a PrettyTable with column names and selected data.

        Iterates over the subset rows, adding them to the table.

      • Returns

        A PrettyTable object representing the selected data in a table format.

    3. get_subset

      • Purpose

        Extracts a subset of rows and/or columns from a DataFrame for simplified display.

      • Inputs

        rows (DataFrame): The DataFrame from which the subset is to be extracted.

      • Process

        Limits the number of columns to 12 (first 9 and last 3) if there are more than 12 columns.

        Limits the number of rows to 20 if there are more than 20 rows.

      • Returns

        A subset of the DataFrame based on the defined constraints.

    4. format_workorder

      • Purpose

        Formats work order data into a readable table format.

      • Inputs

        workorder (DataFrame): The DataFrame containing work order data.

      • Process

        Creates a PrettyTable and populates it with data from the workorder DataFrame.

        Iterates through each row in the DataFrame, adding it to the table.

      • Returns

        A PrettyTable object displaying the work order data in a structured table format.

    checkWorkorder

    Overview

    This module is designed to fetch and process work order data related to specific sites, particularly focusing on missing date ranges. It includes functions for checking date ranges, converting timezones, and filtering relevant work order records based on the missing dates in a dataset.

    Functions

    1. is_any_date_in_range

      • Purpose

        Determines if any date within a specified range falls within a list of missing dates.

      • Inputs

        start, end, mark (datetime): Start, Fault End and Marked Complete/Incomplete dates defining the range.

        missing_dates (list): A list of missing dates to check against.

      • Process

        Generates a date range based on the provided start, end, or mark dates.

        Checks if any date in this range is present in the missing_dates list.

      • Returns

        A boolean indicating whether any date in the range matches the missing dates.

    2. convert_time

      • Purpose

        Converts a time string from one timezone to another.

      • Inputs

        time_str (string): The time string to be converted.

        local_timezone, target_timezone (string): The original and target timezones for conversion.

      • Process

        Converts the time string to a datetime object.

        Uses getTargetTime to convert this time to the target timezone.

      • Returns

        The converted datetime object in the target timezone.

    3. fetch_workorder

      • Purpose

        Retrieves work order records related to specific missing dates for a given site name.

      • Inputs

        missing_dates (list): List of dates where data is missing.

        site_name (string): Name of the site for which work orders are being fetched.

      • Process:

        Reads work orders related to the site.

        Converts relevant time fields in the work orders to the local timezone.

        Identifies work orders related to the missing dates using is_any_date_in_range.

        Logs and formats the matched work orders.

      • Returns

        A DataFrame of matched work order records.

    checkMissing

    Overview

    This module focuses on identifying and handling missing values in various data columns of a DataFrame. It employs various strategies to autofill missing values or flag them for further review.

    Functions

    1. check_missing_irradiance

      • Purpose

        Identify and handle missing values in the 'POA Irradiance' column of a DataFrame. It also adjusts the 'Day/Night' column based on irradiance levels.

      • Process

        Identifies missing values (NaN) in the 'POA Irradiance' column and classifies these missing values based on whether they occur during the day or night, as indicated by the 'Day/Night' column.

        If there are missing values during the day, these are specifically noted for further review. The details of these day-time missing values are formatted and logged using the get_info function. If all missing values occur during the night, a log entry is made to indicate this.

        If daytime missing values are detected, the irradiance_status in the Summary module is marked with an "x" to indicate a data issue that needs attention.

        Adjusts the 'Day/Night' column based on irradiance values: If 'POA Irradiance' is greater than 1, the corresponding 'Day/Night' status is set to 'Day'; If 'POA Irradiance' is 1 or less (but not -999, the placeholder for missing values), the status is set to 'Night'.

      • Returns

        The modified DataFrame with filled missing values in 'POA Irradiance' and adjusted 'Day/Night' column.

    2. check_and_autofill_temperature_and_wind

      • Purpose

        Identify and manage missing values in the 'Temperature' and 'Wind Speed' columns of a DataFrame.

      • Process

        Iterates through the specified columns ('Temperature' and 'Wind Speed') to check for missing values (NaN).

        Determines the importance of these missing values based on a condition: if 'POA Irradiance' is greater than or equal to a defined threshold (CONDITION_VALUE, set at 100).

        All missing values in the specified columns are filled with a placeholder value of -999.

        Logs the total number of missing values found in each column. For missing values occurring when 'POA Irradiance' is above the threshold, the function logs detailed information about these cases. If no missing values are found when 'POA Irradiance' is above the threshold, a log entry confirms there is no significant missing data.

      • Returns

        The DataFrame with autofilled missing values in the 'Temperature' and 'Wind Speed' columns.

    3. check_and_autofill_Meter

      • Purpose

        Identify and handle missing values in the 'Meter Power' column of a DataFrame.

      • Process

        1. Identify Missing Meter Power Values

          Detects missing (NaN) values in the 'Meter Power' column.

          Classifies these missing values based on whether they occur during the day.

        2. Identify Conditions for Autofilling

          Identifies cases where missing meter power values can be autofilled based on whether there is present corresponding inverter data for the missing meter power.

        3. Autofilling Strategy:

          Fills all missing 'Meter Power' values with a placeholder value (-999).

          For rows with missing meter power during the day, the function attempts to autofill these based on the sum of the inverter values, if possible.

        4. Log and Flag Missing Values:

          Logs detailed information about missing 'Meter Power' values after possible autofill operations, particularly focusing on daytime data. Note that if all inverter data is present, missing 'Meter Power' values, once autofilled using the sum of inverter outputs, are no longer considered a data issue.

        5. Update Summary Module:

          If there are missing meter power values during the day with corresponding inverter issues, the function updates the production_status in the Summary module to indicate a data issue.

    4. check_and_autofill_inverter

      • Purpose

        Identify and handle missing inverter values in a DataFrame.

      • Process

        1. Identifying Missing Inverter Values

          Detects rows with any missing values in inverter columns (Inverter_ prefix).

          Differentiates them between day and night time using the 'Day/Night' column.

        2. Autofilling Strategy

          The function employs a two-step approach to autofill missing inverter values:

          Step 1: Site Off Condition: Fills missing inverter values with 0 when the site is determined to be off (either during the night or when meter power is ≤ 0).

          Step 2: More Detailed Inverter Analysis:

          For remaining missing inverter values, the function compares the total meter power with the combined output of all non-missing inverters:

          • If the meter power is less than or equal to the sum of the outputs of existing inverters, it suggests that the missing inverters are likely 'off'. In such cases, they are autofilled with 0.
          • Conversely, if the meter power is higher than the sum of the existing inverters, it implies that some of the missing inverters might actually be 'on'. Then, the function calculates the average output of the currently non-missing inverters. This average is used to estimate how many of the missing inverters are likely 'on' to account for the total meter power. Those missing inverters regarded as “should be on” are autofilled with a value of 1 (representing 'on'). For cases where it’s challenging to precisely determine the number of 'on' inverters, the function might fill a conservative estimate of missing inverters with 1, leaving others as missing.
        3. Logging and Documentation:

          Logs the number of rows where missing inverter values have been autofilled, and provides detailed information about these autofilled records.

          In cases where it is challenging to determine the exact status of missing inverters (whether they should be on or off), the function logs this ambiguity and details the autofill strategy used.

          If there are still rows with missing inverter values after autofill attempts, these are flagged for further review.

        4. Updating Summary Module:

          If there are unresolved missing inverter values after the autofill process, the function updates the inverter_status in the Summary module to indicate potential data issues.

      • Returns

        The DataFrame with autofilled missing values in the 'Inverter_x' columns.

    5. check_and_autofill_voltage

      • Purpose

        Deals with missing values in the 'Meter Voltage' column.

      • Process

        1. Identify Missing Voltage Values:

          Detects missing (NaN) values in the 'Meter Voltage' column.

          Differentiates cases based on whether the entire column is missing data or if only specific rows are missing.

        2. Autofilling Strategy:

          Complete Column Empty: If the entire 'Meter Voltage' column is missing, all values are filled with a placeholder value of -999. This scenario is logged as the column being completely empty.

          Partial Missing Values: When only some values in the 'Meter Voltage' column are missing, the function first identifies rows where the Meter Power is above 0. For these rows, the function calculates the average of the existing non-zero voltage readings and autofills the missing voltage values with this calculated average.

        3. Logging Autofill Actions:

          The function logs detailed information about the autofilling actions it performs. For rows where missing voltages are autofilled with the average, the function provides specific details about these cases, including the number of rows affected and the average voltage used for autofilling.

      • Returns

        The DataFrame with autofilled missing values in the 'Meter Voltage' column.

    6. check_missing

      • Purpose

        Aggregates all the missing value checks and autofills into a single function.

      • Process

        Uses a pipeline approach, sequentially applying specialized functions to handle missing data in different columns:

        1. Irradiance Check with check_missing_irradiance function
        2. Temperature and Wind Speed Check with the check_and_autofill_temperature_and_wind function
        3. Inverter Check with check_and_autofill_inverter function
        4. Meter Power Check with the check_and_autofill_Meter function (potentially returning a list of dates (missing_dates) for further investigation in relevant work orders.)
        5. Voltage Check with check_and_autofill_voltage function
      • Returns:

        missing_dates, which are dates with missing 'Meter Power' that cannot be autofilled, can be used in **checkWorkorder** module to retrieve helpful work order records

    Summary

    Overview

    The Summary module maintains a cumulative DataFrame to provide a structured summary of important data missing across multiple sites. It also keep a track of three status variables for production, irradiance, and inverter which will temporarily populated missing information for each specific site during data processing. Once the data processing for a site is completed, the status data will be appended to the summary DataFrame and be cleared within the processFile module. Ultimately, the summary DataFrame is exported to a CSV file in the main module.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published