CSS Documentation

Standard Operating Procedure

This SOP outlines the steps for setting up and running the Python-based program,which is designed for automated data processing for Carolina Solar Services.

Requirements

A computer with Python installed.
Internet access for downloading the program files.

Setup and Installation

Clone the Repository:
- Open a terminal or command prompt.
- Run the command: git clone https://github.com/syuu-syuu/carolinasolar.git
- This will download the program files to your local machine.
Create a Python Virtual Environment:
- Navigate to the cloned directory: cd carolinasolar
- Create a virtual environment named 'venv' by running: python -m venv venv
- Activate the virtual environment:
  - On Windows: venv\Scripts\activate
  - On macOS/Linux: source venv/bin/activate
Install Dependencies:
- Ensure you are in the root directory of the project (carolinasolar).
- Install required Python packages: pip install -r requirements.txt

Running the Program

Prepare Data Files:
- Place all data files that need to be processed into the data folder within the project directory.
- Ensure that geo coding data, site address data, and work order files are also placed in the data folder.
Execute the Program:
- Navigate to the src directory within the project: cd src
- Run the program: python -u main.py

Post-Execution

Processed Data:
- Once the program has run successfully, processed data will be located in the output/exportedData directory.
- Log files generated during the process will be saved in the output/log directory.
Original Data Files:
- The original data files that have been processed will be moved to the data/processed folder for record-keeping.

Documentation for Each Module

`main`

Overview

This module serves as the main function of the program. Its primary task is to identify and process files named with the suffix "Monthly.csv" located in the data folder at the upper level of the directory structure.

Functionalities

Each file is processed sequentially using the process_file function from the processFile module.

After processing a file, the program clears any logged messages accumulated in the getInfo.log_messages list, a variable defined at the module level.

Once all targeted files have been handled, the script compiles a summary. This summary is generated by arranging the data in the Summary.summary DataFrame in order based on the "Site Name" column. Finally, this organized summary is saved as a CSV file in the "**../output/summary.csv**" path.

`processFile`

Overview

The processFile.py module is a comprehensive unit that integrates various sub-modules to execute the complete data processing workflow. It consists of only one function process_file, which is designed to be called for each file in a data folder with the file path as its argument.

Functionalities

Initial Logging

Logs introductory notes about display constraints for record counts and inverter column details.
Data Reading and Normalization

Extracts the site name from the file path.

Reads site data using the read_site function from the readData module.

Processes data through a pipeline involving renaming columns (with rename function) and normalizing data specific to the site ( with normalize function).
Missing Data Handling

Checks for missing data in columns including 'irradiance', 'temperature', 'wind speed', 'meter power', ‘meter voltage’ and ‘inverter_n’ using check_missing.

Retrieves and logs work order records for any missing 'meter power' data using fetch_workorder. If there are no missing records, a corresponding message is logged.
Data Renaming and Output

Revert inverter column to their original names before saving.

Outputs the processed data into a CSV file in the "**../output/exportedData/**" directory.

Writes log messages to a text file specific to the site in the "**../output/log/**" directory.
File Management

Moves processed files to a designated "processed" directory for organizational purposes.
Summary Update

Appends a new row with site-specific summaries to the global DataFrame Summary.summary.

Resets the specific summary status variables for the next iteration.

Notes

Ensure that the required directories ("**../output/exportedData/**", "**../output/log/**", "**../data/processed**") exist and have the necessary permissions for file operations.

`readData`

Overview

This module provides functionality for reading and preprocessing data files for site-specific analysis. It includes three functions, one for reading site data, one for reading work order records, and a help function to detect separator used in the original files.

Functionalities

detect_separator
- Purpose
  
  Determines the delimiter used in a data file.
- Inputs
  
  line (string): A line from the data file.
- Returns
  
  A string representing the detected delimiter (comma , or semicolon ;).
read_site
- Purpose
  
  Reads site data from a specified file path.
- Inputs
  
  file_path (string): The path to the data file.
- Process
  
  Reads the first 10 lines of the file to determine the header line, which contains the word "timestamp" (case insensitive).
  
  Identifies the correct delimiter using detect_separator with the determined head line
  
  Uses pandas.read_csv to read the file, skipping rows up to the header.
  
  Checks for and handles an optional extra unit row immediately after the header.
- Error Handling
  
  Logs and prints an error message if the header is not found; exits the program on this error condition.
- Returns
  
  A pandas DataFrame with the site data.
read_workorder
- Purpose
  
  Reads work order data from a predefined file.
- Process:
  
  Uses pandas.read_csv to read the work order file located at "**../data/(PE)OpenedWOs.csv**" with no row skipping.
- Returns:
  
  A pandas DataFrame containing the work order data.

Notes

Ensure the file paths provided to these functions are correct and accessible.
The read_site function assumes a specific structure in the data files, particularly the presence of the word "timestamp" in the header of site data.

`renameColumn`

Overview

This module is responsible for the systematic restructuring of column names in a DataFrame.

Functions

column_temperature
- Purpose
  
  Standardizes temperature-related column names.
- Process
  
  Identifies all columns containing "temperature".
  
  Renames the relevant column to "Temperature", prioritizes the column with keyword "ambient" if multiple temperature columns are found.
  
  Removes any redundant temperature columns.
column_wind
- Purpose
  
  Standardizes wind-related column names.
- Process:
  
  Identifies all columns containing “wind” or “speed”
  
  Renames the first relevant column to "Wind Speed".
  
  Adds a new "Wind Speed" column with NaN values if none is found.
column_voltage
- Purpose
  
  Standardizes voltage-related column names.
- Process:
  
  Identifies all columns containing "voltage".
  
  Renames the column with the least NaN values to "Meter Voltage" if multiple are found.
  
  Removes any redundant voltage columns.
find_keywords
- Purpose
  
  Search for specific keywords within a column name.
- Inputs
  
  column (string): The name of the column to check.
  
  keywords_list (list): A list of keywords or keyword pairs to search for in the column name.
- Returns
  
  Boolean indicating whether the keywords were found in the column name.
column_others
- Purpose
  
  Handles the renaming of other essential columns of "Timestamp", "POA Irradiance", and "Meter Power".
- Process
  
  Defines a keyword_mapping dictionary where keys are the new standardized column names and values are lists of keywords (or lists of keywords).
  
  Iterates through the **keyword_mapping** dictionary, and for each mapping checks each column in the DataFrame using find_keywords to see if it matches the specified keywords. (a) If a matching column is found, it's added to a rename_mapping dictionary as key with its new name as value; (b) If no matching column is found for a particular key, a new column with that key as the name is created, filled with NaN values.
  
  Renames the columns in the DataFrame based on the rename_mapping.
column_inverter
- Purpose
  
  Renames and organizes inverter-related columns.
- Process:
  
  Renames any remaining columns, which are not identifiable through specific keywords, as "Inverter_x" (where x is a numeric).
  
  Maintains a record of the original inverter column names during the renaming process using a global dictionary called **name_mapping**. This dictionary is utilized later to revert the inverter column names to their original forms when exporting the processed data files.
column_reorder
- Purpose
  
  Reorders the columns in the DataFrame for standardization.
- Process:
  
  Orders columns with specific names first, followed by sorted inverter columns.
rename
- Purpose
  
  Aggregates all column renaming functions into a pipeline.
- Process
  
  Applies each column renaming function in sequence to the DataFrame.

Notes

The renaming process is sensitive to the presence of specific keywords in the column names.
The module maintains a record of original inverter column names, which can be useful for backtracking and reference purposes.

`normalizeData`

Overview

This module is mainly used for converting timestamp formats, identifying day or night based on geographical coordinates, and normalizing non-numeric data values.

Functions

custom_to_datetime
- Purpose
  
  Converts the 'Timestamp' column in a DataFrame to a standard datetime object.
- Inputs
  
  df (DataFrame): The DataFrame processed by the previous module **renameColumn**.
- Process
  
  Tries multiple datetime formats to find the correct one for conversion.
  
  Converts 'Timestamp' to a datetime object using the first matching format.
- Error Handling
  
  Raises a ValueError if no suitable format is found, indicating an issue with the timestamp format in the data.
- Returns
  
  The DataFrame with the 'Timestamp' column converted to datetime objects.
determine_day_night
- Purpose
  
  Assesses each timestamp in a DataFrame to categorize it as either day or night based on the retrieved geographical information and the based on the calculated local sunrise and sunset times.
- Inputs
  
  row (Series): A row from the DataFrame.
  
  lat, lng (float): Latitude and longitude of the site.
  
  tz (string): The timezone of the site.
- Process
  
  Uses the Sun class from the suntime library, along with the latitude and longitude data, to calculate the sunrise and sunset times for the date of the given timestamp.
  
  Converts the calculated sunrise and sunset times from UTC to the local timezone of the site.
  
  Compares the timestamp from the DataFrame row with the local sunrise and sunset times. (a) If the timestamp falls between sunrise and sunset (inclusive), it is categorized as "Day"; (b) else, categorized as "Night".
- Returns
  
  A string ("Day" or "Night") indicating the time of day for each timestamp.
normalize
- Purpose
  
  Aggregates multiple functions to implement comprehensive data normalization, including converting timestamps and adding a 'Day/Night' column.
- Inputs
  
  df (DataFrame): The DataFrame to be normalized.
  
  site_name (string): The name of the site, used for geolocation.
- Process
  
  Transforms all non-timestamp columns in the DataFrame into numeric data types before any further processing.
  
  Uses the custom_to_datetime function to convert the 'Timestamp' column into a uniform datetime object format.
  
  Retrieves geographical coordinates and timezone for the site for the given site_name and determines the timezone corresponding to these geographical coordinates.
  
  Applies determine_day_night to each row to create a new 'Day/Night' column and concatenates this new column with the original DataFrame.
- Returns
  
  The normalized DataFrame with additional 'Day/Night' information.
Notes
- Accurate geolocation information (latitude, longitude) is crucial for the functioning of the determine_day_night function.
getGeocoding

Overview

This module provides functions for retrieving geocoding information (latitude and longitude) for specific sites, determining the corresponding timezone based on geographical coordinates, and converting time data between different timezones.

Functions
1. getGeocoding
  - Purpose
    
    Retrieves latitude and longitude for a given site from a predefined geocoding data file
  - Inputs
    
    site_name (string): The name of the site for which geocoding information is required.
  - Process
    
    Reads a CSV file ("**../data/geoCoding.csv**") containing geocoding information.
    
    Searches for the specific row matching the site_name.
    
    Extracts latitude and longitude from the found row.
  - Returns
    
    A tuple containing (latitude, longitude) of the site.
    
    Returns (None, None) if the site is not found in the geocoding file.
2. getTimeZone
  - Purpose
    
    Determines the timezone for a given set of geographical coordinates.
  - Inputs
    
    latitude, longitude (float): The geographical coordinates of the site.
  - Process
    
    Utilizes the TimezoneFinder library to find the timezone at the specified coordinates.
  - Returns
    
    The timezone as a string.
    
    If coordinates are not provided or invalid, returns None.
3. getTargetTime
  - Purpose
    
    Converts a given timestamp from its original timezone to a target timezone. (In the normalizeData module, this function is used to convert timestamps from UTC to the specific site's local timezone, enabling comparisons between local times and the local sunrise and sunset times.)
  - Inputs
    
    original_time (datetime): The timestamp to be converted.
    
    original_timezone, target_timezone (string): The original and target timezones for the conversion.
  - Process
    
    Localizes the original_time to its original timezone if it's not timezone-aware.
    
    Then converts the timestamp to the target timezone.
  - Returns
    
    The converted timestamp in the target timezone.
Notes
- Ensure that the "**../data/geoCoding.csv**" file exists and is properly formatted with "Site Name", "Latitude", and "Longitude" columns.
- getTargetTime assumes that the input timestamp is either timezone-aware or in the original timezone specified.
getInfo

Overview

This module provides functionality for formatting raw DataFrame outputs into more user-friendly tabular formats, managing subsets of data for display, and logging messages which will be later printed in a log file.

Functions
1. log
  - Purpose
    
    Stores and appends messages to a log list.
  - Inputs
    
    message (string): The content to be logged.
  - Process
    
    Appends the provided message to the log_messages list.
2. get_info
  - Purpose
    
    Formats a subset of rows from a DataFrame into a table for display.
  - Inputs
    
    df (DataFrame): The DataFrame from which information needs to be extracted.
  - Process
    
    Retrieves a subset of the DataFrame using get_subset.
    
    Creates a PrettyTable with column names and selected data.
    
    Iterates over the subset rows, adding them to the table.
  - Returns
    
    A PrettyTable object representing the selected data in a table format.
3. get_subset
  - Purpose
    
    Extracts a subset of rows and/or columns from a DataFrame for simplified display.
  - Inputs
    
    rows (DataFrame): The DataFrame from which the subset is to be extracted.
  - Process
    
    Limits the number of columns to 12 (first 9 and last 3) if there are more than 12 columns.
    
    Limits the number of rows to 20 if there are more than 20 rows.
  - Returns
    
    A subset of the DataFrame based on the defined constraints.
4. format_workorder
  - Purpose
    
    Formats work order data into a readable table format.
  - Inputs
    
    workorder (DataFrame): The DataFrame containing work order data.
  - Process
    
    Creates a PrettyTable and populates it with data from the workorder DataFrame.
    
    Iterates through each row in the DataFrame, adding it to the table.
  - Returns
    
    A PrettyTable object displaying the work order data in a structured table format.
checkWorkorder

Overview

This module is designed to fetch and process work order data related to specific sites, particularly focusing on missing date ranges. It includes functions for checking date ranges, converting timezones, and filtering relevant work order records based on the missing dates in a dataset.

Functions
1. is_any_date_in_range
  - Purpose
    
    Determines if any date within a specified range falls within a list of missing dates.
  - Inputs
    
    start, end, mark (datetime): Start, Fault End and Marked Complete/Incomplete dates defining the range.
    
    missing_dates (list): A list of missing dates to check against.
  - Process
    
    Generates a date range based on the provided start, end, or mark dates.
    
    Checks if any date in this range is present in the missing_dates list.
  - Returns
    
    A boolean indicating whether any date in the range matches the missing dates.
2. convert_time
  - Purpose
    
    Converts a time string from one timezone to another.
  - Inputs
    
    time_str (string): The time string to be converted.
    
    local_timezone, target_timezone (string): The original and target timezones for conversion.
  - Process
    
    Converts the time string to a datetime object.
    
    Uses getTargetTime to convert this time to the target timezone.
  - Returns
    
    The converted datetime object in the target timezone.
3. fetch_workorder
  - Purpose
    
    Retrieves work order records related to specific missing dates for a given site name.
  - Inputs
    
    missing_dates (list): List of dates where data is missing.
    
    site_name (string): Name of the site for which work orders are being fetched.
  - Process:
    
    Reads work orders related to the site.
    
    Converts relevant time fields in the work orders to the local timezone.
    
    Identifies work orders related to the missing dates using is_any_date_in_range.
    
    Logs and formats the matched work orders.
  - Returns
    
    A DataFrame of matched work order records.
checkMissing

Overview

This module focuses on identifying and handling missing values in various data columns of a DataFrame. It employs various strategies to autofill missing values or flag them for further review.

Functions
1. check_missing_irradiance
  - Purpose
    
    Identify and handle missing values in the 'POA Irradiance' column of a DataFrame. It also adjusts the 'Day/Night' column based on irradiance levels.
  - Process
    
    Identifies missing values (NaN) in the 'POA Irradiance' column and classifies these missing values based on whether they occur during the day or night, as indicated by the 'Day/Night' column.
    
    If there are missing values during the day, these are specifically noted for further review. The details of these day-time missing values are formatted and logged using the get_info function. If all missing values occur during the night, a log entry is made to indicate this.
    
    If daytime missing values are detected, the irradiance_status in the Summary module is marked with an "x" to indicate a data issue that needs attention.
    
    Adjusts the 'Day/Night' column based on irradiance values: If 'POA Irradiance' is greater than 1, the corresponding 'Day/Night' status is set to 'Day'; If 'POA Irradiance' is 1 or less (but not -999, the placeholder for missing values), the status is set to 'Night'.
  - Returns
    
    The modified DataFrame with filled missing values in 'POA Irradiance' and adjusted 'Day/Night' column.
2. check_and_autofill_temperature_and_wind
  - Purpose
    
    Identify and manage missing values in the 'Temperature' and 'Wind Speed' columns of a DataFrame.
  - Process
    
    Iterates through the specified columns ('Temperature' and 'Wind Speed') to check for missing values (NaN).
    
    Determines the importance of these missing values based on a condition: if 'POA Irradiance' is greater than or equal to a defined threshold (CONDITION_VALUE, set at 100).
    
    All missing values in the specified columns are filled with a placeholder value of -999.
    
    Logs the total number of missing values found in each column. For missing values occurring when 'POA Irradiance' is above the threshold, the function logs detailed information about these cases. If no missing values are found when 'POA Irradiance' is above the threshold, a log entry confirms there is no significant missing data.
  - Returns
    
    The DataFrame with autofilled missing values in the 'Temperature' and 'Wind Speed' columns.
3. check_and_autofill_Meter
  - Purpose
    
    Identify and handle missing values in the 'Meter Power' column of a DataFrame.
  - Process
    1. Identify Missing Meter Power Values
      
      Detects missing (NaN) values in the 'Meter Power' column.
      
      Classifies these missing values based on whether they occur during the day.
    2. Identify Conditions for Autofilling
      
      Identifies cases where missing meter power values can be autofilled based on whether there is present corresponding inverter data for the missing meter power.
    3. Autofilling Strategy:
      
      Fills all missing 'Meter Power' values with a placeholder value (-999).
      
      For rows with missing meter power during the day, the function attempts to autofill these based on the sum of the inverter values, if possible.
    4. Log and Flag Missing Values:
      
      Logs detailed information about missing 'Meter Power' values after possible autofill operations, particularly focusing on daytime data. Note that if all inverter data is present, missing 'Meter Power' values, once autofilled using the sum of inverter outputs, are no longer considered a data issue.
    5. Update Summary Module:
      
      If there are missing meter power values during the day with corresponding inverter issues, the function updates the production_status in the Summary module to indicate a data issue.
4. check_and_autofill_inverter
  - Purpose
    
    Identify and handle missing inverter values in a DataFrame.
  - Process
    1. Identifying Missing Inverter Values
      
      Detects rows with any missing values in inverter columns (Inverter_ prefix).
      
      Differentiates them between day and night time using the 'Day/Night' column.
    2. Autofilling Strategy
      
      The function employs a two-step approach to autofill missing inverter values:
      
      Step 1: Site Off Condition: Fills missing inverter values with 0 when the site is determined to be off (either during the night or when meter power is ≤ 0).
      
      Step 2: More Detailed Inverter Analysis:
      
      For remaining missing inverter values, the function compares the total meter power with the combined output of all non-missing inverters:
      - If the meter power is less than or equal to the sum of the outputs of existing inverters, it suggests that the missing inverters are likely 'off'. In such cases, they are autofilled with 0.
      - Conversely, if the meter power is higher than the sum of the existing inverters, it implies that some of the missing inverters might actually be 'on'. Then, the function calculates the average output of the currently non-missing inverters. This average is used to estimate how many of the missing inverters are likely 'on' to account for the total meter power. Those missing inverters regarded as “should be on” are autofilled with a value of 1 (representing 'on'). For cases where it’s challenging to precisely determine the number of 'on' inverters, the function might fill a conservative estimate of missing inverters with 1, leaving others as missing.
    3. Logging and Documentation:
      
      Logs the number of rows where missing inverter values have been autofilled, and provides detailed information about these autofilled records.
      
      In cases where it is challenging to determine the exact status of missing inverters (whether they should be on or off), the function logs this ambiguity and details the autofill strategy used.
      
      If there are still rows with missing inverter values after autofill attempts, these are flagged for further review.
    4. Updating Summary Module:
      
      If there are unresolved missing inverter values after the autofill process, the function updates the inverter_status in the Summary module to indicate potential data issues.
  - Returns
    
    The DataFrame with autofilled missing values in the 'Inverter_x' columns.
5. check_and_autofill_voltage
  - Purpose
    
    Deals with missing values in the 'Meter Voltage' column.
  - Process
    1. Identify Missing Voltage Values:
      
      Detects missing (NaN) values in the 'Meter Voltage' column.
      
      Differentiates cases based on whether the entire column is missing data or if only specific rows are missing.
    2. Autofilling Strategy:
      
      Complete Column Empty: If the entire 'Meter Voltage' column is missing, all values are filled with a placeholder value of -999. This scenario is logged as the column being completely empty.
      
      Partial Missing Values: When only some values in the 'Meter Voltage' column are missing, the function first identifies rows where the Meter Power is above 0. For these rows, the function calculates the average of the existing non-zero voltage readings and autofills the missing voltage values with this calculated average.
    3. Logging Autofill Actions:
      
      The function logs detailed information about the autofilling actions it performs. For rows where missing voltages are autofilled with the average, the function provides specific details about these cases, including the number of rows affected and the average voltage used for autofilling.
  - Returns
    
    The DataFrame with autofilled missing values in the 'Meter Voltage' column.
6. check_missing
  - Purpose
    
    Aggregates all the missing value checks and autofills into a single function.
  - Process
    
    Uses a pipeline approach, sequentially applying specialized functions to handle missing data in different columns:
    1. Irradiance Check with check_missing_irradiance function
    2. Temperature and Wind Speed Check with the check_and_autofill_temperature_and_wind function
    3. Inverter Check with check_and_autofill_inverter function
    4. Meter Power Check with the check_and_autofill_Meter function (potentially returning a list of dates (missing_dates) for further investigation in relevant work orders.)
    5. Voltage Check with check_and_autofill_voltage function
  - Returns:
    
    missing_dates, which are dates with missing 'Meter Power' that cannot be autofilled, can be used in **checkWorkorder** module to retrieve helpful work order records
Summary

Overview

The Summary module maintains a cumulative DataFrame to provide a structured summary of important data missing across multiple sites. It also keep a track of three status variables for production, irradiance, and inverter which will temporarily populated missing information for each specific site during data processing. Once the data processing for a site is completed, the status data will be appended to the summary DataFrame and be cleared within the processFile module. Ultimately, the summary DataFrame is exported to a CSV file in the main module.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
KPICal		KPICal
dataProcessing		dataProcessing
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

syuu-syuu/Auto-Reviewer

Folders and files

Latest commit

History

Repository files navigation

CSS Documentation

Standard Operating Procedure

Requirements

Setup and Installation

Running the Program

Post-Execution

Documentation for Each Module

main

Overview

Functionalities

processFile

Overview

Functionalities

Notes

readData

Overview

Functionalities

Notes

renameColumn

Overview

Functions

Notes

normalizeData

Overview

Functions

Notes

getGeocoding

Overview

Functions

Notes

getInfo

Overview

Functions

checkWorkorder

Overview

Functions

checkMissing

Overview

Functions

Summary

Overview

About

Resources

Stars

Watchers

Forks

Releases

`main`

`processFile`

`readData`

`renameColumn`

`normalizeData`

`getGeocoding`

`getInfo`

`checkWorkorder`

`checkMissing`

`Summary`