This SOP outlines the steps for setting up and running the Python-based program,which is designed for automated data processing for Carolina Solar Services.
- A computer with Python installed.
- Internet access for downloading the program files.
- Clone the Repository:
- Open a terminal or command prompt.
- Run the command:
git clone https://github.com/syuu-syuu/carolinasolar.git
- This will download the program files to your local machine.
- Create a Python Virtual Environment:
- Navigate to the cloned directory:
cd carolinasolar
- Create a virtual environment named '
venv
' by running:python -m venv venv
- Activate the virtual environment:
- On Windows:
venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
- Navigate to the cloned directory:
- Install Dependencies:
- Ensure you are in the root directory of the project (
carolinasolar
). - Install required Python packages:
pip install -r requirements.txt
- Ensure you are in the root directory of the project (
- Prepare Data Files:
- Place all data files that need to be processed into the
data
folder within the project directory. - Ensure that geo coding data, site address data, and work order files are also placed in the
data
folder.
- Place all data files that need to be processed into the
- Execute the Program:
- Navigate to the
src
directory within the project:cd src
- Run the program:
python -u main.py
- Navigate to the
- Processed Data:
- Once the program has run successfully, processed data will be located in the
output/exportedData
directory. - Log files generated during the process will be saved in the
output/log
directory.
- Once the program has run successfully, processed data will be located in the
- Original Data Files:
- The original data files that have been processed will be moved to the
data/processed
folder for record-keeping.
- The original data files that have been processed will be moved to the
This module serves as the main function of the program. Its primary task is to identify and process files named with the suffix "Monthly.csv
" located in the data folder at the upper level of the directory structure.
Each file is processed sequentially using the process_file
function from the processFile
module.
After processing a file, the program clears any logged messages accumulated in the getInfo.log_messages
list, a variable defined at the module level.
Once all targeted files have been handled, the script compiles a summary. This summary is generated by arranging the data in the Summary.summary
DataFrame in order based on the "Site Name" column. Finally, this organized summary is saved as a CSV file in the "**../output/summary.csv**
" path.
The processFile.py
module is a comprehensive unit that integrates various sub-modules to execute the complete data processing workflow. It consists of only one function process_file
, which is designed to be called for each file in a data folder with the file path as its argument.
-
Initial Logging
Logs introductory notes about display constraints for record counts and inverter column details.
-
Data Reading and Normalization
Extracts the site name from the file path.
Reads site data using the
read_site
function from thereadData
module.Processes data through a pipeline involving renaming columns (with
rename
function) and normalizing data specific to the site ( withnormalize
function). -
Missing Data Handling
Checks for missing data in columns including 'irradiance', 'temperature', 'wind speed', 'meter power', ‘meter voltage’ and ‘inverter_n’ using
check_missing
.Retrieves and logs work order records for any missing 'meter power' data using
fetch_workorder
. If there are no missing records, a corresponding message is logged. -
Data Renaming and Output
Revert inverter column to their original names before saving.
Outputs the processed data into a CSV file in the "
**../output/exportedData/**
" directory.Writes log messages to a text file specific to the site in the "
**../output/log/**
" directory. -
File Management
Moves processed files to a designated "processed" directory for organizational purposes.
-
Summary Update
Appends a new row with site-specific summaries to the global DataFrame
Summary.summary
.Resets the specific summary status variables for the next iteration.
- Ensure that the required directories ("
**../output/exportedData/**
", "**../output/log/**
", "**../data/processed**
") exist and have the necessary permissions for file operations.
This module provides functionality for reading and preprocessing data files for site-specific analysis. It includes three functions, one for reading site data, one for reading work order records, and a help function to detect separator used in the original files.
-
detect_separator
-
Purpose
Determines the delimiter used in a data file.
-
Inputs
line
(string): A line from the data file. -
Returns
A string representing the detected delimiter (comma
,
or semicolon;
).
-
-
read_site
-
Purpose
Reads site data from a specified file path.
-
Inputs
file_path
(string): The path to the data file. -
Process
Reads the first 10 lines of the file to determine the header line, which contains the word "timestamp" (case insensitive).
Identifies the correct delimiter using
detect_separator
with the determined head lineUses
pandas.read_csv
to read the file, skipping rows up to the header.Checks for and handles an optional extra unit row immediately after the header.
-
Error Handling
Logs and prints an error message if the header is not found; exits the program on this error condition.
-
Returns
A pandas DataFrame with the site data.
-
-
read_workorder
-
Purpose
Reads work order data from a predefined file.
-
Process:
Uses
pandas.read_csv
to read the work order file located at "**../data/(PE)OpenedWOs.csv**
" with no row skipping. -
Returns:
A pandas DataFrame containing the work order data.
-
- Ensure the file paths provided to these functions are correct and accessible.
- The
read_site
function assumes a specific structure in the data files, particularly the presence of the word "timestamp" in the header of site data.
This module is responsible for the systematic restructuring of column names in a DataFrame.
-
column_temperature
-
Purpose
Standardizes temperature-related column names.
-
Process
Identifies all columns containing "temperature".
Renames the relevant column to "Temperature", prioritizes the column with keyword "ambient" if multiple temperature columns are found.
Removes any redundant temperature columns.
-
-
column_wind
-
Purpose
Standardizes wind-related column names.
-
Process:
Identifies all columns containing “wind” or “speed”
Renames the first relevant column to "Wind Speed".
Adds a new "Wind Speed" column with
NaN
values if none is found.
-
-
column_voltage
-
Purpose
Standardizes voltage-related column names.
-
Process:
Identifies all columns containing "voltage".
Renames the column with the least
NaN
values to "Meter Voltage" if multiple are found.Removes any redundant voltage columns.
-
-
find_keywords
-
Purpose
Search for specific keywords within a column name.
-
Inputs
column
(string): The name of the column to check.keywords_list
(list): A list of keywords or keyword pairs to search for in the column name. -
Returns
Boolean indicating whether the keywords were found in the column name.
-
-
column_others
-
Purpose
Handles the renaming of other essential columns of "Timestamp", "POA Irradiance", and "Meter Power".
-
Process
Defines a
keyword_mapping
dictionary where keys are the new standardized column names and values are lists of keywords (or lists of keywords).Iterates through the
**keyword_mapping
** dictionary, and for each mapping checks each column in the DataFrame usingfind_keywords
to see if it matches the specified keywords. (a) If a matching column is found, it's added to arename_mapping
dictionary as key with its new name as value; (b) If no matching column is found for a particular key, a new column with that key as the name is created, filled withNaN
values.Renames the columns in the DataFrame based on the
rename_mapping
.
-
-
column_inverter
-
Purpose
Renames and organizes inverter-related columns.
-
Process:
Renames any remaining columns, which are not identifiable through specific keywords, as "Inverter_x" (where x is a numeric).
Maintains a record of the original inverter column names during the renaming process using a global dictionary called
**name_mapping**
. This dictionary is utilized later to revert the inverter column names to their original forms when exporting the processed data files.
-
-
column_reorder
-
Purpose
Reorders the columns in the DataFrame for standardization.
-
Process:
Orders columns with specific names first, followed by sorted inverter columns.
-
-
rename
-
Purpose
Aggregates all column renaming functions into a pipeline.
-
Process
Applies each column renaming function in sequence to the DataFrame.
-
- The renaming process is sensitive to the presence of specific keywords in the column names.
- The module maintains a record of original inverter column names, which can be useful for backtracking and reference purposes.
This module is mainly used for converting timestamp formats, identifying day or night based on geographical coordinates, and normalizing non-numeric data values.
-
custom_to_datetime
-
Purpose
Converts the 'Timestamp' column in a DataFrame to a standard
datetime
object. -
Inputs
df
(DataFrame): The DataFrame processed by the previous module**renameColumn**
. -
Process
Tries multiple datetime formats to find the correct one for conversion.
Converts 'Timestamp' to a
datetime
object using the first matching format. -
Error Handling
Raises a
ValueError
if no suitable format is found, indicating an issue with the timestamp format in the data. -
Returns
The DataFrame with the 'Timestamp' column converted to
datetime
objects.
-
-
determine_day_night
-
Purpose
Assesses each timestamp in a DataFrame to categorize it as either day or night based on the retrieved geographical information and the based on the calculated local sunrise and sunset times.
-
Inputs
row
(Series): A row from the DataFrame.lat
,lng
(float): Latitude and longitude of the site.tz
(string): The timezone of the site. -
Process
Uses the
Sun
class from thesuntime
library, along with the latitude and longitude data, to calculate the sunrise and sunset times for the date of the given timestamp.Converts the calculated sunrise and sunset times from UTC to the local timezone of the site.
Compares the timestamp from the DataFrame row with the local sunrise and sunset times. (a) If the timestamp falls between sunrise and sunset (inclusive), it is categorized as "Day"; (b) else, categorized as "Night".
-
Returns
A string ("Day" or "Night") indicating the time of day for each timestamp.
-
-
normalize
-
Purpose
Aggregates multiple functions to implement comprehensive data normalization, including converting timestamps and adding a 'Day/Night' column.
-
Inputs
df
(DataFrame): The DataFrame to be normalized.site_name
(string): The name of the site, used for geolocation. -
Process
Transforms all non-timestamp columns in the DataFrame into numeric data types before any further processing.
Uses the
custom_to_datetime
function to convert the 'Timestamp' column into a uniformdatetime
object format.Retrieves geographical coordinates and timezone for the site for the given
site_name
and determines the timezone corresponding to these geographical coordinates.Applies
determine_day_night
to each row to create a new 'Day/Night' column and concatenates this new column with the original DataFrame. -
Returns
The normalized DataFrame with additional 'Day/Night' information.
- Accurate geolocation information (latitude, longitude) is crucial for the functioning of the
determine_day_night
function.
This module provides functions for retrieving geocoding information (latitude and longitude) for specific sites, determining the corresponding timezone based on geographical coordinates, and converting time data between different timezones.
-
getGeocoding
-
Purpose
Retrieves latitude and longitude for a given site from a predefined geocoding data file
-
Inputs
site_name
(string): The name of the site for which geocoding information is required. -
Process
Reads a CSV file ("
**../data/geoCoding.csv**
") containing geocoding information.Searches for the specific row matching the
site_name
.Extracts latitude and longitude from the found row.
-
Returns
A tuple containing
(latitude, longitude)
of the site.Returns
(None, None)
if the site is not found in the geocoding file.
-
-
getTimeZone
-
Purpose
Determines the timezone for a given set of geographical coordinates.
-
Inputs
latitude
,longitude
(float): The geographical coordinates of the site. -
Process
Utilizes the
TimezoneFinder
library to find the timezone at the specified coordinates. -
Returns
The timezone as a string.
If coordinates are not provided or invalid, returns
None
.
-
-
getTargetTime
-
Purpose
Converts a given timestamp from its original timezone to a target timezone. (In the
normalizeData
module, this function is used to convert timestamps from UTC to the specific site's local timezone, enabling comparisons between local times and the local sunrise and sunset times.) -
Inputs
original_time
(datetime): The timestamp to be converted.original_timezone
,target_timezone
(string): The original and target timezones for the conversion. -
Process
Localizes the
original_time
to its original timezone if it's not timezone-aware.Then converts the timestamp to the target timezone.
-
Returns
The converted timestamp in the target timezone.
-
- Ensure that the "
**../data/geoCoding.csv**
" file exists and is properly formatted with "Site Name", "Latitude", and "Longitude" columns. getTargetTime
assumes that the input timestamp is either timezone-aware or in the original timezone specified.
This module provides functionality for formatting raw DataFrame outputs into more user-friendly tabular formats, managing subsets of data for display, and logging messages which will be later printed in a log file.
-
log
-
Purpose
Stores and appends messages to a log list.
-
Inputs
message
(string): The content to be logged. -
Process
Appends the provided message to the
log_messages
list.
-
-
get_info
-
Purpose
Formats a subset of rows from a DataFrame into a table for display.
-
Inputs
df
(DataFrame): The DataFrame from which information needs to be extracted. -
Process
Retrieves a subset of the DataFrame using
get_subset
.Creates a
PrettyTable
with column names and selected data.Iterates over the subset rows, adding them to the table.
-
Returns
A
PrettyTable
object representing the selected data in a table format.
-
-
get_subset
-
Purpose
Extracts a subset of rows and/or columns from a DataFrame for simplified display.
-
Inputs
rows
(DataFrame): The DataFrame from which the subset is to be extracted. -
Process
Limits the number of columns to 12 (first 9 and last 3) if there are more than 12 columns.
Limits the number of rows to 20 if there are more than 20 rows.
-
Returns
A subset of the DataFrame based on the defined constraints.
-
-
format_workorder
-
Purpose
Formats work order data into a readable table format.
-
Inputs
workorder
(DataFrame): The DataFrame containing work order data. -
Process
Creates a
PrettyTable
and populates it with data from theworkorder
DataFrame.Iterates through each row in the DataFrame, adding it to the table.
-
Returns
A
PrettyTable
object displaying the work order data in a structured table format.
-
This module is designed to fetch and process work order data related to specific sites, particularly focusing on missing date ranges. It includes functions for checking date ranges, converting timezones, and filtering relevant work order records based on the missing dates in a dataset.
-
is_any_date_in_range
-
Purpose
Determines if any date within a specified range falls within a list of missing dates.
-
Inputs
start
,end
,mark
(datetime): Start, Fault End and Marked Complete/Incomplete dates defining the range.missing_dates
(list): A list of missing dates to check against. -
Process
Generates a date range based on the provided start, end, or mark dates.
Checks if any date in this range is present in the
missing_dates
list. -
Returns
A boolean indicating whether any date in the range matches the missing dates.
-
-
convert_time
-
Purpose
Converts a time string from one timezone to another.
-
Inputs
time_str
(string): The time string to be converted.local_timezone
,target_timezone
(string): The original and target timezones for conversion. -
Process
Converts the time string to a
datetime
object.Uses
getTargetTime
to convert this time to the target timezone. -
Returns
The converted
datetime
object in the target timezone.
-
-
fetch_workorder
-
Purpose
Retrieves work order records related to specific missing dates for a given site name.
-
Inputs
missing_dates
(list): List of dates where data is missing.site_name
(string): Name of the site for which work orders are being fetched. -
Process:
Reads work orders related to the site.
Converts relevant time fields in the work orders to the local timezone.
Identifies work orders related to the missing dates using
is_any_date_in_range
.Logs and formats the matched work orders.
-
Returns
A DataFrame of matched work order records.
-
This module focuses on identifying and handling missing values in various data columns of a DataFrame. It employs various strategies to autofill missing values or flag them for further review.
-
check_missing_irradiance
-
Purpose
Identify and handle missing values in the 'POA Irradiance' column of a DataFrame. It also adjusts the 'Day/Night' column based on irradiance levels.
-
Process
Identifies missing values (
NaN
) in the 'POA Irradiance' column and classifies these missing values based on whether they occur during the day or night, as indicated by the 'Day/Night' column.If there are missing values during the day, these are specifically noted for further review. The details of these day-time missing values are formatted and logged using the
get_info
function. If all missing values occur during the night, a log entry is made to indicate this.If daytime missing values are detected, the
irradiance_status
in the Summary module is marked with an "x" to indicate a data issue that needs attention.Adjusts the 'Day/Night' column based on irradiance values: If 'POA Irradiance' is greater than 1, the corresponding 'Day/Night' status is set to 'Day'; If 'POA Irradiance' is 1 or less (but not -999, the placeholder for missing values), the status is set to 'Night'.
-
Returns
The modified DataFrame with filled missing values in 'POA Irradiance' and adjusted 'Day/Night' column.
-
-
check_and_autofill_temperature_and_wind
-
Purpose
Identify and manage missing values in the 'Temperature' and 'Wind Speed' columns of a DataFrame.
-
Process
Iterates through the specified columns ('Temperature' and 'Wind Speed') to check for missing values (
NaN
).Determines the importance of these missing values based on a condition: if 'POA Irradiance' is greater than or equal to a defined threshold (
CONDITION_VALUE
, set at 100).All missing values in the specified columns are filled with a placeholder value of -999.
Logs the total number of missing values found in each column. For missing values occurring when 'POA Irradiance' is above the threshold, the function logs detailed information about these cases. If no missing values are found when 'POA Irradiance' is above the threshold, a log entry confirms there is no significant missing data.
-
Returns
The DataFrame with autofilled missing values in the 'Temperature' and 'Wind Speed' columns.
-
-
check_and_autofill_Meter
-
Purpose
Identify and handle missing values in the 'Meter Power' column of a DataFrame.
-
Process
-
Identify Missing Meter Power Values
Detects missing (
NaN
) values in the 'Meter Power' column.Classifies these missing values based on whether they occur during the day.
-
Identify Conditions for Autofilling
Identifies cases where missing meter power values can be autofilled based on whether there is present corresponding inverter data for the missing meter power.
-
Autofilling Strategy:
Fills all missing 'Meter Power' values with a placeholder value (-999).
For rows with missing meter power during the day, the function attempts to autofill these based on the sum of the inverter values, if possible.
-
Log and Flag Missing Values:
Logs detailed information about missing 'Meter Power' values after possible autofill operations, particularly focusing on daytime data. Note that if all inverter data is present, missing 'Meter Power' values, once autofilled using the sum of inverter outputs, are no longer considered a data issue.
-
Update Summary Module:
If there are missing meter power values during the day with corresponding inverter issues, the function updates the
production_status
in the Summary module to indicate a data issue.
-
-
-
check_and_autofill_inverter
-
Purpose
Identify and handle missing inverter values in a DataFrame.
-
Process
-
Identifying Missing Inverter Values
Detects rows with any missing values in inverter columns (
Inverter_
prefix).Differentiates them between day and night time using the 'Day/Night' column.
-
Autofilling Strategy
The function employs a two-step approach to autofill missing inverter values:
Step 1: Site Off Condition: Fills missing inverter values with 0 when the site is determined to be off (either during the night or when meter power is ≤ 0).
Step 2: More Detailed Inverter Analysis:
For remaining missing inverter values, the function compares the total meter power with the combined output of all non-missing inverters:
- If the meter power is less than or equal to the sum of the outputs of existing inverters, it suggests that the missing inverters are likely 'off'. In such cases, they are autofilled with 0.
- Conversely, if the meter power is higher than the sum of the existing inverters, it implies that some of the missing inverters might actually be 'on'. Then, the function calculates the average output of the currently non-missing inverters. This average is used to estimate how many of the missing inverters are likely 'on' to account for the total meter power. Those missing inverters regarded as “should be on” are autofilled with a value of 1 (representing 'on'). For cases where it’s challenging to precisely determine the number of 'on' inverters, the function might fill a conservative estimate of missing inverters with 1, leaving others as missing.
-
Logging and Documentation:
Logs the number of rows where missing inverter values have been autofilled, and provides detailed information about these autofilled records.
In cases where it is challenging to determine the exact status of missing inverters (whether they should be on or off), the function logs this ambiguity and details the autofill strategy used.
If there are still rows with missing inverter values after autofill attempts, these are flagged for further review.
-
Updating Summary Module:
If there are unresolved missing inverter values after the autofill process, the function updates the
inverter_status
in the Summary module to indicate potential data issues.
-
-
Returns
The DataFrame with autofilled missing values in the 'Inverter_x' columns.
-
-
check_and_autofill_voltage
-
Purpose
Deals with missing values in the 'Meter Voltage' column.
-
Process
-
Identify Missing Voltage Values:
Detects missing (
NaN
) values in the 'Meter Voltage' column.Differentiates cases based on whether the entire column is missing data or if only specific rows are missing.
-
Autofilling Strategy:
Complete Column Empty: If the entire 'Meter Voltage' column is missing, all values are filled with a placeholder value of -999. This scenario is logged as the column being completely empty.
Partial Missing Values: When only some values in the 'Meter Voltage' column are missing, the function first identifies rows where the Meter Power is above 0. For these rows, the function calculates the average of the existing non-zero voltage readings and autofills the missing voltage values with this calculated average.
-
Logging Autofill Actions:
The function logs detailed information about the autofilling actions it performs. For rows where missing voltages are autofilled with the average, the function provides specific details about these cases, including the number of rows affected and the average voltage used for autofilling.
-
-
Returns
The DataFrame with autofilled missing values in the 'Meter Voltage' column.
-
-
check_missing
-
Purpose
Aggregates all the missing value checks and autofills into a single function.
-
Process
Uses a pipeline approach, sequentially applying specialized functions to handle missing data in different columns:
- Irradiance Check with
check_missing_irradiance
function - Temperature and Wind Speed Check with the
check_and_autofill_temperature_and_wind
function - Inverter Check with
check_and_autofill_inverter
function - Meter Power Check with the
check_and_autofill_Meter
function (potentially returning a list of dates (missing_dates
) for further investigation in relevant work orders.) - Voltage Check with
check_and_autofill_voltage
function
- Irradiance Check with
-
Returns:
missing_dates
, which are dates with missing 'Meter Power' that cannot be autofilled, can be used in**checkWorkorder**
module to retrieve helpful work order records
-
The Summary module maintains a cumulative DataFrame to provide a structured summary of important data missing across multiple sites. It also keep a track of three status variables for production, irradiance, and inverter which will temporarily populated missing information for each specific site during data processing. Once the data processing for a site is completed, the status data will be appended to the summary DataFrame and be cleared within the processFile module. Ultimately, the summary DataFrame is exported to a CSV file in the main module.
-