SDAP-390 Update NetCDF reader tool for data match-up (#178)

* SDAP -390 Update NetCDF reader tool for data match-up * Update CHANGELOG.md * Update cdms_reader.py * Update README.md * Update cdms_reader.py * Updated README.md. Co-authored-by: Jordan Gethers <[email protected]> Co-authored-by: nchung <[email protected]>
apache · Aug 9, 2022 · 1dc62e2 · 1dc62e2
1 parent 5c96c3d
commit 1dc62e2
Show file tree

Hide file tree

Showing 6 changed files with 546 additions and 211 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,8 +12,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - SDAP-372: Updated `match_spark_doms` to interface with samos_cdms endpoint 
 - SDAP-393: Included `insitu` in ingress based on the value of `insituAPI.enabled` in `values.yaml`
 - SDAP-371: Renamed `/domssubset` endpoint to `/cdmssubset`
+- SDAP-390: Updated NetCDF reader tool for data matchup and added user functionality.
 - SDAP-396: Added saildrone insitu api to matchup
 ### Changed
+
+-SDAP-390: Changed `/doms` to `/cdms` and `doms_reader.py` to `cdms_reader.py`
 - domslist endpoint points to AWS insitu instead of doms insitu
 ### Deprecated
 ### Removed
@@ -32,5 +35,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Fixed issue where satellite to satellite matchups with the same dataset don't return the expected result
 - Fixed CSV and NetCDF matchup output bug
 - Fixed NetCDF output switching latitude and longitude
+
+### Security
 - Fixed import error causing `/timeSeriesSpark` queries to fail.
-### Security
+
diff --git a/tools/cdms/README.md b/tools/cdms/README.md
@@ -0,0 +1,85 @@
+# CDMS_reader.py
+The functions in cdms_reader.py read a CDMS netCDF file into memory, assemble a list of matches from a primary (satellite) and secondary (satellite or in situ) data set, and optionally outputs the matches to a CSV file. Each matched pair contains one primary data record and one in secondary data record.
+
+The CDMS netCDF files holds the two groups (`PrimaryData` and `SecondaryData`). The `matchIDs` netCDF variable contains pairs of IDs (matches) which reference a primary data record and a secondary data record in their respective groups. These records have a many-to-many relationship; one primary record may match to many in secondary records, and one secondary record may match to many primary records. The `assemble_matches` function assembles the individual data records into pairs based on their `dim` group dimension IDs as paired in the `matchIDs` variable.
+
+## Requirements
+This tool was developed and tested with Python 3.9.13.
+Imported packages:
+* argparse
+* string
+* netcdf4
+* sys
+* datetime
+* csv
+* collections
+* logging
+
+
+## Functions
+### Function: `assemble_matches(filename)`
+Read a CDMS netCDF file into memory and return a list of matches from the file.
+
+#### Parameters 
+- `filename` (str): the CDMS netCDF file name.
+
+#### Returns
+- `matches` (list): List of matches. 
+
+Each list element in `matches` is a dictionary organized as follows:
+    For match `m`, netCDF group `GROUP` ('PrimaryData' or 'SecondaryData'), and netCDF group variable `VARIABLE`:
+
+`matches[m][GROUP]['matchID']`: netCDF `MatchedRecords` dimension ID for the match
+`matches[m][GROUP]['GROUPID']`: GROUP netCDF `dim` dimension ID for the record
+`matches[m][GROUP][VARIABLE]`: variable value 
+
+For example, to access the timestamps of the primary data and the secondary data of the first match in the list, along with the `MatchedRecords` dimension ID and the groups' `dim` dimension ID:
+```python
+matches[0]['PrimaryData']['time']
+matches[0]['SecondaryData']['time']
+matches[0]['PrimaryData']['matchID']
+matches[0]['PrimaryData']['PrimaryDataID']
+matches[0]['SecondaryData']['SecondaryDataID']
+```
+
+        
+### Function: `matches_to_csv(matches, csvfile)`
+Write the CDMS matches to a CSV file. Include a header of column names which are based on the group and variable names from the netCDF file.
+
+#### Parameters:
+- `matches` (list): the list of dictionaries containing the CDMS matches as returned from the `assemble_matches` function.
+- `csvfile` (str): the name of the CSV output file.
+
+### Function: `get_globals(filename)`
+Write the CDMS global attributes to a text file. Additionally,
+within the file there will be a description of where all the different
+outputs go and how to best utlize this program.
+
+#### Parameters:
+- `filename` (str): the name of the original '.nc' input file
+
+### Function: `create_logs(user_option, logName)`
+Write the CDMS log information to a file. Additionally, the user may
+opt to print this information directly to stdout, or discard it entirely.
+
+#### Parameters
+- `user_option` (str): The result of the arg.log 's interpretation of
+what option the user selected.
+- `logName` (str): The name of the log file we wish to write to,
+assuming the user did not use the -l option.
+
+## Usage
+For example, to read some CDMS netCDF file called `cdms_file.nc`:
+### Command line
+The main function for `cdms_reader.py` takes one `filename` parameter (`cdms_file.nc` argument in this example) for the CDMS netCDF file to read and calls the `assemble_matches` function. If the -c parameter is utilized, the `matches_to_csv` function is called to write the matches to a CSV file `cdms_file.csv`. If the -g parameter is utilized, the `get_globals` function is called to show them the files globals attributes as well as a short explanation of how the files can be best utlized. Logs of the program are kept automatically in `cdms_file.log` but can be omitted or rerouted with the -l parameter. P.S. when using the --csv, --log, or --meta options, these are the same three commands but --log cannot take any parameters like its' recommended syntax (-l) does.
+```
+python cdms_reader.py cdms_file.nc -c -g
+```
+python3 cdms_reader.py cdms_file.nc -c -g
+```
+python3 cdms_reader.py cdms_file.nc --csv --meta
+### Importing `assemble_matches`
+```python
+from cdms_reader import assemble_matches
+matches = assemble_matches('cdms_file.nc')
+```
diff --git a/tools/cdms/cdms_reader.py b/tools/cdms/cdms_reader.py
@@ -0,0 +1,250 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import string
+from netCDF4 import Dataset, num2date
+import sys
+import datetime
+import csv
+from collections import OrderedDict
+import logging
+
+#TODO: Get rid of numpy errors?
+#TODO: Update big SDAP README
+
+LOGGER =  logging.getLogger("cdms_reader")
+
+def assemble_matches(filename):
+    """
+    Read a CDMS netCDF file and return a list of matches.
+    
+    Parameters
+    ----------
+    filename : str
+        The CDMS netCDF file name.
+    
+    Returns
+    -------
+    matches : list
+        List of matches. Each list element is a dictionary.
+        For match m, netCDF group GROUP (PrimaryData or SecondaryData), and
+        group variable VARIABLE:
+        matches[m][GROUP]['matchID']: MatchedRecords dimension ID for the match
+        matches[m][GROUP]['GROUPID']: GROUP dim dimension ID for the record
+        matches[m][GROUP][VARIABLE]: variable value 
+    """
+
+    try:
+        # Open the netCDF file
+        with Dataset(filename, 'r') as cdms_nc:
+            # Check that the number of groups is consistent w/ the MatchedGroups
+            # dimension
+            assert len(cdms_nc.groups) == cdms_nc.dimensions['MatchedGroups'].size,\
+                ("Number of groups isn't the same as MatchedGroups dimension.")
+
+            matches = []
+            matched_records = cdms_nc.dimensions['MatchedRecords'].size
+
+            # Loop through the match IDs to assemble matches
+            for match in range(0, matched_records):
+                match_dict = OrderedDict()
+                # Grab the data from each platform (group) in the match
+                for group_num, group in enumerate(cdms_nc.groups):
+                    match_dict[group] = OrderedDict()
+                    match_dict[group]['matchID'] = match
+                    ID = cdms_nc.variables['matchIDs'][match][group_num]
+                    match_dict[group][group + 'ID'] = ID
+                    for var in cdms_nc.groups[group].variables.keys():
+                        match_dict[group][var] = cdms_nc.groups[group][var][ID]
+
+                    # Create a UTC datetime field from timestamp
+                    dt = num2date(match_dict[group]['time'],
+                                  cdms_nc.groups[group]['time'].units)
+                    match_dict[group]['datetime'] = dt
+                LOGGER.info(match_dict)
+                matches.append(match_dict)
+
+            return matches
+    except (OSError, IOError) as err:
+        LOGGER.exception("Error reading netCDF file " + filename)
+        raise err
+
+def matches_to_csv(matches, csvfile):
+    """
+    Write the CDMS matches to a CSV file. Include a header of column names
+    which are based on the group and variable names from the netCDF file.
+    
+    Parameters
+    ----------
+    matches : list
+        The list of dictionaries containing the CDMS matches as returned from
+        assemble_matches.      
+    csvfile : str
+        The name of the CSV output file.
+    """
+    # Create a header for the CSV. Column names are GROUP_VARIABLE or
+    # GROUP_GROUPID.
+    header = []
+    for key, value in matches[0].items():
+        for otherkey in value.keys():
+            header.append(key + "_" + otherkey)
+
+    try:
+        # Write the CSV file
+        with open(csvfile, 'w') as output_file:
+            csv_writer = csv.writer(output_file)
+            csv_writer.writerow(header)
+            for match in matches:
+                row = []
+                for group, data in match.items():
+                    for value in data.values():
+                        row.append(value)
+                csv_writer.writerow(row)
+    except (OSError, IOError) as err:
+        LOGGER.exception("Error writing CSV file " + csvfile)
+        raise err
+
+def get_globals(filename):
+    """
+    Write the CDMS  global attributes to a text file. Additionally,
+     within the file there will be a description of where all the different
+     outputs go and how to best utlize this program.
+    
+    Parameters
+    ----------      
+    filename : str
+        The name of the original '.nc' input file.
+    
+    """
+    x0 = "README / cdms_reader.py Program Use and Description:\n"
+    x1 = "\nThe cdms_reader.py program reads a CDMS netCDF (a NETCDF file with a matchIDs variable)\n"
+    x2 = "file into memory, assembles a list of matches of primary and secondary data\n"
+    x3 = "and optionally\n"
+    x4 = "output the matches to a CSV file. Each matched pair contains one primary\n"
+    x5 = "data record and one secondary data record.\n"
+    x6 = "\nBelow, this file wil list the global attributes of the .nc (NETCDF) file.\n"
+    x7 = "If you wish to see a full dump of the data from the .nc file,\n"
+    x8 = "please utilize the ncdump command from NETCDF (or look at the CSV file).\n"
+    try:
+        with Dataset(filename, "r", format="NETCDF4") as ncFile:
+            txtName = filename.replace(".nc", ".txt")
+            with open(txtName, "w") as txt:
+                txt.write(x0 + x1 +x2 +x3 + x4 + x5 + x6 + x7 + x8)
+                txt.write("\nGlobal Attributes:")
+                for x in ncFile.ncattrs():
+                    txt.write(f'\t :{x} = "{ncFile.getncattr(x)}" ;\n')
+
+
+    except (OSError, IOError) as err:
+        LOGGER.exception("Error reading netCDF file " + filename)
+        print("Error reading file!")
+        raise err
+
+def create_logs(user_option, logName):
+    """
+    Write the CDMS log information to a file. Additionally, the user may
+    opt to print this information directly to stdout, or discard it entirely.
+    
+    Parameters
+    ----------      
+    user_option : str
+        The result of the arg.log 's interpretation of
+         what option the user selected.
+    logName : str
+        The name of the log file we wish to write to,
+        assuming the user did not use the -l option.
+    """
+    if user_option == 'N':
+        print("** Note: No log was created **")
+
+
+    elif user_option == '1':
+        #prints the log contents to stdout
+        logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s',
+                        level=logging.INFO,
+                        datefmt='%Y-%m-%d %H:%M:%S',
+                        handlers=[
+                            logging.StreamHandler(sys.stdout)
+                            ])
+
+    else:
+        #prints log to a .log file
+        logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s',
+                        level=logging.INFO,
+                        datefmt='%Y-%m-%d %H:%M:%S',
+                        handlers=[
+                            logging.FileHandler(logName)
+                            ])
+        if user_option != 1 and user_option != 'Y':
+            print(f"** Bad usage of log option. Log will print to {logName} **")
+
+
+
+
+
+if __name__ == '__main__':
+    """
+    Execution:
+        python cdms_reader.py filename
+        OR
+        python3 cdms_reader.py filename 
+        OR
+        python3 cdms_reader.py filename -c -g 
+        OR
+        python3 cdms_reader.py filename --csv --meta
+
+    Note (For Help Try):
+            python3 cdms_reader.py -h
+            OR
+            python3 cdms_reader.py --help
+
+    """
+
+    u0 = '\n%(prog)s -h OR --help \n'
+    u1 = '%(prog)s filename -c -g\n%(prog)s filename --csv --meta\n'
+    u2 ='Use -l OR -l1 to modify destination of logs'
+    p = argparse.ArgumentParser(usage= u0 + u1 + u2)
+
+    #below block is to customize user options
+    p.add_argument('filename', help='CDMS netCDF file to read')
+    p.add_argument('-c', '--csv', nargs='?', const= 'Y', default='N',
+     help='Use -c or --csv to retrieve CSV output')
+    p.add_argument('-g', '--meta', nargs='?', const='Y', default='N',
+     help='Use -g or --meta to retrieve global attributes / metadata')
+    p.add_argument('-l', '--log', nargs='?', const='N', default='Y',
+     help='Use -l or --log to AVOID creating log files, OR use -l1 to print to stdout/console') 
+
+    #arguments are processed by the next line
+    args = p.parse_args()
+
+    logName = args.filename.replace(".nc", ".log")
+    create_logs(args.log, logName)
+
+    cdms_matches = assemble_matches(args.filename)
+
+    if args.csv == 'Y' :
+        matches_to_csv(cdms_matches, args.filename.replace(".nc",".csv"))
+
+    if args.meta == 'Y' :
+        get_globals(args.filename)
+
+
+
+
+
+
+
+