Skip to content

Commit

Permalink
SDAP-390 Update NetCDF reader tool for data match-up (#178)
Browse files Browse the repository at this point in the history
* SDAP -390 Update NetCDF reader tool for data match-up

* Update CHANGELOG.md

* Update cdms_reader.py

* Update README.md

* Update cdms_reader.py

* Updated README.md.

Co-authored-by: Jordan Gethers <[email protected]>
Co-authored-by: nchung <[email protected]>
  • Loading branch information
3 people authored Aug 9, 2022
1 parent 5c96c3d commit 1dc62e2
Show file tree
Hide file tree
Showing 6 changed files with 546 additions and 211 deletions.
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- SDAP-372: Updated `match_spark_doms` to interface with samos_cdms endpoint
- SDAP-393: Included `insitu` in ingress based on the value of `insituAPI.enabled` in `values.yaml`
- SDAP-371: Renamed `/domssubset` endpoint to `/cdmssubset`
- SDAP-390: Updated NetCDF reader tool for data matchup and added user functionality.
- SDAP-396: Added saildrone insitu api to matchup
### Changed

-SDAP-390: Changed `/doms` to `/cdms` and `doms_reader.py` to `cdms_reader.py`
- domslist endpoint points to AWS insitu instead of doms insitu
### Deprecated
### Removed
Expand All @@ -32,5 +35,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Fixed issue where satellite to satellite matchups with the same dataset don't return the expected result
- Fixed CSV and NetCDF matchup output bug
- Fixed NetCDF output switching latitude and longitude

### Security
- Fixed import error causing `/timeSeriesSpark` queries to fail.
### Security

85 changes: 85 additions & 0 deletions tools/cdms/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# CDMS_reader.py
The functions in cdms_reader.py read a CDMS netCDF file into memory, assemble a list of matches from a primary (satellite) and secondary (satellite or in situ) data set, and optionally outputs the matches to a CSV file. Each matched pair contains one primary data record and one in secondary data record.

The CDMS netCDF files holds the two groups (`PrimaryData` and `SecondaryData`). The `matchIDs` netCDF variable contains pairs of IDs (matches) which reference a primary data record and a secondary data record in their respective groups. These records have a many-to-many relationship; one primary record may match to many in secondary records, and one secondary record may match to many primary records. The `assemble_matches` function assembles the individual data records into pairs based on their `dim` group dimension IDs as paired in the `matchIDs` variable.

## Requirements
This tool was developed and tested with Python 3.9.13.
Imported packages:
* argparse
* string
* netcdf4
* sys
* datetime
* csv
* collections
* logging


## Functions
### Function: `assemble_matches(filename)`
Read a CDMS netCDF file into memory and return a list of matches from the file.

#### Parameters
- `filename` (str): the CDMS netCDF file name.

#### Returns
- `matches` (list): List of matches.

Each list element in `matches` is a dictionary organized as follows:
For match `m`, netCDF group `GROUP` ('PrimaryData' or 'SecondaryData'), and netCDF group variable `VARIABLE`:

`matches[m][GROUP]['matchID']`: netCDF `MatchedRecords` dimension ID for the match
`matches[m][GROUP]['GROUPID']`: GROUP netCDF `dim` dimension ID for the record
`matches[m][GROUP][VARIABLE]`: variable value

For example, to access the timestamps of the primary data and the secondary data of the first match in the list, along with the `MatchedRecords` dimension ID and the groups' `dim` dimension ID:
```python
matches[0]['PrimaryData']['time']
matches[0]['SecondaryData']['time']
matches[0]['PrimaryData']['matchID']
matches[0]['PrimaryData']['PrimaryDataID']
matches[0]['SecondaryData']['SecondaryDataID']
```

### Function: `matches_to_csv(matches, csvfile)`
Write the CDMS matches to a CSV file. Include a header of column names which are based on the group and variable names from the netCDF file.

#### Parameters:
- `matches` (list): the list of dictionaries containing the CDMS matches as returned from the `assemble_matches` function.
- `csvfile` (str): the name of the CSV output file.

### Function: `get_globals(filename)`
Write the CDMS global attributes to a text file. Additionally,
within the file there will be a description of where all the different
outputs go and how to best utlize this program.

#### Parameters:
- `filename` (str): the name of the original '.nc' input file

### Function: `create_logs(user_option, logName)`
Write the CDMS log information to a file. Additionally, the user may
opt to print this information directly to stdout, or discard it entirely.

#### Parameters
- `user_option` (str): The result of the arg.log 's interpretation of
what option the user selected.
- `logName` (str): The name of the log file we wish to write to,
assuming the user did not use the -l option.

## Usage
For example, to read some CDMS netCDF file called `cdms_file.nc`:
### Command line
The main function for `cdms_reader.py` takes one `filename` parameter (`cdms_file.nc` argument in this example) for the CDMS netCDF file to read and calls the `assemble_matches` function. If the -c parameter is utilized, the `matches_to_csv` function is called to write the matches to a CSV file `cdms_file.csv`. If the -g parameter is utilized, the `get_globals` function is called to show them the files globals attributes as well as a short explanation of how the files can be best utlized. Logs of the program are kept automatically in `cdms_file.log` but can be omitted or rerouted with the -l parameter. P.S. when using the --csv, --log, or --meta options, these are the same three commands but --log cannot take any parameters like its' recommended syntax (-l) does.
```
python cdms_reader.py cdms_file.nc -c -g
```
python3 cdms_reader.py cdms_file.nc -c -g
```
python3 cdms_reader.py cdms_file.nc --csv --meta
### Importing `assemble_matches`
```python
from cdms_reader import assemble_matches
matches = assemble_matches('cdms_file.nc')
```
250 changes: 250 additions & 0 deletions tools/cdms/cdms_reader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import string
from netCDF4 import Dataset, num2date
import sys
import datetime
import csv
from collections import OrderedDict
import logging

#TODO: Get rid of numpy errors?
#TODO: Update big SDAP README

LOGGER = logging.getLogger("cdms_reader")

def assemble_matches(filename):
"""
Read a CDMS netCDF file and return a list of matches.
Parameters
----------
filename : str
The CDMS netCDF file name.
Returns
-------
matches : list
List of matches. Each list element is a dictionary.
For match m, netCDF group GROUP (PrimaryData or SecondaryData), and
group variable VARIABLE:
matches[m][GROUP]['matchID']: MatchedRecords dimension ID for the match
matches[m][GROUP]['GROUPID']: GROUP dim dimension ID for the record
matches[m][GROUP][VARIABLE]: variable value
"""

try:
# Open the netCDF file
with Dataset(filename, 'r') as cdms_nc:
# Check that the number of groups is consistent w/ the MatchedGroups
# dimension
assert len(cdms_nc.groups) == cdms_nc.dimensions['MatchedGroups'].size,\
("Number of groups isn't the same as MatchedGroups dimension.")

matches = []
matched_records = cdms_nc.dimensions['MatchedRecords'].size

# Loop through the match IDs to assemble matches
for match in range(0, matched_records):
match_dict = OrderedDict()
# Grab the data from each platform (group) in the match
for group_num, group in enumerate(cdms_nc.groups):
match_dict[group] = OrderedDict()
match_dict[group]['matchID'] = match
ID = cdms_nc.variables['matchIDs'][match][group_num]
match_dict[group][group + 'ID'] = ID
for var in cdms_nc.groups[group].variables.keys():
match_dict[group][var] = cdms_nc.groups[group][var][ID]

# Create a UTC datetime field from timestamp
dt = num2date(match_dict[group]['time'],
cdms_nc.groups[group]['time'].units)
match_dict[group]['datetime'] = dt
LOGGER.info(match_dict)
matches.append(match_dict)

return matches
except (OSError, IOError) as err:
LOGGER.exception("Error reading netCDF file " + filename)
raise err

def matches_to_csv(matches, csvfile):
"""
Write the CDMS matches to a CSV file. Include a header of column names
which are based on the group and variable names from the netCDF file.
Parameters
----------
matches : list
The list of dictionaries containing the CDMS matches as returned from
assemble_matches.
csvfile : str
The name of the CSV output file.
"""
# Create a header for the CSV. Column names are GROUP_VARIABLE or
# GROUP_GROUPID.
header = []
for key, value in matches[0].items():
for otherkey in value.keys():
header.append(key + "_" + otherkey)

try:
# Write the CSV file
with open(csvfile, 'w') as output_file:
csv_writer = csv.writer(output_file)
csv_writer.writerow(header)
for match in matches:
row = []
for group, data in match.items():
for value in data.values():
row.append(value)
csv_writer.writerow(row)
except (OSError, IOError) as err:
LOGGER.exception("Error writing CSV file " + csvfile)
raise err

def get_globals(filename):
"""
Write the CDMS global attributes to a text file. Additionally,
within the file there will be a description of where all the different
outputs go and how to best utlize this program.
Parameters
----------
filename : str
The name of the original '.nc' input file.
"""
x0 = "README / cdms_reader.py Program Use and Description:\n"
x1 = "\nThe cdms_reader.py program reads a CDMS netCDF (a NETCDF file with a matchIDs variable)\n"
x2 = "file into memory, assembles a list of matches of primary and secondary data\n"
x3 = "and optionally\n"
x4 = "output the matches to a CSV file. Each matched pair contains one primary\n"
x5 = "data record and one secondary data record.\n"
x6 = "\nBelow, this file wil list the global attributes of the .nc (NETCDF) file.\n"
x7 = "If you wish to see a full dump of the data from the .nc file,\n"
x8 = "please utilize the ncdump command from NETCDF (or look at the CSV file).\n"
try:
with Dataset(filename, "r", format="NETCDF4") as ncFile:
txtName = filename.replace(".nc", ".txt")
with open(txtName, "w") as txt:
txt.write(x0 + x1 +x2 +x3 + x4 + x5 + x6 + x7 + x8)
txt.write("\nGlobal Attributes:")
for x in ncFile.ncattrs():
txt.write(f'\t :{x} = "{ncFile.getncattr(x)}" ;\n')


except (OSError, IOError) as err:
LOGGER.exception("Error reading netCDF file " + filename)
print("Error reading file!")
raise err

def create_logs(user_option, logName):
"""
Write the CDMS log information to a file. Additionally, the user may
opt to print this information directly to stdout, or discard it entirely.
Parameters
----------
user_option : str
The result of the arg.log 's interpretation of
what option the user selected.
logName : str
The name of the log file we wish to write to,
assuming the user did not use the -l option.
"""
if user_option == 'N':
print("** Note: No log was created **")


elif user_option == '1':
#prints the log contents to stdout
logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s',
level=logging.INFO,
datefmt='%Y-%m-%d %H:%M:%S',
handlers=[
logging.StreamHandler(sys.stdout)
])

else:
#prints log to a .log file
logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s',
level=logging.INFO,
datefmt='%Y-%m-%d %H:%M:%S',
handlers=[
logging.FileHandler(logName)
])
if user_option != 1 and user_option != 'Y':
print(f"** Bad usage of log option. Log will print to {logName} **")





if __name__ == '__main__':
"""
Execution:
python cdms_reader.py filename
OR
python3 cdms_reader.py filename
OR
python3 cdms_reader.py filename -c -g
OR
python3 cdms_reader.py filename --csv --meta
Note (For Help Try):
python3 cdms_reader.py -h
OR
python3 cdms_reader.py --help
"""

u0 = '\n%(prog)s -h OR --help \n'
u1 = '%(prog)s filename -c -g\n%(prog)s filename --csv --meta\n'
u2 ='Use -l OR -l1 to modify destination of logs'
p = argparse.ArgumentParser(usage= u0 + u1 + u2)

#below block is to customize user options
p.add_argument('filename', help='CDMS netCDF file to read')
p.add_argument('-c', '--csv', nargs='?', const= 'Y', default='N',
help='Use -c or --csv to retrieve CSV output')
p.add_argument('-g', '--meta', nargs='?', const='Y', default='N',
help='Use -g or --meta to retrieve global attributes / metadata')
p.add_argument('-l', '--log', nargs='?', const='N', default='Y',
help='Use -l or --log to AVOID creating log files, OR use -l1 to print to stdout/console')

#arguments are processed by the next line
args = p.parse_args()

logName = args.filename.replace(".nc", ".log")
create_logs(args.log, logName)

cdms_matches = assemble_matches(args.filename)

if args.csv == 'Y' :
matches_to_csv(cdms_matches, args.filename.replace(".nc",".csv"))

if args.meta == 'Y' :
get_globals(args.filename)








Loading

0 comments on commit 1dc62e2

Please sign in to comment.