Aerosol zonal plots #291

justin-richling · 2024-03-19T17:32:48Z

This PR will add calculations for zonal aerosol plots, and add new functionality for colorbar options.

Files changed:

adf_variable_defaults.yaml

Add more aerosol variables and colorbar options
Add constants section

adf_diag.py

Add automatic time series file generation for PMID and T if certain aerosol variables are requested
Change how derived quantities are created; add flexibility in number of constituents a variable needs

plotting_functions.py

Add compatibility for non-linear increments for color maps/ranges
Specify difference plot colorbar options from case colorbar options

regrid_and_vert_interp.py:

Fix small variable name typo

Add more aerosol info, aerosol calc list, and constants section. Also add subsection for difference colorbar options and flag for non_linear scale plotting/color scale

This moves away from the nco way of deriving variables and moves it to xarray. This is mostly so that any derived variable that has more than two constituents can be processed. The nco way only worked for two constituents. Also add check for any aerosol variables that will require rho (density of dry air) and create PMID and T time series files (if not already requested). Additionally, add section in derived variables to calculate aerosol values.

brianpm · 2024-03-21T21:33:07Z

lib/adf_diag.py

+                    variables[i] = ds[constit_list[0]].dims
+
+                #Get coordinate values from the first constituent file
+                coords={'lat': ds[constit_list[0]].lat.values,


Could this be changed to
coords = ds.coords

That way we get the metadata in the coordinate variables, and we don't have to make assumptions about what coordinates are present. I'm thinking of a day when we can ingest data on unstructured grids.

I tried this way, but when it gets to the aerosols (vertical levels) it seems to have a mismatch between coordinates and dimensions:

ValueError: coordinate ilev has dimensions ('ilev',), but these are not a subset of the DataArray dimensions ('time', 'lev', 'lat', 'lon')

coords = ds.coords

Coordinates: * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0 * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.3 357.5 358.8 * lev (lev) float64 0.008135 0.01834 0.03482 ... 983.2 991.2 997.5 * ilev (ilev) float64 0.004259 0.01201 0.02467 ... 987.4 995.1 1e+03 * time (time) object 1997-02-01 00:00:00 ... 2001-01-01 00:00:00

As per @nusbaume suggestion for setting the dims once:

const_dims = ds_constits[constit_list[0]].dims

('time', 'lev', 'lat', 'lon')

So I just grabbed the dimensions and coordinates from the first constituent dataArray

#Grab subset for first constituent for dimensions and coordinates ds_const0 = ds[constit_list[0]]

coords = ds_const0.coords KeysView(Coordinates: * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0 * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.3 357.5 358.8 * lev (lev) float64 0.008135 0.01834 0.03482 ... 983.2 991.2 997.5 * time (time) object 1997-02-01 00:00:00 ... 2001-01-01 00:00:00)

const_dims = ds_const0.dims const_dims ('time', 'lev', 'lat', 'lon')

It seemed to work, let me know if there is anything that stands out about going this way.

brianpm · 2024-03-21T23:34:01Z

lib/adf_diag.py

+                    data_arrays.append(da)
+
+                #NOTE: this will need to be changed when derived equations are more than sums! - JR
+                ds[var] = sum(data_arrays)


Doesn't this break RESTOM which is a difference?

Agreed, for now we could just have a specific check for RESTOM here, e.g.:

Suggested change

ds[var] = sum(data_arrays)

if var == "RESTOM":

#Assume FSNT comes before FLNT:

ds[var] = data_arrays[0] - data_arrays[1]

else:

ds[var] = sum(data_arrays)

brianpm · 2024-03-22T00:21:13Z

lib/adf_diag.py

+
+                variables = {}
+                for i in constit_list:
+                    variables[i] = ds[constit_list[0]].dims


I'm a little confused about what this does. Looks like the keys of variables will be the names of the constituents of the derived quantity, and the values will all be the same tuple of the first constituent's dimensions.

variables then gets used below to loop through the dimensions.

I wonder if you get the same result if by replacing lines 1071-1092 with this:

s = 0 for v in constit_list: s += ds[v] s.name = var

and then either ds[var] = s, or just write s.to_netcdf(...) when it is ready for output (see my comment below about the output file).

I might vote to remove this section of code entirely, and instead just extract the dims once:

const_dims = ds[constit_list[0]].dims

as they should (?) be the same between all of the constituents.

Then down below instead of looping over the variables quantity for the constituent name just loop over constit_list again. This will help reduce the total number of variables in this function, and the total number of iterations it has to perform.

brianpm · 2024-03-22T00:23:29Z

lib/adf_diag.py

+                    if var == "SO4":
+                        ds[var] = ds[var]*(96/115)
+
+                ds.to_netcdf(derived_file, unlimited_dims='time', mode='w')


I think that ds has the constituents and the derived values?

Do we want to store all of them in the output file? I think in the version using ncap2 that was necessary because of how NCO works, but since we already have the constituents in files, I think this would be replicating data.

Agreed, I would remove all of the variables in ds except for the derived variable just before you run the to_netcdf command.

brianpm · 2024-03-22T22:06:12Z

@justin-richling and @nusbaume

Following up on an email I sent you about this... I wonder if we should provide an improved approach for derived variables. If we think bringing in numexpr as a new dependency is a reasonable approach, I think I've got a potential drop-in replacement, as follows:

provide a new "variable_default" entry called formula that would define the derived variable in terms of the derivable_from list.
Modify the vars_to_derive variable like this:

            # Loop over CAM history variables:
            list_of_commands = []
            vars_to_derive = {}
            # create copy of var list that can be modified for derivable variables
            diag_var_list = self.diag_var_list
            for var in diag_var_list:
                if var not in hist_file_var_list:
                    vres = res.get(var, {})
                    if ("derivable_from" in vres) and ("formula" in vres):
                        constit_list = vres["derivable_from"]
                        expression = vres["formula"]
                        for constit in constit_list:
                            if constit not in diag_var_list:
                                diag_var_list.append(constit)
                        vars_to_derive[var] = [constit_list, expression]
                        continue
                    else:
                        msg = f"WARNING: {var} is not in the file {hist_files[0]}."
                        msg += " No time series will be generated."
                        print(msg)
                        continue

replace derive_variables with the following (UNTESTED):

import numexpr as nex


def derive(expression, **inputs):
    """Derive output using expression, with input data suppled in inputs.

    expression : str
        A string describing the expression to evaluate, can contain
        numbers, variables, and basic operators. 

    inputs : dict
        A dictionary with keys being the variables used in expression
        and the values being the data (either xr.DataArray or numpy arrays)

    return
        The result of expression

    NOTES
    -----
        Example: expression = "PRECC + PRECL" 
                 inputs = {"PRECC": arr1, "PRECL": arr2}
    """
    return nex.evaluate(expression, inputs)  



def derive_xrWrap(expression, *args, **kwargs):

    # maybe this isn't kosher, but this allows the user
    # to either provide named arguments 
    # or a dictionary without using **dict
    if (len(args) == 1) and (isinstance(args[0], dict)):
        kwargs = args[0]

    template = kwargs[[*kwargs][0]] # whatever is the first input
    # pop the things that are not part of expression inputs
    dims = kwargs.pop("dims", (template.dims if hasattr(template, 'dims') else None))
    coords = kwargs.pop("coords", (template.coords if hasattr(template, 'coords') else None))
    attrs = kwargs.pop("attrs", template.attrs)
    ans = derive(expression, kwargs)
    return xr.DataArray(ans, dims=dims, coords=coords, attrs=attrs)



def derive_variables(self, vars_to_derive=None, ts_dir=None, overwrite=None):
    """
    Derive variables according to expression and inputs provided in vars_to_derive dict.

    Caution: this method _may still_ assume that there will be one time series file per variable

    If the file for the derived variable exists, the kwarg `overwrite` determines
    whether to overwrite the file (true) or exit with a warning message.
    """

    for var in vars_to_derive:
        constit_list = vars_to_derive[var][0]
        expression = vars_to_derive[var][1]
        # get constituent files:
        constit_files = {}
        for c in constit_list:
            if glob.glob(os.path.join(ts_dir, f"*.{c}.*")):
                constit_files[c] = sorted(glob.glob(os.path.join(ts_dir, f"*.{c}.*")))
            else:
                ermsg = f"{c} files were not present; {var} cannot be calculated."
                ermsg += f" Please remove {var} from diag_var_list or find the relevant CAM files."
                raise FileNotFoundError(ermsg)
        # from the constit_files, get the data
        constit_data = {}
        for c in constit_files:
            if len(constit_files[c] > 1):
                constit_data[c] = xr.open_mfdataset(constit_files[c])[c]
            else:
                constit_data[c] = xr.open_dataset(constit_files[c][0])[c]
        # output file specification:
        # TODO: this might not work for multi-file cases
        derived_file = constit_files[[*constit_files][0]].replace(constit_list[0], var)
        if Path(derived_file).is_file():
            if overwrite:
                Path(derived_file).unlink()
            else:
                print(
                    f"[{__name__}] Warning: {var} file was found and overwrite is False. Will use existing file."
                )
                continue
        # derive the variable using expression:
        result = derive_xrWrap(expression, **constit_data)  # defaults to copying metadata from 1st constituent
        # TODO: provide a way to send updated metadata to derive_xrWrap, maybe from additional variable_defaults info
        # Save output:
        if 'time' in result.dims:
            udim = 'time'
        else:
            udim = None
        result.to_netcdf(derived_file, unlimited_dims=udim, mode='w')

brianpm · 2024-03-25T16:22:54Z

I thought about this PR some more. I think given my previous suggestions, it should be acceptable to have derived variables that could be derived from derived variables themselves. For example, say PRECT is derived from PRECC + PRECL. Then the convective fraction of precipitation could be derived as PRECCFRAC = PRECC / PRECT. This seems fine, but requires doing the derived variables in a correct order. I believe that we could just reorder the vars_to_derive using the following code:

def get_derive_order(vres):
    """provides a valid order for deriving variables
       such that derived quantities that are dependencies
       for other quantities are earlier in the output list

        parameters
        ----------
        vres : dict
            dict of each derivable variable with values that are the dependencies (as list)

        return
        ------
        list
            The list of derived variables in a valid order for derivations to succeed
    """
    # use deque for efficient inserting/appending
    from collections import deque
    # Get the full set of values
    derivable = set(vres.keys())  # This also serves as our definition of having dependencies

    q = deque()

    while len(q) < len(derivable):  # every derivable needs to be in q
        for v in derivable:  # Scan through all derivables
            if v not in q:
                q.append(v)
                vindex = q.index(v)  # we might need to know this position
            if v in vres:
                vdepends = vres[v]
                if not isinstance(vdepends, list):
                    vdepends = [vdepends] # in case it's just a single dependency given as a str
                vdtest = [vd for vd in vdepends if vd in vres]  # list of dependencies with dependencies
                if not vdtest:
                    continue # v in q and no dependencies have dependencies
                else:
                    vdd = [vd for vd in enumerate(vdtest) if vd in q] # true if already in q
                    if any(vdd):
                        ndx = vdd.index(True)  # index of first True value
                        q.insert(vindex, vdd[ndx]) # put ahead of v in q
                        break  # leave the for-loop, to re-start the scan
            else:
                print(f"ERROR: encountered {v} which is not listed as a derived quantity")
    return list(q)  # convert deque to list

And that could be used right before calling derive_variables:

if vars_to_derive:
    var_order = get_derive_order(vars_to_derive)  # already a dict as per my other suggestion
    vars_to_derive_ordered = {k: vars_to_derive[k] for k in var_order}
    self.derive_variables(vars_to_derive=vars_to_derive_ordered, ts_dir=ts_dir[case_idx])

brianpm · 2024-03-25T16:28:17Z

@justin-richling -- Sorry for all this code, I think I got carried away! If you and/or @nusbaume would prefer to move ahead with your PR, and then I can put my suggested changes in separately, that would be fine with me. Let me know.

nusbaume

Good work! Just have some minor change requests and a few suggestions that will hopefully help us maintain current functionality while waiting for the new derived variable infrastructure.

lib/adf_diag.py

nusbaume · 2024-04-06T20:55:07Z

lib/adf_diag.py

+
+                variables = {}
+                for i in constit_list:
+                    variables[i] = ds[constit_list[0]].dims


I might vote to remove this section of code entirely, and instead just extract the dims once:

const_dims = ds[constit_list[0]].dims

as they should (?) be the same between all of the constituents.

Then down below instead of looping over the variables quantity for the constituent name just loop over constit_list again. This will help reduce the total number of variables in this function, and the total number of iterations it has to perform.

nusbaume · 2024-04-06T21:05:00Z

lib/adf_diag.py

+                    data_arrays.append(da)
+
+                #NOTE: this will need to be changed when derived equations are more than sums! - JR
+                ds[var] = sum(data_arrays)


Agreed, for now we could just have a specific check for RESTOM here, e.g.:

Suggested change

ds[var] = sum(data_arrays)

if var == "RESTOM":

#Assume FSNT comes before FLNT:

ds[var] = data_arrays[0] - data_arrays[1]

else:

ds[var] = sum(data_arrays)

lib/adf_diag.py

nusbaume · 2024-04-06T21:12:26Z

lib/adf_diag.py

+                    if var == "SO4":
+                        ds[var] = ds[var]*(96/115)
+
+                ds.to_netcdf(derived_file, unlimited_dims='time', mode='w')


Agreed, I would remove all of the variables in ds except for the derived variable just before you run the to_netcdf command.

lib/adf_variable_defaults.yaml

Clean up derived variable function to grab coordinates and dimensions for new variable netcdf file. Also, remove constituents from derived variable file since they will have their own.

justin-richling · 2024-04-16T20:42:04Z

@nusbaume I believe I've addressed your and @brianpm's suggestions/changes. Let me know if anything else needs attention, thanks!

nusbaume

Everything looks good to me now, thanks!

justin-richling added 4 commits March 19, 2024 11:16

Fix incorrect variable name

b928839

Update adf_variable_defaults.yaml

1ec9d80

Add more aerosol info, aerosol calc list, and constants section. Also add subsection for difference colorbar options and flag for non_linear scale plotting/color scale

Add non linear colorbar option and difference cmap opts

6ea5aa1

justin-richling requested review from brianpm and nusbaume March 19, 2024 17:32

justin-richling added 3 commits March 19, 2024 11:33

Update adf_diag.py

a56ebf7

Fix github errors

82c1317

Update plotting_functions.py

8ea3838

brianpm reviewed Mar 21, 2024

View reviewed changes

brianpm reviewed Mar 22, 2024

View reviewed changes

nusbaume requested changes Apr 8, 2024

View reviewed changes

justin-richling added 3 commits April 16, 2024 14:03

Change greek symbols for units

c896ad0

Update adf_diag.py

e440ecb

Clean up derived variable function to grab coordinates and dimensions for new variable netcdf file. Also, remove constituents from derived variable file since they will have their own.

Add statement about derived variables in docstring

5626dd5

nusbaume approved these changes Apr 19, 2024

View reviewed changes

justin-richling merged commit 2ddd16b into NCAR:main Apr 19, 2024
7 checks passed

justin-richling deleted the aerosol-zonal-plots branch April 23, 2024 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aerosol zonal plots #291

Aerosol zonal plots #291

justin-richling commented Mar 19, 2024 •

edited

Loading

brianpm Mar 21, 2024

justin-richling Apr 9, 2024

brianpm Mar 21, 2024

nusbaume Apr 6, 2024

brianpm Mar 22, 2024 •

edited

Loading

nusbaume Apr 6, 2024

brianpm Mar 22, 2024

nusbaume Apr 6, 2024

brianpm commented Mar 22, 2024

brianpm commented Mar 25, 2024

brianpm commented Mar 25, 2024

nusbaume left a comment

nusbaume Apr 6, 2024

nusbaume Apr 6, 2024

nusbaume Apr 6, 2024

justin-richling commented Apr 16, 2024

nusbaume left a comment

-                ds[var] = sum(data_arrays)
+                if var == "RESTOM":
+                   #Assume FSNT comes before FLNT:
+                   ds[var] = data_arrays[0] - data_arrays[1]
+                else:
+                   ds[var] = sum(data_arrays)

Aerosol zonal plots #291

Aerosol zonal plots #291

Conversation

justin-richling commented Mar 19, 2024 • edited Loading

Files changed:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianpm Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianpm commented Mar 22, 2024

brianpm commented Mar 25, 2024

brianpm commented Mar 25, 2024

nusbaume left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justin-richling commented Apr 16, 2024

nusbaume left a comment

Choose a reason for hiding this comment

justin-richling commented Mar 19, 2024 •

edited

Loading

brianpm Mar 22, 2024 •

edited

Loading