Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aerosol zonal plots #291

Merged
merged 10 commits into from
Apr 19, 2024
Merged

Conversation

justin-richling
Copy link
Collaborator

@justin-richling justin-richling commented Mar 19, 2024

This PR will add calculations for zonal aerosol plots, and add new functionality for colorbar options.

Files changed:

adf_variable_defaults.yaml

  • Add more aerosol variables and colorbar options
  • Add constants section

adf_diag.py

  • Add automatic time series file generation for PMID and T if certain aerosol variables are requested
  • Change how derived quantities are created; add flexibility in number of constituents a variable needs

plotting_functions.py

  • Add compatibility for non-linear increments for color maps/ranges
  • Specify difference plot colorbar options from case colorbar options

regrid_and_vert_interp.py:

  • Fix small variable name typo

Add more aerosol info, aerosol calc list, and constants section.

Also add subsection for difference colorbar options and flag for non_linear scale plotting/color scale
This moves away from the nco way of deriving variables and moves it to xarray. This is mostly so that any derived variable that has more than two constituents can be processed. The nco way only worked for two constituents.

Also add check for any aerosol variables that will require rho (density of dry air) and create PMID and T time series files (if not already requested). Additionally, add section in derived variables to calculate aerosol values.
lib/adf_diag.py Outdated
variables[i] = ds[constit_list[0]].dims

#Get coordinate values from the first constituent file
coords={'lat': ds[constit_list[0]].lat.values,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be changed to
coords = ds.coords

That way we get the metadata in the coordinate variables, and we don't have to make assumptions about what coordinates are present. I'm thinking of a day when we can ingest data on unstructured grids.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this way, but when it gets to the aerosols (vertical levels) it seems to have a mismatch between coordinates and dimensions:

ValueError: coordinate ilev has dimensions ('ilev',), but these are not a subset of the DataArray dimensions ('time', 'lev', 'lat', 'lon')


coords = ds.coords

Coordinates:
  * lat      (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0
  * lon      (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.3 357.5 358.8
  * lev      (lev) float64 0.008135 0.01834 0.03482 ... 983.2 991.2 997.5
  * ilev     (ilev) float64 0.004259 0.01201 0.02467 ... 987.4 995.1 1e+03
  * time     (time) object 1997-02-01 00:00:00 ... 2001-01-01 00:00:00

As per @nusbaume suggestion for setting the dims once:

const_dims = ds_constits[constit_list[0]].dims

('time', 'lev', 'lat', 'lon')

So I just grabbed the dimensions and coordinates from the first constituent dataArray

#Grab subset for first constituent for dimensions and coordinates
ds_const0 = ds[constit_list[0]]
coords = ds_const0.coords

KeysView(Coordinates:
  * lat      (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0
  * lon      (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.3 357.5 358.8
  * lev      (lev) float64 0.008135 0.01834 0.03482 ... 983.2 991.2 997.5
  * time     (time) object 1997-02-01 00:00:00 ... 2001-01-01 00:00:00)
const_dims = ds_const0.dims
const_dims ('time', 'lev', 'lat', 'lon') 

It seemed to work, let me know if there is anything that stands out about going this way.

lib/adf_diag.py Outdated
data_arrays.append(da)

#NOTE: this will need to be changed when derived equations are more than sums! - JR
ds[var] = sum(data_arrays)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this break RESTOM which is a difference?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, for now we could just have a specific check for RESTOM here, e.g.:

Suggested change
ds[var] = sum(data_arrays)
if var == "RESTOM":
#Assume FSNT comes before FLNT:
ds[var] = data_arrays[0] - data_arrays[1]
else:
ds[var] = sum(data_arrays)

lib/adf_diag.py Outdated

variables = {}
for i in constit_list:
variables[i] = ds[constit_list[0]].dims
Copy link
Collaborator

@brianpm brianpm Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about what this does. Looks like the keys of variables will be the names of the constituents of the derived quantity, and the values will all be the same tuple of the first constituent's dimensions.

variables then gets used below to loop through the dimensions.

I wonder if you get the same result if by replacing lines 1071-1092 with this:

s = 0
for v in constit_list:
    s += ds[v]
s.name = var

and then either ds[var] = s, or just write s.to_netcdf(...) when it is ready for output (see my comment below about the output file).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might vote to remove this section of code entirely, and instead just extract the dims once:

const_dims = ds[constit_list[0]].dims

as they should (?) be the same between all of the constituents.

Then down below instead of looping over the variables quantity for the constituent name just loop over constit_list again. This will help reduce the total number of variables in this function, and the total number of iterations it has to perform.

lib/adf_diag.py Outdated
if var == "SO4":
ds[var] = ds[var]*(96/115)

ds.to_netcdf(derived_file, unlimited_dims='time', mode='w')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that ds has the constituents and the derived values?

Do we want to store all of them in the output file? I think in the version using ncap2 that was necessary because of how NCO works, but since we already have the constituents in files, I think this would be replicating data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I would remove all of the variables in ds except for the derived variable just before you run the to_netcdf command.

@brianpm
Copy link
Collaborator

brianpm commented Mar 22, 2024

@justin-richling and @nusbaume

Following up on an email I sent you about this... I wonder if we should provide an improved approach for derived variables. If we think bringing in numexpr as a new dependency is a reasonable approach, I think I've got a potential drop-in replacement, as follows:

  1. provide a new "variable_default" entry called formula that would define the derived variable in terms of the derivable_from list.
  2. Modify the vars_to_derive variable like this:
            # Loop over CAM history variables:
            list_of_commands = []
            vars_to_derive = {}
            # create copy of var list that can be modified for derivable variables
            diag_var_list = self.diag_var_list
            for var in diag_var_list:
                if var not in hist_file_var_list:
                    vres = res.get(var, {})
                    if ("derivable_from" in vres) and ("formula" in vres):
                        constit_list = vres["derivable_from"]
                        expression = vres["formula"]
                        for constit in constit_list:
                            if constit not in diag_var_list:
                                diag_var_list.append(constit)
                        vars_to_derive[var] = [constit_list, expression]
                        continue
                    else:
                        msg = f"WARNING: {var} is not in the file {hist_files[0]}."
                        msg += " No time series will be generated."
                        print(msg)
                        continue
  1. replace derive_variables with the following (UNTESTED):
import numexpr as nex


def derive(expression, **inputs):
    """Derive output using expression, with input data suppled in inputs.

    expression : str
        A string describing the expression to evaluate, can contain
        numbers, variables, and basic operators. 

    inputs : dict
        A dictionary with keys being the variables used in expression
        and the values being the data (either xr.DataArray or numpy arrays)

    return
        The result of expression

    NOTES
    -----
        Example: expression = "PRECC + PRECL" 
                 inputs = {"PRECC": arr1, "PRECL": arr2}
    """
    return nex.evaluate(expression, inputs)  



def derive_xrWrap(expression, *args, **kwargs):

    # maybe this isn't kosher, but this allows the user
    # to either provide named arguments 
    # or a dictionary without using **dict
    if (len(args) == 1) and (isinstance(args[0], dict)):
        kwargs = args[0]

    template = kwargs[[*kwargs][0]] # whatever is the first input
    # pop the things that are not part of expression inputs
    dims = kwargs.pop("dims", (template.dims if hasattr(template, 'dims') else None))
    coords = kwargs.pop("coords", (template.coords if hasattr(template, 'coords') else None))
    attrs = kwargs.pop("attrs", template.attrs)
    ans = derive(expression, kwargs)
    return xr.DataArray(ans, dims=dims, coords=coords, attrs=attrs)



def derive_variables(self, vars_to_derive=None, ts_dir=None, overwrite=None):
    """
    Derive variables according to expression and inputs provided in vars_to_derive dict.

    Caution: this method _may still_ assume that there will be one time series file per variable

    If the file for the derived variable exists, the kwarg `overwrite` determines
    whether to overwrite the file (true) or exit with a warning message.
    """

    for var in vars_to_derive:
        constit_list = vars_to_derive[var][0]
        expression = vars_to_derive[var][1]
        # get constituent files:
        constit_files = {}
        for c in constit_list:
            if glob.glob(os.path.join(ts_dir, f"*.{c}.*")):
                constit_files[c] = sorted(glob.glob(os.path.join(ts_dir, f"*.{c}.*")))
            else:
                ermsg = f"{c} files were not present; {var} cannot be calculated."
                ermsg += f" Please remove {var} from diag_var_list or find the relevant CAM files."
                raise FileNotFoundError(ermsg)
        # from the constit_files, get the data
        constit_data = {}
        for c in constit_files:
            if len(constit_files[c] > 1):
                constit_data[c] = xr.open_mfdataset(constit_files[c])[c]
            else:
                constit_data[c] = xr.open_dataset(constit_files[c][0])[c]
        # output file specification:
        # TODO: this might not work for multi-file cases
        derived_file = constit_files[[*constit_files][0]].replace(constit_list[0], var)
        if Path(derived_file).is_file():
            if overwrite:
                Path(derived_file).unlink()
            else:
                print(
                    f"[{__name__}] Warning: {var} file was found and overwrite is False. Will use existing file."
                )
                continue
        # derive the variable using expression:
        result = derive_xrWrap(expression, **constit_data)  # defaults to copying metadata from 1st constituent
        # TODO: provide a way to send updated metadata to derive_xrWrap, maybe from additional variable_defaults info
        # Save output:
        if 'time' in result.dims:
            udim = 'time'
        else:
            udim = None
        result.to_netcdf(derived_file, unlimited_dims=udim, mode='w')

@brianpm
Copy link
Collaborator

brianpm commented Mar 25, 2024

I thought about this PR some more. I think given my previous suggestions, it should be acceptable to have derived variables that could be derived from derived variables themselves. For example, say PRECT is derived from PRECC + PRECL. Then the convective fraction of precipitation could be derived as PRECCFRAC = PRECC / PRECT. This seems fine, but requires doing the derived variables in a correct order. I believe that we could just reorder the vars_to_derive using the following code:

def get_derive_order(vres):
    """provides a valid order for deriving variables
       such that derived quantities that are dependencies
       for other quantities are earlier in the output list

        parameters
        ----------
        vres : dict
            dict of each derivable variable with values that are the dependencies (as list)

        return
        ------
        list
            The list of derived variables in a valid order for derivations to succeed
    """
    # use deque for efficient inserting/appending
    from collections import deque
    # Get the full set of values
    derivable = set(vres.keys())  # This also serves as our definition of having dependencies

    q = deque()

    while len(q) < len(derivable):  # every derivable needs to be in q
        for v in derivable:  # Scan through all derivables
            if v not in q:
                q.append(v)
                vindex = q.index(v)  # we might need to know this position
            if v in vres:
                vdepends = vres[v]
                if not isinstance(vdepends, list):
                    vdepends = [vdepends] # in case it's just a single dependency given as a str
                vdtest = [vd for vd in vdepends if vd in vres]  # list of dependencies with dependencies
                if not vdtest:
                    continue # v in q and no dependencies have dependencies
                else:
                    vdd = [vd for vd in enumerate(vdtest) if vd in q] # true if already in q
                    if any(vdd):
                        ndx = vdd.index(True)  # index of first True value
                        q.insert(vindex, vdd[ndx]) # put ahead of v in q
                        break  # leave the for-loop, to re-start the scan
            else:
                print(f"ERROR: encountered {v} which is not listed as a derived quantity")
    return list(q)  # convert deque to list

And that could be used right before calling derive_variables:

if vars_to_derive:
    var_order = get_derive_order(vars_to_derive)  # already a dict as per my other suggestion
    vars_to_derive_ordered = {k: vars_to_derive[k] for k in var_order}
    self.derive_variables(vars_to_derive=vars_to_derive_ordered, ts_dir=ts_dir[case_idx])

@brianpm
Copy link
Collaborator

brianpm commented Mar 25, 2024

@justin-richling -- Sorry for all this code, I think I got carried away! If you and/or @nusbaume would prefer to move ahead with your PR, and then I can put my suggested changes in separately, that would be fine with me. Let me know.

Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Just have some minor change requests and a few suggestions that will hopefully help us maintain current functionality while waiting for the new derived variable infrastructure.

lib/adf_diag.py Outdated Show resolved Hide resolved
lib/adf_diag.py Outdated Show resolved Hide resolved
lib/adf_diag.py Outdated

variables = {}
for i in constit_list:
variables[i] = ds[constit_list[0]].dims
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might vote to remove this section of code entirely, and instead just extract the dims once:

const_dims = ds[constit_list[0]].dims

as they should (?) be the same between all of the constituents.

Then down below instead of looping over the variables quantity for the constituent name just loop over constit_list again. This will help reduce the total number of variables in this function, and the total number of iterations it has to perform.

lib/adf_diag.py Outdated
data_arrays.append(da)

#NOTE: this will need to be changed when derived equations are more than sums! - JR
ds[var] = sum(data_arrays)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, for now we could just have a specific check for RESTOM here, e.g.:

Suggested change
ds[var] = sum(data_arrays)
if var == "RESTOM":
#Assume FSNT comes before FLNT:
ds[var] = data_arrays[0] - data_arrays[1]
else:
ds[var] = sum(data_arrays)

lib/adf_diag.py Outdated Show resolved Hide resolved
lib/adf_diag.py Outdated Show resolved Hide resolved
lib/adf_diag.py Outdated Show resolved Hide resolved
lib/adf_diag.py Outdated
if var == "SO4":
ds[var] = ds[var]*(96/115)

ds.to_netcdf(derived_file, unlimited_dims='time', mode='w')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I would remove all of the variables in ds except for the derived variable just before you run the to_netcdf command.

lib/adf_variable_defaults.yaml Show resolved Hide resolved
lib/adf_variable_defaults.yaml Outdated Show resolved Hide resolved
Clean up derived variable function to grab coordinates and dimensions for new variable netcdf file. Also, remove constituents from derived variable file since they will have their own.
@justin-richling
Copy link
Collaborator Author

@nusbaume I believe I've addressed your and @brianpm's suggestions/changes. Let me know if anything else needs attention, thanks!

Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me now, thanks!

@justin-richling justin-richling merged commit 2ddd16b into NCAR:main Apr 19, 2024
7 checks passed
@justin-richling justin-richling deleted the aerosol-zonal-plots branch April 23, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants