Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calibration superfluous edges #2033

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

SamGriffithsMO
Copy link
Contributor

@SamGriffithsMO SamGriffithsMO commented Sep 26, 2024

Addresses https://github.com/MetOffice/improver_suite/issues/1750

Description

Updates gridded calibration CLI to align itself with spot calibration, e.g. The CLI now takes a directory and it is responsible for identifying which files are required.

This is additionally driven by a desire to remove >day long offsets in the suite configurations, as these require handling before producing a Cylc graph

Related

Testing:

  • Rewrite acceptance tests (?)
  • Ran tests and they passed OK
  • Added new tests for the new feature(s)

Comment on lines 100 to 101
forecast = get_cube_from_directory(forecast_directory, cycle_point=cycle_point, max_days_offset=max_days_offset)
truth = get_cube_from_directory(truth_directory, cycle_point=cycle_point, max_days_offset=max_days_offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this is still in draft so likely you are doing a minimal change for testing. However, if/when you come closer to pulling this out of draft, we would want to ensure feature parity between the CLI and the plugin, whetehr that means changing EstimateCoefficientsForEnsembleCalibration UI and behaviour to call get_cube_from_directory, or otherwise defining a metaplugin to do achieve this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought about this issue, but hadn't considered a metaplugin, which I think might be the way to go. Thanks :)

@cpelley
Copy link
Contributor

cpelley commented Sep 26, 2024

ping @gavinevans - you might be interested in this.

@cpelley
Copy link
Contributor

cpelley commented Sep 26, 2024

Thanks @SamGriffithsMO, this looks very promising 👍

@SamGriffithsMO
Copy link
Contributor Author

Still wip

  1. I am still looking for a performant method to extract the validity time from a netcdf, instead of from the filename
  2. I need to write tests

@cpelley
Copy link
Contributor

cpelley commented Sep 30, 2024

I am still looking for a performant method to extract the validity time from a netcdf, instead of from the filename

100% agree in using metadata rather information encoded in the filepath if possible.
However. Better metadata within the data from improver to increase traceability is something I have been wanting to address the entire time I have been here (on the 'todo' so-to-speak). If the metadata available right now isn't sufficient or deriving the validity time becomes performance inhibitive with the number of files involved, it might still be worth considering the filepath as a short term solution. An excuse for us to look at enforcing traceability through improved metadata within improver.

Comment on lines +318 to +324
0bd96af6cb5c6caa045e397589dd0ce3b498af837d989fe73326f5e9459c6054 ./construct-reliability-tables/basic/forecast/forecast_0.nc
fbc14286b4ce41e2e60df0870ae4911c1b00a38ec96912f43c6187fcaf7d02f6 ./construct-reliability-tables/basic/forecast/forecast_1.nc
0d0edf9751a2019db952907700b02499ec9f1c360db4591a8012ca247a841c73 ./construct-reliability-tables/basic/kgo_aggregated.nc
902e5cb9d3dc5d2b78bb99aff8370f9815adf5064b2caeb7abed73a56a897a43 ./construct-reliability-tables/basic/kgo_single_value_bins.nc
72d4fd0655d1b7a2bc11d85741ec944f195c59813ae629e6858116c4e09eccb0 ./construct-reliability-tables/basic/kgo_without_single_value_bins.nc
8ed50464c34b8673d98d1256d1c11b9eeea911dc79f7f75d425a590bf8697301 ./construct-reliability-tables/basic/truth_0.nc
3999adb3749052d9efdfab863427a20a1fabbca06ff430c6c9cf5f89d1ea4d60 ./construct-reliability-tables/basic/truth_1.nc
8ed50464c34b8673d98d1256d1c11b9eeea911dc79f7f75d425a590bf8697301 ./construct-reliability-tables/basic/truth/truth_0.nc
3999adb3749052d9efdfab863427a20a1fabbca06ff430c6c9cf5f89d1ea4d60 ./construct-reliability-tables/basic/truth/truth_1.nc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found best practices around the acceptances tests to be fairly obscure. This change is required to match the updated expected inputs for the cli, but it is unclear to me how we would implement this change for all users? And how much of a pain it would be?

If this is too much of a pain, I could make some changes to get_cube_from_directory


if len(cubes) < 2:
if verbose:
print(f"Not enough files found in {directory}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be good to include the number of cubes that was found in this message (i.e. 1 or 0)
I don't recall improver dev. guidelines around use of warnings etc. (you know?)

@@ -1366,6 +1367,61 @@ def process(
return coefficients_cubelist


class MetaEstimateCoefficientsForEnsembleCalibration(BasePlugin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a beast of a name 😋
How about this:

Suggested change
class MetaEstimateCoefficientsForEnsembleCalibration(BasePlugin):
class MetaEnsembleCoeffEstimator(BasePlugin):

@cpelley
Copy link
Contributor

cpelley commented Oct 10, 2024

@SamGriffithsMO, have you given thought as to how to incorporate this change? (implications for BoM, USAF etc.)
An option to consider might be to create two new CLI's and making no functional change to the existing two CLI's. What you think?

Comment on lines +1388 to +1397
self.distribution = distribution
self.truth_attribute = truth_attribute
self.cycle_point = cycle_point
self.max_days_offset = max_days_offset
self.point_by_point = point_by_point
self.use_default_initial_guess = use_default_initial_guess
self.units = units
self.predictor = predictor
self.tolerance = tolerance
self.max_iterations = max_iterations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's good form to suffix these with an underscore to indicate 'private' (considered private by convention, not in the same sense of private in c/c++ of course).
Basically, all data should be hidden within the class. If we really want to expose this data to users, then they should be exposed via properties (setters & getters). This stops users from easily tweaking values they aren't intended to be tweaking.

Comment on lines +546 to +552
self.truth_attribute = truth_attribute
self.n_probability_bins = n_probability_bins
self.single_value_lower_limit = single_value_lower_limit
self.single_value_upper_limit = single_value_upper_limit
self.cycle_point = cycle_point
self.max_days_offset = max_days_offset
self.aggregate_coordinates = aggregate_coordinates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per above


if forecast and truth:
plugin = ConstructReliabilityCalibrationTables(
# truth_attribute=self.truth_attribute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this commended arg. here for?

Comment on lines +72 to +80
return MetaConstructReliabilityCalibrationTables(
truth_attribute=truth_attribute,
n_probability_bins=n_probability_bins,
single_value_lower_limit=single_value_lower_limit,
single_value_upper_limit=single_value_upper_limit,
)(forecast, truth, aggregate_coordinates)
aggregate_coordinates=aggregate_coordinates,
cycle_point=cycle_point,
max_days_offset=max_days_offset,
)(forecast_directory, truth_directory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, there is a lot to unpack here in terms of the implications going forwards.
Our future vision actually separates loading, polling, file identity from the IMPROVER plugins.
In this vision, I the existing IMPROVER plugins wouldn't be modified to identify the relevant files. Instead a plugin would specifically handle this and feed into the existing IMPROVER plugins.

We should chat offline.

Comment on lines +16 to +18
forecast_directory: cli.inputpath,
truth_directory: cli.inputpath,
land_sea_mask: cli.inputcube = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

Suggested change
forecast_directory: cli.inputpath,
truth_directory: cli.inputpath,
land_sea_mask: cli.inputcube = None,
forecast_cubes: cli.calib_input_dir,
truth_cubes: cli.calib_input_dir,
land_sea_mask: cli.inputcube = None,

predictor=predictor,
tolerance=tolerance,
max_iterations=max_iterations,
)
return plugin(forecast, truth, landsea_mask=land_sea_mask)
return plugin(forecast_directory, truth_directory, land_sea_mask=land_sea_mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

Suggested change
return plugin(forecast_directory, truth_directory, land_sea_mask=land_sea_mask)
return plugin(**cubes)

Comment on lines +16 to +18
forecast_directory: cli.inputpath,
truth_directory: cli.inputpath,
land_sea_mask: cli.inputcube = None,
Copy link
Contributor

@cpelley cpelley Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or actually (enabling continued CLI behaviour)

Suggested change
forecast_directory: cli.inputpath,
truth_directory: cli.inputpath,
land_sea_mask: cli.inputcube = None,
*cubes: Union[cli.calib_input_dir, cli.inputcube],

Copy link
Contributor

@cpelley cpelley Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat to this. I'm not a fan of the use of the clize typing mechanism within IMPROVER. However, this might facilitate maintaining existing CLI functionality and behaviour while also facilitating separating 'load' from the plugin itself (useful on our journey towards our long-term goals).

@SamGriffithsMO
Copy link
Contributor Author

Putting in to draft to address what Carwyn and I discussed offline. Move the loading to clize (more inline with other plugins), and therefore allow the plugin to run whether given specific cubes or the directories as planed here

@SamGriffithsMO SamGriffithsMO marked this pull request as draft October 10, 2024 14:23
Copy link
Contributor

@gavinevans gavinevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SamGriffithsMO 👍

I'll written some comments below. I haven't totally finished but I'll post this now, as this PR has been pulled back into draft.

Comment on lines +273 to +279
def get_cube_from_directory(
directory,
cycle_point=None,
max_days_offset=None,
date_format="%Y%m%dT%H%MZ",
verbose=False,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please could you add type hints for this function?

def get_cube_from_directory(
    directory: pathlib.Path,
    cycle_point: Optional[str],
    max_days_offset: Optional[int],
    date_format: str = "%Y%m%dT%H%MZ",
    verbose: bool = False,
) -> Cube:

Comment on lines +281 to +298
loads and merges all netCDF files in a directory

To switch on the max offset filter, both cycle_point and max_days_offset
need to be provided

Args:
directory (pathlib.Path):
The path to the directory.
cycle_point (str):
The cycle point of the forecast, used to filter files
max_days_offset (int):
Maximum number of days before cycle_point to consider files,
Defined as a postive int that is subtracted from the cycle_point
date_format (str):
format of the cyclepoint and datetime in the filename, used by
datetime.strptime
verbose (bool):
switch on verbose output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the types from this docstring e.g. pathlib.Path as that should be added to the type hint, rather than here? Could you also capitalise the start of sentences and add full stops for readability?

Comment on lines +292 to +293
Maximum number of days before cycle_point to consider files,
Defined as a postive int that is subtracted from the cycle_point
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Maximum number of days before cycle_point to consider files,
Defined as a postive int that is subtracted from the cycle_point
Maximum number of days before cycle_point to consider files.
Defined as a positive int that is subtracted from the cycle_point

switch on verbose output

Returns:
Cube
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add something more descriptive here? E.g. Cube created by merging all netCDF files within the directory.

Comment on lines +1375 to +1387
def __init__(
self,
distribution,
truth_attribute,
cycle_point=None,
max_days_offset=None,
point_by_point=False,
use_default_initial_guess=False,
units=None,
predictor="mean",
tolerance: float = 0.02,
max_iterations: int = 1000,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please could you add type hints here? These would be similar as already exist for the EstimateCoefficientsForEnsembleCalibration plugin.

distribution,
truth_attribute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This truth_attribute variable is actually no longer required in this CLI, as the forecasts and truths are being provided by separated directories, rather than this repository needing to separate the forecasts and truths, so I think the usage of truth_attribute in this CLI can be removed.

def __init__(
self,
distribution,
truth_attribute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truth_attribute isn't used, so this can be removed.

aggregate_coordinates: list = None,
cycle_point: Optional[str] = None,
max_days_offset: Optional[int] = None,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add docstrings for the __init__ and process methods?


if forecast and truth:
plugin = ConstructReliabilityCalibrationTables(
# truth_attribute=self.truth_attribute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the truth_attribute should be removed, as it's no longer used.

@@ -11,12 +11,16 @@
@cli.clizefy
@cli.with_output
def process(
*cubes: cli.inputcube,
truth_attribute,
forecast_directory: cli.inputpath,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted by @cpelley, these changes would notably impact other users that use this CLI currently. We would need to communicate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants