Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reported_cases_opts() #346

Open
Tracked by #423
seabbs opened this issue Jan 23, 2023 · 29 comments
Open
Tracked by #423

reported_cases_opts() #346

seabbs opened this issue Jan 23, 2023 · 29 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@seabbs
Copy link
Contributor

seabbs commented Jan 23, 2023

Currently, all inputs except for reported_cases are managed via a helper function (i.e delay_opts()) that enables specifying options. It would make sense to standardized reported_cases to be in line with this approach. This would also make sense as currently there are several data processing steps (i.e to deal with missing dates) that occur inside package code that are not well surfaced to the user. Putting these in a reported_cases_opts() function would help resolve this and bring the user closer to the data being used for model fitting which is generally a good idea.

@seabbs seabbs added enhancement New feature or request help wanted Extra attention is needed labels Jan 23, 2023
@sbfnk sbfnk mentioned this issue Jul 18, 2023
18 tasks
@sbfnk
Copy link
Contributor

sbfnk commented Jan 29, 2024

Perhaps data_opts() would be better if it's to have preprocessing etc.? I think "reported cases" is a bit too specific a name for the argument anyway (could be hospitalisations etc.).

@sbfnk
Copy link
Contributor

sbfnk commented Mar 6, 2024

A downside of this approach is that, if we want a common data_opts() for all the three models contained in the package (which I think we do) then we can't use this to document requirements of the data (estimate_truncation and estimate_secondary: data frames with specific columns; estimate_truncation: list) and will instead have to point to the specific function documentation.

An alternative solution would be to keep the reported_cases/primary/obs argument(s) as first argument and add preprocessing_opts() for filtering zeros etc. This would also have a potential benefit of being slightly easier to integrate in a pipe.

@sbfnk
Copy link
Contributor

sbfnk commented Mar 11, 2024

Trying to separate the steps involved in closing the issue if going with the approach suggested in the previous comment:

  • Add filter_leading_zeros and zero_threshold to estimate_secondary and estimate_truncation as explicit arguments
  • Put these two into a common preprocessing_opts() (or a shorter name?) function and call this from all models
  • Review if any other preprocessing steps should be made explicit / optional
  • rename first argument of all three models to data

Once this issue is closed we can then make the data column argument more flexible, addressing #505

@jamesmbaazam
Copy link
Contributor

  • rename first argument of all three models to data

Makes sense to me. Would bullet 4 require us to deprecate the current names?

@sbfnk
Copy link
Contributor

sbfnk commented Mar 11, 2024

Yes, I think so.

@jamesmbaazam
Copy link
Contributor

I can work on this if it's good to go.

@sbfnk
Copy link
Contributor

sbfnk commented Mar 11, 2024

Yes, it should be good to go - I think ideally addressing each of the bullet points in sequence using separate PRs.

@jamesmbaazam
Copy link
Contributor

jamesmbaazam commented Apr 25, 2024

I might be wrong but the proposed preprocessing_opts() function is just create_clean_reported_cases() right?.

@sbfnk
Copy link
Contributor

sbfnk commented Apr 26, 2024

I might be wrong but the proposed preprocessing_opts() function is just create_clean_reported_cases() right?.

Good question. There are two options:

  1. we create a preprocessing function that would create clean data that can then passed to the relevant functions
  2. we create a preprocessing_opts() function that collates the arguments (as with the other ..._opts() functions) which are passed to create_clean_reported_cases() within the relevant functions. This would add a preprocessing (or other?) argument to those functions.

Option (2) is might be the easier one as it doesn't require updating any internal logic (e.g. where some internal processing is done before calling create_clean_reported_cases(). It's perhaps also more in line with the existing function/argument structure. But I'm open to arguments for (1).

@seabbs
Copy link
Contributor Author

seabbs commented Apr 29, 2024

  1. Is definitely more in line with other bits of the logic but I much prefer the idea of making the data preprocessing apparent and accessible to people (i.e 1.)

@sbfnk
Copy link
Contributor

sbfnk commented Apr 29, 2024

If going down that route we probably want to rename it to create_clean_data() or preprocess_data() in line with the other changes triggered by this issue.

@jamesmbaazam
Copy link
Contributor

  1. Is definitely more in line with other bits of the logic but I much prefer the idea of making the data preprocessing apparent and accessible to people (i.e 1.)

Two options seem to be apparent here:

  • The data argument can only accept objects run through preprocessing_opts()
  • If users want to pipe the data, they must do raw_data %>% preprocessing_opts() %>% estimate_*().

@jamesmbaazam
Copy link
Contributor

jamesmbaazam commented Apr 29, 2024

If going down that route we probably want to rename it to create_clean_data() or preprocess_data() in line with the other changes triggered by this issue.

  • Yes, I noticed from Deprecate obs, reports, and reported_cases in favour of data #607 that we need a PR run to rename data objects and functions to more neutral names.
  • Additionally, I would suggest splitting up create_clean_reported_cases() into:
    • add_horizon() - completes the dates and adds the horizon window
    • add_breakpoints()
    • filter_leading_zeros()
    • handle_zero_threshold()

In particular, the zero_threshold cleaning step (https://github.com/epiforecasts/EpiNow2/blob/main/R/create.R#L60-L77) deserves being in a separate function.

@sbfnk
Copy link
Contributor

sbfnk commented Apr 30, 2024

I really like those suggestions. I think they might take a bit more thinking though so would suggest to push them into a future release in order to get 1.5.0 out asap.

@jamesmbaazam
Copy link
Contributor

I really like those suggestions. I think they might take a bit more thinking though so would suggest to push them into a future release in order to get 1.5.0 out asap.

Agreed. They're not user-facing so not a priority for this release.

@sbfnk
Copy link
Contributor

sbfnk commented Apr 30, 2024

The suggestions here could also help address #547 and #640

@sbfnk sbfnk added this to the CRAN v1.6 release milestone May 1, 2024
@jamesmbaazam jamesmbaazam mentioned this issue May 2, 2024
7 tasks
@sbfnk
Copy link
Contributor

sbfnk commented May 21, 2024

Proposed set up for data handling / cleaning would be to

  • distinguish between NA (no value) and missing dates (can be accumulated) Distinguish NA (missing) from NA (accumulated) #547
  • remove the horizon argument from estimate_infections etc.
  • add a forecast argument with function forecast_opts with arguments horizon, frequency, and maybe future (from rt_opts()) - internally this will add future dates to the passed data frame with NA values at a given frequency (which will ensure it works with the set up for accumulating incidence if desired)
  • add a clean_data() or similar function that does the other steps mentioned above, i.e. filtering leading zeroes and handling the zero threshold. This would be in line with reported_cases_opts() #346 (comment)

In principle the horizon stuff could also be separate in an add_horizon() function but perhaps making settings about forecasts a data manipulating function is confusing? At the same time there's a certain elegance to it, i.e. future dates are just another form of missing data and the model doesn't actually need to know when the present is except for plotting and the future argument in rt_opts(). The present could be inferred as the last known data point.

@sbfnk
Copy link
Contributor

sbfnk commented May 21, 2024

We probably still want a data_opts() function where users can specify which column in the data frame indicates the date and which the data to fit to (if not the standard names), see #505

@seabbs
Copy link
Contributor Author

seabbs commented May 21, 2024

an add_horizon() fun

I like this idea a lot

@sbfnk
Copy link
Contributor

sbfnk commented May 22, 2024

@jamesmbaazam what do you think?

@jamesmbaazam
Copy link
Contributor

jamesmbaazam commented May 22, 2024

I like all the points.

In principle the horizon stuff could also be separate in an add_horizon() function but perhaps making settings about forecasts a data manipulating function is confusing? At the same time there's a certain elegance to it, i.e. future dates are just another form of missing data and the model doesn't actually need to know when the present is except for plotting and the future argument in rt_opts(). The present could be inferred as the last known data point.

I think the forecasting stuff should probably be done internally using forecast_opts() rather than being a preprocessing step. It removes one step in the model setup.

Maybe, frequency -> interval?? (Unless I don't understand what frequency means here).

add a clean_data() or similar function that does the other steps mentioned above, i.e. filtering leading zeroes and handling the zero threshold. This would be in line with #346 (comment)

Maybe, more specifically, handle_zero_cases()?

@jamesmbaazam
Copy link
Contributor

We probably still want a data_opts() function where users can specify which column in the data frame indicates the date and which the data to fit to (if not the standard names), see #505

This seems to suit the use case for linelist but not suggesting we take it on as a dependency.

@seabbs
Copy link
Contributor Author

seabbs commented May 22, 2024

We probably still want a data_opts() function where users can specify which column in the data frame indicates the date and which the data to fit to (if not the standard names), see

I'm generally pretty sceptical of this as an idea. If users are going to get useful things out of the package they likely have thoughts on how to change the name of a variable already.

Or is the proposal you then track their column name through the code base? Again I am a bit sceptical of the value added here to most users.

@sbfnk
Copy link
Contributor

sbfnk commented May 22, 2024

I'm generally pretty sceptical of this as an idea. If users are going to get useful things out of the package they likely have thoughts on how to change the name of a variable already.

I generally agree, just flagging that if ever going ahead with the suggestion in #371 (comment) we might need a way to point out which column in a passed data frame corresponds to which observation model (though that'll likely look different from a data_opts() option).

Somewhere we might also want users to specify what the data represent, i.e. #505

@sbfnk
Copy link
Contributor

sbfnk commented Aug 2, 2024

It seems we broadly have two options here:

obs |>
  rename(value = confirm) |>
  filter_leading_zeroes() |>
  apply_zero_threshold(threshold = 10) |>
  add_horizon(n = 3, frequency = 7) |>
  estimate_infections()

or

obs |>
  estimate_infections(
    data = data_opts(col = "confirm", zero_threshold = 10),
    forecast = forecast_opts(n = 3, frequency = 7)
  )

@jamesmbaazam
Copy link
Contributor

jamesmbaazam commented Aug 2, 2024

I think the second cannot be piped that way as by definition, the pipe passes the data to the first argument. So it will rather be

  estimate_infections(
    data = data_opts(obs, col = "confirm", zero_threshold = 10),
    forecast = forecast_opts(n = 3, frequency = 7)
  )

Until now, users have not had to do any data cleaning using exported functions here, so I am more inclined to vote for the second option.

@sbfnk
Copy link
Contributor

sbfnk commented Aug 2, 2024

If the case data set becomes part of data_opts yes, but I don't think it has to be (though of course it might make sense for it to do so).

@sbfnk
Copy link
Contributor

sbfnk commented Aug 2, 2024

Until now, users have not had to do any data cleaning using exported functions here, so I am more inclined to vote for the second option.

That is a valid point. A counterpoint would be that with the explicit functions users can actually see what happens (e.g. which values get filtered out / changed, or which dates will be forecast) whereas if it's all internal to estimate_infections() they have no way of accessing that information.

@jamesmbaazam
Copy link
Contributor

If the case data set becomes part of data_opts yes, but I don't think it has to be (though of course it might make sense for it to do so).

We already have a data argument so if we are going by the original second option, it will have to be named something else, then the piping will work.

That is a valid point. A counterpoint would be that with the explicit functions users can actually see what happens (e.g. which values get filtered out / changed, or which dates will be forecast) whereas if it's all internal to estimate_infections() they have no way of accessing that information.

Valid point. Alternatively, we could improve the logging and messaging in the current setup to report all of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants