Add optimized simultaneous ECDF bands #2368

sethaxen · 2024-08-19T20:12:22Z

Description

This pull request adds the optimization method from https://doi.org/10.1007/s11222-022-10090-6 for computing simultaneous ECDF bands and ~~switches to using it as the default approach~~ uses it as the default approach sometimes based on a heuristic.

Part of #2309. The changes are non-breaking, since the docstring of plot_ecdf intentionally does not document which method is the default.

Example

import numpy as np
import scipy.stats

import matplotlib.pyplot as plt
import arviz as az

np.random.seed(0)

dist = scipy.stats.expon()
x = dist.rvs(1000)

eval_points = np.linspace(0, 10, 100)

fig, axs = plt.subplots(2, 1, figsize=(8, 5))
%time az.plot_ecdf(x, ax=axs[0], eval_points=eval_points, confidence_bands=True, cdf=dist.cdf, difference=True)
%time az.plot_ecdf(x, ax=axs[1], eval_points=eval_points, confidence_bands='simulated', cdf=dist.cdf, difference=True)
axs[0].set_title('Optimized confidence bands')
axs[1].set_title('Simulated confidence bands')
plt.tight_layout()

CPU times: user 261 ms, sys: 116 μs, total: 261 ms
Wall time: 261 ms
CPU times: user 96.2 ms, sys: 32 μs, total: 96.3 ms
Wall time: 96.3 ms

Note that while optimized confidence bands are more stable than simulated, with this implementation they are not necessarily faster. Perhaps that can be improved using Numba.

Checklist

Does the PR follow official
PR format?
Has included a sample plot to visually illustrate the changes? (only for plot-related functions)
Is the new feature properly documented with an example?
Does the PR include new or updated tests to cover the new feature (using pytest fixture pattern)?
Is the code style correct (follows pylint and black guidelines)?
Is the new feature listed in the New features
section of the changelog?

📚 Documentation preview 📚: https://arviz--2368.org.readthedocs.build/en/2368/

codecov · 2024-08-19T21:23:34Z

Codecov Report

Attention: Patch coverage is 82.27848% with 14 lines in your changes missing coverage. Please review.

Project coverage is 86.99%. Comparing base (e6b5e2b) to head (054efa6).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
arviz/stats/ecdf_utils.py	80.28%	14 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2368      +/-   ##
==========================================
- Coverage   87.01%   86.99%   -0.03%     
==========================================
  Files         124      124              
  Lines       12788    12862      +74     
==========================================
+ Hits        11128    11189      +61     
- Misses       1660     1673      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sethaxen · 2024-08-20T08:03:19Z

arviz/stats/ecdf_utils.py

+    return prob_right
+
+
+def _ecdf_band_optimization_objective(


This should ideally be jitted with numba if available for a likely significant speed-up. Currently the problem is that the functions it calls use scipy.stats.binom.interval and scipy.stats.binom.pmf, which are not JIT-able. If we reimplement these functions here, then we could probably get a significant speed-up.

interval calls ppf, which defers to Boost's implementation, which in turn uses TOMS Algorithm 748 to find a root in an interval: https://doi.org/10.1145/210089.210111, so reimplementing that to be JIT-able would be the bottleneck. Which seems like it could be a lot of work.

Ah wait, since interval is called outside the for loop and only pmf is called within, this could be simple.

@conditional_jit doesn't seem to like it if I call another conditionally-jitted function within the first conditionally-jitted function, e.g.

E numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend) E Untyped global name '_update_ecdf_band_interior_probabilities': Cannot determine Numba type of <class 'arviz.utils.maybe_numba_fn'>

Is there a way to make this work?

sethaxen · 2024-08-22T10:47:54Z

I worked out how to JIT compile with numba and definitely see large speed-ups for low ndraws and len(evaluation_points). For large ndraws the sizes of the prob_conditional matrices increases, and for large len(evaluation_points) the number of matmuls increase, so the paper mentions the following approach for efficiency:

The computation time required can be further reduced by using a grid of pre-computed values as the adjustment parameters, and interpolate for different values of N in log-log scale.

bayesplot defaults to interpolation for ndraws >= 100 & len(evaluation_points) > 200. I'll need to think more about whether this would work for arbitrary ECDFs or only for cases where the CDF at eval points is a uniform grid between 0 and 1 (as is the case for uniform eval points for uniform distribution).

sethaxen · 2024-08-22T12:45:16Z

I'll need to think more about whether this would work for arbitrary ECDFs or only for cases where the CDF at eval points is a uniform grid between 0 and 1 (as is the case for uniform eval points for uniform distribution).

Yeah, that's the only context where this makes sense, which is the same context as PIT plots. Ergo, I'll leave interpolation for a separate PR, and in this PR I'll use a benchmark to heuristically identify when we should switch from optimized bands to simulated as the default in plot_ecdf.

OriolAbril

changes look good, not sure if you still need help with numba, if so it wasn't clear to me what wasn't working. Do you have more changes planned for ecdf functionality? I wanted to test porting it to arviz-stats too

sethaxen · 2024-08-28T18:06:17Z

changes look good, not sure if you still need help with numba, if so it wasn't clear to me what wasn't working.

No, it works now and is tested.

Do you have more changes planned for ecdf functionality?

Yes, mainly to make this efficient for multi-panel PIT plots. For those cases one wants to compute the band once and then add it to each subplot. And the band can be computed more efficiently in many cases by us fitting the band probability using optimization for a wide range of values and then interpolating.

Efficiency boosts shouldn't block porting it. It's just that as I implement the changes, it may motivate some code restructuring, so maybe it's better to hold off on porting until that is finished?

Later I will add the rank ECDF plots and corresponding bands, but that's strictly addition of new features and maybe should be added directly to arviz-stats.

sethaxen added 10 commits August 19, 2024 19:46

Add optimized ECDF confidence bands

13258dc

Test ECDF optimized confidence bands

3e17e5c

Default to optimized confidence bands

5c6b091

Use more descriptive names

aac15cc

Support optimized ECDF bands in plots

23c37d5

Test optimized ECDF bands in plots

70e2ee6

Run black on new code

e0c43be

Drop unnecessary rvs keyword from examples

2dfa73f

Update changelog

3f8704e

Fix or disable linting errors

52a483d

sethaxen requested a review from OriolAbril August 19, 2024 21:43

sethaxen commented Aug 20, 2024

View reviewed changes

sethaxen changed the title ~~Add optimized simultaneous ECDF bands~~ [WIP] Add optimized simultaneous ECDF bands Aug 22, 2024

sethaxen removed the request for review from OriolAbril August 22, 2024 10:35

sethaxen added 7 commits August 22, 2024 15:48

Speed up optimized bands with numba if available

a583bab

Make functions work even when jit is disabled

bbb29f4

Fix pylint errors

d227118

Disable pylint warning

003946c

Add heuristic for switching between band methods

020dff8

Add test for low and high number of draws

a209ef4

Run black

054efa6

sethaxen changed the title ~~[WIP] Add optimized simultaneous ECDF bands~~ Add optimized simultaneous ECDF bands Aug 27, 2024

sethaxen requested a review from OriolAbril August 27, 2024 10:15

OriolAbril approved these changes Aug 28, 2024

View reviewed changes

OriolAbril merged commit 5a928fb into main Aug 28, 2024
12 checks passed

sethaxen deleted the ecdf_bands_opt branch September 16, 2024 08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized simultaneous ECDF bands #2368

Add optimized simultaneous ECDF bands #2368

sethaxen commented Aug 19, 2024 •

edited

Loading

codecov bot commented Aug 19, 2024 •

edited

Loading

sethaxen Aug 20, 2024

sethaxen Aug 20, 2024

sethaxen Aug 20, 2024 •

edited

Loading

sethaxen commented Aug 22, 2024

sethaxen commented Aug 22, 2024

OriolAbril left a comment

sethaxen commented Aug 28, 2024

Add optimized simultaneous ECDF bands #2368

Add optimized simultaneous ECDF bands #2368

Conversation

sethaxen commented Aug 19, 2024 • edited Loading

Description

Example

Checklist

codecov bot commented Aug 19, 2024 • edited Loading

Codecov Report

sethaxen Aug 20, 2024

Choose a reason for hiding this comment

sethaxen Aug 20, 2024

Choose a reason for hiding this comment

sethaxen Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

sethaxen commented Aug 22, 2024

sethaxen commented Aug 22, 2024

OriolAbril left a comment

Choose a reason for hiding this comment

sethaxen commented Aug 28, 2024

sethaxen commented Aug 19, 2024 •

edited

Loading

codecov bot commented Aug 19, 2024 •

edited

Loading

sethaxen Aug 20, 2024 •

edited

Loading