Skip to content

Commit

Permalink
Merge pull request #1055 from CamDavidsonPilon/v0.24.11
Browse files Browse the repository at this point in the history
v0.24.11
  • Loading branch information
CamDavidsonPilon authored Jun 18, 2020
2 parents 9be518a + bf1caed commit 40ccdf1
Show file tree
Hide file tree
Showing 17 changed files with 820 additions and 591 deletions.
12 changes: 11 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
## Changelog

#### 0.24.10
#### 0.24.11 - 2020-06-17

##### New features
- new spline regression model `CRCSplineFitter` based on the paper "A flexible parametric accelerated failure time model" by Michael J. Crowther, Patrick Royston, Mark Clements.
- new survival probability calibration tool `lifelines.calibration.survival_probability_calibration` to help validate regression models. Based on “Graphical calibration curves and the integrated calibration index (ICI) for survival models” by P. Austin, F. Harrell, and D. van Klaveren.

##### API Changes
- (and bug fix) scalar parameters in regression models were not being penalized by `penalizer` - we now penalizing everything except intercept terms in linear relationships.


#### 0.24.10 - 2020-06-16

##### New features
- New improvements when using splines model in CoxPHFitter - it should offer much better prediction and baseline-hazard estimation, including extrapolation and interpolation.
Expand Down
30 changes: 27 additions & 3 deletions docs/Survival Regression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1016,10 +1016,10 @@ Prime Minister Stephen Harper.
.. note:: Because of the nature of the model, estimated survival functions of individuals can increase. This is an expected artifact of Aalen's additive model.


Model selection in survival regression
=========================================
Model selection and calibration in survival regression
==========================================================

Parametric vs Semi-parametric models
Parametric vs semi-parametric models
---------------------------------------
Above, we've displayed two *semi-parametric* models (Cox model and Aalen's model), and a family of *parametric* models. Which should you choose? What are the advantages and disadvantages of either? I suggest reading the two following StackExchange answers to get a better idea of what experts think:

Expand Down Expand Up @@ -1147,6 +1147,30 @@ into a training set and a testing set fits itself on the training set and evalua
Also, lifelines has wrappers for `compatibility with scikit learn`_ for making cross-validation and grid-search even easier.


Model probability calibration
---------------------------------------------------

New in *lifelines* v0.24.11 is the :func:`~lifelines.calibration.survival_probability_calibration` function to measure your fitted survival model against observed frequencies of events. We follow the advice in "Graphical calibration curves and the integrated calibration index (ICI) for survival models" by P. Austin and co., and use create a smoothed calibration curve using a flexible spline regression model (this avoids the traditional problem of binning the continuous-valued probability, and handles censored data).


.. code:: python
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
from lifelines.calibration import survival_probability_calibration
regression_dataset = load_rossi()
cph = CoxPHFitter(baseline_estimation_method="spline", n_baseline_knots=3)
cph.fit(rossi, "week", "arrest")
survival_probability_calibration(cph, rossi, t0=25)
.. image:: images/survival_calibration_probablilty.png




.. _Assessing Cox model fit using residuals: jupyter_notebooks/Cox%20residuals.html
.. _Testing the Proportional Hazard Assumptions: jupyter_notebooks/Proportional%20hazard%20assumption.html
.. _Custom Regression Models: jupyter_notebooks/Custom%20Regression%20Models.html
Expand Down
Binary file added docs/images/survival_calibration_probablilty.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
992 changes: 503 additions & 489 deletions docs/jupyter_notebooks/Custom Regression Models.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,10 @@ def __init__(self, n_baseline_knots, *args, **kwargs):
super(CRCSplineFitter, self).__init__(*args, **kwargs)

def _create_initial_point(self, Ts, E, entries, weights, Xs):
return [
{
**{"beta_": np.zeros(len(Xs.mappings["beta_"])), "gamma0_": np.array([0.0]), "gamma1_": np.array([0.1])},
**{"gamma%d_" % i: np.array([0.0]) for i in range(2, self.n_baseline_knots)},
}
]
return {
**{"beta_": np.zeros(len(Xs.mappings["beta_"])), "gamma0_": np.array([0.0]), "gamma1_": np.array([0.1])},
**{"gamma%d_" % i: np.array([0.0]) for i in range(2, self.n_baseline_knots)},
}

def set_knots(self, T, E):
self.knots = np.percentile(np.log(T[E.astype(bool).values]), np.linspace(5, 95, self.n_baseline_knots))
Expand Down Expand Up @@ -111,6 +109,3 @@ def S(t, x):
cf = CRCSplineFitter(4).fit(df, "T", "E", regressors=regressors)
cf.print_summary()
cf.predict_hazard(df)[[0, 1, 2, 3]].plot()


cph = CoxPHFitter(baseline_estimation_method="spline").fit(df.drop("constant", axis=1), "T", "E")
25 changes: 12 additions & 13 deletions examples/royston_parmar_splines.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,7 @@ def relu(x):

def basis(self, x, knot, min_knot, max_knot):
lambda_ = (max_knot - knot) / (max_knot - min_knot)
return self.relu(x - knot) ** 3 - (
lambda_ * self.relu(x - min_knot) ** 3 + (1 - lambda_) * self.relu(x - max_knot) ** 3
)
return self.relu(x - knot) ** 3 - (lambda_ * self.relu(x - min_knot) ** 3 + (1 - lambda_) * self.relu(x - max_knot) ** 3)


class PHSplineFitter(SplineFitter, ParametricRegressionFitter):
Expand All @@ -39,6 +37,14 @@ class PHSplineFitter(SplineFitter, ParametricRegressionFitter):

KNOTS = [0.1972, 1.769, 6.728]

def _create_initial_point(self, Ts, E, entries, weights, Xs):
return {
"beta_": np.zeros(len(Xs.mappings["beta_"])),
"phi0_": np.array([0.0]),
"phi1_": np.array([0.1]),
"phi2_": np.array([0.0]),
}

def _cumulative_hazard(self, params, T, Xs):
exp_Xbeta = np.exp(np.dot(Xs["beta_"], params["beta_"]))
lT = np.log(T)
Expand Down Expand Up @@ -73,8 +79,7 @@ def _cumulative_hazard(self, params, T, Xs):
+ (
params["phi0_"]
+ params["phi1_"] * lT
+ params["phi2_"]
* self.basis(lT, np.log(self.KNOTS[1]), np.log(self.KNOTS[0]), np.log(self.KNOTS[-1]))
+ params["phi2_"] * self.basis(lT, np.log(self.KNOTS[1]), np.log(self.KNOTS[0]), np.log(self.KNOTS[-1]))
)
)
)
Expand Down Expand Up @@ -124,11 +129,7 @@ def _cumulative_hazard(self, params, T, Xs):


# check PH(1) Weibull model (different parameterization from lifelines)
regressors = {
"beta_": ["binned_lp_(-8.257, -7.608]", "binned_lp_(-7.608, -3.793]"],
"phi0_": ["constant"],
"phi1_": ["constant"],
}
regressors = {"beta_": ["binned_lp_(-8.257, -7.608]", "binned_lp_(-7.608, -3.793]"], "phi0_": ["constant"], "phi1_": ["constant"]}
waf = WeibullFitter().fit(df[columns_needed_for_fitting], "T", "E", regressors=regressors).print_summary()


Expand All @@ -155,8 +156,6 @@ def _cumulative_hazard(self, params, T, Xs):

# looks like figure 2 from paper.
pof.predict_hazard(
pd.DataFrame(
{"binned_lp_(-8.257, -7.608]": [0, 0, 1], "binned_lp_(-7.608, -3.793]": [0, 1, 0], "constant": [1.0, 1, 1]}
)
pd.DataFrame({"binned_lp_(-8.257, -7.608]": [0, 0, 1], "binned_lp_(-7.608, -3.793]": [0, 1, 0], "constant": [1.0, 1, 1]})
).plot()
plt.show()
110 changes: 53 additions & 57 deletions examples/survival_calibration.py
Original file line number Diff line number Diff line change
@@ -1,76 +1,72 @@
# -*- coding: utf-8 -*-
"""
Smoothed calibration curves for time-to-event models

https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8570

"""
def survival_probability_calibration(model, training_df, t0):
"""
Smoothed calibration curves for time-to-event models
https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8570
def ccl(p):
return np.log(-np.log(1 - p))
"""

def ccl(p):
return np.log(-np.log(1 - p))

from lifelines.datasets import load_regression_dataset
T = model.duration_col
E = model.event_col

df = load_regression_dataset()
T = "T"
E = "E"
predictions_at_t0 = np.clip(1 - model.predict_survival_function(training_df, times=[t0]).T.squeeze(), 1e-10, 1 - 1e-10)

T_0 = 15
# create new dataset with the predictions
prediction_df = pd.DataFrame(
{"ccl_at_%d" % t0: ccl(predictions_at_t0), "constant": 1, T: model.durations, E: model.event_observed}
)

# fit new dataset to flexible spline model
# this new model connects prediction probabilities and actual survival. It should be very flexible, almost to the point of overfitting. It's goal is just to smooth out the data!
knots = 3
regressors = {"beta_": ["ccl_at_%d" % t0], "gamma0_": ["constant"], "gamma1_": ["constant"], "gamma2_": ["constant"]}

# fit original model and make survival predictions
cph = CoxPHFitter(baseline_estimation_method="spline", n_baseline_knots=4).fit(df, T, E)
predictions_at_T_0 = 1 - cph.predict_survival_function(df, times=[T_0]).T.squeeze()
cll_predictions_at_T_0 = ccl(predictions_at_T_0)
# this model is from examples/royson_crowther_clements_splines.py
crc = CRCSplineFitter(knots, penalizer=0)
if CensoringType.is_right_censoring(model):
crc.fit_right_censoring(prediction_df, T, E, regressors=regressors)
elif CensoringType.is_left_censoring(model):
crc.fit_left_censoring(prediction_df, T, E, regressors=regressors)
elif CensoringType.is_interval_censoring(model):
crc.fit_interval_censoring(prediction_df, T, E, regressors=regressors)

# create new dataset with the predictions
prediction_df = pd.DataFrame({"ccl_at_%d" % T_0: cll_predictions_at_T_0, "constant": 1, "week": df[T], "arrest": df[E]})
# predict new model at values 0 to 1, but remember to ccl it!
x = np.linspace(np.clip(predictions_at_t0.min() - 0.01, 0, 1), np.clip(predictions_at_t0.max() + 0.01, 0, 1), 100)
y = 1 - crc.predict_survival_function(pd.DataFrame({"ccl_at_%d" % t0: ccl(x), "constant": 1}), times=[t0]).T.squeeze()

# fit new dataset to flexible spline model
# this new model connects prediction probabilities and actual survival. It should be very flexible, almost to the point of overfitting. It's goal is just to smooth out the data!
fig, ax = plt.subplots()
# plot our results
ax.set_title("Smoothed calibration curve of \npredicted vs observed probabilities of t ≤ %d mortality" % t0)

regressors = {
"beta_": ["ccl_at_%d" % T_0],
"gamma0_": ["constant"],
"gamma1_": ["constant"],
"gamma2_": ["constant"],
"gamma3_": ["constant"],
}
# this model is from examples/royson_crowther_clements_splines.py
crc = CRCSplineFitter(4).fit(prediction_df, "week", "arrest", regressors=regressors)
color = "tab:red"
ax.plot(x, y, label="smoothed calibration curve, %d knots" % knots)
ax.set_xlabel("Predicted probability of \nt ≤ %d mortality" % t0)
ax.set_ylabel("Observed probability of \nt ≤ %d mortality" % t0, color=color)
ax.tick_params(axis="y", labelcolor=color)

# predict new model at values 0 to 1, but remember to ccl it!
x = np.linspace(np.clip(predictions_at_T_0.min() - 0.05, 0, 1), np.clip(predictions_at_T_0.max() + 0.05, 0, 1), 100)
y = 1 - crc.predict_survival_function(pd.DataFrame({"ccl_at_%d" % T_0: ccl(x), "constant": 1}), times=[T_0]).T.squeeze()
# plot x=y line
ax.plot(x, x, c="k", ls="--")
ax.legend()

# plot histogram of our original predictions
color = "tab:blue"
twin_ax = ax.twinx()
twin_ax.set_ylabel("Count of \npredicted probabilities", color=color) # we already handled the x-label with ax1
twin_ax.tick_params(axis="y", labelcolor=color)
twin_ax.hist(predictions_at_t0, alpha=0.3, bins="sqrt", color=color)

# plot our results
fig, ax = plt.subplots()
ax.set_title("Smoothed calibration curve of predicted vs observed probabilities")
plt.ylim(x[0], x[-1])
plt.xlim(x[0], x[-1])
plt.tight_layout()

deltas = ((1 - crc.predict_survival_function(prediction_df, times=[t0])).T.squeeze() - predictions_at_t0).abs()
ICI = deltas.mean()
E50 = np.percentile(deltas, 50)
print("ICI = ", ICI)
print("E50 = ", E50)

color = "tab:red"
ax.plot(x, y, label="smoothed calibration curve", color=color)
ax.set_xlabel("Predicted probability of \nt=%d mortality" % T_0)
ax.set_ylabel("Observed probability of \nt=%d mortality" % T_0, color=color)
ax.tick_params(axis="y", labelcolor=color)

# plot x=y line
ax.plot(x, x, c="k", ls="--")


# plot histogram of our original predictions
color = "tab:blue"
twin_ax = ax.twinx()
twin_ax.set_ylabel("histogram of \npredicted probabilities", color=color) # we already handled the x-label with ax1
twin_ax.tick_params(axis="y", labelcolor=color)


twin_ax.hist(predictions_at_T_0, alpha=0.3, bins="sqrt", color=color)

ax.legend()
plt.tight_layout()
return ax, ICI, E50
1 change: 1 addition & 0 deletions lifelines/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from lifelines.fitters.generalized_gamma_regression_fitter import GeneralizedGammaRegressionFitter
from lifelines.fitters.spline_fitter import SplineFitter
from lifelines.fitters.mixture_cure_fitter import MixtureCureFitter
from lifelines.fitters.crc_spline_fitter import CRCSplineFitter


from lifelines.version import __version__
Expand Down
104 changes: 104 additions & 0 deletions lifelines/calibration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from lifelines.utils import CensoringType
from lifelines.fitters import RegressionFitter
from lifelines import CRCSplineFitter


def survival_probability_calibration(model: RegressionFitter, training_df: pd.DataFrame, t0: float, ax=None):
r"""
Smoothed calibration curves for time-to-event models. This is analogous to
calibration curves for classification models, extended to handle survival probabilities
and censoring. Produces a matplotlib figure and some metrics.
We want to calibrate our model's prediction of :math:`P(T < \text{t0})` against the observed frequencies.
Parameters
-------------
model:
a fitted lifelines regression model to be evaluated
training_df: DataFrame
the DataFrame used to train the model
t0: float
the time to evaluate the probability of event occurring prior at.
Returns
----------
ax:
mpl axes
ICI:
mean absolute difference between predicted and observed
E50:
median absolute difference between predicted and observed
https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8570
"""

def ccl(p):
return np.log(-np.log(1 - p))

if ax is None:
ax = plt.gca()

T = model.duration_col
E = model.event_col

predictions_at_t0 = np.clip(1 - model.predict_survival_function(training_df, times=[t0]).T.squeeze(), 1e-10, 1 - 1e-10)

# create new dataset with the predictions
prediction_df = pd.DataFrame(
{"ccl_at_%d" % t0: ccl(predictions_at_t0), "constant": 1, T: model.durations, E: model.event_observed}
)

# fit new dataset to flexible spline model
# this new model connects prediction probabilities and actual survival. It should be very flexible, almost to the point of overfitting. It's goal is just to smooth out the data!
knots = 3
regressors = {"beta_": ["ccl_at_%d" % t0], "gamma0_": ["constant"], "gamma1_": ["constant"], "gamma2_": ["constant"]}

# this model is from examples/royson_crowther_clements_splines.py
crc = CRCSplineFitter(knots, penalizer=0)
if CensoringType.is_right_censoring(model):
crc.fit_right_censoring(prediction_df, T, E, regressors=regressors)
elif CensoringType.is_left_censoring(model):
crc.fit_left_censoring(prediction_df, T, E, regressors=regressors)
elif CensoringType.is_interval_censoring(model):
crc.fit_interval_censoring(prediction_df, T, E, regressors=regressors)

# predict new model at values 0 to 1, but remember to ccl it!
x = np.linspace(np.clip(predictions_at_t0.min() - 0.01, 0, 1), np.clip(predictions_at_t0.max() + 0.01, 0, 1), 100)
y = 1 - crc.predict_survival_function(pd.DataFrame({"ccl_at_%d" % t0: ccl(x), "constant": 1}), times=[t0]).T.squeeze()

# plot our results
ax.set_title("Smoothed calibration curve of \npredicted vs observed probabilities of t ≤ %d mortality" % t0)

color = "tab:red"
ax.plot(x, y, label="smoothed calibration curve")
ax.set_xlabel("Predicted probability of \nt ≤ %d mortality" % t0)
ax.set_ylabel("Observed probability of \nt ≤ %d mortality" % t0, color=color)
ax.tick_params(axis="y", labelcolor=color)

# plot x=y line
ax.plot(x, x, c="k", ls="--")
ax.legend()

# plot histogram of our original predictions
color = "tab:blue"
twin_ax = ax.twinx()
twin_ax.set_ylabel("Count of \npredicted probabilities", color=color) # we already handled the x-label with ax1
twin_ax.tick_params(axis="y", labelcolor=color)
twin_ax.hist(predictions_at_t0, alpha=0.3, bins="sqrt", color=color)

plt.tight_layout()

deltas = ((1 - crc.predict_survival_function(prediction_df, times=[t0])).T.squeeze() - predictions_at_t0).abs()
ICI = deltas.mean()
E50 = np.percentile(deltas, 50)
print("ICI = ", ICI)
print("E50 = ", E50)

return ax, ICI, E50
Loading

0 comments on commit 40ccdf1

Please sign in to comment.