Methods for estimating the prevalence of cases in a mixture population based on genetic risk scores.
This repository contains the Python 3 implementation of the proportion estimation algorithms. The following instructions assume that you have a working Python (>=3.6) installation obtained either through Anaconda (recommended) or through other means (which requires pip).
A seperate Matlab implementation is provided in the matlab
subdirectory with its own instructions.
- Install Python >= 3.6.
- Clone this repository and add this folder to your
PYTHONPATH
:git clone https://github.com/bdevans/DPE.git
- Install the requirements using one of the provided requirements files:
- If you use a standard Python distribution:
pip3 install -r requirements.txt
. - Or, if you use Anaconda:
conda env create -f environment.yml
. Then runsource activate dpe
to activate the environment.
- If you use a standard Python distribution:
This should take approximately 1-2 minutes to complete, depending on the speed of your internet connection and the number of dependencies already satisfied.
The examples given were tested with Python 3.8.2 on macOS 11.4 (20F71) running in a Docker container (version 20.10.7, build f0df350). The exact packages installed with conda
during testing are given in .devcontainer/frozen_environment.yml
.
Three main code files are relevant for review purposes:
scripts/run_examples.py
the main script for applying the proportion estimation algorithms to the example data sets.dpe/datasets.py
has utilities for generating, loading and saving synthetic data sets.dpe/estimate.py
has the main routines for estimating proportions in a mixture distribution.
Once the requirements are installed (and the environment activated if necessary) run the example script with:
python scripts/run_examples.py
The proportion estimates and confidence intervals are then generated, with a plain text summary printed to the console (when verbose > 0) and written to a log file (named after the data set file with a .log
extension).
The analysis parameters may be changed by editing the run_examples.py
script. It is recommended to keep the total number of mixture bootstraps (n_mix * n_boot
) below around 10,000 when using the KDE method, as this may take a long time to finish.
Optionally, the file datasets.py
may be edited to change the construction parameters and generate new data sets to analyse.
When running with the parameters N_M = 100
and N_B = 100
on a 2015 15" Macbook Pro (2.8 GHz Intel Core i7) with 8 threads, processing each data set takes around 40m. The majority of this time is spent running the KDE algorithm (the other three each take approximately 1m each). This run time increases considerably with the number of mixtures generated (n_mix
) and/or the number of bootstraps generated for each mixture (n_boot
). Accordingly, a short simulation for demonstration purposes has been configured in run_examples.py
by using fewer bootstraps (N_B = 100
) than in the manuscript (N_B = 1000
), although this can be edited for longer runs if desired. The output produced for the first data set (p_C = 0.25
) is given below for N_M = 100
and N_B = 100
parallelised across 8 threads:
================================================================================
Running on example dataset: p_C = 0.25
Loading dataset: example_pC025...
Running 100 bootstraps with 8 processors...
Method: 100%|████████████████████████████████████| 4/4 [19:41<00:00, 295.33s/it]
Method | Estimated p_C | Estimated p_N
======================================================
Excess point | 0.18200 | 0.81800
Excess (µ±σ) | 0.13660 +/- 0.018 | 0.86340 +/- 0.018
Bias | -0.04560 | 0.04560
C.I. (95.0%) | 0.19040 , 0.20320 | 0.79680 , 0.80960
------------------------------------------------------
Means point | 0.24279 | 0.75721
Means (µ±σ) | 0.24264 +/- 0.020 | 0.75736 +/- 0.020
Bias | -0.00026 | 0.00026
C.I. (95.0%) | 0.20433 , 0.28278 | 0.71722 , 0.79567
------------------------------------------------------
EMD point | 0.24313 | 0.75687
EMD (µ±σ) | 0.24399 +/- 0.019 | 0.75601 +/- 0.019
Bias | 0.00093 | -0.00093
C.I. (95.0%) | 0.20334 , 0.27881 | 0.72119 , 0.79666
------------------------------------------------------
KDE point | 0.24843 | 0.75157
KDE (µ±σ) | 0.25001 +/- 0.022 | 0.74999 +/- 0.022
Bias | 0.00136 | -0.00136
C.I. (95.0%) | 0.20503 , 0.29097 | 0.70903 , 0.79497
------------------------------------------------------
Ground Truth | 0.25000 | 0.75000
======================================================
Elapsed time = 2503.352 seconds
================================================================================
Additionally a results
directory will be created with a subdirectory for each data set processed containing a csv
file with the initial point estimates and bootstrap estimates for each method.
The results are reproducible by default since a seed is set (42
) for the pseudo random number generator. This seed may be changed (or set to None
for a random seed) or set to any integer in the range [0, 2^32)
to explore variations in results due to stochasticity in sampling.
The main requirement is to prepare a dictionary (dict
) containing the keys R_C
, R_N
and Mix
. The associated values should be (one dimensional) arrays (or lists) of the GRS scores for the "Cases Reference" (R_C
) distribution, the "Non-cases Reference" (R_N
) distribution and the "Mixture" distribution (Mix
) respectively.
Alternatively a csv
file may be prepared and loaded with the function datasets.load_dataset(filename)
as demonstrated in the run_examples.py
script. The csv
file should contain the header Group,GRS
followed by pairs of code,score
values (one per line for each GRS score) where code is 1
for the Cases Reference distribution, 2
for the Non-cases Reference distribution and 3
for the Mixture distribution.
Once the GRS scores have been prepared in a suitable form, they may be passed to the analyse_mixture()
function as demonstrated in the run_examples.py
script. Further details about this function are given in the following sections.
The main algorithm is as follows. Details of the specific methods may be found in the methods section of the accompanying manuscript.
bins <-- generate_bins({R_C, R_N, Mix}, method='fd') # Freedman-Diaconis
p^hat <-- {} # Dictionary (hashtable) of proportion estimates
p^cor <-- {} # Dictionary (hashtable) of corrected proportion estimates
p^mbe <-- {} # Dictionary (hashtable) of mixture-bootstrap estimates
CI <-- {} # Dictionary (hashtable) of confidence intervals (2-tuple)
for meth in ["Excess", "Means", "EMD", "KDE"]:
p^hat[meth] <-- get_point_estimate({R_C, R_N, Mix}, bins, meth) # Proportion of cases
# Calculate confidence intervals around the initial proportion estimates
p^mbe[meth] <-- [] # Create empty list of estimates for each method
for m in 1 to N_M: # Number of Monte Carlo mixtures
Mix_meth_m <-- get_monte_carlo_mixture({R_C, R_N}, p^hat[meth])
for b in 1 to N_B: # Number of bootstraps
Mix_meth_m_b <-- get_bootstrap({Mix_meth_m}, bins) # Sample with replacement
p_meth_m_b <-- get_point_estimate({R_C, R_N, Mix_meth_m_b}, bins, meth)
p^mbe[meth].append(p_meth_m_b)
end
end
p^cor[meth] <-- correct_estimate(p^hat[meth], p^mbe[meth]) # Use the set of all bootstraps of all mixtures for each method
CI[meth] <-- get_confidence_intervals(p^cor[meth], p^mbe[meth], alpha=0.05)
end
def analyse_mixture(scores, bins='fd', methods='all',
n_boot=1000, boot_size=-1, n_mix=0, alpha=0.05,
ci_method="bca", correct_bias=False, seed=None,
n_jobs=1, verbose=1, true_pC=None, logfile=''):
scores
(dict
): A required dictionary of the form,{'R_C': array_of_cases_scores, 'R_N': array_of_non-cases_scores, 'Mix': array_of_mixture_scores}
.bins
(str
): A string specifying the binning method:['auto', 'fd', 'doane', 'scott', 'rice', 'sturges', 'sqrt']
. Default:'fd'
. Alternatively, a dictionary,{'width': bin_width, 'min', min_edge, 'max': max_edge, 'edges': array_of_bin_edges, 'centers': array_of_bin_centers, 'n': number_of_bins}
.methods
(str
): A string with the name of the method or'all'
to run all methods (default). Alternatively, a list of method names (strings),["Excess", "Means", "EMD", "KDE"]
, or a dictionary of (bool) flags,{'Excess': True, 'Means': True, 'EMD': True, 'KDE': True}
.n_boot
(int
): Number of bootstraps of the mixture to generate. Default:1000
.boot_size
(int
): The size of each mixture bootstrap. Default is the same size as the mixture.n_mix
(int
): Number of mixtures to construct based on the initial point estimate. Default:0
.alpha
(float
): The alpha value for calculating confidence intervals from bootstrap distributions. Default:0.05
.ci_method
(str
): The name of the method used to calculate the confidence intervals Default:bca
.correct_bias
(bool
): A boolean flag specifing whether to apply the bootstrap correction method or not. Default:False
.seed
(int
): An optional value to seed the random number generator with (in the range[0, (2^32)-1]
) for reproducibility of sampling used for confidence intervals. Defaults:None
.n_jobs
(int
): Number of bootstrap jobs to run in parallel. Default:1
. (n_jobs = -1
runs on all CPUs).verbose
(int
): Integer to control the level of output (0
,1
,2
). Set to-1
to turn off all console output except the progress bars.true_pC
(float
): Optionally pass the true proportion for showing the comparison with estimated proportion(s).logfile
(str
): Optional filename for the output logs. Default:"proportion_estimates.log"
.
(summary, bootstraps)
(tuple
): A tuple consisting of the following data structures.
-
summary
(dict
): A nested dictionary with a key for each estimation method within which is a dictionary with the following keys:p_C
: the prevalence estimate
Optionally, if bootstrapping is used:
CI
: the confidence intervals around the prevalencemean
: the mean of the bootstrapped estimatesstd
: the standard deviation of the bootstrap estimatesp_cor_C
: the corrected prevalence estimate whencorrect_bias == True
-
bootstraps
(DataFrame
): Apandas
dataframe of the proportion estimates. The first row is the point estimate. The remainingn_boot * n_mix
rows are the bootstrapped estimates. Each column is the name of the estimation method.
Additionally the logfile is written to the working directory.
We bootstrap (i.e. sample with replacement) from the available data, systematically varying sample size and mixture proportion. We then apply the following methods to yield estimates of the true proportion and from those, calculate the errors throughout the bootstrap parameter space.
-
number_low = len(RM[RM <= population_median]) number_high = len(RM[RM > population_median]) p_hat_C = abs(number_high - number_low) / len(RM) p_hat_C = np.clip(p_hat_C, 0.0, 1.0)
-
Means
mu_C, mu_N = methods["Means"]["mu_C"], methods["Means"]["mu_N"] if mu_C > mu_N: # This should be the case p_hat_C = (RM.mean() - mu_N) / (mu_C - mu_N) else: p_hat_C = (mu_N - RM.mean()) / (mu_N - mu_C) p_hat_C = np.clip(p_hat_C, 0.0, 1.0)
-
$$EMD(P,Q)={\frac {\sum _{i=1}^{m}\sum _{j=1}^{n}f_{i,j}d_{i,j}}{\sum _{i=1}^{m}\sum _{j=1}^{n}f_{i,j}}}$$ def interpolate_CDF(scores, x_i, min_edge, max_edge): """Interpolate the cumulative density function of the scores at the points in the array `x_i`. """ x = [min_edge, *sorted(scores), max_edge] y = np.linspace(0, 1, num=len(x), endpoint=True) (iv, ii) = np.unique(x, return_index=True) return np.interp(x_i, iv, y[ii]) # Interpolate the cdfs at the same points for comparison CDF_1 = interpolate_CDF(scores["R_C"], bins['centers'], bins['min'], bins['max']) CDF_2 = interpolate_CDF(scores["R_N"], bins['centers'], bins['min'], bins['max']) CDF_M = interpolate_CDF(RM, bins['centers'], bins['min'], bins['max']) # EMDs computed with interpolated CDFs EMD_1_2 = sum(abs(CDF_1 - CDF_2)) EMD_M_1 = sum(abs(CDF_M - CDF_1)) EMD_M_2 = sum(abs(CDF_M - CDF_2)) p_hat_C = 0.5 * (1 + (EMD_M_2 - EMD_M_1) / EMD_1_2)
-
- Convolve a Gaussian kernel (bandwidth set with FD method by default) with each datum for each reference population (
R_C
andR_N
) to produce two distribution templates:kde_R_C
andkde_R_N
. - Create a new model which is the weighted sum of these two distributions (initialise to equal amplitudes:
amp_R_C = amp_R_N = 1
):y = amp_R_C * kde_R_C + amp_R_N * kde_R_N
. - Convolve the same kernel with the unknown mixture data to form
kde_M
. - Fit the combined model to the smoothed mixture allowing the amplitude of each reference distribution to vary. Optimise with the least squares algorithm.
p_hat_C = amp_R_C / (amp_R_C + amp_R_N)
.
def fit_KDE_model(Mix, bins, model, params_mix, kernel, method='leastsq'): x = bins["centers"] bw = bins['width'] kde_mix = KernelDensity(kernel=kernel, bandwidth=bw).fit(Mix[:, np.newaxis]) res_mix = model.fit(np.exp(kde_mix.score_samples(x[:, np.newaxis])), x=x, params=params_mix, method=method) amp_R_C = res_mix.params['amp_1'].value amp_R_N = res_mix.params['amp_2'].value return amp_R_C / (amp_R_C + amp_R_N)
- Convolve a Gaussian kernel (bandwidth set with FD method by default) with each datum for each reference population (
Note: The population of cases is sometimes referred to as Reference 1 and the population of non-cases referred to as Reference 2. Accordingly these notations may be used interchangeably.
R_C == Ref1, R_N == Ref2
p_C == p1, p_N == p2
Excess | Means | EMD | KDE | |
---|---|---|---|---|
Advantages | Simple to calculate. | Simple to calculate. | Computationally cheap. | Accurate. |
Relies only on summary statistics. | Relies only on summary statistics. | |||
Works for even highly overlapping distributions. | ||||
------------ | ------------ | ------------ | ------------ | ------------ |
Disadvantages | Less accurate when the distributions significantly overlap. | Less accurate when data is not normally distributed. | Requires a choice of bin size. | Requires a choice of bandwidth and kernel. |
Not recommended for data analysis (included for comparison only). | Computationally expensive. |
If you use this software, please cite our accompanying publication:
@article{EvansSlowinskiEstimatingDiseasePrevalence2021,
author = {Evans, Benjamin D. and S{\l}owi{\'n}ski, Piotr and Hattersley, Andrew T. and Jones, Samuel E. and Sharp, Seth and Kimmitt, Robert A. and Weedon, Michael N. and Oram, Richard A. and Tsaneva-Atanasova, Krasimira and Thomas, Nicholas J.},
title = {Estimating disease prevalence in large datasets using genetic risk scores},
volume = {12},
number = {1},
pages = {6441},
journal = {Nature Communications},
issn = {2041-1723},
year = {2021},
month = nov,
url = {https://www.nature.com/articles/s41467-021-26501-7},
doi = {10.1038/s41467-021-26501-7},
}