pyCopCor is a framework which implements different copula-based correlation approaches. The framework is a result of my dissertation. If you use any of these methods, you are welcome to cite my Dissertation. While I create the empirical approach, the other approaches are based on other works, which you'll find in the References section. However, I did spend some time parallelising and optimising some methods to modern CPUs supporting AVX2.
- Marginals
- Variable Selection based on the Work of Schweizer and Wolff
- Copula Entropy
- Dissertation (Cite)
- References
To work with the Copula functions, you first need the marginals of your data.
These can be computed for 1-D numpy arrays X, Y or Z using:
import pycopcor.marginal as pcm
fx_0 = pcm.density(X[:])
fx_1 = pcm.density(Y[:])
fx_2 = pcm.density(Z[:])
One promising approach for dependence or correlation analysis is created by Schweizer and Wolff [1] based on a dissertation of Wolff [2]. Using copulas, they described a set of measures for dependence:
- Spearman's
$\rho = 12 \int_0^1{\int_0^1 {(C(u,v)-uv)du dv}}$ $\gamma = (90 \int_0^1{\int_0^1 {(C(u,v)-uv)^2du dv}})^{\frac{1}{2}}$ $\sigma = 12 \int_0^1{\int_0^1 {|C(u,v)-uv|du dv}}$
where
Seth and Príncipe [3] proposed
import pycopcor.copula.wolff as pcw
pcw.spearman(fx_0,fx_1)
pcw.sigma(fx_0,fx_1)
pcw.gamma(fx_0,fx_1)
More recent work from Blumentritt and Schmid [4] or from Ma and Sun [5] showed that copula entropy is mutual information for continuous-valued variables. Two ways to calculate the copula entropy exist: Using histograms, as done in [4,5] or using the empirical copula, as shown in my dissertation [0, p. 41]. Both approaches have benefits and drawbacks. Some of them are shown in the notebooks
folder.
As the theory around the histogram approach is well covered, I focus on the empirical approach. The copula entropy is based on the copula density, the derivate of the copula:
However, the empirical copula
$\sigma(x) = \frac{1}{1+e^{-x}}$ $\frac{\partial \sigma(x)}{\partial (x)} = \sigma(x)\sigma(-x)$
So the empirical copula can be estimated with:
which allows the computation of the derivate and the calculation of the copula entropy:
As also shown in my dissertation [0, p. 39 ff], the copula entropy
Please be aware that normalisation plays its role in this scenario. The normalisation to the numerical maximum seems to be most similar to the traditional mutual information, as shown in the notebook section. The results of my dissertation indicate the same.
Please be aware that Blumentritt and Schmid [4] suggested a different normalisation related to the gaussian copula, which has its own benefits:
The more dimensional normalisation is shown in [4].
Compute the copula entropy:
import pycopcor.copula.entropy as pce
# histogram based approaches
pce.histogram_2d(fx_0,fx_1)
pce.histogram_3d(fx_0,fx_1,fx_2)
# empirical approaches
pce.empirical_2d(fx_0,fx_1)
pce.empirical_3d(fx_0,fx_1,fx_2)
import pycopcor.copula.entropy as pce
import pycopcor.marginal as pcm
# for 2d data:
## based on the numerical maximum, as in [0]
rng = numpy.random.default_rng()
ref = rng.normal(0,1,n)
f_ref = pcm.density(ref)
i_cd_ref_e = pce.empirical_2d(f_ref,f_ref)
norm_max_e = lambda x: x/i_cd_ref_e
i_cd_ref_h = pce.histogram_2d(f_ref,f_ref)
norm_max_h = lambda x: x/i_cd_ref_h
## based on gaussian copula, as in [4]
norm_gauss = lambda x: numpy.sqrt(1 - numpy.exp(-2*x))
# for 3d data
## based on the numerical maximum, as in [0]
rng = numpy.random.default_rng()
ref = rng.normal(0,1,n)
f_ref = pcm.density(ref)
i_cd_ref_e = pce.empirical_3d(f_ref,f_ref,f_ref)
norm_max_e = lambda x: x/i_cd_ref_e
i_cd_ref_h = pce.histogram_3d(f_ref,f_ref,f_ref)
norm_max_h = lambda x: x/i_cd_ref_h
## based on gaussian copula, as in [4]
def norm_gauss_3d(val):
d_1 = 3 - 1
def f(p): return - 1/2 * numpy.log((1-p)**d_1 * (1+d_1*p)) - val
ret = scipy.optimize.root_scalar(f, bracket=(0, 1))
if not ret.converged:
raise RuntimeError("Not Converged")
return ret.root
Feel free to open a ticket in GitHub or email me if you have any questions.
[0] Gocht-Zech, Andreas
Ein Framework zur Optimierung derEnergieeffizienz von HPC-Anwendungen auf der Basisvon Machine-Learning-Methoden (2022)
Technische Universität Dresden, Technische Universität Dresden
https://nbn-resolving.org/urn:nbn:de:bsz:14-qucosa2-819405
[1] Schweizer, B. ans Wolff, Edward. F.
On Nonparametric Measures of Dependence for Random Variables (1981)
The Annals of Statistics , Vol. 9, No. 4
DOI: 10.1214/aos/1176345528
[2] Wolff, Edward F.
Measures of dependence derived from copulas (1977)
Dissertation
https://search.proquest.com/docview/302846303
[3] Seth, Sohan and Príncipe, José Carlos
Variable Selection: A Statistical Dependence Perspective
2010 Ninth International Conference on Machine Learning and Applications
DOI: 10.1109/ICMLA.2010.148
[4] Blumentritt, Thomas and Schmid, Friedrich
Mutual information as a measure of multivariate association: analytical properties and statistical estimation (2012)
Journal of Statistical Computation and Simulation , Vol. 82, No. 9
DOI: 10.1080/00949655.2011.575782
[5] Ma, J. and Sun, Z.
Mutual information is copula entropy (2011)
Tsinghua Science and Technology , Vol. 16, No. 1
DOI: 10.1016/S1007-0214(11)70008-6
[6] Timme, Nicholas / Alford, Wesley / Flecker, Benjamin / Beggs, John M.
Synergy, redundancy, and multivariate information measures: an experimentalist's perspective (2014)
Journal of Computational Neuroscience , Vol. 36, No. 2
DOI: 10.1007/s10827-013-0458-4
[7] Brown, Gavin and Pocock, Adam and Zhao, Ming-Jie and Luján, Mikel
Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection (2012)
Journal of Machine Learning Research , Vol. 13
https://www.jmlr.org/papers/volume13/brown12a/brown12a.pdf
[8] Gocht, A. and Lehmann, C. and Schöne, R.
A New Approach for Automated Feature Selection (2018)
2018 IEEE International Conference on Big Data (Big Data)
DOI: 10.1109/BigData.2018.8622548