This package provides the first scalable implementation of Vector Quantile Regression (VQR), ready for large real-world datasets. In addition, it provides a powerful extension which makes VQR non-linear in the covariates, via a learnable transformation. The package is easy to use via a familiar sklearn
-style API.
Refer to our paper1 for further details about nonlinear VQR, and please cite our work if you use this package:
@article{rosenberg2022fast,
title={Fast Nonlinear Vector Quantile Regression},
author={Rosenberg, Aviv A and Vedula, Sanketh and Romano, Yaniv and Bronstein, Alex M},
journal={arXiv preprint arXiv:2205.14977},
year={2022}
}
Quantile regression2 (QR) is a well-known method which estimates a
conditional quantile of a target variable
Vector quantiles extend the notion of quantiles to high-dimensional variables 3.
Vector quantile regression (VQR) is the estimation of the conditional vector quantile function
VQR is a highly general approach, as it allows for assumption-free estimation of the conditional vector quantile function, which is a fundamental quantity that fully represents the distribuion of
Below is an illustration of vector quantiles of a
- Data is sampled uniformly from a 2d star-shaped region (middle, gray dots).
- Vector quantiles are overlaid on their data distribution (middle, colored dots).
- The vector quantile function (VQF)
$Q_{\mathbf{Y}}: [0,1]^d \mapsto \mathbb{R}^d$ is a mapping, which satisfies:-
Strong representation:
$\mathbf{Y}=Q_{\mathbf{Y}}(\mathbf{U})$ where$\mathbf{U}\sim\mathbb{U}[0,1]^d$ . -
Co-monotonicity:
$(Q_{\mathbf{Y}}(\boldsymbol{u})-Q_{\mathbf{Y}}(\boldsymbol{u}'))^{\top}(\boldsymbol{u}-\boldsymbol{u}')\geq 0$ .
-
Strong representation:
- Different colors correspond to
$\alpha$ -contours, each containing$100\cdot(1-2\alpha)^d$ percent of the data, a generalization of confidence intervals for vector-valued variables.- For example, for
$\alpha=0.02$ , roughly 92% of the data is contained within the contour. - The shape of the distribution is correctly modelled, without any distributional assuptions.
- For example, for
- For
$Q_{\mathbf{Y}}(\boldsymbol{u})=[Q_1(\boldsymbol{u}),Q_2(\boldsymbol{u})]^{\top}$ and$\boldsymbol{u}=(u_1,u_2)$ , the components$Q_1, Q_2$ of the VQF are depicted as surfaces (left, right) with the corresponding vector quantiles overlaid.- On
$Q_1$ , increasing$u_1$ for a fixed$u_2$ produces a monotonically increasing curve. - This corresponds to a quantile function for
$\text{Y}_1$ given that$\text{Y}_2$ is at a value corresponding to its$u_2$ -th quantile (and vice versa for$Q_2$ ).
- On
Nonlinear VQR (NL-VQR) outperformes linear VQR and Conditional VAE (C-VAE)4 on challenging distribution estimation tasks. The metric shown is KDE-L1 distribution distance (lower is better). Comparisons on two synthetic datasets are shown belows.
Conditional banana: In this dataset both the mean of the distribution and its shape change as a nonlinear function of the covariates
Rotating stars: Features a nonlinear relationship between the covariates and the quantile function (a rotation matrix), where the conditional mean remains the same for any
The Nonlinear VQR implementation in this package can be used for performing scalar, i.e.
Synthetic glasses: A bi-modal distribution in which the modes' distance depends on
- Vector quantile estimation (VQE): Given samples of a vector-valued random variable
$\mathbf{Y}$ , estimate its vector quantile function$Q_{\mathbf{Y}}(\boldsymbol{u})$ . - Vector quantile regression (VQR): Given samples from a joint distribution of
$(\mathbf{X},\mathbf{Y})$ where$\mathbf{X}$ contains covariates ("feature vector") and$\mathbf{Y}$ is the target variable, estimate the conditional vector quantile function$Q_{\mathbf{Y}|\mathbf{X}}(\boldsymbol{u};\boldsymbol{x})$ (CVQF). - Vector monotone rearrangement (VMR): an optional refinement procedure for estimated CVQFs which guarantees that the output is a valid quantile function, with no co-monotonicity violations.
- Support for arbitrary learnable non-linear functions of the covariates
$g_{\boldsymbol{\theta}}(\boldsymbol{x})$ , where the parameters$\boldsymbol{\theta}$ are fitted jointly with the VQR model. Can provide anypytorch
model as the learnable transformation. - Sampling: After fitting VQR, new samples can be generated from the conditional distribution. Thus VQR can be used as a generative model which can be fitted on samples, without making any distributional assumptions.
- Calculating quantile
$\alpha$ -contours: the equivalent of$\alpha$ -confidence regions for high-dimensional data. - Works for any
$d\geq 1$ . Specifically, for$d=1$ , provides an incredibly fast method for performing nonlinear scalar quantile regression which estimates multiple quantiles of the target variable simultaneously. - Multiple solvers supported as backends. The VQE/VQR API can work with different solver implementations which can provide different benefits and tradeoffs. Easy to integrate new solvers.
- GPU support.
- Coverage and area calculation: measures whether samples are within some
$\alpha$ -contour of the fitted quantile function, and also the area of these contours. - Plotting: Basic capabilities for plotting 2d and 3d quantile functions.
Simply install the vqr
package via pip
:
pip install vqr
To run the example notebooks, please clone this repo and install the supplied conda
environment.
conda env update -f environment.yml -n vqr
conda activate vqr
Below is a minimal usage example for VQR, demonstrating fitting linear VQR, sampling from the conditional distribution, and calculating coverage at a specified
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from vqr import VectorQuantileRegressor
from vqr.solvers.regularized_lse import RegularizedDualVQRSolver
N, d, k, T = 5000, 2, 1, 20
N_test = N // 10
seed = 42
alpha = 0.05
# Generate some data (or load from elsewhere).
X, Y = make_regression(
n_samples=N, n_features=k, n_targets=d, noise=0.1, random_state=seed
)
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=N_test, shuffle=True, random_state=seed
)
# Create the VQR solver and regressor.
vqr_solver = RegularizedDualVQRSolver(
verbose=True, epsilon=1e-2, num_epochs=1000, lr=0.9
)
vqr = VectorQuantileRegressor(n_levels=T, solver=vqr_solver)
# Fit the model on the data.
vqr.fit(X_train, Y_train)
# Marginal coverage calculation: for each test point, calculate the
# conditional quantiles given x, and check whether the corresponding y is covered
# in the alpha-contour.
cov_test = np.mean(
[vqr.coverage(Y_test[[i]], X_test[[i]], alpha=alpha) for i in range(N_test)]
)
print(f"{cov_test=}")
# Sample from the fitted conditional distribution, given a specific x.
Y_sampled = vqr.sample(n=100, x=X_test[0])
# Calculate conditional coverage given a sample x.
cov_sampled = vqr.coverage(Y_sampled, x=X_test[0], alpha=alpha)
print(f"{cov_sampled=}")
For further examples, please fefer to the example notebooks in the notebooks/
folder of this repo.
Footnotes
-
Rosenberg, A.A., Vedula, S., Romano, Y. and Bronstein, A.M., 2022. Fast Nonlinear Vector Quantile Regression. arXiv preprint arXiv:2205.14977. ↩
-
Koenker, R. and Bassett Jr, G., 1978. Regression quantiles. Econometrica: journal of the Econometric Society, pp.33-50. ↩
-
Carlier, G., Chernozhukov, V. and Galichon, A., 2016. Vector quantile regression: an optimal transport approach. The Annals of Statistics, 44(3), pp.1165-1192. ↩
-
Feldman, S., Bates, S. and Romano, Y., 2021. Calibrated multiple-output quantile regression with representation learning. arXiv preprint arXiv:2110.00816. ↩