warped.tex

\input{common/header.tex}
\inbpdocument

\chapter{Warped Mixture Models}
\label{ch:warped}

\begin{quotation}
``What, exactly, is a cluster?''

\hspace*{\fill} - Bernhard Sch\"{o}lkopf, personal communication
\end{quotation}

Previous chapters showed how the probabilistic nature of \gp{}s sometimes allows the automatic determination of the appropriate structure when building models of functions.
One can also take advantage of this property when composing \gp{}s with other models, automatically trading-off complexity between the \gp{} and the other parts of the model.

This chapter considers a simple example: a Gaussian mixture model warped by a draw from a \gp{}.
This novel model produces clusters (density manifolds) having arbitrary nonparametric shapes.
We call the proposed model the \emph{infinite warped mixture model} (\iwmm{}).
%\Cref{fig:generative} shows a set of manifolds and datapoints sampled from the prior defined by this model.
The probabilistic nature of the \iwmm{} lets us automatically infer the number, dimension, and shape of a set of nonlinear manifolds, and summarize those manifolds in a low-dimensional latent space.


%The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data.  
%The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred.
%We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function.  
%We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.


The work comprising the bulk of this chapter was done in collaboration with Tomoharu Iwata and Zoubin Ghahramani, and appeared in \citet{IwaDuvGha12}.
The main idea was born out of a conversation between Tomoharu and myself, and together we wrote almost all of the code as well as the paper.
Tomoharu ran most of the experiments, and Zoubin Ghahramani provided guidance and many helpful suggestions throughout the project.


\section{The Gaussian process latent variable model}
\label{sec:gplvm}

\begin{figure}[t]
\centering
\begin{tabular}{ll}
\qquad \qquad \qquad Warping function: $y = f(x)$ & \\
\reflectbox{\includegraphics[width=0.8\textwidth,clip,trim=1.3cm 0.79cm 0cm 0cm]{\gplvmfiguresdir/gplvm_1d_draw_9} } &
\raisebox{3cm}{\rotatebox{90}{Warped density: $p(y)$}} \\
\qquad \qquad \qquad Latent density: $p(x)$ & 
\end{tabular}
\caption[One-dimensional Gaussian process latent variable model]{
A draw from a one-dimensional Gaussian process latent variable model. 
\emph{Bottom:} the density of a set of samples from a 1D Gaussian specifying the distribution $p(x)$ in the latent space.
\emph{Top left:} A function $y = f(x)$ drawn from a \gp{} prior.
Grey lines show points being mapped through $f$.
\emph{Right:} A nonparametric density $p(y)$ defined by warping the latent density through the sampled function.} 
\label{fig:oned-gplvm}
\end{figure}

\begin{figure}[t]
\centering
\begin{tabular}{ccc}
Latent space $p(\vx)$ & & Observed space $p(\vy)$ \\
\fbox{\includegraphics[height=15em,width=15em]{\gplvmfiguresdir/gplvm-latent-crop}} &
\raisebox{7em}{$\overset{ \textstyle \vf(\vx)}{ \mathlarger{\mathlarger{\mathlarger{\mathlarger{\mathlarger{ \rightarrow}}}}}}$} &
\fbox{\includegraphics[height=15em,width=15em]{\gplvmfiguresdir/gplvm-draw-crop}}
\end{tabular}
\caption[Two-dimensional Gaussian process latent variable model]{A draw from a two-dimensional Gaussian process latent variable model.
\emph{Left:} Isocontours and samples from a 2D Gaussian, specifying the distribution $p(\vx)$ in the latent space. 
\emph{Right:} The observed density $p(\vy)$ has a nonparametric shape, defined by warping the latent density through a function drawn from a \gp{} prior.}  
\label{fig:twod-gplvm}
\end{figure}


The \iwmm{} can be viewed as an extension of the Gaussian process latent variable model (\gplvm{})~\citep{lawrence2004gaussian}, a probabilistic model of nonlinear manifolds.
The \gplvm{} smoothly warps a Gaussian density into a more complicated distribution, using a draw from a \gp{}.
Usually, we say that the Gaussian density is defined in a ``latent space'' having $Q$ dimensions, and the warped density is defined in the ``observed space'' having $D$ dimensions.

A generative definition of the \gplvm{} is:
%
\begin{align}
\textnormal{latent coordinates } \vX = (\vx_1, \vx_2, \ldots, \vx_{N})\tra & \simiid \N{\vx}{0}{\vI_Q} \\
\textnormal{warping functions } \vf = (f_1, f_2, \ldots, f_D)\tra & \simiid \GPt{0}{\seard \kernplus \kWN} \\
\textnormal{observed datapoints } \vY = (\vy_1, \vy_2, \ldots, \vy_N)\tra & = \vf(\vX)
\end{align}

Under the \gplvm{}, the probability of observations $\vY$ given the latent coordinates $\vX$, integrating over the mapping functions $\vf$ is simply a product of \gp{} likelihoods:
\begin{align}
p(\vY | \vX, \vtheta) 
& = \prod_{d=1}^D p(\vY_{:,d} | \vX, \vtheta) 
  = \prod_{d=1}^D \N{\vY_{:,d}}{0}{\vK_\vtheta} \\ 
& = (2 \pi)^{-\frac{DN}{2}}  |\vK_\vtheta|^{-\frac{D}{2}} \exp \left( -\frac{1}{2} {\rm tr}( \vY\tra \vK_\vtheta\inv \vY) \right),
\label{eq:py_x}
\end{align}
%
where $\vtheta$ are the kernel parameters and $\vK_\vtheta$ is the Gram matrix $k_\vtheta(\vX, \vX)$.
%In this chapter, we use an \kSE + \kWN kernel.
% with an additive noise term:
%\begin{align}
%k(\vx_{n},\vx_{m}) &= \alpha \exp\left( - \frac{1}{2 \ell^2}(\vx_n - \vx_m)\tra (\vx_n - \vx_m) \right) 
%+ \delta_{nm} \beta\inv.
%\end{align}
%where $\alpha$ is 
%$\ell$ is the length scale,
%and $\beta$ is the variance of the additive noise.
%This likelihood is simply the product of $D$ independent Gaussian process likelihoods, one for each output dimension.

Typically, the \gplvm{} is used for dimensionality reduction or visualization, and the latent coordinates are set by maximizing \eqref{eq:py_x}.
In that setting, the Gaussian prior density on $\vx$ is essentially a regularizer which keeps the latent coordinates from spreading arbitrarily far apart.  
One can also approximately integrate out $\vX$, which is the approach taken in this chapter.
%In contrast, we integrate out the latent coordinates, and the \iwmm{} places a more flexible parameterization on $p(\vx)$ than a single isotropic Gaussian.
%Just as the \gplvm{} can be viewed as a manifold learning algorithm, the \iwmm{} can be viewed as learning a set of manifolds, one for each cluster.


\section{The infinite warped mixture model}
\label{sec:iwmm-definition}

\begin{figure}
\centering
\begin{tabular}{ccc}
Latent space $p(\vx)$ & & Observed space $p(\vy)$ \\
\fbox{\includegraphics[trim=2em 0em 0em 0em, clip, height=15em,width=15em]{\warpedfiguresdir/iwmm_latent_N200_seed2155_darker}} &
\raisebox{7em}{$\overset{ \textstyle \vf(\vx)}{ \mathlarger{\mathlarger{\mathlarger{\mathlarger{\mathlarger{ \rightarrow}}}}}}$} &
\fbox{\includegraphics[trim=8em 6em 6em 2em, clip, height=15em,width=15em]{\warpedfiguresdir/iwmm_N200_seed2155_points1000000}}
\end{tabular}
\caption[A draw from the infinite warped mixture model prior]{
A sample from the \iwmm{} prior.
\emph{Left:} In the latent space, a mixture distribution is sampled from a Dirichlet process mixture of Gaussians.
\emph{Right:} The latent mixture is smoothly warped to produce a set of non-Gaussian manifolds in the observed space.}
\label{fig:generative}
\end{figure}


This section defines the infinite warped mixture model (\iwmm{}).
Like the \gplvm{}, the \iwmm{} assumes a smooth nonlinear mapping from a latent density to an observed density.
The only difference is that the \iwmm{} assumes that the latent density is an infinite Gaussian mixture model~(\iGMM{})~\citep{rasmussen2000infinite}:
%
\begin{align}
p(\vx ) = \sum_{c=1}^{\infty} \lambda_c \, {\cal N}(\vx|\bm{\mu}_{c},\vR_{c}\inv)
\end{align}
%
where $\lambda_{c}$, $\bm{\mu}_{c}$ and $\vR_{c}$ denote the mixture weight, mean, and precision matrix of the $c^{\text{\tiny th}}$ mixture component.

The \iwmm{} can be seen as a generalization of either the \gplvm{} or the \iGMM{}:
The \iwmm{} with a single fixed spherical Gaussian density on the latent coordinates $p(\vx)$ corresponds to the \gplvm{}, while the \iwmm{} with fixed mapping ${\vy = \vx}$ and 
${Q = D}$ corresponds to the \iGMM{}.

If the clusters being modeled do not happen to have Gaussian shapes, a flexible model of cluster shapes is required to correctly estimate the number of clusters.
For example, a mixture of Gaussians fit to a single non-Gaussian cluster (such as one that is curved or heavy-tailed) will report that the data contains many Gaussian clusters.
% move up?

%\subsection{Identifiability}

%The \gplvm{} is capable of modeling multi-modal densities, as is the \iGMM{}. 
%How can one distinguish between a single latent cluster warped into two observed clusters, and two latent clusters which haven't been distorted?

%We propose that, in general, 


\section{Inference}
\label{sec:iwmm-inference}

As discussed in \cref{sec:useful-properties}, one of the main advantages of \gp{} priors is that, given inputs $\vX$, outputs $\vY$ and kernel parameters $\vtheta$, one can analytically integrate over functions mapping $\vX$ to $\vY$.
However, inference becomes more difficult when one introduces uncertainty about the kernel parameters or the input locations $\vX$.
This section outlines how to compute approximate posterior distributions over all parameters in the \iwmm{} given only a set of observations $\vY$.
Further details can be found in appendix~\ref{sec:iwmm-inference-details}.

We first place conjugate priors on the parameters of the Gaussian mixture components, allowing analytic integration over latent cluster shapes, given the assignments of points to clusters.
The only remaining variables to infer are the latent points $\vX$, the cluster assignments $\CLAS$, and the kernel parameters $\vtheta$.
%In particular, we'll alternate collapsed Gibbs sampling of $\CLAS$ and Hamiltonian Monte Carlo sampling of $\vX$.
%
%Given $\CLAS$, we can calculate the gradient of the un-normalized posterior distribution of $\vX$, integrating over warping functions.
%This gradient allows us to sample $\vX$ using Hamiltonian Monte Carlo (\HMC{}).
%
We can obtain samples from their posterior 
%$p(\vX,\CLAS|\vY,\bm{\theta},\vS,\nu,\vu,r,\eta)$ 
$p(\vX,\CLAS, \vtheta | \vY)$ by iterating two steps:
%
\begin{enumerate}

\item Given a sample of the latent points $\vX$, sample the discrete cluster memberships $\CLAS$ using collapsed Gibbs sampling, integrating out the \iGMM{} parameters \eqref{eq:gibbs}.

\item Given the cluster assignments $\CLAS$, sample the continuous latent coordinates $\vX$ and kernel parameters $\vtheta$ using Hamiltonian Monte Carlo (\HMC{})~\citep[chapter 30]{mackay2003information}.
The relevant equations are given by \cref{eq:warped-hmc1,eq:warped-hmc2,eq:warped-hmc3,eq:warped-hmc4}.
\end{enumerate}

The complexity of each iteration of \HMC{} is dominated by the $\mathcal{O}(N^3)$ computation of $\vK\inv$.
This complexity could be improved by making use of an inducing-point approximation~\citep{quinonero2005unifying,snelson2006sparse}.


\subsubsection{Posterior predictive density}

One disadvantage of the \gplvm{} is that its predictive density has no closed form, and the \iwmm{} inherits this problem.
To approximate the predictive density, we first sample latent points, then sample warpings of those points into the observed space. % and warpings from the posterior and map the sampled points through the sampled warpings.
The Gaussian noise added to each observation by the $\kWN$ kernel component means that each sample adds a Gaussian to the Monte Carlo estimate of the predictive density.
Details can be found in appendix \ref{sec:iwmm-predictive-density}.
This procedure was used to generate the plots of posterior density in \cref{fig:generative,fig:warping,fig:posterior}.


\section{Related work}
\label{sec:iwmm-related-work}

The literature on manifold learning, clustering and dimensionality reduction is extensive.
This section highlights some of the most relevant related work.

\subsubsection{Extensions of the \sgplvm{}}

The \gplvm{} has been used effectively in a wide variety of applications~\citep{lawrence2004gaussian,salzmann2008local,lawrence2009non}.
The latent positions $\vX$ in the \gplvm{} are typically obtained by maximum a posteriori estimation or variational Bayesian inference~\citep{titsias2010bayesian}, placing a single fixed spherical Gaussian prior on $\vx$.

A regularized extension of the \gplvm{} that allows estimation of the dimension of the latent space was introduced by \citet{geiger2009rank}, in which the latent variables and their intrinsic dimensionality were simultaneously optimized.
The \iwmm{} can also infer the intrinsic dimensionality of nonlinear manifolds:
the Gaussian covariance parameters for each latent cluster allow the variance of irrelevant dimensions to become small.
The marginal likelihood of the latent Gaussian mixture will favor using as few dimensions as possible to describe each cluster.
Because each latent cluster has a different set of parameters, each cluster can have a different effective dimension in the observed space,
%This allows the \iwmm{} to model manifolds of differing dimension in the observed space
as demonstrated in \cref{fig:warping}(c).

\citet{nickisch2010gaussian} considered several modifications of the \gplvm{} which model the latent density using a mixture of Gaussians centered around the latent points.
They approximated the observed density $p(\vy)$ by a second mixture of Gaussians, obtained by moment-matching the density obtained by warping each latent Gaussian into the observed space.
Because their model was not generative, training was done by maximizing a leave-some-out predictive density.
This method had poor predictive performance compared to simple baselines.
%, and did not have a generative clustering model.
%spherical Gaussian at each latent datapoint, and approximated the implied density in the observed space with another set of Gaussians, one for each datapoint.
%This model enjoys some of the flexibility of the \iwmm{}, but 


\subsubsection{Related linear models}
The \iwmm{} can also be viewed as a generalization of the mixture of probabilistic principle component analyzers~\citep{tipping1999mixtures}, or the mixture of factor analyzers~\citep{ghahramani2000variational}, where the linear mapping is replaced by a draw from a \gp{}, and the number of components is infinite.


\subsubsection{Non-probabilistic methods}
There exist non-probabilistic clustering methods which can find clusters with complex shapes, such as spectral clustering~\citep{ng2002spectral} and nonlinear manifold clustering~\citep{cao2006nonlinear,elhamifar2011sparse}.
Spectral clustering finds clusters by first forming a similarity graph, then finding a low-dimensional latent representation using the graph, and finally clustering the latent coordinates via k-means.
The performance of spectral clustering depends on parameters which are usually set manually, such as the number of clusters, the number of neighbors, and the variance parameter used for constructing the similarity graph.
The \iwmm{} infers such parameters automatically, and has no need to construct a similarity graph.
%The \iwmm{} can be viewed as a probabilistic generative model for spectral clustering, where the both methods find clusters in a nonlinearly mapped low-dimensional space.  

The kernel Gaussian mixture model~\citep{wang2003kernel} can also find non-Gaussian shaped clusters.
This model estimates a \GMM{} in the implicit infinite-dimensional feature space defined by the kernel mapping of the observed space.
%However, the kernel \GMM{} uses a fixed nonlinear mapping function, with no likelihood term that encourages the latent points to be well-modeled by a \GMM{}.
However, the kernel parameters must be set by cross-validation.
%and the data points mapped by a fixed function are not guaranteed 
%to be distributed according to a GMM.
In contrast, the \iwmm{} infers the mapping function such that the latent coordinates will be well-modeled by a mixture of Gaussians.

%The infinite mixtures of Gaussian process experts~\citep{rasmussen2002infinite}
%uses Dirichlet process and Gaussian process priors for regression problems.
%The mixture of ppca
%ppca \citep{tipping1999probabilistic}
%mixture of ppca \citep{tipping1999mixtures}
%one Gaussian process function

\subsubsection{Nonparametric cluster shapes}

To the best of our knowledge, the only other Bayesian clustering method with nonparametric cluster shapes is that of \citet{rodriguez2012univariate}, who for one-dimensional data introduce a nonparametric model of \emph{unimodal} clusters, where each cluster's density function strictly decreases away from its mode.


\subsubsection{Deep Gaussian processes}

An elegant way to construct a \gplvm{} having a more structured latent density $p(\vx)$ is to use a second \gplvm{} to model the prior density of the latent coordinates $\vX$.
This latent \gplvm{} can have a third \gplvm{} modeling its latent density, etc.
This model class was considered by \citet{damianou2012deep}, who also tested to what extent each layer's latent representation grouped together points having the same label.
They found that when modeling \MNIST{} hand-written digits, nearest-neighbour classification performed best in the 4th layer of a 5-layer-deep nested \gplvm{}, suggesting that the latent density might have been implicitly forming clusters at that layer.


\section{Experimental results}
\label{sec:iwmm-experiments}

\subsection{Synthetic datasets}
Figure~\ref{fig:warping} demonstrates the proposed model on four synthetic datasets.
%
\def\inclatentpic#1{\fbox{\includegraphics[width=0.22\columnwidth, height=0.2\columnwidth]{\warpedfiguresdir/#1}}}
\begin{figure}[t]
\centering
{\tabcolsep=0.3em
\begin{tabular}{cccc}
\multicolumn{4}{c}{Observed space} \\
\inclatentpic{spiral2_x3_observed_coordinates_epoch5000} &
\inclatentpic{halfcircles_N100K3_x3_observed_coordinates_epoch5000} &
\inclatentpic{circles_N50K2_x3_observed_coordinates_epoch5000} &
\inclatentpic{pinwheel_N50K5_x3_observed_coordinates_epoch5000} \\
$\uparrow$ & $\uparrow$ & $\uparrow$ & $\uparrow$ \\ 
\inclatentpic{spiral2_x_latent_coordinates_epoch5000} &
\inclatentpic{halfcircles_N100K3_x_latent_coordinates_epoch5000} &
\inclatentpic{circles_N50K2_x_latent_coordinates_epoch5000} &
\inclatentpic{pinwheel_N50K5_x_latent_coordinates_epoch5000} \\
\multicolumn{4}{c}{Latent space} \\
(a) 2-curve & (b) 3-semi & (c) 2-circle & (d) Pinwheel \\
\end{tabular}}
\caption[Recovering clusters on synthetic data]{
\emph{Top row:} Observed unlabeled data points (black), and cluster densities inferred by the \iwmm{} (colors).
\emph{Bottom row:} Latent coordinates and Gaussian components from a single sample from the posterior.
Each circle plotted in the latent space corresponds to a datapoint in the observed space.}
\label{fig:warping}
\end{figure}
%
None of these datasets can be appropriately clustered by a Gaussian mixture model (\GMM{}).
For example, consider the 2-curve data shown in \cref{fig:warping}(a), where 100 data points lie in each of two curved lines in a two-dimensional observed space.
%as shown at the top row of .
A \GMM{} having only two components cannot separate the two curved lines, while a \GMM{} with many components could separate the two lines only by breaking each line into many clusters. 
In contrast, the \iwmm{} separates the two non-Gaussian clusters in the observed space, representing them using two Gaussian-shaped clusters in the latent space.
\Cref{fig:warping}(b) shows a similar dataset having three clusters.

\Cref{fig:warping}(c) shows an interesting manifold learning challenge: a dataset consisting of two concentric circles.
The outer circle is modeled in the latent space of the \iwmm{} by a Gaussian with one effective degree of freedom.
This narrow Gaussian is fit to the outer circle in the observed space by bending its two ends until they cross over.
In contrast, the sampler fails to discover the 1D topology of the inner circle, modeling it with a 2D manifold instead.
This example demonstrates that each cluster in the \iwmm{} can have a different effective dimension.

\Cref{fig:warping}(d) shows a five-armed variant of the pinwheel dataset of \citet{adams2009archipelago}, generated by warping a mixture of Gaussians into a spiral.
This generative process closely matches the assumptions of the \iwmm{}.
Unsurprisingly, the \iwmm{} is able to recover an analogous latent structure, and its predictive density follows the observed data manifolds.


\subsection{Clustering face images}

We also examined the \iwmm{}'s ability to model images without extensive pre-processing.
We constructed a dataset consisting of 50 greyscale 32x32 pixel images of two individuals from the \UMIST{} faces dataset~\citep{umistfaces}.
Each of the two series of images show a different person turning his head to the right.

\begin{figure}[h!]
\centering
\includegraphics[width=0.5\columnwidth]{\warpedfiguresdir/faces2}
\caption[Latent clusters of face images]{A sample from the 2-dimensional latent space of the \iwmm{} when modeling a series of face images.
Images are rendered at their latent 2D coordinates.
The \iwmm{} reports that the data consists of two separate manifolds, both approximately one-dimensional, which both share the same head-turning structure.}
\label{fig:faces}
\end{figure}

\Cref{fig:faces} shows a sample from the posterior over latent coordinates and density, with each image rendered at its location in the latent space.
The observed space has $32 \times 32 = 1024$ dimensions.
The model has recovered three interpretable features of the dataset:
First, that there are two distinct faces.
Second, that each set of images lies approximately along a smooth one-dimensional manifold.
Third, that the two manifolds share roughly the same structure: the front-facing images of both individuals lie close to one another, as do the side-facing images.


\subsection{Density estimation}

\begin{figure}[ht!]
\centering
\begin{tabular}{cc}
(a) \iwmm{} & (b) \gplvm{} \\
\includegraphics[width=0.4\columnwidth]{\warpedfiguresdir/result_spiral2_ydistribution} &
\includegraphics[width=0.4\columnwidth]{\warpedfiguresdir/result_spiral2all_gplvm_ydistribution}
\end{tabular}
\caption[Comparing density estimates of the \sgplvm{} and the \siwmm{}]{
\emph{Left:} Posterior density inferred by the \iwmm{} in the observed space, on the 2-curve data.
\emph{Right:} Posterior density inferred by an \iwmm{} restricted to have only one cluster, a model equivalent to a fully-Bayesian \gplvm{}.}
\label{fig:posterior}
\end{figure}

\Cref{fig:posterior}(a) shows the posterior density in the observed space inferred by the \iwmm{} on the 2-curve data, computed using 1000 samples from the Markov chain.
%The two separate manifolds of high density implied by the two curved lines was recovered by the \iwmm{}.  
The \iwmm{} correctly recovered the separation of the density into two unconnected manifolds.
% was recovered by the iWMM{}.
%Also note that the density along the manifold varies along with the density of data, shown in \cref{fig:warping}(a).  
%The warped densities output by our model are actually a richer representation than simply a set of manifolds, since they also define a density at each point in the observed space.
%The density computed using multiple samples from MCMC is clearer than the density computed from a single sample shown at the bottom of Figure~\ref{fig:warping} (a).

This result can be compared to the density manifold recovered by the fully-Bayesian \gplvm{}, equivalent to a special case of the \iwmm{} having only a single cluster.
\Cref{fig:posterior}(b) shows that the \gplvm{} places significant density connecting the two end of the clusters, since it must reproduce the observed density manifold by warping a single Gaussian.


\subsection{Mixing}

An interesting side-effect of learning the number of latent clusters is that this added flexibility can help the sampler to escape local minima.
\Cref{fig:infer} shows samples of the latent coordinates and clusters of the \iwmm{} over a single run of a Markov chain modeling the 2-curve data.
%
\def\incwarpmixpic#1{\fbox{\includegraphics[width=0.22\columnwidth, height=0.2\columnwidth]{\warpedfiguresdir/#1}}}
\begin{figure}
\centering
{\tabcolsep=0.3em
\begin{tabular}{cccc}
\incwarpmixpic{spiral2all_o_latent_coordinates_epoch1}&
\incwarpmixpic{spiral2all_o_latent_coordinates_epoch500} & 
\incwarpmixpic{spiral2all_o_latent_coordinates_epoch1800}&
\incwarpmixpic{spiral2all_o_latent_coordinates_epoch3000}\\
(a) Initialization & (b) Iteration 500 & (c) Iteration 1800 & (d) Iteration 3000 \\
\end{tabular}}
\caption[Visualization of the behavior of a sampler for the \siwmm{}]{
%The inferred infinite GMMs over iterations in the two-dimensional latent space with the \siwmm{} using the 2-curve data. Labels indicate the number of iterations of the sampler, and the color of each point represents its ordering in the observed coordinates.}
Latent coordinates and densities of the \iwmm{}, plotted throughout one run of a Markov chain.}
\label{fig:infer}
\end{figure}
%
\Cref{fig:infer}(a) shows the latent coordinates initialized at the observed coordinates, starting with one latent component.
After 500 iterations, each curved line was modeled by two components.
After 1800 iterations, the left curved line was modeled by a single component.
After 3000 iterations, the right curved line was also modeled by a single component, and the dataset was appropriately clustered.
This configuration was relatively stable, and a similar state was found at the 5000th iteration.
%In this way, the \iwmm{} can find latent coordinates
%by flexibly changing the number of components.


\subsection{Visualization}
Next, we briefly investigate the utility of the \iwmm{} for low-dimensional visualization of data.
%
\def\incdensitypic#1{\fbox{\includegraphics[width=0.3\columnwidth, height=0.3\columnwidth]{\warpedfiguresdir/#1}}}%
\begin{figure}[ht!]
\centering
{\tabcolsep=0.3em
\begin{tabular}{cc}
\incdensitypic{spiral2all_o_latent_coordinates} &
\incdensitypic{spiral2all_wm2_o_latent_coordinates} \\
%\incdensitypic{spiral2_gplvm2_o_latent_coordinates} &
%\incdensitypic{spiral2_fixedgaussgplvm_o_latent_coordinates}\\
(a) \iwmm{} & (b) \iwmm{} ($C=1$) %& (c) \gplvm{} & (d) \vbgplvm{}
\end{tabular}}
\caption[Comparison of latent coordinate estimates]{Latent coordinates of the 2-curve data, estimated by two different methods.
% by (a) \iwmm{}, (b) \iwmm{} ($C=1$), (c) \gplvm{}, and (d) Bayesian \gplvm{}.
}
\label{fig:latent}
\end{figure}
%
\Cref{fig:latent}(a) shows the latent coordinates obtained by averaging over 1000 samples from the posterior of the \iwmm{}.
%Because rotating the latent coordinates does not change their probability, simple averaging may not be an adequate way to summarize the posterior.
%However, this result shows characteristics of the latent coordinates obtained by the \iwmm{}.
The estimated latent coordinates are clearly separated, forming two straight lines.
This result is an example of the \iwmm{} recovering the original topology of the data before it was warped to produce observations.

For comparison, \cref{fig:latent}(b) shows the latent coordinates estimated by the fully-Bayesian \gplvm{}, in which case the latent coordinates lie in two sections of a single straight line.
%\Cref{fig:latent}(c) shows the result of standard \MAP{} estimation in the \gplvm{}, and \cref{fig:latent}(d) shows the result of variational Bayesian inference of \citet{titsias2010bayesian}.
%These methods did not unfold the two curved lines, since the effective dimension of their latent representation is fixed beforehand.
%because in these models, no force to favor forming a low-dimensional manifold.
%In contrast, the \iwmm{} effectively formed a low-dimensional representation in the latent space. 

%Regardless of the dimension of the latent space, the \iwmm{} will tend to model each cluster with as low-dimensional a Gaussian as possible, 
%This is because, if the data in a cluster can be made to lie in a low-dimensional plane, 
%since a narrowly-shaped Gaussian will assign the latent coordinates much higher likelihood than a spherical Gaussian.

\subsection{Clustering performance}
We more formally evaluated the density estimation and clustering performance of the proposed model using four real datasets: iris, glass, wine and vowel, obtained from the \LIBSVM{} multi-class datasets~\citep{chang2011libsvm}, in addition to the four synthetic datasets shown above: 2-curve, 3-semi, 2-circle and pinwheel~\citep{adams2009archipelago}.
The statistics of these datasets are summarized in \cref{tab:statistics}.

\begin{table}[ht!]
\centering
\caption[Datasets used for evaluation of the \siwmm{}]
{Statistics of the datasets used for evaluation.}
\label{tab:statistics}
\begin{tabular}{rrrrrrrrr}
\hline
 & 2-curve & 3-semi & 2-circle & Pinwheel & Iris & Glass  & Wine  & Vowel  \\
\hline
dataset size: $N$ & 100 & 300 & 100 & 250 & 150 & 214 & 178 & 528 \\
dimension: $D$ & 2 & 2 & 2 & 2 & 4 & 9 & 13 & 10 \\
num. clusters: $C$ & 2 & 3 & 2 & 5 & 3 & 7 & 3 & 11 \\
\hline
\end{tabular}
\end{table}
%
For each experiment, we show the results of ten-fold cross-validation.
Results in bold are not significantly different from the best performing method in each column according to a paired t-test.
%In each experiment, 10\% of the data was used for testing.
%
%
%We used two evaluation measurements: test point likelihood and Rand index
%for evaluating in terms of density estimation performance
%and clustering performance, respectively.
% the Gaussian-Wishart prior is quite different in these cases, placing more mass on higher-dimensional manifolds when the dimension of the latent space is high.  Placing a hyper-prior allowing the concentration parameter $\eta$ to vary may shrink these differences.

\begin{table}[ht!]
\centering
\caption[Clustering performance comparison]
{Average Rand index for evaluating clustering performance.}
\label{tab:rand}
\begin{tabular}{lrrrrrrrr}
\hline
 & 2-curve & 3-semi & 2-circle & Pinwheel & Iris  & Glass  & Wine  & Vowel  \\
\hline
iGMM & $0.52$ & $0.79$ & $0.83$ & $0.81$ & $0.78$ & $0.60$ & $0.72$ & $\mathbf{0.76}$ \\
iWMM(Q=2) & $\mathbf{0.86}$ & $\mathbf{0.99}$ & $\mathbf{0.89}$ & $\mathbf{0.94}$ & $\mathbf{0.81}$ & $\mathbf{0.65}$ & $0.65$ & $0.50$ \\
iWMM(Q=D) & $\mathbf{0.86}$ & $\mathbf{0.99}$ & $\mathbf{0.89}$ & $\mathbf{0.94}$ & $0.77$ & $0.62$ & $\mathbf{0.77}$ & $\mathbf{0.76}$ \\
\hline
\end{tabular}
\end{table}
%
\Cref{tab:rand} compares the clustering performance of the \iwmm{} with the \iGMM{}, quantified by the Rand index~\citep{rand1971objective}, which measures the correspondence between inferred cluster labels and true cluster labels.
Since the manifold on which the observed data lies can be at most $D$-dimensional, we set the latent dimension $Q$ equal to the observed dimension $D$.
We also included the $Q = 2$ case in an attempt to characterize how much modeling power is lost by forcing the latent representation to be visualizable. 

%The \iGMM{} is another probabilistic generative model commonly used for clustering, which can be seen as a special case of the \iwmm{} in which the Gaussian clusters are not warped.  
These experiments were designed to measure the extent to which nonparametric cluster shapes help to estimate meaningful clusters.
To eliminate any differences due to different inference procedures, we used identical code for the \iGMM{} and \iwmm{}, the only difference being that the warping function was set to the identity $\vy = \vx$.
%When the latent dimensionality was set to two ($Q=2$), clustering performance was degraded.% did not work well
Both variants of the \iwmm{} usually outperformed the \iGMM{} on this measure.


\subsection{Density estimation}

Next, we compared the \iwmm{} in terms of predictive density against kernel density estimation (\KDE{}), the \iGMM{}, and the fully-Bayesian \gplvm{}.
%All samplers were run for 5000 iterations.
For \KDE{}, the kernel width was estimated by maximizing the leave-one-out density.
%Although all these methods are consistent density estimators, we expect the \iwmm{} to have an advantage for finite datasets since its density contours can follow the data density.
%With WM and \iwmm{}, we set the dimensionality of the latent space
%at that of the observed space $Q=D$ or $Q=2$.
%We include the $Q=2$ case in order to attempt to quantify the 
%The proposed models achieved higher test log likelihoods than the KDE, MPW
%
\Cref{tab:likelihood} lists average test log likelihoods.
%
\begin{table}[ht!]
\centering
\caption[Predictive likelihood comparison]
{Average test log-likelihoods for evaluating density estimation performance.}
\label{tab:likelihood}
\begin{tabular}{lrrrm{1cm}rrrr}
\hline
& 2-curve & 3-semi & 2-circle & Pinwheel & Iris  & Glass  & Wine  & Vowel  \\
\hline 
\KDE{} & $-2.47$ & $-0.38$ & $-1.92$ & $-1.47$ & $\mathbf{-1.87}$ & $1.26$ & $-2.73$ & $\mathbf{6.06}$ \\
\iGMM{} & $-3.28$ & $-2.26$ & $-2.21$ & $-2.12$ & $-1.91$ & $3.00$ & $\mathbf{-1.87}$ & $-0.67$ \\
\gplvm{}(Q=2) & $-1.02$ & $-0.36$ & $\mathbf{-0.78}$ & $\mathbf{-0.78}$ & $-1.91$ & $\mathbf{5.70}$ & $\mathbf{-1.95}$ & $\mathbf{6.04}$ \\
\gplvm{}(Q=D) & $-1.02$ & $-0.36$ & $\mathbf{-0.78}$ & $\mathbf{-0.78}$ & $-1.86$ & $\mathbf{5.59}$ & $-2.89$ & $-0.29$ \\
\iwmm{}(Q=2) & $\mathbf{-0.90}$ & $\mathbf{-0.18}$ & $\mathbf{-1.02}$ & $\mathbf{-0.79}$ & $\mathbf{-1.88}$ & $\mathbf{5.76}$ & $\mathbf{-1.96}$ & $\mathbf{5.91}$ \\
\iwmm{}(Q=D) & $\mathbf{-0.90}$ & $\mathbf{-0.18}$ & $\mathbf{-1.02}$ & $\mathbf{-0.79}$ & $\mathbf{-1.71}$ & $\mathbf{5.70}$ & $-3.14$ & $-0.35$ \\
\hline
\end{tabular}
\end{table}


The \iwmm{} usually achieved higher test likelihoods than the \KDE{} and the \iGMM{}.
%The \iwmm{} achieved better performance than WM for each latent dimensionality,
%except on the wine dataset.
% This result indicates that it is important to assume an infinite mixture model in the latent space for density estimation.%, and that even a fully Bayesian version of the \gplvm{} is not necessarily a good density model.
%
The \gplvm{} performed competitively with the \iwmm{}, although it never significantly outperformed the corresponding \iwmm{} having the same latent dimension.

The sometimes large differences between performance in the $D = 2$ case and the $D = Q$ case of these two methods may be attributed to the fact that when the observed dimension is high, many samples are required from the latent distribution in order to produce accurate estimates of the posterior predictive density at the test locations.
This difficulty might be resolved by using a warping with back-constraints~\citep{Lawrence06localdistance}, which would allow a more direct evaluation of the density at a given point in the observed space.


\subsubsection{Source code}
Code to reproduce all the above figures and experiments is available at \\\url{http://www.github.com/duvenaud/warped-mixtures}.


\section{Conclusions}

This chapter introduced a simple generative model of non-Gaussian density manifolds which can infer nonparametric cluster shapes, low-dimensional representations of varying dimension per cluster, and density estimates which smoothly follow the contours of each cluster.
We also introduced a sampler for this model which integrates out both the cluster parameters and the warping function exactly at each step.
%We further demonstrated that allowing non-parametric cluster shapes improves clustering performance over the Dirichlet process Mixture of Gaussians.

%Relatively flexible nonparametric clustering models already exist
Non-probabilistic methods such as spectral clustering can also produce nonparametric cluster shapes, but usually lack principled methods other than cross-validation for setting kernel parameters, the number of clusters, and the implicit dimension of the learned manifolds.
This chapter showed that using a fully generative model allows these model choices to be determined automatically.

%Many methods have been proposed which can perform some combination of clustering, manifold learning, density estimation and visualization.
%We demonstrated that a simple but flexible probabilistic generative model can perform well at all these tasks.

%Since the proposed model is generative, it can also be extended to handle missing data, integrate with other probabilistic models, and use other families of distributions for the latent components.


\section{Future work}

%One design decision which may be worth revisiting is the use the same warping for all mixture components.  It would be simple, and typically faster, to use a separate set of GPs to model the warping of each latent cluster.  

\subsubsection{More sophisticated latent density models}
The Dirichlet process mixture of Gaussians in the latent space of the \iwmm{} could easily be replaced by a more sophisticated density model, such as a hierarchical Dirichlet process~\citep{teh2006hierarchical}, or a Dirichlet diffusion tree~\citep{neal2003density}.
Another straightforward extension would be to make inference more scalable by using sparse Gaussian processes~\citep{quinonero2005unifying,snelson2006sparse} or more advanced Hamiltonian Monte Carlo methods~\citep{zhang2011quasi}.


\subsubsection{A finite cluster count model}
\citet{miller2013inconsistent} note that the Dirichlet process assumes infinitely many clusters, and that estimates of the number of clusters in a dataset based on Bayesian inference are inconsistent under this model.
They propose a consistent alternative which still allows efficient Gibbs sampling, called the mixture of finite mixtures.
Replacing the Dirichlet process with a mixture of finite mixtures could improve the consistency properties of the \iwmm{}.


%As mentioned above, another natural alternative would be to use nested \gplvm{} density models, recovering a deep \gp{} latent variable model.
%An interesting open question is whether 
%TODO: talk about deep GPs


\subsubsection{Semi-supervised learning}

A straightforward extension of the \iwmm{} would be a semi-supervised version of the model.
The \iwmm{} could allow label propagation along regions of high density in the latent space, even if the individual points in those regions are stretched far apart along low-dimensional manifolds in the observed space.
Another natural extension would be to allow a separate warping for each cluster, producing a mixture of warped Gaussians, rather than a warped mixture of Gaussians.% which would also improve inference speed.


\subsubsection{Learning the topology of data manifolds}

Some datasets naturally live on manifolds which are not simply-connected.
For example, motion capture data or video of a person walking in a circle can be said to live on a torus, with one coordinate specifying the phase of the person's walking cycle, and another specifying how far around the circle they are.

%What if we were to use the structured kernels of \cref{ch:kernels,ch:grammar} to specify the prior on warpings?
As shown in \cref{sec:topological-manifolds}, using structured kernels to specify the warping of a latent space gives rise to interesting topologies on the observed density manifold.
If a suitable method for computing the marginal likelihood of a \gplvm{} is available, an automatic search similar to the one described in \cref{ch:grammar} may be able to automatically discover the topology of the data manifold.


% Todo:  Cite paper from Jeff's talk that suggests using DP mixtures to estimate the number of clusters.


%This model is 

%Table \ref{tab:count} shows the inferred number of components
%by the iGMM and \iwmm{}.
%With the synthetic data sets, 
%the number of components inferred by the \iwmm{} was 
%
%\begin{table*}[t!]
%\centering
%\caption{Average inferred number of components.}
%\label{tab:count}
%\begin{tabular}{lrrrrrrrr}
%\hline
% & 2-curve & 3-semi & 2-circle & Pinwheel & Iris  & Glass  & Wine  & Vowel  \\
%\hline
%iGMM & 2.7 & 4.5 & 3.8 & 4.2 & 2.0 & 2.3 & 2.7 & 5.0 \\ 
%iWMM($Q=2$) & 2.2 & 3.0 & 3.2 & 4.6 & 2.6 & 3.7 & 3.1 & 2.8 \\
%iWMM($Q=D$) & 2.2 & 3.0 & 3.2 & 4.6 & 2.0 & 2.8 & 5.9 & 5.1 \\ 
%\hline
%true & 2 & 3 & 2 & 5 & 3 & 7 & 3 & 11 \\
%\hline
%\end{tabular}
%\end{table*}
%


%In contrast to the many somewhat ad-hoc clustering, manifold learning, density estimation, and visualization methods introduced recently, we show that a simple generative model can perform well at all of these tasks.

%We introduced a nonparametric Bayesian clustering method capable of inferring nonlinearly separable clusters.
%In the experiments, we demonstrated that our model is effective for density estimation, and performs much better than infinite Gaussian mixture models at discovering meaningful clusters.


%\subsection*{Acknowledgements}
%The authors would like to thank Dominique Perrault-Joncas, Carl Edward Rasmussen, and Ryan Prescott Adams for helpful discussions.
%\pagebreak

%\section{Open Questions}

%\subsubsection{How can we extend this model to handle heavy-tailed clusters?}


\outbpdocument{
\bibliographystyle{plainnat}
\bibliography{references.bib}
}