forked from duvenaud/phd-thesis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
discussion.tex
188 lines (118 loc) · 14 KB
/
discussion.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
\input{common/header.tex}
\inbpdocument
\chapter{Discussion}
\label{ch:discussion}
This chapter summarizes the contributions of this thesis, articulates some of the questions raised by this work, and relates the kernel-based model-building procedure of \cref{ch:kernels,ch:grammar,ch:description} to the larger project of automating statistics and machine learning.
%We with a summary of the contributions of this thesis.
\section{Summary of contributions}
The main contribution of this thesis was to develop a way to automate the construction of structured, interpretable nonparametric regression models using Gaussian processes.
This was done in several parts:
First, \cref{ch:kernels} presented a systematic overview of kernel construction techniques, and examined the resulting \gp{} priors.
Next, \cref{ch:grammar} showed the viability of a search over an open-ended space of kernels, and showed that the corresponding \gp{} models could be automatically decomposed into diverse parts allowing visualization of the structure found in the data.
\Cref{ch:description} showed that sometimes parts of kernels can be described modularly, allowing automatically written text to be included in detailed reports describing \gp{} models.
An example report is included in appendix~\ref{ch:example-solar}.
Together, these chapters demonstrate a proof-of-concept of what could be called an ``automatic statistician'' capable of the performing some of the model construction and analysis currently performed by experts.
The second half of this thesis examined several extensions of Gaussian processes, all of which enabled the automatic determination of model choices that were previously set by trial and error or cross-validation.
\Cref{ch:deep-limits} characterized and visualized deep Gaussian processes, related them to existing deep neural networks, and derived novel deep kernels.
\Cref{ch:additive} investigated additive \gp{}s, %a family of models consisting of sums of functions of all subsets of input variables,
and showed that they have the same covariance as a \gp{} using dropout.
\Cref{ch:warped} extended the \gp{} latent variable model into a Bayesian clustering model which automatically infers the nonparametric shape of each cluster, as well as the number of clusters.
% TODO: move to the relevant chapter
%\section{Future Work}
%How far can we push the kernel structure discovery procedure outlined in this thesis?
%\subsubsection{Multiple Dimensions}
%The experiments in \cref{ch:additive} showed that the structure search had very good performance in moderate dimensions.
%\subsubsection{Input and Output Warpings}
\section{Structured versus unstructured models}
One question left unanswered by this thesis is: when to prefer the structured, kernel-based models described in \cref{ch:kernels,ch:grammar,ch:description,ch:additive} to the relatively unstructured deep \gp{} models described in \cref{ch:deep-limits}?
This section considers some advantages and disadvantages of the two approaches.
\begin{itemize}
\item {\bf Difficulty of optimization.}
The discrete nature of the space of composite kernel structures can be seen as both a blessing and a curse.
Certainly, a mixed discrete-and-continuous search space requires more complex optimization procedures than the continuous-only optimization possible in deep \gp{}s.
However, the discrete nature of the space of composite kernels offers the possibility of learning heuristics to suggest which types of structure to add, based on features of the dataset, or previous model fits.
For example, finding periodic structure or growing variance in the residuals suggests adding periodic or linear components to the kernel, respectively.
It is not clear whether such heuristics could easily be constructed for optimizing the variational parameters of a deep \gp{}.
\item {\bf Long-range extrapolation.}
Another open question is whether the inductive bias of deep \gp{}s can be made to allow the sorts of long-range extrapolation shown in \cref{ch:kernels,ch:grammar}.
As an example, consider the problem of extrapolating a periodic function.
A deep \gp{} could learn a latent representation similar to that of the periodic kernel, projecting into a basis equivalent to $[\sin(x), \cos(x)]$ in the first hidden layer.
However, to extrapolate a periodic function, the $\sin$ and $\cos$ functions would have to repeat beyond the input range of the training data, which would not happen if each layer assumed only local smoothness.
One obvious possibility is to marry the two approaches, building deep \gp{}s with structured kernels.
However, we may lose some of the advantages of interpretability by this approach, and inference would become more difficult.
Another point to consider is that, in high dimensions, the distinction between interpolation and extrapolation becomes less meaningful.
If the training and test data both live on a low-dimensional manifold, then learning a suitable representation of that manifold may be sufficient for obtaining high predictive accuracy.
\item {\bf Ease of interpretation.}
Historically, the statistics community has put more emphasis on the interpretability and meaning of models than the machine learning community, which has focused more on predictive performance.
To begin to automate the practice of statistics, developing model-description procedures for powerful open-ended model classes seems to be a necessary step. %promising direction.% with the most low-hanging fruit.
At first glance, automatic model description may seem to require a decomposition of the model being described into discrete components, as in the additive decomposition demonstrated in \cref{ch:kernels,ch:grammar,ch:description}.
%For example, \cref{ch:description} showed that composite kernels allow automatic visualization and desription of low-dimensional structure.
On the other hand, most probabilistic models allow a form of summarization of high-dimensional structure through sampling from the posterior.
For the specific case of deep \gp{}s, \citet{damianou2012deep} showed that these models allow summarization through examining the dimension of each latent layer, visualizing latent coordinates, and examining how the predictive distribution changes as one moves in the latent space.
Perhaps more sophisticated procedures could also allow intelligible text-based descriptions of such models.
\end{itemize}
The warped mixture model of \cref{ch:warped} represents a compromise between these two approaches, combining a discrete clustering model with an unstructured warping function.
However, the explicit clustering model may by unnecessary: the results of \citet{damianou2012deep} suggest that clustering can be automatically (but implicitly) accomplished by a sufficiently deep, unstructured \gp{}.
\section{Approaches to automating model construction}
\label{top-down-vs-bottom-up}
This thesis is a small part of a larger push to automate the practice of model building and inference.
Broadly speaking, this problem is being attacked from two directions.
From the top-down, the probabilistic programming community is developing automatic inference engines for extremely broad classes of models~\citep{goodman2008church,mansinghka2014venture,Wood-AISTATS-2014,koller1997effective,milch20071,stan-software:2014} such as the class of all computable distributions~\citep{solomonoff1964formal, li2009introduction}.
As discussed in \cref{sec:ingredients}, model construction procedures can usually be seen as a search through an open-ended model class.
%For example, \citet{liang2010learning}
%Exact inference in sufficiently powerful model classes is \#P-hard (cite), so approximate inference algorithms are necessary.
Exhaustive search strategies have been constructed for the space of computable distributions~\citep{hutter2002fastest,schmidhuber2002speed,levin1973universal}, but they remain impractically slow.
An alternative, bottom-up, approach is to design procedures which extend and combine existing model classes for which relatively efficient inference algorithms are already known.
The language of models proposed in \cref{ch:grammar} is an example of this bottom-up approach.
Another example is \citet{roger-grosse-thesis}, who built an open-ended language of matrix decomposition models and a corresponding compositional language of relatively efficient approximate inference algorithms.
Similarly, \citet{christian-thesis} showed how to compose inference algorithms for sequence models.
These approaches have the advantage that inference is usually feasible for any model in the language.
However, extending these languages may require developing new inference algorithms.
If sufficiently powerful building-blocks are composed, the line between the top-down and bottom-up approaches becomes blurred.
For example, deep generative models~\citep{adams2010learning,damianou2012deep,rippel2013high,bengio2013generalized,salakhutdinov2009deep} could be considered an example of the bottom-up approach, since they compose individual model ``layers'' to produce more powerful models.
However, large neural nets can capture enough different types of structure that they could also be seen as an example of the universalist, top-down approach.
%As well, inference in these models has so far been done in a one-size-fits all approach, ignoring the particular structure of a given problem.
\section{Conclusion}
It seems clear that one way or another, large parts of the existing practice of model-building will eventually be automated.
However, it remains to be seen which of the above model-building approaches will be most useful.
I hope that this thesis will contribute to our understanding of the strengths and weaknesses of these different approaches, and towards the use of more powerful model classes by practitioners in other fields.
% Benefit: systematically explores model space?
%\section{Approaches to automating model description}
%The machine learning community has so far focused largely on producing efficient inference strategies for powerful model classes, which is sufficient for improving predictive performance.
%\section{Automating Statistics}
%Automating the process of statistical modeling would have a tremendous impact on fields that currently rely on expert statisticians, machine learning researchers, and data scientists.
%While fitting simple models (such as linear regression) is largely automated by standard software packages, there has been little work on the automatic construction of flexible but interpretable models.
\outbpdocument{
\bibliographystyle{plainnat}
\bibliography{references.bib}
}
\iffalse
\section{Why assume zero mean?}
\section{Why not learn the mean function instead?}
One might ask: besides integrating over the magnitude, what is the advantage of moving the mean function into the covariance function?
After all, mean functions are certainly more interpretable than a posterior distribution over functions.
Instead of searching over a large class of covariance functions, which seems strange and unnatural, we might consider simply searching over a large class of structured mean functions, assuming a simple \iid noise model.
This is the approach taken by practically every other regression technique: neural networks, decision trees, boosting, \etc.
If we could integrate over a wide class of possible mean functions, we would have a very powerful learning an inference method.
The problem faced by all of these methods is the well-known problem \emph{overfitting}.
If we are forced to choose just a single function with which to make predictions, we must carefully control the flexibility of the model we learn, generally preferring ``simple'' functions, or to choose a function from a restricted set.
If, on the other hand, we are allowed to keep in mind many possible explanations of the data, \emph{there is no need to penalize complexity}. [cite Occam's razor paper?]
The power of putting structure into the covariance function is that doing so allows us to implicitly integrate over many functions, maintaining a posterior distribution over infinitely many functions, instead of choosing just one.
In fact, each of the functions being considered can be infinitely complex, without causing any form of overfitting.
For example, each of the samples shown in \cref{fig:gp-post} varies randomly over the whole real line, never repeating, each one requiring an infinite amount of information to describe.
Choosing the one function which best fits the data will almost certainly cause overfitting.
However, if we integrate over many such functions, we will end up with a posterior putting mass on only those functions which are compatible with the data.
In other words, the parts of the function that we can determine from the data will be predicted with certainty, but the parts which are still uncertain will give rise to a wide range of predictions.
To repeat: \emph{there is no need to assume that the function being modeled is simple, or to prefer simple explanations} in order to avoid overfitting, if we integrate over many possible explanations rather than choosing just the one.
In Chapter~\ref{ch:grammar}, we will compare a system which estimates a parametric covariance function against one which estimates a parametric mean function.
\section{Signal versus noise}
In most derivations of Gaussian processes, the model is given as
%
\begin{align}
y = f(\inputVar) + \epsilon, \quad \textnormal{where} \quad \epsilon \simiid \Nt{0}{\sigma^2_{\textnormal{noise}}}
\end{align}
However, $\epsilon$ can equivalently be thought of as another Gaussian process, and so this model can be written as $y(\inputVar) \sim \GP\left(0, k + \delta \right)$. The lack of a hard distinction between the noise model and the signal model raises the larger question: Which part of a model represents signal, and which represents noise?
We believe that it depends on what you want to do with the model - there is no hard distinction between signal and noise in general.
For example: often, we do not care about the short-term variations in a function, and only in the long-term trend.
However, in many other cases, we wish to de-trend our data to see more clearly how much a particular part of the signal deviated from normal.
\fi