forked from EddyRivasLab/easel
-
Notifications
You must be signed in to change notification settings - Fork 0
/
esl_histogram.tex
299 lines (235 loc) · 13.9 KB
/
esl_histogram.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
The \eslmod{histogram} module is for collecting scores, fitting
them to expected distributions, and displaying them.
The histogram automatically reallocates its bins as data points
arrive, so the caller only needs to provide some initial guidance
about bin size and ``phase'' (offset of the bins relative to the real
number line). It accumulates counts in 64-bit unsigned integers, so
it can handle over $10^19$ total counts. Optionally (and provided
that the caller knows it has enough memory to support this), a
``full'' histogram can be created and used to collect a sorted vector
of raw (unbinned) values.
Various different ways of fitting histogram data to different sorts of
expected distributions are supported, with interfaces to all of
Easel's statistical distribution modules. Data fitting is oriented
toward the case where the values are scores, with high scores being of
the most interest; for instance, routines for obtaining and fitting
the right (high-scoring) tail are provided, but not for the left tail.
Several of the output functions output data as XY data files suitable
for input into the popular and freely available \prog{xmgrace}
graphing program [\url{http://plasma-gate.weizmann.ac.il/Grace/}].
The API for the \eslmod{histogram} module is summarized in
Table~\ref{tbl:histogram_api}.
\begin{table}[hbp]
\begin{center}
{\small
\begin{tabular}{|ll|}\hline
\apisubhead{Collecting data in an \ccode{ESL\_HISTOGRAM}}\\
\hyperlink{func:esl_histogram_Create()}{\ccode{esl\_histogram\_Create()}} & Create a new \ccode{ESL\_HISTOGRAM}.\\
\hyperlink{func:esl_histogram_CreateFull()}{\ccode{esl\_histogram\_CreateFull()}} & A \ccode{ESL\_HISTOGRAM} to keep all data samples.\\
\hyperlink{func:esl_histogram_Destroy()}{\ccode{esl\_histogram\_Destroy()}} & Frees a \ccode{ESL\_HISTOGRAM}.\\
\hyperlink{func:esl_histogram_Add()}{\ccode{esl\_histogram\_Add()}} & Add a sample to the histogram.\\
\apisubhead{Declarations about binned data, before fitting}\\
\hyperlink{func:esl_histogram_DeclareCensoring()}{\ccode{esl\_histogram\_DeclareCensoring()}} & Collected data were left-censored.\\
\hyperlink{func:esl_histogram_DeclareRounding()}{\ccode{esl\_histogram\_DeclareRounding()}} & Declare collected data were no more accurate than bins.\\
\hyperlink{func:esl_histogram_SetTail()}{\ccode{esl\_histogram\_SetTail()}} & Declare only tail $>$ some threshold is considered "observed".\\
\hyperlink{func:esl_histogram_SetTailByMass()}{\ccode{esl\_histogram\_SetTailByMass()}} & Declare only right tail mass is considered "observed".\\
\apisubhead{Accessing raw data samples}\\
\hyperlink{func:esl_histogram_GetRank()}{\ccode{esl\_histogram\_GetRank()}} & Retrieve n'th high score.\\
\hyperlink{func:esl_histogram_GetData()}{\ccode{esl\_histogram\_GetData()}} & Retrieve vector of all raw scores.\\
\hyperlink{func:esl_histogram_GetTail()}{\ccode{esl\_histogram\_GetTail()}} & Retrieve all raw scores above some threshold.\\
\hyperlink{func:esl_histogram_GetTailByMass()}{\ccode{esl\_histogram\_GetTailByMass()}} & Retrieve all raw scores in right tail mass.\\
\apisubhead{Setting expected counts}\\
\hyperlink{func:esl_histogram_SetExpect()}{\ccode{esl\_histogram\_SetExpect()}} & Set expected counts for complete distribution.\\
\hyperlink{func:esl_histogram_SetExpectedTail()}{\ccode{esl\_histogram\_SetExpectedTail()}} & Set expected counts for right tail.\\
\apisubhead{Output}\\
\hyperlink{func:esl_histogram_Write()}{\ccode{esl\_histogram\_Write()}} & Print a "pretty" ASCII histogram.\\
\hyperlink{func:esl_histogram_Plot()}{\ccode{esl\_histogram\_Plot()}} & Output a histogram in xmgrace XY format.\\
\hyperlink{func:esl_histogram_PlotSurvival()}{\ccode{esl\_histogram\_PlotSurvival()}} & Output $P(X>x)$ in xmgrace XY format.\\
\hyperlink{func:esl_histogram_PlotQQ()}{\ccode{esl\_histogram\_PlotQQ()}} & Output a Q-Q plot in xmgrace XY format.\\
\hyperlink{func:esl_histogram_Goodness()}{\ccode{esl\_histogram\_Goodness()}} & Evaluate fit between observed, expected. \\
\hline
\end{tabular}
}
\end{center}
\caption{The \eslmod{histogram} API.}
\label{tbl:histogram_api}
\end{table}
\subsection{Example of using the histogram API}
The example code below stores 10,000 samples from a Gumbel
distribution in a histogram, retrieves a vector containing the sorted
samples, fits a Gumbel distribution to that dataset, sets the expected
counts in the histogram, prints the observed and expected counts in an
ASCII histogram, and evaluates the goodness-of-fit.
\input{cexcerpts/histogram_example}
Some points of interest:
\begin{itemize}
\item When the histogram is created, the arguments \ccode(-100, 100, 0.5)
tell it to bin data into bins of width 0.5, initially
starting at -100 and ending at 100. This initialization
is described below (see ``Specifying binning of data values'').
\item Samples are collected one at a time with
\ccode{esl\_histogram\_Add()}.
\item After the data have been collected in a \emph{full} histogram, a
vector of sorted raw data values can be retrieved using functions
like \ccode{esl\_histogram\_GetData()}, and used to fit parameters
of an expected distribution to the data.
\item In addition to the observed binned counts, you can optionally
set \emph{expected} binned counts in the histogram by calling
\ccode{esl\_histogram\_SetExpect()} and providing pointers
to an appropriate distribution function and its parameters.
\item The \ccode{esl\_histogram\_Print()} function shows an ASCII text
representation of the observed counts (and expected counts, if set)
that looks a lot like FASTA's nice histogram output.
\item The \ccode{esl\_histogram\_Goodness()} function compares the
observed and expected binned counts, and calculates two goodness of
fit tests: a G-test, and a $\chi^2$ test.
\end{itemize}
\subsection{Specifying binning of data values}
The histogram collects data values into bins. When the histogram is
created, the bin width and the relative offset of the bins is
permanently set, and an initial range is allocated.
For example, the call \ccode{esl\_histogram\_Create(-10, 10, 0.5)}
creates 40 bins of width 0.5 from -10 to 10, with the first bin
collecting scores from $-10 < x \leq -9.5$, and the last bin
collecting scores $9.5 < x \leq 10.0$.
The lower bound of the initialization permanently sets the relative
offset of the bins. That is, \ccode{esl\_histogram\_Create(-10, 10,
0.5)} makes the first bin $-10 < x \leq -9.5$, whereas
\ccode{esl\_histogram\_Create(-10.1, 9.9, 0.5)} makes the first bin
$-10.1 < x \leq -9.6$.
Aside from that, the initial range is only a suggestion. You can add
any real-valued $x$ to the histogram. The histogram will silently
reallocate itself to a wider range as needed. The ability of a
histogram to store data is effectively unlimited. Up to $2^{64}-1$
(more than $10^{19}$) counts can be collected. The histogram requires 16
bytes of storage per bin, and the number of bins it allocates scales
as $x_{\mbox{max}} - x_{\mbox{min}} / w$.
\subsection{Optional collection of raw data values: full histograms}
Normally a histogram would store only binned counts, so it can
efficiently summarize even very large numbers of samples.
In some cases it is useful to keep a list of the raw data values --
for instance, for more accurate parameter fitting to expected
distributions. This can be done by creating a ``full'' histogram with
\ccode{esl\_histogram\_CreateFull()} instead of
\ccode{esl\_histogram\_Create()}. (The example code above did this,
because it did parameter fitting to the raw data.) After data have
been collected in a full histogram, individual raw values or pointers
to sorted arrays of raw values can be retrieved using the
\ccode{esl\_histogram\_Get*} functions.
A full histogram may require much more memory: about 4 bytes per data
point. You may not want to use full histograms if your problem
involves collecting many ($> 10^9$, say) data points.
\subsection{Different parameter fitting scenarios}
By default, the data you collect are assumed to be \emph{complete}.
You observed all samples; if you fit to any expected distribution, the
expected distribution is assumed to describe the complete data; the
parameters of the expected distribution are to be fitted to an array
of the complete raw data samples; and any goodness of fit test is to
be applied to the complete data. This is the simplest, most obvious
case.
Other situations may arise. In addition to complete data, Easel is
designed to deal with four other cases:
\begin{enumerate}
\item The collected data are complete, and they are fit to a
distribution that describes the complete data, but parameter
fitting is done only in the right (highest-scoring) tail. This
makes parameter fitting focus on the most important,
high-scoring region of a score distribution, and ignore
low-scoring outliers.
\item The collected data are complete, but they are fit to a
distribution that only describes the right (highest scoring)
tail, and the goodness-of-fit test is only performed on that
tail. This case arises when we don't know the form of the
expected distribution for the complete data, but the tail
follows a predictable decay (an exponential tail, for example).
\item The collected data are left-censored such that no values $<
\phi$ were recorded in the histogram, but the data are fit to a
complete distribution that predicts the probability even of the
censored (unobserved) values. Goodness of fit is only evaluated
in the observed data. (This case is what is actually meant by
left-censored data.)
\item The high-scoring right tail of the collected data are fit as the
\emph{binned} counts in the histogram (not raw sample values) to
a distribution that describes the tail, such as an
exponential. This case becomes useful when the raw data values
have limited precision (because of rounding, for example), which
can cause numerical problems with parameter fitting to tails.
Another case where this is useful is when there are so many data
points that the data must be binned just as a matter of
practicality (not enough memory to hold a full histogram).
\end{enumerate}
A variety of other situations can be dealt with by using different
combinations of the function calls that deal with these four cases.
\subsubsection{Focusing parameter fitting on the highest scores}
An example of focusing a Gumbel parameter fit on the right half of an
observed distribution:
\input{cexcerpts/histogram_example2}
The key differences from the complete data case are:
\begin{itemize}
\item Only the high-scoring 50\% of the data samples are
retrieved, by calling
\ccode{esl\_histogram\_GetTailByMass(h, 0.5, \&xv, \&n, \&z)}.
This returns \ccode{z}, the number of samples that
were \emph{censored}.
\item These data are fit to a Gumbel distribution
as a \emph{left-censored} dataset by calling
\ccode{esl\_gumbel\_FitCensored(xv, n, z, xv[0], \&mu, \&lambda)}.
\end{itemize}
The expected counts and the goodness of fit tests are still evaluated
for the complete data, even though the fit was performed only on the
highest scores.
\subsubsection{Fitting to a tail distribution}
An example of fitting an exponential tail to the high-scoring 10\% of
a Gumbel-distributed dataset:
\input{cexcerpts/histogram_example3}
The differences to note are:
\begin{itemize}
\item The tail is fit as if it is \emph{complete} data as far
as the exponential distribution is concerned.
\item As a result, to use the exponential tail to predict expected
data, we have to keep in mind how much probability mass the tail
is supposed to predict (here, 10\%), and that
is provided to
\ccode{esl\_histogram\_SetExpectedTail()}, which specifically
calculates expected counts for a tail.
\end{itemize}
\subsubsection{Fitting left-censored data}
Fitting a Gumbel distribution to data that are \emph{truly} left
censored looks a lot like the case where we extracted the high scoring
data for a censored fit:
\input{cexcerpts/histogram_example4}
\subsubsection{Fitting binned data to a tail distribution}
Normally, you want to fit parameters to the actual individual data
samples, not to binned data, because you'll get more accurate results.
An exception can arise when the data samples have limited precision
because they've been rounded off. Most distributions are not sensitive
to this, but some tail densities are, especially those with
singularities ($P(X=x) \rightarrow \infty$) at their origin. In such a
case, a fit to binned data may be superior, especially if you can
match the histogram's bins to the rounding procedure that was used.
The following code shows an example of fitting for samples that were
already rounded up to the nearest integer before adding them to the
histogram:
\input{cexcerpts/histogram_example5}
Issues to note:
\begin{itemize}
\item The \ccode{esl\_histogram\_Create(-100, 100, 1.0)} call
defined bins that exactly match the rounding procedure
defined by \ccode{ceil(x)} -- all $x$ that are rounded
to the same value by \ccode{ceil(x)} would also go in
the same bin of the histogram.
\item The \ccode{esl\_histogram\_SetTailByMass()} function sets flags
in the histogram to demarcate the desired tail. However,
because the data have been binned, and we can only define the
tail by a range of bins, it will generally be impossible to
match the requested tail mass with adequate accuracy; the actual
tail mass is $\geq$ the requested tail mass. It is returned
to the caller, and it is the actual mass, not the requested mass,
that should be used when setting expected counts.
\item The \ccode{esl\_histogram\_SetRounding()} declaration
sets a flag in the histogram that tells binned parameter
fitting functions that the origin of the fitted
density ($\mu$) should be set at the lower bound of the smallest bin,
rather than the smallest raw data value observed in that
bin.
\end{itemize}