esl_histogram.tex


The \eslmod{histogram} module is for collecting scores, fitting
them to expected distributions, and displaying them.

The histogram automatically reallocates its bins as data points
arrive, so the caller only needs to provide some initial guidance
about bin size and ``phase'' (offset of the bins relative to the real
number line).  It accumulates counts in 64-bit unsigned integers, so
it can handle over $10^19$ total counts.  Optionally (and provided
that the caller knows it has enough memory to support this), a
``full'' histogram can be created and used to collect a sorted vector
of raw (unbinned) values.

Various different ways of fitting histogram data to different sorts of
expected distributions are supported, with interfaces to all of
Easel's statistical distribution modules. Data fitting is oriented
toward the case where the values are scores, with high scores being of
the most interest; for instance, routines for obtaining and fitting
the right (high-scoring) tail are provided, but not for the left tail.

Several of the output functions output data as XY data files suitable
for input into the popular and freely available \prog{xmgrace}
graphing program [\url{http://plasma-gate.weizmann.ac.il/Grace/}].

The API for the \eslmod{histogram} module is summarized in
Table~\ref{tbl:histogram_api}.

\begin{table}[hbp]
\begin{center}
{\small
\begin{tabular}{|ll|}\hline
    \apisubhead{Collecting data in an \ccode{ESL\_HISTOGRAM}}\\
\hyperlink{func:esl_histogram_Create()}{\ccode{esl\_histogram\_Create()}} & Create a new \ccode{ESL\_HISTOGRAM}.\\
\hyperlink{func:esl_histogram_CreateFull()}{\ccode{esl\_histogram\_CreateFull()}} & A \ccode{ESL\_HISTOGRAM} to keep all data samples.\\
\hyperlink{func:esl_histogram_Destroy()}{\ccode{esl\_histogram\_Destroy()}} & Frees a \ccode{ESL\_HISTOGRAM}.\\
\hyperlink{func:esl_histogram_Add()}{\ccode{esl\_histogram\_Add()}} & Add a sample to the histogram.\\
    \apisubhead{Declarations about binned data, before fitting}\\
\hyperlink{func:esl_histogram_DeclareCensoring()}{\ccode{esl\_histogram\_DeclareCensoring()}} & Collected data were left-censored.\\
\hyperlink{func:esl_histogram_DeclareRounding()}{\ccode{esl\_histogram\_DeclareRounding()}} & Declare collected data were no more accurate than bins.\\
\hyperlink{func:esl_histogram_SetTail()}{\ccode{esl\_histogram\_SetTail()}} & Declare only tail $>$ some threshold is considered "observed".\\
\hyperlink{func:esl_histogram_SetTailByMass()}{\ccode{esl\_histogram\_SetTailByMass()}} & Declare only right tail mass is considered "observed".\\
    \apisubhead{Accessing raw data samples}\\
\hyperlink{func:esl_histogram_GetRank()}{\ccode{esl\_histogram\_GetRank()}} & Retrieve n'th high score.\\
\hyperlink{func:esl_histogram_GetData()}{\ccode{esl\_histogram\_GetData()}} & Retrieve vector of all raw scores.\\
\hyperlink{func:esl_histogram_GetTail()}{\ccode{esl\_histogram\_GetTail()}} & Retrieve all raw scores above some threshold.\\
\hyperlink{func:esl_histogram_GetTailByMass()}{\ccode{esl\_histogram\_GetTailByMass()}} & Retrieve all raw scores in right tail mass.\\
    \apisubhead{Setting expected counts}\\
\hyperlink{func:esl_histogram_SetExpect()}{\ccode{esl\_histogram\_SetExpect()}} & Set expected counts for complete distribution.\\
\hyperlink{func:esl_histogram_SetExpectedTail()}{\ccode{esl\_histogram\_SetExpectedTail()}} & Set expected counts for right tail.\\
    \apisubhead{Output}\\
\hyperlink{func:esl_histogram_Write()}{\ccode{esl\_histogram\_Write()}} & Print a "pretty" ASCII histogram.\\
\hyperlink{func:esl_histogram_Plot()}{\ccode{esl\_histogram\_Plot()}} & Output a histogram in xmgrace XY format.\\
\hyperlink{func:esl_histogram_PlotSurvival()}{\ccode{esl\_histogram\_PlotSurvival()}} & Output $P(X>x)$ in xmgrace XY format.\\
\hyperlink{func:esl_histogram_PlotQQ()}{\ccode{esl\_histogram\_PlotQQ()}} & Output a Q-Q plot in xmgrace XY format.\\
\hyperlink{func:esl_histogram_Goodness()}{\ccode{esl\_histogram\_Goodness()}} & Evaluate fit between observed, expected. \\
\hline
\end{tabular}
}
\end{center}
\caption{The \eslmod{histogram} API.}
\label{tbl:histogram_api}
\end{table}

\subsection{Example of using the histogram API}

The example code below stores 10,000 samples from a Gumbel
distribution in a histogram, retrieves a vector containing the sorted
samples, fits a Gumbel distribution to that dataset, sets the expected
counts in the histogram, prints the observed and expected counts in an
ASCII histogram, and evaluates the goodness-of-fit.

\input{cexcerpts/histogram_example}

Some points of interest:

\begin{itemize}
\item When the histogram is created, the arguments \ccode(-100, 100, 0.5)
      tell it to bin data into bins of width 0.5, initially
      starting at -100 and ending at 100. This initialization
      is described below (see ``Specifying binning of data values'').

\item Samples are collected one at a time with
  \ccode{esl\_histogram\_Add()}.

\item After the data have been collected in a \emph{full} histogram, a
   vector of sorted raw data values can be retrieved using functions
   like \ccode{esl\_histogram\_GetData()}, and used to fit parameters
   of an expected distribution to the data.

\item In addition to the observed binned counts, you can optionally
   set \emph{expected} binned counts in the histogram by calling
   \ccode{esl\_histogram\_SetExpect()} and providing pointers
   to an appropriate distribution function and its parameters.

\item The \ccode{esl\_histogram\_Print()} function shows an ASCII text
   representation of the observed counts (and expected counts, if set)
   that looks a lot like FASTA's nice histogram output.

\item The \ccode{esl\_histogram\_Goodness()} function compares the
   observed and expected binned counts, and calculates two goodness of
   fit tests: a G-test, and a $\chi^2$ test.
\end{itemize}


\subsection{Specifying binning of data values}

The histogram collects data values into bins. When the histogram is
created, the bin width and the relative offset of the bins is
permanently set, and an initial range is allocated. 

For example, the call \ccode{esl\_histogram\_Create(-10, 10, 0.5)}
creates 40 bins of width 0.5 from -10 to 10, with the first bin
collecting scores from $-10 < x \leq -9.5$, and the last bin
collecting scores $9.5 < x \leq 10.0$.

The lower bound of the initialization permanently sets the relative
offset of the bins. That is, \ccode{esl\_histogram\_Create(-10, 10,
0.5)} makes the first bin $-10 < x \leq -9.5$, whereas
\ccode{esl\_histogram\_Create(-10.1, 9.9, 0.5)} makes the first bin
$-10.1 < x \leq -9.6$.

Aside from that, the initial range is only a suggestion. You can add
any real-valued $x$ to the histogram. The histogram will silently
reallocate itself to a wider range as needed.  The ability of a
histogram to store data is effectively unlimited. Up to $2^{64}-1$
(more than $10^{19}$) counts can be collected. The histogram requires 16
bytes of storage per bin, and the number of bins it allocates scales
as $x_{\mbox{max}} - x_{\mbox{min}} / w$.

\subsection{Optional collection of raw data values: full histograms}

Normally a histogram would store only binned counts, so it can
efficiently summarize even very large numbers of samples.

In some cases it is useful to keep a list of the raw data values --
for instance, for more accurate parameter fitting to expected
distributions. This can be done by creating a ``full'' histogram with
\ccode{esl\_histogram\_CreateFull()} instead of
\ccode{esl\_histogram\_Create()}. (The example code above did this,
because it did parameter fitting to the raw data.) After data have
been collected in a full histogram, individual raw values or pointers
to sorted arrays of raw values can be retrieved using the
\ccode{esl\_histogram\_Get*} functions.

A full histogram may require much more memory: about 4 bytes per data
point. You may not want to use full histograms if your problem
involves collecting many ($> 10^9$, say) data points.


\subsection{Different parameter fitting scenarios}

By default, the data you collect are assumed to be \emph{complete}.
You observed all samples; if you fit to any expected distribution, the
expected distribution is assumed to describe the complete data; the
parameters of the expected distribution are to be fitted to an array
of the complete raw data samples; and any goodness of fit test is to
be applied to the complete data. This is the simplest, most obvious
case.

Other situations may arise. In addition to complete data, Easel is
designed to deal with four other cases:

\begin{enumerate}
\item The collected data are complete, and they are fit to a
      distribution that describes the complete data, but parameter
      fitting is done only in the right (highest-scoring) tail. This
      makes parameter fitting focus on the most important,
      high-scoring region of a score distribution, and ignore
      low-scoring outliers.

\item The collected data are complete, but they are fit to a
      distribution that only describes the right (highest scoring)
      tail, and the goodness-of-fit test is only performed on that
      tail. This case arises when we don't know the form of the
      expected distribution for the complete data, but the tail
      follows a predictable decay (an exponential tail, for example).

\item The collected data are left-censored such that no values $<
      \phi$ were recorded in the histogram, but the data are fit to a
      complete distribution that predicts the probability even of the
      censored (unobserved) values. Goodness of fit is only evaluated
      in the observed data. (This case is what is actually meant by
      left-censored data.)

\item The high-scoring right tail of the collected data are fit as the
      \emph{binned} counts in the histogram (not raw sample values) to
      a distribution that describes the tail, such as an
      exponential. This case becomes useful when the raw data values
      have limited precision (because of rounding, for example), which
      can cause numerical problems with parameter fitting to tails.
      Another case where this is useful is when there are so many data
      points that the data must be binned just as a matter of
      practicality (not enough memory to hold a full histogram).
\end{enumerate}

A variety of other situations can be dealt with by using different
combinations of the function calls that deal with these four cases.


\subsubsection{Focusing parameter fitting on the highest scores}

An example of focusing a Gumbel parameter fit on the right half of an
observed distribution:

\input{cexcerpts/histogram_example2}

The key differences from the complete data case are:

\begin{itemize}
\item Only the high-scoring 50\% of the data samples are
      retrieved, by calling 
      \ccode{esl\_histogram\_GetTailByMass(h, 0.5, \&xv, \&n, \&z)}.
      This returns \ccode{z}, the number of samples that 
      were \emph{censored}.

\item These data are fit to a Gumbel distribution
      as a \emph{left-censored} dataset by calling
      \ccode{esl\_gumbel\_FitCensored(xv, n, z, xv[0], \&mu, \&lambda)}.
\end{itemize}

The expected counts and the goodness of fit tests are still evaluated
for the complete data, even though the fit was performed only on the
highest scores.


\subsubsection{Fitting to a tail distribution}

An example of fitting an exponential tail to the high-scoring 10\% of
a Gumbel-distributed dataset:

\input{cexcerpts/histogram_example3}

The differences to note are:

\begin{itemize}
\item The tail is fit as if it is \emph{complete} data as far
      as the exponential distribution is concerned.

\item As a result, to use the exponential tail to predict expected
      data, we have to keep in mind how much probability mass the tail
      is supposed to predict (here, 10\%), and that
      is provided to
      \ccode{esl\_histogram\_SetExpectedTail()}, which specifically
      calculates expected counts for a tail.
\end{itemize}

\subsubsection{Fitting left-censored data}

Fitting a Gumbel distribution to data that are \emph{truly} left
censored looks a lot like the case where we extracted the high scoring
data for a censored fit:

\input{cexcerpts/histogram_example4}

\subsubsection{Fitting binned data to a tail distribution}

Normally, you want to fit parameters to the actual individual data
samples, not to binned data, because you'll get more accurate results.
An exception can arise when the data samples have limited precision
because they've been rounded off. Most distributions are not sensitive
to this, but some tail densities are, especially those with
singularities ($P(X=x) \rightarrow \infty$) at their origin. In such a
case, a fit to binned data may be superior, especially if you can
match the histogram's bins to the rounding procedure that was used.

The following code shows an example of fitting for samples that were
already rounded up to the nearest integer before adding them to the
histogram:

\input{cexcerpts/histogram_example5}

Issues to note:

\begin{itemize}
\item The \ccode{esl\_histogram\_Create(-100, 100, 1.0)} call
      defined bins that exactly match the rounding procedure
      defined by \ccode{ceil(x)} -- all $x$ that are rounded
      to the same value by \ccode{ceil(x)} would also go in
      the same bin of the histogram.

\item The \ccode{esl\_histogram\_SetTailByMass()} function sets flags
      in the histogram to demarcate the desired tail.  However,
      because the data have been binned, and we can only define the
      tail by a range of bins, it will generally be impossible to
      match the requested tail mass with adequate accuracy; the actual
      tail mass is $\geq$ the requested tail mass. It is returned
      to the caller, and it is the actual mass, not the requested mass,
      that should be used when setting expected counts.

\item The \ccode{esl\_histogram\_SetRounding()} declaration
      sets a flag in the histogram that tells binned parameter
      fitting functions that the origin of the fitted
      density ($\mu$) should be set at the lower bound of the smallest bin,
      rather than the smallest raw data value observed in that 
      bin. 
\end{itemize}