Key changes in version v0.2

A Cythonized version of the collapsed Gibbs sampling loop used by the LDA sequential and multiprocessing models is now used by default for much shorter training times.
The various methods used to quantify distances between numerical representations of semantic features of data (words, documents, topics) now default to using metric functions. In particular, distances between probability distributions are computed as the Jensen-Shannon distance; other sorts of vectors (e.g., from LSA or from BEAGLE) are compared using angular distance. vsm.spatial also includes a wrapper for any distance or similarity function found in scipy.spatial.distance.
Most of the plotting and clustering functionality has been migrated to an extension vsm.extension.clustering, as there are many possibilities in this direction and the core of vsm should limit itself to providing a stable source of data for these.
Likewise, the corpus building tools have been migrated to an extension, vsm.extension.corpusbuilders. There are many ways to build a corpus and corpus data and metadata arrives in many different forms. The core of vsm should limit itself to providing a stable target data structure for the corpus preparation stage of the workflow.
Importing the various classes that vsm has provides is now much simplified. In the style of numpy, import vsm or from vsm import * should drag in most of the commonly used classes and functions.

Provide feedback