Byblo is a software package for the construction of large-scale distributional thesauri. It provides an efficient yet flexible framework for calculating all pair-wise similarities between terms in a corpus.
Naively, a distributional thesaurus can be thought of much like a tradition thesaurus. It allows one to look up a word, returning a list of synonyms. Unlike traditional thesauri, however, the synonyms are not manually curated by humans, they are calculated using a statistical model, estimated from a text corpus. Notionally the similarity between two terms might be calculated from the intersection of the features of the terms, extracted from a large corpus of text. For example if the phrases "christmas holiday" and "xmas holiday" occur very frequently in the corpus, the model might indicate that christmas and xmas are similar. Here we have decided that co-occurrent words are features, both christmas and xmas share the feature holiday, so they are similar.
Unfortunately, a distribution thesaurus is actually not at all like a manually curated one. Fundamentally we are defining a notion of similarity that is wholly different from synonymy. Worse still is that it changes depending on the corpus, feature selection and similarity measure. The previous paragraph is there to give a ludicrously high level overview of the projects purpose, to those who are only mildly curious. For those who are intent on using the software, let me start again:
A distributional thesaurus is a resource that contains the estimated substitutability of entries. Typically these entries are terms or phrases. The thesaurus can be queried with an entry (the base entry) and returns a list of other entries (the base entry's neighbours). These neighbours can be used as a part of an external text-processing pipeline, such as to expand queries during information retrieval. The entries are extracted from a text corpus, along with features of each entry. The similarity of entries is calculated as a function of their features. Underpinning this process is Harris' Distribution Hypothesis: "a word is characterised by the company it keeps". As mentioned above, it is hard to pin down the precise notion of similarity used because there are so many variables.
To provide an intuition, here is an example:
Take as our input corpus a balanced collection of English language text (such as Wikipedia). Let our entries be all unique terms in the corpus. We shall select the features of a base-entry as the frequency of all terms that co-occur in the corpus within a window of ±1 terms. Finally, the similarity function will be Cosine, which represents the feature sets as high dimensional vectors. Cosine calculates the similarity as being inversely proportional to the degree of orthogonality of vectors. Instructions for the thesaurus build process would proceeds as follows:
-
Tokenise the corpus, extracting a list of all unique terms. For each entry record occurrences of all the other entries within a window of ±1. For example, if we encounter the string "the big red bus", the entry-features produced are the:big, big:the, big:red, red:big, red:bus, and bus:red. Note that this set of the process is not (yet) covered by the provided software.
-
Count the occurrences of each unique feature with each unique entry. So we may find that the string "red bus" occurs 100 times in the corpus, and that "bus red" occurs 2 times. In this case we produce the features bus:red:102 and red:bus:102 (remember the feature windows is 1 term before and after). In words, we are saying that "red" occurs as a feature of "bus" 102 times, and "bus" occurs as a feature of "red" 102 times. For each entry, construct a multi-set of all the features it occurs with. So "bus" may occur with "green" 23 times, and with "big" 376 times, etc...
-
Convert the feature multi-set for each entry to a vector, where each features is a dimension, and the magnitude along a dimension is the frequency.
-
For all pairs of entries calculate their similarity as the cosine of the angle between their feature vectors.
-
For each base entry, select as its neighbours the top k highest similarity entries.
The resultant thesaurus will have a highly semantic notion of similarity. The neighbours of a word are likely to be strongly related in terms of meaning and topic, but not at all in terms of syntax. So for example, the top neighbours of happy could be pleased, impressed, satisfied, surprised, disappointed, thrilled, upset, and glad. Notice that while some words - such as the adjectival sense of pleased - are good synonyms, there are some antonymic words such as disappointed.
The software is primarily distributed in source-code form. Binaries may be available sporadically, and on request. The source code can be acquired from the github repository, click the Downloads button, and select a version to download an archive of the source code.
The project requires Java 6 installed on the system. It also requires a unix command-line such as Linux or Mac OS X. Windows users may get it to work through Cygwin. The software is compiled using an Apache Ant script. In addition the following Java libraries are required:
-
Google Guava 10.0.1 -- A library containing numerous useful features and tools; mostly the kind of things that should have been included with Java in the first place.
-
JCommander 1.19 -- A framework for parsing command line arguments in a very elegant way.
-
Fastutil 6.4.1 -- A library for collections that handle primitive data types efficiently. It also includes improved implementations of most of the standard collection framework.
-
Commons Logging 1.1.1 -- A very light weight wrapper API that enables logging frameworks to be configured and "plugged in" at runtime.
-
MLCL Lib 0.1.0 -- A collection of generic Java utilities and classes developed by the authors of Byblo, for use in this and other projects.
The following additional dependancies are optional:
-
JUnit 4.10 -- Required for unit testing the project.
-
Log4J 1.2 -- Can be used as a replacement for JDK 1.4 Logging, at the users discretion. Simply place the log4j jar files in the
libs
directory before building, or indist/libs
at run-time.
All except JCommander are available in pre-compiled binary form. Simply place the .jar
files in the /libs/
directory. In the case JCommander you must compile it first using maven.
For convenience a script has been provided to automatically download and compile the dependancies. Simply run:
$ cd libs
$ ./download_libraries.sh
Compiling the software from a source distribution
First download the dependancies as described above. Then, to compile the software from the command line:
$ ant dist
This will compile the source code, and create a new directory /dist/
containing the project jar
file, and a copy of the various required libraries.
This section how build the project from with Netbeans 7. First acquire the source code as described above.
-
Download all and compile all the library dependancies (see above), placing the jar file in the
libs/
directory. -
Start Netbeans and select "File -> New Project" from the menu bar. Select Java / Java Project with Existing Sources and click Next.
-
Enter the Project Name as "Byblo", and select the Project Folder a the location of the project source code. Click Next.
-
To Source Package Folders click Add Folder and select
src
. To Test Package Folders click Add Folder and selecttest
. Click Next then Finish. -
Right click on Libraries in the Projects view, and select Add JAR/Folder. Select all
.jar
files in thelibs
directory and click Choose.
From here you can run the project by clicking Run -> Run Main Project from the menu bar, and selecting uk.ac.susx.mlcl.byblo.Byblo
as the main class.
This byblo.sh
script is designed to be the primary point of usage for the thesaurus building software. It runs a complete build process from frequency counting to K-Nearest-Neighbours in a single pass, providing all the
most commonly used functionality of the underlying software.
$ ./byblo.sh [<options>] [@<config>] <file>
Where the arguments are:
-
<file>
Input instances file containing entry/feature pairs. -
@<config>
Options and input files can be read from a file specified directly after an '$\mathtt{@}$' character. Options in this file should be specified exactly as they would be at the command line, and may contain additional@
references to other config files. -
<options>
Any number of the option switches
There are a large number of options. To view a complete list enter ./byblo.sh --help
or view the wiki page on Running the Sotware
This project is supported a TSB (Technology Strategy Board) grant reference GCL-100934, and by the EPSRC Doctoral Training Account Scheme.
Special thanks to all members of the Machine Learning and Computational Linguistics Lab, Department of Informatics, University of Sussex, for all the helpful input.
To contribute to the project you should fork the git repository. First click the "Fork" button on github. Then open a console and type the following:
$ git clone [email protected]:[your-user-name]/Byblo.git
$ cd Byblo
$ git remote add upsteam [email protected]:hamishmorgan/Byblo.git
$ git fetch upstream
If you have changes to contribute back to the main project, send me a pull request by clicking the "Pull Request" button in your fork of the repository. For a detailed description click here.
This software is distributed under the 3-clause BSD Licence.