Skip to content

Latent Dirichlet Allocation (LDA)

Prashant Pokhriyal edited this page May 7, 2019 · 13 revisions

Summary

LDA image

Latent Dirichlet Allocation (LDA) is a topic model, used to represent the topics "occurring" in data, which is usually a collection of documents.

Our LDA implementation can be used to develop a model based on NGO Project Summaries (or other text). The different "clusters", or topics, of the LDA model should be representative of the variety of types of NGOs.

Implementation

Background

Like most implementations of LDA, ours is unguided / unsupervised, which is to say that the model's topics are based off of input data (i.e. NGO Project Summaries), not by direct input or influence of the user.

We used Gensim's LDA Multicore Module for the creation of the model itself.

For an explanation of how LDA actually works, this article may be of use. However, please note that a deeper understanding of LDA is not crucial for understanding the remainder of this Wiki Page, as the actual building of the LDA Model is handled by the above Gensim Module. What we built primarily deals with handling everything before that final step, in addition to some final testing.

Specifics

We wrote our Unguided LDA implementation in the form of a Python class titled NGOUnguidedLDA, which can build an LDA model representing NGO topics in the following manner:

  • After initial pre-processing of the text of each NGO used in training data, a training_dict is created containing the most significant words appearing in the training data. The designation of "significant" words is not based on high frequency alone; a significant word also cannot appear in too much of the training data. Why? Because if a word were to appear in the majority of NGO summaries, that word serves little purpose in helping differentiate these NGOs and form clusters.
  • Next, a corpus is created by only selecting words per project that can be found in training_dict. As a result, we now have a list of all the most significant words of our training data, separated by project. It's important to note that the user may choose (see the Usage section below) to use a tf-idf based corpus instead of the default Bag of Words corpus as described above. For more information on tf-idf and its benefits, see the TF-IDF section below.
  • The corpus is then fed into Gensim's LDA Multicore Module, giving us the lda_model, complete with the number of clusters (topics) specified by the user.

Testing functionality built into the NGOUnguidedLDA class is detailed in the Usage section.

TF-IDF

TF-IDF refers to term-frequency-inverse document frequency, which will be used to reflect a word's importance to the projects it belongs to.

TF stands for Term Frequency, and IDF stands for Inverse Document Frequency, which calculates the weight of "rare" words in the corpus; a word that rarely occurs in the corpus (increasing its significance to the documents it does belong to) will have a high IDF score. TF-IDF is the product of these two calculations.

Benefits of using TF-IDF include providing weight to uncommon words in overall corpus, increasing their significance for classifying the projects they belong to.

Usage

NLTK

In order to use the NGOUnguidedLDA class, it may be necessary to use the NLTK Downloader to obtain “stopwords,” “WordNetLemmatizer,” and other resources from the Natural Language Tool Kit. For more information, please see the NLTK documentation on Installing NLTK data.

Input Data

The sample dataset we used with our LDA implementation uses project summaries provided by NGOs; it is important to note that this dataset is an example of clean text as opposed to the "scraped" text from websites often used with our classifier algorithms (see Conclusions and Findings for further explanation).

The data itself is in JSON format; see a snippet from the project summaries JSON below:

[
    ...
    {
        'theme': 'Health', 
        '_id': 38774, 
        'summary': 'Prenatal vitamins could be the most critically-needed medicine in resource-poor communities...',
        'title': 'Prenatal Vitamins for Women in Resource-Poor Areas'
    }
    ...
]

As demonstrated in lda_main.py, sklearn's train_test_split() can be easily used to split the input data between training_data and testing_data.

with open("project_summaries.json", "r") as datafile:
    input_data = json.load(datafile)
training_data, testing_data = train_test_split(input_data, test_size=0.01)

Basic functionality

Our NGOUnguidedLDA object will be created by passing in training_data (obtained in the manner shown above).

clusterer = NGOUnguidedLDA(training_data)

For detailed information on how to use the NGOUnguidedLDA class, please refer to this Jupyter Notebook. The functions are also listed below for convenience.

Please note that since the NGOUnguidedLDA class was written with the project_summaries JSON in mind as the input data, some keys may need to be changed in the functions in order for the code to run (e.g. the user may need need to swap the "summary" key below for a "text" key depending on the input data).

for project_dict in self.training_data:
    project_text = project_dict["summary"]
    if project_text and detect(project_text) == "en":
        processed_text = preprocess_text(project_text)
        self.processed_projects.append(processed_text)
return self.processed_projects

Storing to disk

At some point during its use, you may be interested in storing your NGOUnguidedLDA object to disk. This may be useful if you want to use any of the following at a later point:

  • The LDA model itself
  • The word corpus used to create the LDA model (representing the "important" words for each project)
  • The training dictionary used to create the word corpus (containing words deemed important from the input data)
  • The processed version of the input text used for all of the above.

This can be done with the following command, where "clusterer" is what you named your object and "UnguidedLDAClusterer.joblib" is the filename or path of the file in which the clusterer will be stored.

joblib.dump(clusterer, "UnguidedLDAClusterer.joblib")

For more detailed information on joblib.dump, please see the JobLib Docs.

Functions

The NGOUnguidedLDA Class contains the following functions:

__init__(self, training_data): Creates our NGOUnguidedLDA object.

process_projects(self): Returns a "processed" (see preprocess_text below) version of the training data.

create_training_dict(self, max_proportion, num_keep): Creates dictionary containing the most significant words appearing in the training data.

create_lda_model(self, num_topics, num_workers, tf_idf): Builds our word corpus, then creates and trains the LDA Model.

print_lda_topics(self): Prints out the top words for each topic in the LDA Model.

test_lda_model(self, testing_data, top_topics, words_per_topic): Tests LDA Model by printing out the most likely topics for each project in our testing data. "Scores" indicate the probability of a project being part of that topic.

The UnguidedLDA.py file also contains the function preprocess_text(text), which tokenizes, stems, and removes stop words from a string of text.

Application

The primary application for this LDA implementation will be "clustering" together new topics to better represent the variety of NGOs. This will be especially important if Global Giving decides to dramatically change the current categorization structure and/or add new topics (see New Categorization Schema).

Although the test_lda_model functionality does allow one to "assign" new projects (in the testing data) via matching them with the best-fitting LDA model topic, we received mixed results with this. In general, the core purpose of this LDA implementation is to help generate topics, not to assign projects to topics.

One important benefit regarding this LDA implementation is that since it only forms a model based on project/NGO text, it is not at all dependent upon the current topic structure or schema.

Findings & Conclusions

Necessity of Clean Text

As noted above, we used a Project Summaries dataset to test this LDA implementation, as our attempts at topic creation with LDA were much more successful when "clean" text (i.e. the text was directly written) was used as opposed to scraped text from websites (which may sometimes have HTML Tags or other non-English words in the mix). Although this disparity may decrease organically with an improved text scraper, it seems fair to assume for now that clean text such as written Project Summaries would be the best choices for training/testing data for our clustering methods, LDA or otherwise.

Possible Need for Guided / Semi-Supervised LDA

While this implementation of Unguided LDA shows promise, it didn't quite give us the results we were looking for, both in terms of topic creation by the LDA model and the "testing" of the model by matching new projects to topics (which is part of why we began to look into clustering methods such as Document Embeddings). It's possible that while an Unguided LDA implementation can be effective at categorizing items at a fairly high level (e.g. categorizing unknown newspaper articles into broad categories such as Science, Politics, Finance, etc.), it might not be well-suited for the more specific categorization we were trying to achieve (i.e. finding topics within the broader "topic" of NGOs), especially with a relatively limited set of data.

For this reason, we began to look into a Guided or Semi-Supervised version of LDA, primarily based off of this article. Although we were unable to get a working implementation based off of this article, this is definitely something interesting for Global Giving to look into in the future.