Skip to content
Josh Burke edited this page May 18, 2019 · 34 revisions

Welcome

Welcome to the GlobalGiving-Depth wiki! This wiki offers an in-depth view of Hack4Impact’s initiatives this semester in partnership with GlobalGiving. This page will work as an index, linking off to subpages describing each individual approach to the problem. Check out the sidebar for quick-links to the pages offered on this wiki.

Problem

With many potential contacts to make, GlobalGiving needs to be able to make an informed choice when contacting nonprofits to bring into their network. GlobalGiving’s network consists of many organizations based in the US along with some nonprofits in other countries. However, the process of finding and applying to GlobalGiving remains significantly easier within the United States. In certain countries, factors including lack of internet connectivity and lack of access to documents required by GlobalGiving’s vetting process has led to slower onboarding and discovery. As a result, GlobalGiving aims to use data science techniques to preemptively find and reach out to nonprofits around the world in an attempt to streamline the process of acquiring and benefitting more nonprofits outside of the country.

In late 2018, Hack4Impact provided a solution for GlobalGiving which obtained basic information about many new organizations which are not yet a part of GlobalGiving’s network. Now, GlobalGiving aims to fill out these records with more detailed information about the work they do and for whom in order to further streamline the process of benefitting these organizations. Whereas last semester’s problem was more about discovering the breadth of organizations around the world, this semester is about depth.

Approaches

Along with the many algorithms we provide in this repo, we also spent some time developing a new categorization scheme that takes into account statistic trends in the data and implements mechanisms which enforce consistency of these trends. This idea came about in trying to imagine an ideal categorization scheme for classifying new NGOs.


Classification:

Classification algorithms are one way to characterize the work of unknown, new NGOs. By feeding in summary text of an NGO to a properly trained classifier model, you can obtain a set of categories that describes that NGO with some degree of accuracy.


Clustering:

Clustering algorithms offer the possibility of generating new sets of categories, or better understanding of the connections and similarities between NGOs. K-Means with Document Embeddings seem to be the most promising initiative in this category.

We attempted to design a Semi-Supervised LDA algorithm based on an article published online, but were unable to get the code to run. Here is the article for reference.


Data Processing/Visualization:

Most processing was involved with obtaining data and seeing what it looked like, along with getting it into a form we could analyze. Preprocessing like TF-IDF scoring and count vectorizing are not included here, but stock SKLearn preprocessors were used in many algorithms.

Conclusions

Classification

The general consensus among our team is that classification with current data/category sets is not especially useful. Our classifiers yield a maximum F1 score of 0.67 in tests, which is not accurate enough to provide significant value. The SGD Classifier depends on an Scikit-Learn implementation of SGD, which is already as optimized as it can be - so any improvements will have to come from the categorization scheme or larger/cleaner datasets.

The SGD Classifier can be trained on any arbitrary set of labels, so its performance depends on the predictability of an NGO's categories given its summary text. So improvements in performance can be achieved with either cleaner text, or more predictable/consistent categories. The Bag of Words classifier's performance, however, depends mostly on the word dictionaries provided to it, which can be built and rebuilt in many ways. It is hard to say what approach for building dictionaries is best, we built them using the corpus for each category's training set of website text and removed common words. We also included words semantically similar to words in that category's corpus using the Datamuse API, and it worked very effectively. Further study would be needed to determine if better dictionaries could be generated. BOW is less adaptable, more difficult to work with, but can work better than SGD with the right dictionaries. It should be noted that SGD can operate on other types of features like TF-IDF vectors, or even document embeddings (which we haven't tried, but could be fruitful), whereas BOW cannot.

Moving forward - we provide classifiers in this repository with the intent that they could be used on future datasets, ones that are cleaner (than the project summaries dataset) or have different label sets (such as the labels specified in the recategorization scheme).

Clustering

The clustering algorithms provided in this repository are useful for discovering which categories are logically necessary. We generally searched for ways to generate 'what' categories (which describe what an NGO does) that may not have been considered before. The most fruitful and novel approach was clustering Document Embeddings with K-Means, which we believe has potential to be used in development of new categories. More detail about this process can be found in the Document Embeddings page, and approaches to the process can be seen in the document embeddings jupyter notebook and centroid analysis jupyter notebook.

LDA is not particularly novel or useful, but it still generates semantically similar topics. We don't believe there's a ton of insight to be gleaned from it but it is simple and easy to use, so we included it in case there's a need for a quick way to generate topics from datasets.

Data Processing/Visualization

The HTML parser we used to build some of our datasets has a few faults and could be improved. Recursive parsing by following links on the page has proved to be a roadblock, as NGO websites have innumerable formats and edge cases. In most of our approaches we decided to forego NGO website parsing in favor of cleaner datasets like the project summaries dataset that can be obtained with an API, so further study and experimentation would be needed to develop an HTML parser that would rival the cleanliness of API-retrieved datasets.

Overall Conclusions

This was a very interesting and difficult problem to address - we did not find an easy way to uniformly characterize NGOs with one algorithm. Our hope is that accurate classification could provide a starting point to characterizing new and unknown NGOs, and accurate classification starts with a predictable and consistent categorization scheme.

We anticipate that GlobalGiving could expand on this work by using the recategorization scheme proposal as a guideline to developing a new categorization scheme, should they deem it necessary. Developing and implementing such a scheme would most likely improve classification to a point where we could classify all unknown NGOs, so long as we had their website URL. The exact value of an acceptable classification accuracy is up to GlobalGiving, but we recommend an F1-Score > 0.9 as a guideline for 'informative classification'.

The development of this scheme would be an involved process, as the process of generating a new set of categories would be informed not only by the logical structure of the data as provided by clustering, but also the needs of GlobalGiving's users and NGOs. Balancing these constraints would require some human labor and critical thought. It is our hope that the provided clustering algorithms could be used as tools to aid this process, though we are aware that they will not solve the whole process themselves.

Past Presentation Slides