This project seeks to understand the defining features of the American Frontier by identifying “frontier language” and its persistence over time through natural language processing and machine learning. The main difficulties with this project lie primarily with 2 tasks: Firstly, having to handle a dataset of such a large size (in excess of 1,200 lengthy text documents); Secondly, the possibility of not finding any natural language processing algorithm which, out of the box, can analyze the data and instead having to heavily modify either the algorithm or our data in order to obtain our analyses.
- Jean P. Vazquez
- Kevin Chen
- Zhiwei Tang
Our data consisted of two historical datasets: Frederick Jackson Turner’s speeches from http://xroads.virginia.edu/~hyper/turner/, a folder containing the histories of several hundred U.S. counties organized by state from www.dropbox.com/county-histories as well as two modern datasets to compare with the historical datasets consisting of presidential nominees nomination acceptance speeches and political party platforms from https://www.presidency.ucsb.edu/documents. In order to use the dataset, however, several steps had to be taken. First, all the files were converted into .txt format, which the county histories files already were in, but all others required conversion into this. Following this, several preprocessing steps were taken to streamline the reading process; all punctuation marks and special characters were filtered out. Lastly, each individual text was compacted such that there was no distinction between different lines or sentences, only separate words.
The project was divided into two separate phases. The original first phase consisted of finding as many “frontier words” as possible within the historical datasets provided; which are words that are much more frequently spoken by “frontier-esque” persons and, as such, would serve as the basis for the second phase of this project, wherein the change of “frontier-esque” behavior over time would be measured. However, several attempts were made using unsupervised machine learning algorithms in order to find these words to extremely limited success. Additionally, consulting with Dr. Lapets about the attempts made to find words given a theme such as “frontier” only confirmed the information inferred from the previous results, that no machine learning or natural language processing algorithm would yield any results unless the method was supervised and very heavily modified. As such, instead of attempting to find words, the project focus was shifted immediately towards the second phase thanks to a list of “frontier words” given by the project head, Dr. Martin Fitzsbein. The second phase consists of then using the “frontier words” found in the first phase of the project in order to place candidates on a continuum based on how “frontier-esque” they are determined to be. In order to determine the “frontierness” of a candidate, three separate methods were applied across both historical datasets: term frequency analysis, a tf-idf score analysis, and what was dubbed “near word association”.