Skip to content

This project seeks to understand the defining features of the American Frontier by identifying “frontier language” and its persistence over time through natural language processing and machine learning.

Notifications You must be signed in to change notification settings

JPVazquez/Frontier-Culture-and-Modern-Politics-in-the-US

Repository files navigation

Frontier Culture and Modern Politics in the US

This project seeks to understand the defining features of the American Frontier by identifying “frontier language” and its persistence over time through natural language processing and machine learning. The main difficulties with this project lie primarily with 2 tasks: Firstly, having to handle a dataset of such a large size (in excess of 1,200 lengthy text documents); Secondly, the possibility of not finding any natural language processing algorithm which, out of the box, can analyze the data and instead having to heavily modify either the algorithm or our data in order to obtain our analyses.

Team Members

  • Jean P. Vazquez
  • Kevin Chen
  • Zhiwei Tang

Dataset

Our data consisted of two historical datasets: Frederick Jackson Turner’s speeches from http://xroads.virginia.edu/~hyper/turner/, a folder containing the histories of several hundred U.S. counties organized by state from www.dropbox.com/county-histories as well as two modern datasets to compare with the historical datasets consisting of presidential nominees nomination acceptance speeches and political party platforms from https://www.presidency.ucsb.edu/documents. In order to use the dataset, however, several steps had to be taken. First, all the files were converted into .txt format, which the county histories files already were in, but all others required conversion into this. Following this, several preprocessing steps were taken to streamline the reading process; all punctuation marks and special characters were filtered out. Lastly, each individual text was compacted such that there was no distinction between different lines or sentences, only separate words.

Approach

The project was divided into two separate phases. The original first phase consisted of finding as many “frontier words” as possible within the historical datasets provided; which are words that are much more frequently spoken by “frontier-esque” persons and, as such, would serve as the basis for the second phase of this project, wherein the change of “frontier-esque” behavior over time would be measured. However, several attempts were made using unsupervised machine learning algorithms in order to find these words to extremely limited success. Additionally, consulting with Dr. Lapets about the attempts made to find words given a theme such as “frontier” only confirmed the information inferred from the previous results, that no machine learning or natural language processing algorithm would yield any results unless the method was supervised and very heavily modified. As such, instead of attempting to find words, the project focus was shifted immediately towards the second phase thanks to a list of “frontier words” given by the project head, Dr. Martin Fitzsbein. The second phase consists of then using the “frontier words” found in the first phase of the project in order to place candidates on a continuum based on how “frontier-esque” they are determined to be. In order to determine the “frontierness” of a candidate, three separate methods were applied across both historical datasets: term frequency analysis, a tf-idf score analysis, and what was dubbed “near word association”.

For more information on this project and the final results, refer to the Final Report pdf

About

This project seeks to understand the defining features of the American Frontier by identifying “frontier language” and its persistence over time through natural language processing and machine learning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published