In political discourse and geopolitical analysis, national leaders’ words hold profound significance, often serving as harbingers of pivotal historical moments. From impassioned rallying cries to calls for caution, presidential speeches preceding major conflicts encapsulate the multifaceted dynamics of decision-making at the apex of governance. This project aims to use deep learning techniques to decode the subtle nuances and underlying patterns of US presidential rhetoric that may signal US involvement in major wars. While accurate classification is desirable, we seek to take a step further and identify discriminative features between the two classes (i.e., interpretable learning).
Through an interdisciplinary fusion of machine learning and historical inquiry, we aspire to unearth insights into the predictive capacity of neural networks in discerning the preparatory rhetoric of US presidents preceding war. Indeed, as the venerable Prussian General and military theorist Carl von Clausewitz admonishes, “War is not merely an act of policy but a true political instrument, a continuation of political intercourse carried on with other means.”1
We aim to shed light on the interplay between the verbiage of national leaders and the inexorable currents of history that they set in motion. In addition to probing the efficacy of deep learning and natural language processing (NLP) while navigating the challenges inherent in the analysis of protracted textual corpora, we endeavor to examine how presidential rhetoric shapes, reflects and occasionally catalyzes the nation’s trajectory toward pivotal global events. We aim to gauge the impact of leaders’ orations on national decisions and international relations, furnishing novel insights and fresh perspectives on matters of global import.
Moreover, this interdisciplinary approach provides valuable tools for policymakers, historians, and the wider public. Deciphering the recurrent motifs within presidential addresses holds the potential to inform prognostication or influence forthcoming events, thereby exemplifying the enduring relevance of Clausewitzian principles in conjunction with contemporary technological innovations. In doing so, it bridges age-old theories with cutting-edge methodologies, fostering a more comprehensive understanding of how leaders adeptly frame their rhetoric to galvanize support for political endeavors. While impressive accuracy warrants attention and is important for a classification task as important as ours, we seek to make our model results interpretable; deep neural networks for classification are, to most, black boxes; we plan to use interpretable learning techniques to shed insight on how/why our models predict as they do.
The data for this project comes from Kaggle, but the author scraped the data from The Miller Center at the University of Virginia.2 We added a column to the dataset that represents our binary categorical response variable (War), indicating whether the US entered a major war within one year of the president’s speech. We encode an observation’s value for the War variable as 1 if the US entered a major war within one year of the president’s speech; otherwise, we encode the observation’s value for the War variable as 0. We derived wars’ start dates from the US Congressional Research Service.3
We perform some slight cleaning and preprocessing to set up the data for modeling. First, we checked for null values and found one missing transcript for a speech delivered by Thomas Jefferson on Nov. 8, 1808; we found the transcript via the Miller Center and added it to the dataset. Next, because the first war we consider (First Barbary War) started in 1801, we filter the dataset to speeches dated after 1800.
Several transcripts end with the president’s signature; we remove the signature text from the transcripts column given that the president is identifiable from the president column and that text is not important for our modeling purposes. The transcripts also contain instances of long integers and floating point numbers when a president describes various treasury and debt statistics, for example. We remove floating point numbers and integers from the transcripts. Additionally, we convert the transcripts to lowercase and remove punctuation.
After cleaning the data and adding our response variable, the dataset contains 964 observations and exhibits significant class imbalance. There are 883 observations classified as War = 0 and 81 classified as War = 1; roughly 92% of the speeches were not delivered within one year of the US entering a major war. We use the Synthetic Minority Over-sampling Technique (SMOTE) to balance the classes, and, as the authors suggest, we combine SMOTE with random undersampling of the majority class.4 We combine these transformations into a single pipeline.
With the classes relatively balanced and the text minimally cleaned, we now convert the text data into a format suitable for our modeling purposes.
We leverage a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to tokenize and vectorize the raw text data, converting the speeches into fixed-length vectors that we pass as inputs to our models.5 We experiment with various model architectures using a binary cross entropy loss function, and we evaluate model performance across accuracy, F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). The models train on 80% of the data; we use half of the remaining 20% for validation and half for testing. We trained each model for ten epochs using batches of size 32.
We battled with shape mismatches when trying to feed the vectorized representations into the BERT model because we stacked the predictor features before applying the resampling pipeline, so we set up a separate pipeline to transform the text data for BERT. In this second pipeline, we use the same approach as before, but we append the input IDs and attention masks to lists so they can be directly accessed during training and evaluation.
The models we experiment with include:
-
Multilayer Perceptron (MLP)
-
Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM)
-
LSTM with Attention
-
BERT
This section describes our models, interpretable learning approaches, and results.
Our MLP consists of two dense hidden layers with ReLU activation followed by dropout regularization and an output layer with a sigmoid activation function. We apply L2 regularization of 0.01 to the kernel weights in all dense layers to prevent overfitting. When compiling the model, we use the Stochastic Gradient Descent optimizer with a learning rate of 0.001 and Nesterov momentum of 0.99.
The MLP performs relatively well; the training and validation accuracy steadily improve, for the most part, and surpass 0.7 by epoch ten, and the training and validation loss steadily decrease.
In our second model, we reshape the input data to include a timestep dimension before it’s fed into the LSTM layer, allowing the model to effectively capture temporal dependencies in the input data. With 128 units, the LSTM layer utilizes hyperbolic tangent activation, Glorot uniform, and orthogonal initializers, along with dropout of 0.1 and recurrent dropout of 0.1 for regularization. Next comes a densely connected layer consisting of 64 units with ReLU activation, He normal initialization, and L2 regularization of 0.1. We added a dropout layer to apply further regularization and mitigate overfitting. Given that we’re performing binary classification, the final layer is a dense output layer with a sigmoid activation function. We apply L2 regularization to the kernel weights in both dense layers to further prevent overfitting. When compiling the model, we use the Adam optimizer with a learning rate of 0.001.
The RNN architecture with an LSTM layer performs better than the MLP; although the training and validation accuracy fluctuate somewhat, they steadily increase and reach over 0.9 by epoch ten. The training and validation loss steadily decrease across epochs.
This model architecture is the same as the previous model except that it includes a custom attention layer between the LSTM layer and the first dense layer that dynamically weighs the input sequence elements based on their importance. As with the second model, we use the Adam optimizer with a learning rate of 0.001.
Adding the attention layer seems to have improved performance compared to the previous two models. We observe the training and validation accuracy increasing steadily, except for a drop in validation accuracy in epoch nine. The training and validation loss decrease steadily and barely diverge.
The fourth model, fine-tuned on our dataset, utilizes self-attention mechanisms to process and analyze text segments in relation to their broader context within each speech. In training this model, we switch from the binary cross-entropy loss function to sparse categorical cross-entropy and use the Adam optimizer, specifying several hyperparameter values:
- learning rate = 3e-7
- β1 = 0.9
- β2 = 0.999
- ε = 1e-08
- clipnorm = 2.0
The fine-tuned BERT model performs well; the training and validation accuracy increase over the first few epochs but remain nearly constant thereafter, while the training and validation loss drop consistently.
The RNNs and BERT perform best on the training and validation sets, at least in terms of training and validation accuracy. BERT quickly reaches training and validation accuracy of over 0.9, while the RNNs take longer to get there. However, BERT takes much longer to train. Next, we use AUC-ROC and F1-Scores to compare model performance on the test dataset.
The RNNs and BERT all achieve AUC-ROC values over 0.9, although the LSTM with Attention has the highest value at 0.982, and the RNN with LSTM slightly outperforms the others in terms of F1-Score, achieving 0.929. The MLP achieves a comparatively decent AUC-ROC but a much lower F1-Score, indicating that the classifier performs poorly when attaching heightened importance to false positives and false negatives.
While accurate classification is desirable, we also try to identify discriminative features between the two classes. We use three approaches to help interpret our models and results.
In our first approach, we extract and analyze the attention weights from our third model. To do so, we create another model using the Model class, specifying the same inputs as our third model but setting the output to that of the attention layer. This allows us to extract the attention weights, providing insights into how the attention mechanism weighs different parts of the input sequence.
The x-axis of the distribution plot above represents mean attention weights, which indicate the average importance that the attention mechanism assigned to different pieces of the input sequences. The y-axis represents the frequency of sequences with a particular mean attention weight. The plot allows us to compare the mean attention weight distributions between the two classes; we observe some overlap but reasonably clear separation between the distributions of mean attention weights for the two classes, suggesting that the attention mechanism effectively captures differences between the classes.
For our second approach, we use the Local Interpretable Modeling-agnostic Explanations (LIME) package to help explain predictions from our second model.6 Approximating our complex model via a local linear explanation model enables us to analyze and visualize the influence of individual features on prediction outcomes, helping identify key attributes that distinguish between classes and providing a basis for deeper analysis and justification of the model’s decisions.
The chart above shows the dimensions that contributed most to a single prediction from the model; the bars indicate magnitude and whether the feature influenced the model toward or away from a prediction of War = 1. Investigating local explanations can provide insight into whether or not the model’s decisions align with human decision-making.
The third way we add interpretability is by employing the SHapley Additive exPlanations (SHAP) package to visualize feature importance values from the second model.7 In contrast with LIME, SHAP values explain how features affect a model globally.
The visualization illustrates the most influential features SHAP identified for our second model, ranked by the largest mean magnitude associated with war predictions. By comparing SHAP with LIME, we observe that the key features influencing local predictions often differ significantly from those impacting global outcomes. This contrast highlights the unique insights each method brings to model interpretability.
Our experiments not only demonstrate the potential for deep learning techniques to reveal patterns in US presidential rhetoric but also hint at their predictive power in determining involvement in future wars. The diverse neural network architectures we constructed and the pre-trained BERT model we utilized show that gated RNNs and transformer-based architectures can accurately classify text inputs of varying lengths, even in the face of extensive raw texts.
Exciting avenues for future research in this area could include experimentation with more advanced transformer models for classification as well as different language encoding techniques, such as sub-word tokenization. These explorations hold the promise of further enhancing our understanding and application of deep learning in text analysis.
Given that the texts of presidential speeches are longer, future research can experiment with emerging techniques to overcome the input sequence length limitation of powerful transformer-based models like BERT. BERT’s self-attention mechanism, for example, can process a maximum of 512 tokens. Overcoming such limitations requires careful preprocessing; for instance, researchers have explored employing truncation, chunking, etc.8 Other newer approaches, like BigBird and Longformer, use sparse attention mechanisms with larger maximum token limits, and others explore fine-tuning BERT to work with longer text data, including ChunkBERT and BERT For Longer Texts (BELT).9,10 Future research on our topic of focus would benefit from experimenting with similar approaches and evaluating model performance when the inputs capture most, if not all, of the longer-form texts.
Research has shown that the structure of the BERT-based gated approaches, which use a fully connected encoding unit and apply the gate mechanism to update state memory, are computationally inefficient given the quadratic time complexity of applying self-attention in long-text modeling. A recent paper proposes addressing these issues using what the authors refer to as a Recurrent Attention Network (RAN).11 The RAN model uses positional multi-head self-attention on local windows for dependency extraction and employs a Global Perception Cell (GPC) vector to propagate information across windows, concatenated with tokens in subsequent windows. The GPC vector acts as a window-level contextual representation and maintains long-distance memory, enhancing local and global understanding. Additionally, a memory review mechanism allows the GPC vector from the last window to serve as a document-level representation for classification tasks. Thus, future research on our topic of interest might look to leverage similarly powerful transformer-based models while optimizing efficiency.
There is much room for improving interpretability in the field of deep learning generally but more specifically in the context of these larger models. Researchers recently developed an approach called ProtoryNet, which makes predictions by finding the most similar prototype for each sentence in a sequence and feeding an RNN backbone with the proximity of each sentence of the corresponding active prototype. The RNN backbone then captures the temporal pattern of the prototypes, which the authors refer to as ‘prototype trajectories.’ These trajectories enable intuitive, find-grained interpretation of the RNN model’s reasoning process.12 Future research in long-text modeling (among other topics) might try to leverage ProtoryNet and other emergent approaches to increase model explainability and shine some light on the ‘black box’ of these models’ decision-making.
If we were to spend more time and expand our analysis, we would leverage the Party variable to examine whether partisan differences exist in war rhetoric or if any differentiable patterns exist when including party affiliation as an input feature. Another way we might augment our research is by including speeches from the leaders of many countries operating under differing government structures with varying degrees of openness. The context would change slightly in that, for the current study, we operate under the assumption that US presidents need to get buy-in from citizens and massage the national psyche to support their cause; otherwise, the president risks losing power. The same assumption would likely not hold, or at least would need to be adjusted, for authoritarian regimes. Expanding the dataset to include speeches from different countries with different forms of government may open interesting avenues for future research into national leaders’ rhetoric and accountability.
[1]: von Clausewitz, C. (1997). On War (J. J. Graham, Trans.). Wordsworth Editions.
[2]: Lilleberg, J. (2020). United States presidential speeches. Kaggle. https://www.kaggle.com/datasets/littleotter/united-states-presidential-speeches. Data scraped from The Miller Center at the University of Virginia, https://millercenter.org/the-presidency/presidential-speeches.
[3]: Barbara Salazar Torreon and Carly A. Miller, US Congressional Research Service. (2024). U.S. Periods of War and Dates of Recent Conflicts, available at https://sgp.fas.org/crs/natsec/RS21405.pdf.
[4]: Nitesh V. Chawla et al., “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence 16 (2002) pp. 321-357.
[5]: Jacob Devlin et al. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1.
[6]: Riberio et al., Why Should I Trust You?: Explaining the Predictions of Any Classifier, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), pp. 1135-1144.
[7]: Scott M. Lundberg and Su Lee. (2017). A Unified Approach to Interpreting Model Predictions. In I. Guyon & U. V. Luxburg & S. Bengio & H. Wallach & R. Fergus & S. Vishwanathan & R. Garnett (Eds.), Advances in Neural Information Processing Systems (pp. 4765-4774). Curran Associates, Inc.
[8]: Zican Dong et al. (2022). A Survey on Long Text Modeling with Transformers. ArXiv 2302.14502v1. See also Park et al. (2022). Efficient Classification of Long Documents Using Transformers. ArXiv 2203.11258v1.
[9]: Aman Jaiswal and Evangelos Milios. (2023). Breaking the Token Barrier: Chunking and Convolution for Efficient Longer Text Classification with BERT. ArXiv 2310.2055av1.
[10]: Michal Brzozowski. (2023). Fine-tuning BERT model for arbitrarily long texts Part 1. MIM AI. See also Michal Brzozowski. (2023). Fine-tuning BERT model for arbitrarily long texts, Part 2. MIM AI. For technical documentation, see Michal Brzozowski and Marek Wachnicki. (2023). Welcome to BELT (BERT For Longer Text)’s documentation. MIM AI.
[11]: Xianming Li et al. (2023). Recurrent Attention Networks for Long-text Modeling. Findings of the Association for Computational Linguistics (ACL), pp. 3006-3019.
[12]: Dat Hong et al. (2023). ProtoryNet - Interpretable Text Classification Via Prototype Trajectories. Journal of Machine Learning Research 24, pp. 1-39.
Python Module Files (helper functions, classes)
This Python module file includes the BertSequenceVectorizer
class, which we designed to convert input text into vector representations using a pre-trained the Bidirectional Encoder Representations from Transformers (BERT) model.
-
Features:
BERT-based Vectorization: Utilizes a pre-trained BERT model to generate vector representations of input text.
Tokenization: Employs the BERT tokenizer to tokenize input text before vectorization.
Customizable Sequence Length: Allows customization of the maximum length of input sequences for vectorization.
-
Usage
Upon instantiation of the
BertSequenceVectorizer
object, the class automatically loads a pre-trained BERT model (bert-base-uncased by default) and its corresponding tokenizer, specifying the maximum length of input sequences for vectorization.
This Python module file contains a helper function for plotting model history (accuracy, validation accuracy, loss, and validation loss).
Jupyter Notebooks
The Jupyter Notebook contains the code we used to clean the input data (speeches.csv) and set up the training, testing, and validation sets. In this notebook, we use the pre-trained BERT model and vectorizer (see BertSeqVect.py) to tokenize and vectorize the text data.
This Jupyter Notebook contains code and visualizations from our exploratory data analysis.
This Jupyter Notebook contains our code for the modeling experiments. We experiment with three models: (1) MLP, (2) gated RNN (LSTM), (3) the same second model but with Attention mechanisms, and (4) a pre-trained BERT model. After developing these models, we use the third approach to begin exploring various ways to perform interpretable learning to discern how the model differentiates the two classes.
Data Files
This file contains the cleaned data that we use for modeling.
This file contains the original source data.
This file contains the testing features (the vector representations of the input text).
This file contains the training features (the vector representations of the input text).
This file contains the validation features (the vector representations of the input text).
This file contains the testing labels (binary response variable 'War').
This file contains the training labels (binary response variable 'War').
This file contains the validation labels (binary response variable 'War').