Capitol Letters: Detecting Differential Attitudes in Legislator Communications Using NLP

Repo contains code/figures created for my master's thesis which looked at legislator comunication through one-minute pseeches from the House floor and on Twitter.


This repo contains code necessary for end-to-end workflow for scraping and analyzing one-minute speech data using gpt-3.5-turbo.

First, I provide code to collect all links to C-SPAN videos which mention a given phrase, in this analysis that phrase is "one-minute" (scrape 117th one minute speeches.ipynb). I also provide the code I used to isolate just the one-minute speeches from the transcript data, since the transcripts contains other House session communication (clean_transcript_data_11*.ipynb). The result fo this cleaning can be found in the folder, One Minute Congresional Speech Data. After cleaning the transcript data the goal is to then score these speeches for valece, arousal, confidence, empathy and smypathy. This is done using open ai's gpt-3.5-turbo (openapi_getVACES_115.ipynb). Similar scoring is done for 116th and 117th congresses. The resulting scores are availible in the VACES output directory, which has each spceech, the speaker, the party identification of the speaker as well as it's score for each VACES measure.

The goal of this project was to explore differntial emotion by Members of Congress, so KS-testing is done on the distributions by party to determine if the distrubtions differ (ks-testing VAC_11*.ipynb). Figures of the dsitributions are also created and combined in 'make kde-figure-115-116-117.ipynb.'

One could use this code to expand the work done here to analyze House Session transcripts on any topic.

Components of repo

  • scraping and cleaning code
    • scrape 117th one minute speeches.ipynb, program which uses selenium to scrape the links to all CSPAN house sessions which mention one-minute as well as scraping the transcripts of those House sessions using link to the CSPAN video. Similar collection was done for 115th and 116th Congresses.
    • clean_transcript_data_115.ipynb, code which isolates just one-minute speeches from House session transcripts for 115th Congress
    • clean_transcript_data_116.ipynb, code which isolates just one-minute speeches from House session transcripts for 116th Congress
    • clean_transcript_data_117.ipynb, code which isolates just one-minute speeches from House session transcripts for 117th Congress
  • One Minute Congresional Speech Data * cleaned_transcript_data_115th.csv, cleaned_transcript_data_116th.csv, cleaned_transcript_data_117th.csv resulting output from cleaning pipeline above.
  • open ai code
    • openapi_getVACES_115.ipynb, code which scores one-minute speeches using gpt-3.5-turbo
  • VACES output
    • Contains collected one-minute speeches for 115th, 116th and 117th Congresses as well as their Valence, Arousal, Confidence, Empathy and Sympathy which was scored using OpenAI gpt-3.5-turbo in openapi_getVACES_115.ipynb
  • figure and ks code
    • kde figures and ks testing
      • ks-testing VAC_115.ipynb, ks testing for 115th Congress and figure generation
      • ks-testing VAC_116.ipynb, ks testing for 116th Congress and figure generation
      • ks-testing VAC_117.ipynb, ks testing for 117th Congress and figure generation
      • make kde-figure-115-116-117.ipynb, combine figures into one.
    • sentiment figures
      • cleaning_and_sentiment_scoring_116.ipynb, cleans Tweets and scores sentiment for 116th Congress
      • cleaning_and_sentiment_scoring_117.ipynb, cleans Tweets and scores sentiment for 117th Congress

Twitter Analysis Replication

I can't provide the raw tweet data due to the Twitter Academic API usage agreement. Instead, I provide a file which contains the date of the tweet, the sentiment score of a tweet as scored by VADER, and the corresponding political party of the tweets author. This is found in the folder CapitolLetters/twitter sentiment data, tweets_115.csv, tweets_116.csv, and tweets_117.csv. This then is enough to replicate the figures found in my thesis, i.e.

figure and ks code/sentiment figures/cleaning_and_sentiment_scoring_115.ipynb

figure and ks code/sentiment figures/cleaning_and_sentiment_scoring_116.ipynb

figure and ks code/sentiment figures/cleaning_and_sentiment_scoring_117.ipynb

VACES Analysis Replication

scraping data

The data was scraped from the C-SPAN website, using selenium.

To capture the transcript you'll need a C-SPAN account (free to make). Then navigate to their API page and use the GET \mentions feature.

Then appropriate min date, max date, search term ("one minute speeches" for my use case), as well as session type= 25 (House sessions). This will output a json with links to all videos which mention the search term in the given range. See scrape 117th one minute speeches.ipynb for code to capture these links and the corresponding transcripts.

One you have the transcraipts from here, for me, the next step was to isolate the one-minute speeches data, see clean_transcript_data_115.ipynb for example.

The resulting, cleaned, one minute speech data can be found

  • cleaned_transcript_data_115th.csv
  • cleaned_transcript_data_116th.csv
  • cleaned_transcript_data_117th.csv

using open ai to score data

From here, will score one-minute speeches using gpt-3.5-turbo, see openapi_getVACES_115.ipynb for full example.

Most notably, you will need your own API key to be able to replicate this analysis.

Essentially, need to pass it the text you want to score, the definition of the emotion dimention, as well as the dimensions name.

def label_text_using_gpt(text, defn, name):
    chat_completion =
            "role": "system", "content": "You are a helpful assistant for labelling text data."
        }, {
            "role": "user", "content": f"The following is part of a congressional speech. Given this definition of '{name}': '{defn}' and this text: '{text}', provide a score for the '{name}' of the text between -1 and 1. Please give me ONLY a number between -1 and 1 as your response."
    return chat_completion.choices[0].message.content

Example input:

valence= "the degree of pleasure/positivity i.e. valence is the positive--negative or pleasure--displeasure dimension, where high valence (1) indicates the content of text is all positive or pleasurable and low valence (-1) indicates the content of the text is not positive or pleasurable" 

text_ex = "My name is Cara Nix, I like ice cream and cookies and Taylor Swift" 

result= safe_label_text_using_gpt_v2(text_ex, valence, "valence")

The result is then a score for the valence of the text!

The scores for valence, arousal, confidence, empathy, and sympathy for the one-minute speeches for the 115-117th Congresses is availible VACES output/speeches_VACES_115.csv, VACES output/speeches_VAC_116.csv, VACES output/speeches_ES_116.csv, and VACES output/speeches_VACES_117.csv

analyzing results

After passing in all the one-minute speeches and getting scores for valence, arousal, confidence, empathy, and sympathy for 115-117th Congresses it is now time to analyze these scores.

path to folder: figure and ks code/kde figures and ks testing

  • ks-testing VAC_115.ipynb
  • ks-testing VAC_116.ipynb
  • ks-testing VAC_117.ipynb

Then to combine these figures, use make kde-figure-115-116-117.ipynb.


Contributors names and contact info

ex. Cara Nix, [email protected]

Version History


Inspiration, code snippets, etc.

