Wednesday, January 20, 2016

State of the Union Speeches and Data

I've done a couple posts on the SOTU speeches.  In the past these dealt with word count, approval, and the vague notion that the applause the president receives has a relationship with his approval rating at that time (which had a lower correlation this year in fact).

Wired had a good article highlighting the sentiment in the current and previous State of the Union (SOTU) speeches.  They went through the speech for several of the past years, highlighted the events that occurred each year, and gave the corresponding frequency or usage of terms in the speech that communicated the impact of those events.  This blog post is not duplicating the article.  I did see the graph though and wanted to see if I got a similar sentiment score for the speeches.  I used the 'syuzhet' library in R to conduct the analysis (big thanks to Matthew Jockers for the package).





The graph is similar to the one in the Wired article, but not entirely.  Some smoothing was involved and perhaps a different sentiment analysis technique.  We do see a similar finding in the most recent SOTU speech:  it ended with the highest sentiment score out of all the speeches.  Several of the speeches in my analysis showed a curving up toward the end, which would in general go along with "ending on a positive note".  Additionally, one can see the "valleys" or lower sentiment values occurring between the 50 and 75 time intervals.  This isn't too surprising in that the same speech writer is being used and that the SOTU has perhaps a more standard sentiment form (another analysis perhaps?).  

This same library has a function which scores certain words to emotional categories.  These 10 categories include a positive/negative categorization.  Along with these, I added in the applause count for each speech and the approval rating for each year for the time period of the speech.  The matrix below depicts the correlation values of each category with corresponding color.  Additionally, I added in a p-value scoring for each relationship, those >.1 were given bubbles.


There's a lot here in terms of what could be said about the speeches but I'll only say a few things that I thought were interesting.  The applause/approval rating correlation showed a weaker value than last year (-.5), which isn't too surprising since this is probably spurious anyways.  Negative word categorization and applause had a higher correlation than positive word categorization and applause.  Meaning, when comparing applause and negative word use across speeches, these counts varied in a similar way (applause count higher - negative word count higher and vice versa).  Speeches with words categorized as "anger" or "fear" had a weak correlation to the applause count.  Conversely, speeches with words categorized in emotions like "joy", "surprise", and "trust" portray a stronger correlation with applause count in those same speeches.  So perhaps to get more applause in general, certain positive words are better than others?  Yoda's advice about fear would make sense here in that words associated with fear tend to vary similarly to words associated with anger.

We also see a decent amount of correlation among more positive emotions as well as within more negative emotions.  This refers back to the common "curve" that these speeches may have.  In that the sentiment used year over year tend to be similar, or at least the emotional categorization of words follow similar patterns.

Thanks to Matthew Jockers, Taiyun Wei, and Hadley Wickam for their work on the 'syuzhet', 'corrplot', and 'ggplot' packages respectively.  Code for the above analysis is on my github page.