A Computational Linguistic Analysis of the 2016 Presidential Candidates

Published: 01/13/2016

Last week, a video analyzing the language Donald Trump has been using went viral. It was quite interesting - highlighting the supposed differences in how Trump uses language compared to the other candidates. It talks about the number of syllables in a word, rearranging the words in a sentence, and a few other topics.

However as it was manual linguistic analysis, it only looked at a small, 100 word response to Jimmy Kimmel. This got me thinking about how all of the candidates use language ... especially on a larger scale. So, being a Computational Linguist, I decided to try and find out. Here's a larger scale analysis of the language used by the candidates across the 2016 primary field.

Minor Disclaimers:

"All models are wrong, but some are useful." - George Box - First off, the analysis done here is computational linguistics (i.e. done by computers) which are not perfect at understanding human language, so take the results with a grain of salt. For instance, the syllable counter does not know if 105 is pronounced "one-hundred-five" or "one-oh-five" so it labels all numbers as unrecognized. Or, the periodic sentence identifier had a 90% accuracy on the dataset it was published on, so we expect it to make some mistakes. Second, this is just pointing out trends in the data. It is not meant to make a political statement one way or the other, but rather to highlight trends. Just because a candidate uses language one way glosses over semantics, meaning, intent, etc... And ... who knows if this is part of their campaign strategy or not?

Data and Code:

I got the data from various parts of the web. First, I have the transcripts of the October 13th Democratic Debate and the October 28th Republican Debate. Since these cover different topics, I analyzed them separately. I also have the campaign launch announcement transcripts for most of the candidates. You can find the data and the code for all of this blog post on GitHub.

Imperatives:

In the video, it is pointed out that Trump uses a lot of imperatives such as "Look at what happened in ..." So, I decided to check if he used more or less than any of the other candidates. This makes use of pattern.en from the University of Antwerp. Each sentence is labeled as "Indicative", "Imperative", "Conditional", or "Subjunctive". However, it turns out that the most any candidate used the subjunctive was twice, so I've removed that from the analysis.

Imperative Republican Debate — In the Republican Debate, Trump actually used fewer Imperatives than many of the other candidates.

Looking at the graph, at least while debating, it does not seem that Trump is using more imperatives than anyone else. In fact, it looks like he is using significantly fewer than both Bush and Rubio. However, it is still a small portion of the sentences overall with no one above 15%. The same holds true for the Democratic Debate - almost all are indicative

Imperative Democratic Debate — In the Democratic Debate, the majority of the sentences were also indicative.

However, perhaps there is a difference between debating and when a candidate has a lot of time to prepare a speech. However, this does not really seem to be the case. The percentages are about the same - often around 6%. Chafee actually uses it the most in his announcement speech at 14%. Trump just looks bigger because he had a much longer speech.

Imperative Campaign — In a prepared speech, most candidates are using the indicative and again Trump is not the biggest user of the imperative ... he is just the longest winded.

So, overall. Despite the analysis on Jimmy Kimmel, it does not look like Trump uses the imperative more than anyone else running for President.

Syllables:

Another interesting part of the video is the claim that Trump uses fewer syllables and implies it is intentional. It is very difficult for a computer to recognize that he swapped "California" for "San Bernardino", but we can just look at the raw amount of syllables a candidate used. Fewer than one percent of all words from any candidate used 5 or more syllables, so I excluded them from the graphs. The unrecognized category comes from the fact that this is actually a non-trivial task for computers and sometimes, they just don't know. However, it is likely that unrecognized words are more syllables so this may make the data look skewed to fewer syllables. Unrecognized words are slightly more common in the debate transcripts, likely since they are talking off the top of their heads. This data was analyzed using NLTK with the CMU Pronunciation Dictionary.

Here, we do seem to see the claim that Trump uses fewer syllables than other candidates, but Christie gives him a run for his money. In the campaign announcements, over 3 out of 4 of Trump's words are one syllable with roughly only 5% being 3 or more syllables. For one syllable words, Christie is only one percentage point lower. On the other end of the spectrum, Chafee and Webb have only 3 out of 5 of their words being one syllable. More then 10% of their words are 3 or more syllables.

Syllables Campaign — Trump and Christie use fewer syllables while Chafee and Webb use more.

However, if you look at the debate transcripts, most of the differences disappear with the exception of Fiorina. Now, this could just be because they are all discussing the same topic and using similar words, or it could imply a specific word choice when writing speeches.

Syllables Republican Debate — Most of the differences in syllable usage disappeared, aside from Fiorina who actually did speak more than the average candidate.

Syllables Democratic Debate — The Democratic Debate was quite close.

The video does reference a recent study that mentions "Flesch–Kincaid" readability studies of the campaign launch announcements. This has also been wide shared. However, this test relies on two factors: words in a sentence and syllables in a word. Unsurprisingly, my results on syllable length mirror the results of that study since it is the same data and one of the factors. However, it appears that a lot of the differences would disappear when looking at debate transcripts.

Periodic Sentences:

The video also points out that Trump frequently rearranges a sentence so that the most important word came last. Again, this is rather difficult for a computer to discern, however work has been done on determining if a sentence is "Periodic" or "Loose". A periodic sentence puts the main clause at the end of the sentence whereas a loose sentence puts it at the front. Loose sentences are more common, but historically, periodic sentences were not as rare. Though this does not quite capture the impact of the most important word at the end of the sentence, having the main clause at the end may be similar. To do this, I reimplemented Algorithm II from Section 3 of this EMNLP Paper which is one of the major publication venues for computational linguistics. It is worth noting that the authors only had a 90% success rate with this method, so differences may be smaller than they appear here.

First off, looking at the Republican Debate, we see that Trump hardly uses periodic sentences, but the "Other" category is quite large - perhaps implying that his rearrangement of words confuses the algorithm completely. We can also see that Cruz uses periodic sentences more than anyone else in the debate.

Periodic Republican Debate — Cruz used the most periodic sentences in the Republican Debate.

Looking at the campaign announcements, we see that Trump again has the most "Other" sentences that are not classified by the algorithm and few periodic sentences. Here, Huckabee has the most periodic sentences with Christie and Rubio also having quite a few. For the Democrats, Sanders and Webb have more than the others, but it appears to be less common on average than among Republicans.

Periodic Campaign — Huckabee uses the most periodic sentences and Chafee the least.

Sentence Types:

Finally, I decided to look at the type of sentences candidates were using. Again, I used methods from the same EMNLP paper, but used Algorithm I to classify sentences as either "Simple", "Complex", "Compound", "Compound-Complex", or "Other". In debates and campaign speeches, "Simple" and "Complex" sentences are by far the most common, though you do see "Compounds" and "Complex-Compounds".

In the campaign announcements, every candidate except Trump and Chafee used more "Complex" sentences than "Simple". However, Trump did this by a large margin. He uses "Simple" over "Complex" at 1.5x rate while Chafee is at a 1.2x rate. Everyone else is below 1.0x

Sentence Type Campaign — Trump and Chafee use "Simple" sentences way more often than the rest of the field.

In the Democratic Debate, the same thing holds. Only Chafee is using simple sentences more than complex.

Sentence Type democrat — Chafee uses "Simple" sentences the most.

In the Republican Debate, again we see that Trump uses simple sentences the most. However, we also see Kasich and Fiorina using them more than complex sentences. Unfortunately, I do not have their campaign announcements transcribed to compare overall.

Sentence Type Republican — Trump still uses "Simple" sentences the most in the Republican debate. However, Fiorina and Kasich also use them more than "Complex".

... and to Conclude?

Looking at a larger scale and at more candidates, we do see that Trump is using simpler sentence constructs and fewer syllables than the rest of the Presidential Primary Field. However, we do not see that he is using more imperatives or periodic sentences than other candidates. Overall, we do see that there are more similarities than differences among the candidates. We also see that Trump has more sentences that our algorithms classify as "Other" than the everyone else - perhaps meaning that he is breaking our methods and differing from standard speech.

So, most candidates use speech quite similarly, but there is some validity to the viral video that Trump uses language different than everyone else.

Kenton Murray