Home | Trolltweetsandtrump

Troll Tweets and Trump's: How Similar are They?

About

With the Muller Report released a few weeks ago, a lot of people have been talking about the ways in which Russia may have interfered with the 2016 US Election. Amidst all the talk about collusion and obstruction, one facet of the election that's gone largely overlooked is how Russia attempted to use social media to influence this election. One way Russia did so was through "troll factories", producing incendiary tweets on Twitter to polarize public sentiment here stateside.

When I saw that FiveThirtyEight had written about and compiled a dataset of these fake Russian tweets, I was curious to explore these. As you'll see in the wordclouds, some features of these tweets seemed reminiscent of Trump's tweets. As such, I've decided to compare the two! For this project, I used this 200,000 troll Russian tweet dataset and this dataset of Trump's tweets between mid-2015 and Election Day 2016, both from Kaggle.

In order to compare Trump's tweets to the Russian ones, I hope to answer the following questions:

What are the Trump and Russian tweets about? What words do they use most often?
When in the election cycle did they both tweet the most, and the least?
How does the sentiment of each vary over the election cycle?
Did the length of the tweets change over time?
Which words are the most "Trumpian", and which ones are the least?

Analysis

Conclusions

to Services

Analysis

Cleaning the Data

Our first task was cleaning the data to get meaningful text for analysis. For each dataset, I started by selecting the columns that pertained to the analysis, namely the date tweeted, the hashtag, the time tweeted, and the tweet text.

From there, I noticed two key elements of the tweet text that needed cleaning: the presence of corrupted or non-English characters in the Russian tweets, and the presence of links. I figured that the text in links would not be germane to this analysis, so I removed both "https://..." and "t.co/..." URLs.

From here, I began feature extraction--converting the timestamp into meaningful dates, and extracting the sentiment and tweet length.

I then used the syuzhet package to get a numeric "score" of the sentiments of each of the tweets, and used stringr to get the length of each tweet. Each of these columns were appended onto the datasets, and then the two datasets were combined to create data frame which we could analyze.

All analysis for this project was done in R.

What Do They Talk About?

So what do these tweets talk about anyway? As you'll see, on the face of it, the words used in the two can be awfully similar.

In fact, can you tell which is which? Scroll for the answer!

A closer look reveals that the figure to the left is Trump's and the one to the right is Russian.

However, we see a number of commonalities in what they talk about. Beyond the obvious focus on Trump, we see "Hillary" featuring prominently in both of them, for instance. However, we see that Trump's focus tends to be more broad, talking about topics like the news (Fox News and CNN), polling, and his signature "Make America Great Again" campaign slogan. Conversely, with the Russian wordcloud, we note a large emphasis on Obama than in Trump's wordcloud. Interestingly, the Russian one also seems to have more emphasis on race, with words like "black" and "white" both featuring in it and not in Trump's.

Now, let's turn to understanding how these tweets have changed over time.

How Have These Tweets Changed Over Time?

Tweet Frequency

In order to explore how these tweets have evolved over time, we will look at three core metrics: the number of tweets, the sentiment of these tweets, and the number of characters in them.

Starting with the first one, we have a plot of Trump's tweets per month, as well as a plot of Trump's tweets and Russia's on the same axes.

Based on the above two plots, the most striking thing to me is the sheer volume of Russian tweets. Trump's tweets of roughly 400 per month (around 13 a day!) are barely visible in blue in the plot on the right, showing the Russian tweets relative to Trump's. Remember that this is on a subset of only 200,000 tweets, and FiveThirtyEight says that there are over 3 million fake Russian tweets!

Beyond the scale, we can see that Trump gradually decreased the amount he tweeted after a spike in October 2015, although there was a large jump in October 2016. It is important to note that the Trump dataset only went up til the 11th of November 2016, which is why the last bar in the left figure looks so low.

Unlike Trump, whose tweets decreased as Election Day approached, Russian tweets ramped up dramatically. The month with most tweets was, unsurprisingly, October 2016, just before the election. Interestingly, though, these tweets stayed high into January 2017, when Trump was inaugurated as president.

Tweet Sentiment

With the numerical scores we obtained for sentiment, we now plot those over time for the months we have both Trump's and Russian tweets.

Unsurprisingly, as alluded to in the wordclouds, the Russian tweets tend to be far more negative than Trump's. In fact, while almost all of Trump's months had net positive sentiment scores, only a handful of months saw net positive Russian tweets.

What I find interesting in this figure is how the spike in Trump's sentiment around April 2016 broadly aligns with the Russian spike in May 2016, and how they both dip sharply in the months subsequent. While it's hard to pinpoint a precise reason for these spikes, April 2016 saw Trump take a large lead in the Republican primaries. Perhaps, his success in primaries and ultimate confirmation as the nominee contributed to the spike in sentiment. Similarly, events like Bernie Sanders' endorsement of Hillary Clinton in July and the Democratic National Convention could have contributed to Trump's distinctly harsher tone.

Tweet Length

Another facet I wanted to explore was tweet length. On average, who wrote more per tweet?

Interestingly, as the length of Russian tweets grew on average, Trump's tweets consistently decreased in length! An important caveat to note here is the fact that this character count excludes URLs, which were very prevalent in Russian tweets. On the whole, it's interesting to see how as Election Day approached, Trump went with shorter and fewer tweets, while the Russians did the opposite.

Do Hashtags Impact Tweet Content?

An interesting dimension of this dataset was the use of hashtags. One of the most distinctive parts of Trump's Twitter feed is the abundant #MakeAmericaGreatAgain hashtag. I was curious to see how Russian tweets used hashtags, and whether the presence of a hashtag impacted the content of a tweet. The graph below shows how the use of hashtags changed with time.

While Trump's hashtags were more iconic, it appears that the Russian tweets used far more hashtags in almost every month! Trump's use of hashtags crashed in May 2016 from 40% of tweets down to just over 20%, while the Russian use of hashtags fluctuated between 40% and 60% of all tweets. So, does the presence of a hashtag affect the substance of the tweet?

This graph above shows how sentiments vary based on both the tweeter and the presence of a hashtag. Based on the graph, for the most part there isn't a substantial difference in the month-to-month sentiment of Russian tweets with hashtags. However, strikingly, that isn't the case with Trump! Trump's tweets with a hashtag were overwhelmingly positive, with a sentiment score of around 0.5 on average. However, without a hashtag, they were consistently lower, even going below the sentiment of Russian tweets in one month!

In terms of why this could be, a possible explanation could be that Trump's #MAGA tweets contain a sense of optimism and hope for the future, leading to a positive sentiment score.

How Can We Distinguish A Trump Tweet from a Russian Tweet?

Having explored features of the tweets like sentiment, length, and hashtags, we now arrive at the core question, how can we distinguish a Trump tweet from a fake Russian one, and how reliably can we do so using text analytics?

In order to do this, we first split our tweets into a test and training dataset. I created a classifier to mark a tweet as Trump's or non-Trump's, and then I built a Document-Term Matrix, logging the frequency of each word in the tweet. From here, I used a LASSO model to select the most important terms in predicting whether a tweet is Trump's or not. With these variables, I created a logistic regression model to give a probabilistic estimate of whether a tweet is Trump's or not.

From this model, we can conclude that a highly positive coefficient means a word is very "Trumpian", or unique to Trump, and a word that is very negative is very "non-Trumpian", or unique to tweets that aren't Trump. This is because the presence of a highly positive word will increase the probability of a word being classified as Trump's.

So, which words are most "Trumpian"? Below are wordclouds of the most "Trumpian" and "non-Trumpian" words.

The figures above show which words are most and least Trumpian, respectively. We see that, unsurprisingly, positive, grand words we commonly associate with Trump's discourse, like "great" and "big" appear to be the best indicator of whether a tweet is by Trump. He also seemed to focus more on polling, something which was largely omitted by Russian trolls.

Turning to the non-Trumpian words, we see the term #tcot jump out. A little Google tells us that it's an abbreviation for "top conservatives on Twitter", a way that conservative Twitter users identify themselves. More tellingly, the word "black" also stands out. Based on FiveThirtyEight's article on these tweets, a lot of these tweets were fake Black Lives Matter posts, meant to stir up controversy and promote racial divisions. Thus, it makes sense that this features so prominently.

How Well Can This Model Classify Trump's Tweets?

Quite simply put, really well!

In order to measure how well this model works, we applied it to our test dataset. We had a misclassification rate of around 15%, indicating that the model classified a tweet correctly 85% of the time. In order to more robustly measure the efficacy of our model, we used a ROC Curve, which gauges the model's false positive and true negative rate. Our ROC curve using testing data is shown below.

Our model gives us an Area Under Curve, or AUC, of 0.855, which is far better than the baseline of 0.50. Overall, this tells us that this model is very effective in predicting whether a tweet is Trump's or not.

Conclusions

This project has explored a range of topics surrounding troll Russian tweets and how they compare to President Trump's in the year before the election. Here's what I found:

Russian tweets eclipse Trump's manifold in terms of volume, and the number of Russian tweets spiked towards Election Day 2016
Trump's tweets were more positive than the Russian ones, and the sentiments had similar peaks and troughs
Trump's tweets were longer than the Russian ones, however got shorter as Election Day approached. The Russian tweets got longer.
Hashtags matter with Trump! While Russian tweets' sentiment didn't vary very much whether there was a hashtag or not, Trump tweets with a hashtag were substantially more positive
We can reliably distinguish Trump's tweets from a fake Russian one using LASSO and logistic regression
Some of the most "Trumpian" words are "great", "big" and "very", while the least Trumpian word was #tcot, which stands for Top Conservatives on Twitter, a common hashtag used by conservatives.
Our model was able to successfully classify Trump's tweets with 85% accuracy, and has a 0.855 AUC under the ROC curve with testing data.

Thanks for reading! Title photo attribution: https://www.wired.com/story/how-americans-wound-up-on-twitters-list-of-russian-bots/

to Conc