R Text Mining Tutorial: Exploratory Analysis

This tutorial was created to help students complete task 2 of the Coursera Swiftkey Data Science Capstone: Exploratory Analysis. After completing this tutorial, you will know how to:

  • Prepare data for analysis.
  • Complete common EDA tasks such as count frequency of words and combination of words.

Load Data

For this tutorial, we will analyze tweets from Donald Trump which can be downloaded from  Kaggle. Like the coursera assignment, the data is saved as a CSV file. Because it contains column names, we will use the function read.csv() to import it and save it to a data frame. The text data provided for the assignment does not have column names, so I’d recommend using the readLines() function to import it as a character vector.

#load data
text <- read.csv("clinton-trump-tweets/tweets.csv")
#drop unecesssary columns
text <- text[,c("id", "handle", "text")]

Word Frequencies

A common first exploratory analysis step is to identify the top words in your body of text. There are a few different methods for doing this, a popular one is using the freq_terms() function from the qdap package. We get rid of filler words like “the” and “a” that don’t provide us any insight using the “stopwords” option and using a list of words provided in the function stopwords() from the tm package. We also add some of our own words we don’t want included in the word count.

library(qdap)
library(tm)
word.freq <- freq_terms(text.var = tweets.trump$text, top = 100, at.least = 1,
#words to exclude
stopwords = c(stopwords("english"), "trump", "hillary", "will",
"donald", "realdonaldtrump", "amp", "httpstco"))
plot(word.freq[1:10,])

rplot01

A nice way to visualize word frequency is through the use of a word cloud. Based on the visual below, are there any insights you can derive such as which candidates or issues Trump focused on the most?

library(wordcloud)
wordcloud(words = word.freq$WORD, freq = word.freq$FREQ,
             max.words = 75, col = "blue")

rplot02

tm package

If you plan on using functions from the tm package to process your data, you will first have to turn your text data into a corpus object. A corpus is essentially a collection of documents and is “the main structure for managing documents in tm”(Feinerer). The `VectorSource()` function is first used to transform the vector elements into documents.

#save tweets to vector
vector <- as.vector(tweets.trump$text)

#save vector as documents
corpus.source <- VectorSource(vector)

#save as a corpus object (collection of documents)
corpus <-  VCorpus(corpus.source)

Rest of tutorial  coming soon…

Extra Resources

Peer Review / Tutoring: stuck on this assignment? Send me your code at hdykiel@gmail.com with a note explaining what you are having trouble with and I will try to help you debug it and understand what is going on. Donations welcomed.

Data Camp: look for their “Intro to Text Mining: Bag of Words” class. $25 a month gets you access to all their classes which take you through concepts step by step. Unlike Coursera, it’s not as project based but the skills you learn will definitely help you with Coursera projects. I’ve found it helpful to take notes as I work through the exercises so you have something to refer to if you decide to cancel the membership.

Please post in the comment section of other training materials you found helpful and I will update the list!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s