This tutorial was created to help students complete task 2 of the Coursera Swiftkey Data Science Capstone: Exploratory Analysis. After completing this tutorial, you will know how to:
- Prepare data for analysis.
- Complete common EDA tasks such as count frequency of words and combination of words.
For this tutorial, we will analyze tweets from Donald Trump which can be downloaded from Kaggle. Like the coursera assignment, the data is saved as a CSV file. Because it contains column names, we will use the function read.csv() to import it and save it to a data frame. The text data provided for the assignment does not have column names, so I’d recommend using the readLines() function to import it as a character vector.
#load data text <- read.csv("clinton-trump-tweets/tweets.csv") #drop unecesssary columns text <- text[,c("id", "handle", "text")]
A common first exploratory analysis step is to identify the top words in your body of text. There are a few different methods for doing this, a popular one is using the freq_terms() function from the qdap package. We get rid of filler words like “the” and “a” that don’t provide us any insight using the “stopwords” option and using a list of words provided in the function stopwords() from the tm package. We also add some of our own words we don’t want included in the word count.
library(qdap) library(tm) word.freq <- freq_terms(text.var = tweets.trump$text, top = 100, at.least = 1, #words to exclude stopwords = c(stopwords("english"), "trump", "hillary", "will", "donald", "realdonaldtrump", "amp", "httpstco")) plot(word.freq[1:10,])
A nice way to visualize word frequency is through the use of a word cloud. Based on the visual below, are there any insights you can derive such as which candidates or issues Trump focused on the most?
library(wordcloud) wordcloud(words = word.freq$WORD, freq = word.freq$FREQ, max.words = 75, col = "blue")
If you plan on using functions from the tm package to process your data, you will first have to turn your text data into a corpus object. A corpus is essentially a collection of documents and is “the main structure for managing documents in tm”(Feinerer). The `VectorSource()` function is first used to transform the vector elements into documents.
#save tweets to vector vector <- as.vector(tweets.trump$text) #save vector as documents corpus.source <- VectorSource(vector) #save as a corpus object (collection of documents) corpus <- VCorpus(corpus.source)
Rest of tutorial coming soon…
Peer Review / Tutoring: stuck on this assignment? Send me your code at firstname.lastname@example.org with a note explaining what you are having trouble with and I will try to help you debug it and understand what is going on. Donations welcomed.
Data Camp: look for their “Intro to Text Mining: Bag of Words” class. $25 a month gets you access to all their classes which take you through concepts step by step. Unlike Coursera, it’s not as project based but the skills you learn will definitely help you with Coursera projects. I’ve found it helpful to take notes as I work through the exercises so you have something to refer to if you decide to cancel the membership.
Please post in the comment section of other training materials you found helpful and I will update the list!