In my last post I have demonstrated how R can be used to scrape information from the internet. In particular I have scraped the archives of DailyHaiku for their poetic treasures.
Today I want to show how to perform a sentiment analysis on those Haiku. Sentiment analysis is a very popular sub-area of natural language processing that is used to systematically identify, extract, and quantify affective states from text. In the most basic form it tells you whether a statement in form of a word, sentence, paragraph, or even book is positive or negative.
For the demonstration I will use R and the tidytext package, because I just love how tidytext integrates into the R tidyverse. However, several alternative packages for and other programming languages exist. E.g., if you are into python you should really check out the Natural Language Toolkit.
Preparation
Before we begin we have to do some groundwork. As mentioned I will use data from a previous post - if you have missed that one you can download the needed R object here and use load() to add it to your environment, or you can run the code-block below to achieve the same.
if (!exists("haiku_clean")){
if (!file.exists("haiku_clean.RData")){
res <- tryCatch(download.file("http://bit.ly/haikuC_rdata",
"haiku_clean.RData", mode = "wb"),
error=function(e) 1)
}
load("haiku_clean.RData")
}Now, there should be a haiku_clean object in your environment (if you are unsure you can test it with exists("haiku_clean")).
To make the individual Haiku easily identifiable, we number them consecutively.
haiku_clean <- haiku_clean %>%
mutate (h_number = row_number())Next, we load the required packages for today’s task.
library(tidyverse)
library(tidytext)Note: Today’s code examples will rely heavily on piping with the “%>%” operator. Although piping IMHO is much more human readable than traditional R-code, some readers might still be overwhelmed by multiple consecutive pipes. One of the beauty of pipeliness, however, is that you can break them before each “%>%” and see what the intermediate result until this point is. This way you can easily trace the changes in data and data organisation from the beginning to end of a pipeline and find out what each individual command causes.
Get the Sentiments
Tokenization
Several approaches for sentiment analysis exist. Today I will focus on a rather basic variant: Unigram-based sentiment analysis. I.e. we will assign a sentiment score for each single word and sum them up to get an overall score for the analyzed text. This is a very straight forward and computational inexpensive approach, but can easily lead to errors if one word changes the sentiment of another word in its vicinity (e.g., negations). Nonetheless, it is a good place to start your sentiment analysis career.
Because the analysis will be based on single words we first have to extract them from the Haiku.
haiku_tidy <- haiku_clean %>%
unnest_tokens(word, text)Note: When dealing with words, one has to think about capitalization. In most cases it is a good idea to unify it and change all characters to lower-case. unnest_tokens() does this by default.
Taking a quick peek on the most frequent words we see that most of them (e.g., the, of, a, in, …) are not very informative. They are known as stop words.
haiku_tidy %>%
count(word, sort = TRUE)#> # A tibble: 5,816 × 2
#> word n
#> <chr> <int>
#> 1 the 2556
#> 2 of 854
#> 3 a 841
#> 4 in 556
#> 5 on 346
#> 6 my 310
#> 7 to 248
#> 8 moon 209
#> 9 and 166
#> 10 i 145
#> # ... with 5,806 more rows
The tidytext package comes with a predefined lexicon of stop words, which can be used to remove them from your text.
haiku_tidy <- haiku_tidy %>%
anti_join(stop_words)After the stop word removal the most common words are much more haiku-ish.
haiku_tidy %>%
count(word, sort = TRUE)#> # A tibble: 5,419 × 2
#> word n
#> <chr> <int>
#> 1 moon 209
#> 2 rain 144
#> 3 snow 140
#> 4 summer 132
#> 5 morning 118
#> 6 spring 114
#> 7 winter 114
#> 8 sky 111
#> 9 wind 105
#> 10 night 97
#> # ... with 5,409 more rows
Sentiment Lexicons
Now that we have extracted the single words, the next step is to assign each word a sentiment. The common way to do so is to use a sentiment lexicon. Sentiment lexicons consist of words and their associated sentiments. However, sentiment is topic and medium dependent. E.g., an excellent sentiment lexicon for tweets might be only mediocre for poetry or book chapters in a novel. Keep this in mind when you use a ready-made sentiment lexicon from the internet and always compare its original range of use to your problem at hand. In some cases it might be best to build your own.
Tidytext comes with three predefined sentiment lexicons and we will use all so that you can get a feel for their differences. All are well evaluated and were used in several scientific reports.
AFINN
The AFINN lexicon rates the valence of words with an integer between -5 and +5. It was originally created to evaluate the sentiment of tweets.
Let’s take a quick peek.
set.seed(0) # for replicability
get_sentiments("afinn") %>%
sample_n(10)#> # A tibble: 10 × 2
#> word score
#> <chr> <int>
#> 1 terrorizes -3
#> 2 dickhead -4
#> 3 expose -1
#> 4 laughs 1
#> 5 tops 2
#> 6 convivial 2
#> 7 terrorized -3
#> 8 unsettled -1
#> 9 offends -2
#> 10 moaning -2
BING
The Bing lexicon assigns either positive or negative valence. It was originally created to evaluate the sentiment of social media (e.g., reviews, forum discussions, and blogs).
Let’s take a quick peek.
set.seed(0) # for replicability
get_sentiments("bing") %>%
sample_n(10)#> # A tibble: 10 × 2
#> word sentiment
#> <chr> <chr>
#> 1 togetherness positive
#> 2 doomed negative
#> 3 freedoms positive
#> 4 lugubrious negative
#> 5 triumphantly positive
#> 6 denial negative
#> 7 top-heavy negative
#> 8 unresolved negative
#> 9 paralyzed negative
#> 10 nourishment positive
NRC
The NRC Word-Emotion Association Lexicon assigns positive or negative like Bing. Moreover, it quantifies how strong a word is linked to the emotions anger, anticipation, disgust, fear, joy, sadness, surprise, and trust on a four-level scale.
Tip: Check the NRC link for further interesting lexicons - e.g., one for emoticons!
Let’s take a quick peek.
set.seed(0) # for replicability
get_sentiments("nrc") %>%
sample_n(10)#> # A tibble: 10 × 2
#> word sentiment
#> <chr> <chr>
#> 1 threatening disgust
#> 2 disaster sadness
#> 3 forced fear
#> 4 manslaughter surprise
#> 5 transcendence surprise
#> 6 criticize fear
#> 7 thriving anticipation
#> 8 unsatisfactory negative
#> 9 penance sadness
#> 10 occasional surprise
Assign Sentiments
To assign the sentiments we make use of dplyr’s inner_join() function.
sentiment_afinn <- haiku_tidy %>%
inner_join(get_sentiments("afinn")) %>%
group_by(h_number) %>%
summarise(score_afinn = sum(score)) %>%
ungroup()To get a sentiment score from the Bing and NRC lexicon we have to subtract the number of negative from the number of positive assignments per Haiku.
sentiment_bing <- haiku_tidy %>%
inner_join(get_sentiments("bing")) %>%
count(h_number, sentiment) %>%
spread(sentiment, n, fill=0) %>%
mutate(score_bing = positive - negative) %>%
select(-positive, -negative) %>%
ungroup()sentiment_nrc <- haiku_tidy %>%
inner_join(get_sentiments("nrc")) %>%
count(h_number, sentiment) %>%
spread(sentiment, n, fill=0) %>%
setNames(c(names(.)[1],paste0('nrc_', names(.)[-1]))) %>%
mutate(score_nrc = nrc_positive - nrc_negative) %>%
ungroup()Note that from the overall 3225 only 746 were assigned a sentiment by AFINN, 1253 by Bing, and 2105 by NRC.
Next, we combine all sentiment ratings and fill the missing values with zeros.
haiku_sentiments <- Reduce(full_join,
list(sentiment_nrc, sentiment_bing, sentiment_afinn)) %>%
mutate_each(funs(replace(., which(is.na(.)), 0)))When we look at the correlations between the sentiment scores of the three lexicons, we see that there is only small to medium agreement between them.
h_cors <-haiku_sentiments %>%
select(starts_with("score")) %>%
cor() %>%
round(digits=2)
upper<-h_cors
upper[upper.tri(h_cors)]<-""
knitr::kable(upper, format="html", booktabs = TRUE)| score_nrc | score_bing | score_afinn | |
|---|---|---|---|
| score_nrc | 1 | ||
| score_bing | 0.4 | 1 | |
| score_afinn | 0.31 | 0.41 | 1 |
This weak association does not seem to become much stronger, when we split across NRC emotion-types.
haiku_sentiments %>%
gather(emotion, intensity,starts_with("nrc_")) %>%
filter(intensity > 0) %>%
mutate(emotion = substring(emotion,5)) %>%
ggplot(aes(x = score_nrc, y = score_bing)) +
geom_hex(bins=5) +
facet_wrap(~emotion, nrow = 2)Take a Look at the Results
Finally, it is a good idea to take a closer look at the results. Do the sentiment ratings seem sensible to you as a human?
To do so we have to link the sentiment scores back to the original Haiku (without stop word removal and word tokenization). Again we fill the missing values with zero values.
haiku_full <- full_join(haiku_clean, haiku_sentiments) %>%
mutate_each(funs(replace(., which(is.na(.)), 0)), starts_with("score"), starts_with("nrc"))Let’s take a look at the 5 most positive Haiku according to the Bing lexicon.
haiku_full %>%
top_n(5, score_bing) %>%
select(text, author, score_bing) %>%
arrange(desc(score_bing)) %>%
slice(1:5) %>%
knitr::kable(format="html", booktabs = TRUE)| text | author | score_bing |
|---|---|---|
| afternoon light-bloom soft heavy hot lean against me, an old lover | John Moore Williams | 5 |
| the glow of windfall peaches morning cool | Ann K. Schwader | 3 |
| strong breeze clouds sweep by the top of the tree | Angela Kublik | 3 |
| soft breeze the gentle hum of a beehive | Anna Maris | 2 |
| river opening the politician promises faster | LeRoy Gorman | 2 |
Note: In this pipeline there is some redundancy. The standard way to extract the “top” of anything is top_n() , however there are several ties in the data so this returns more than the requested five entries. Hence, I used slice() to take only the first five. This makes the previous top selection obsolete and sorting would have been sufficient. I have included it anyway for you to see the standard procedure.
The Haiku indeed seem positive, although I wonder if they would also be the most positive according to a human judge.
Let’s take a look at the 5 most negative Haiku according to the AFINN lexicon.
haiku_full %>%
top_n(-5, score_afinn) %>%
select(text, author, score_afinn) %>%
arrange(score_afinn) %>%
slice(1:5) %>%
knitr::kable(format="html", booktabs = TRUE)| text | author | score_afinn |
|---|---|---|
| asshole questioning doesn’t know about haiku 5-7-5, bitch | Aaron Marko | -10 |
| Each morning, the stink of fox piss warns the poetDon’t prettify me! | Temple Cone | -6 |
| coffee with Buddha I talk about my worries his dick-ass smile. | Thomas Trofimuk | -6 |
| paul mccartney, a conspiracy theory: he’s really a douche | Aaron Marko | -6 |
| lost on the Appalachian trail I murder the map | Carolyne Rohrig | -5 |
Well, they are quite negative (and vivid).
I guess you already got the hang on extracting Haiku plus corresponding sentiment so I will conclude the last lexicon with a short inspirational example of what can be done with the sentiment assignments. Let’s take a look how the emotional sentiment of the published Haiku has changed over time.
haiku_full %>%
gather(emotion, intensity,starts_with("nrc_")) %>%
mutate(emotion = substring(emotion,5)) %>%
filter(!emotion %in% c("positive", "negative")) %>%
ggplot(aes(x=date, y=intensity, color=emotion, fill=emotion)) +
geom_smooth(se = FALSE) +
scale_color_brewer(palette="Dark2") #> `geom_smooth()` using method = 'gam'
We can see interesting patterns here. There seems to be a general emotion decline between 2006 and 2010. After that anticipation, sadness, and fear seem to have stagnated, while for the rest we see an upward trend in the last two years. Of course this is only some crude (over?)interpretation of the plot. We could improve the interpretability by adding confidence bands to the plot with geom_smooth(se = TRUE) or you could try some more sophisticated statistical methods. What ever you try - Have fun! - And I’m happy to read about your exploits in the comment section.
Closing Remarks
I hope you have enjoyed this little exploration and have gained a little insight into sentiment analysis.
If you are going to use the described approach be aware of its two major shortcomings:
- the sentiment of one word can change by the presence of another word (the unigram approach does NOT account for that)
- the sentiment lexicons might not be adequate for your use case
Further, I have not done any data cleaning. With some more data preparation the results can be easily improved. For a hands-on example see Cleaning Words with R: Stemming, Lemmatization & Replacing with More Common Synonym.
In this post I have relied heavily on pipes and tidy R - if you want to learn more about those concepts check out R for Datascience by the great Hadley Wickham. If you want to deepen your knowledge about the tidytext package and text analysis in R see Text Mining with R. Both books are also available as free eBooks at bookdown. Of course pipes, text analysis and R will be again topics in future posts on BernhardLearns.com - so please stay tuned!
If you have any questions or comments please post them in the comments section.
If something is not working as outlined here, please check the package versions you are using. The system I have used was:
sessionInfo()#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#>
#> locale:
#> [1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Austria.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] hexbin_1.27.1 tidytext_0.1.2.900 dplyr_0.5.0
#> [4] purrr_0.2.2 readr_1.1.0 tidyr_0.6.1
#> [7] tibble_1.3.0 ggplot2_2.2.1 tidyverse_1.1.1
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.10 RColorBrewer_1.1-2 highr_0.6
#> [4] cellranger_1.1.0 plyr_1.8.4 tokenizers_0.1.4
#> [7] forcats_0.2.0 tools_3.3.2 digest_0.6.12
#> [10] jsonlite_1.2 lubridate_1.6.0 evaluate_0.10
#> [13] nlme_3.1-131 gtable_0.2.0 lattice_0.20-34
#> [16] mgcv_1.8-17 Matrix_1.2-8 psych_1.6.12
#> [19] DBI_0.6-1 yaml_2.1.14 parallel_3.3.2
#> [22] haven_1.0.0 janeaustenr_0.1.4 xml2_1.1.1
#> [25] stringr_1.2.0 httr_1.2.1 knitr_1.15.1
#> [28] hms_0.3 rprojroot_1.2 grid_3.3.2
#> [31] R6_2.2.0 readxl_1.0.0 foreign_0.8-67
#> [34] rmarkdown_1.5 modelr_0.1.0 reshape2_1.4.2
#> [37] magrittr_1.5 codetools_0.2-15 SnowballC_0.5.1
#> [40] backports_1.0.5 scales_0.4.1 htmltools_0.3.5
#> [43] rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
#> [46] colorspace_1.3-2 labeling_0.3 stringi_1.1.5
#> [49] lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
Beautifully clear, very nicely done, makes learning a pleasure. Thanks, Bernhard!
ReplyDelete