In my last post I have demonstrated how R can be used to scrape information from the internet. In particular I have scraped the archives of DailyHaiku for their poetic treasures.

Today I want to show how to perform a sentiment analysis on those Haiku. Sentiment analysis is a very popular sub-area of natural language processing that is used to systematically identify, extract, and quantify affective states from text. In the most basic form it tells you whether a statement in form of a word, sentence, paragraph, or even book is positive or negative.

For the demonstration I will use R and the tidytext package, because I just love how tidytext integrates into the R tidyverse. However, several alternative packages for and other programming languages exist. E.g., if you are into python you should really check out the Natural Language Toolkit.

Preparation

Before we begin we have to do some groundwork. As mentioned I will use data from a previous post - if you have missed that one you can download the needed R object here and use load() to add it to your environment, or you can run the code-block below to achieve the same.

if (!exists("haiku_clean")){
  if (!file.exists("haiku_clean.RData")){
    res <- tryCatch(download.file("http://bit.ly/haikuC_rdata",
                         "haiku_clean.RData", mode = "wb"),
                error=function(e) 1)
  }
  load("haiku_clean.RData")
}

Now, there should be a haiku_clean object in your environment (if you are unsure you can test it with exists("haiku_clean")).

To make the individual Haiku easily identifiable, we number them consecutively.

haiku_clean <- haiku_clean %>% 
  mutate (h_number = row_number())

Next, we load the required packages for today’s task.

library(tidyverse)
library(tidytext)

Note: Today’s code examples will rely heavily on piping with the “%>%” operator. Although piping IMHO is much more human readable than traditional R-code, some readers might still be overwhelmed by multiple consecutive pipes. One of the beauty of pipeliness, however, is that you can break them before each “%>%” and see what the intermediate result until this point is. This way you can easily trace the changes in data and data organisation from the beginning to end of a pipeline and find out what each individual command causes.

Get the Sentiments

Tokenization

Several approaches for sentiment analysis exist. Today I will focus on a rather basic variant: Unigram-based sentiment analysis. I.e. we will assign a sentiment score for each single word and sum them up to get an overall score for the analyzed text. This is a very straight forward and computational inexpensive approach, but can easily lead to errors if one word changes the sentiment of another word in its vicinity (e.g., negations). Nonetheless, it is a good place to start your sentiment analysis career.

Because the analysis will be based on single words we first have to extract them from the Haiku.

haiku_tidy <- haiku_clean %>% 
  unnest_tokens(word, text)

Note: When dealing with words, one has to think about capitalization. In most cases it is a good idea to unify it and change all characters to lower-case. unnest_tokens() does this by default.

Taking a quick peek on the most frequent words we see that most of them (e.g., the, of, a, in, …) are not very informative. They are known as stop words.

haiku_tidy %>%
  count(word, sort = TRUE)

#> # A tibble: 5,816 × 2
#>     word     n
#>    <chr> <int>
#> 1    the  2556
#> 2     of   854
#> 3      a   841
#> 4     in   556
#> 5     on   346
#> 6     my   310
#> 7     to   248
#> 8   moon   209
#> 9    and   166
#> 10     i   145
#> # ... with 5,806 more rows

The tidytext package comes with a predefined lexicon of stop words, which can be used to remove them from your text.

haiku_tidy <- haiku_tidy %>% 
  anti_join(stop_words)

After the stop word removal the most common words are much more haiku-ish.

haiku_tidy %>%
  count(word, sort = TRUE)

#> # A tibble: 5,419 × 2
#>       word     n
#>      <chr> <int>
#> 1     moon   209
#> 2     rain   144
#> 3     snow   140
#> 4   summer   132
#> 5  morning   118
#> 6   spring   114
#> 7   winter   114
#> 8      sky   111
#> 9     wind   105
#> 10   night    97
#> # ... with 5,409 more rows

Sentiment Lexicons

Now that we have extracted the single words, the next step is to assign each word a sentiment. The common way to do so is to use a sentiment lexicon. Sentiment lexicons consist of words and their associated sentiments. However, sentiment is topic and medium dependent. E.g., an excellent sentiment lexicon for tweets might be only mediocre for poetry or book chapters in a novel. Keep this in mind when you use a ready-made sentiment lexicon from the internet and always compare its original range of use to your problem at hand. In some cases it might be best to build your own.

Tidytext comes with three predefined sentiment lexicons and we will use all so that you can get a feel for their differences. All are well evaluated and were used in several scientific reports.

AFINN

The AFINN lexicon rates the valence of words with an integer between -5 and +5. It was originally created to evaluate the sentiment of tweets.

Let’s take a quick peek.

set.seed(0) # for replicability
get_sentiments("afinn") %>% 
  sample_n(10)

#> # A tibble: 10 × 2
#>          word score
#>         <chr> <int>
#> 1  terrorizes    -3
#> 2    dickhead    -4
#> 3      expose    -1
#> 4      laughs     1
#> 5        tops     2
#> 6   convivial     2
#> 7  terrorized    -3
#> 8   unsettled    -1
#> 9     offends    -2
#> 10    moaning    -2

BING

The Bing lexicon assigns either positive or negative valence. It was originally created to evaluate the sentiment of social media (e.g., reviews, forum discussions, and blogs).

Let’s take a quick peek.

set.seed(0) # for replicability
get_sentiments("bing") %>% 
  sample_n(10)

#> # A tibble: 10 × 2
#>            word sentiment
#>           <chr>     <chr>
#> 1  togetherness  positive
#> 2        doomed  negative
#> 3      freedoms  positive
#> 4    lugubrious  negative
#> 5  triumphantly  positive
#> 6        denial  negative
#> 7     top-heavy  negative
#> 8    unresolved  negative
#> 9     paralyzed  negative
#> 10  nourishment  positive

NRC

The NRC Word-Emotion Association Lexicon assigns positive or negative like Bing. Moreover, it quantifies how strong a word is linked to the emotions anger, anticipation, disgust, fear, joy, sadness, surprise, and trust on a four-level scale.

Tip: Check the NRC link for further interesting lexicons - e.g., one for emoticons!

Let’s take a quick peek.

set.seed(0) # for replicability
get_sentiments("nrc") %>% 
  sample_n(10)

#> # A tibble: 10 × 2
#>              word    sentiment
#>             <chr>        <chr>
#> 1     threatening      disgust
#> 2        disaster      sadness
#> 3          forced         fear
#> 4    manslaughter     surprise
#> 5   transcendence     surprise
#> 6       criticize         fear
#> 7        thriving anticipation
#> 8  unsatisfactory     negative
#> 9         penance      sadness
#> 10     occasional     surprise

Assign Sentiments

To assign the sentiments we make use of dplyr’s inner_join() function.

sentiment_afinn <- haiku_tidy %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(h_number) %>% 
  summarise(score_afinn = sum(score)) %>%
  ungroup()

To get a sentiment score from the Bing and NRC lexicon we have to subtract the number of negative from the number of positive assignments per Haiku.

sentiment_bing <- haiku_tidy %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(h_number, sentiment) %>%
  spread(sentiment, n, fill=0) %>%
  mutate(score_bing = positive - negative) %>%
  select(-positive, -negative) %>%
  ungroup()

sentiment_nrc <- haiku_tidy %>% 
  inner_join(get_sentiments("nrc")) %>% 
  count(h_number, sentiment) %>%
  spread(sentiment, n, fill=0) %>%
  setNames(c(names(.)[1],paste0('nrc_', names(.)[-1]))) %>%
  mutate(score_nrc = nrc_positive - nrc_negative) %>%
  ungroup()

Note that from the overall 3225 only 746 were assigned a sentiment by AFINN, 1253 by Bing, and 2105 by NRC.

Next, we combine all sentiment ratings and fill the missing values with zeros.

haiku_sentiments <- Reduce(full_join,
  list(sentiment_nrc, sentiment_bing, sentiment_afinn)) %>% 
    mutate_each(funs(replace(., which(is.na(.)), 0)))

When we look at the correlations between the sentiment scores of the three lexicons, we see that there is only small to medium agreement between them.

h_cors <-haiku_sentiments %>%
  select(starts_with("score")) %>%
  cor() %>%
  round(digits=2)

upper<-h_cors
upper[upper.tri(h_cors)]<-""
knitr::kable(upper, format="html", booktabs = TRUE)

	score_nrc	score_bing	score_afinn
score_nrc	1
score_bing	0.4	1
score_afinn	0.31	0.41	1

This weak association does not seem to become much stronger, when we split across NRC emotion-types.

haiku_sentiments %>%
  gather(emotion, intensity,starts_with("nrc_")) %>%
  filter(intensity > 0) %>%
  mutate(emotion = substring(emotion,5)) %>%
  ggplot(aes(x = score_nrc, y = score_bing)) +
  geom_hex(bins=5) +
  facet_wrap(~emotion, nrow = 2)

Take a Look at the Results

Finally, it is a good idea to take a closer look at the results. Do the sentiment ratings seem sensible to you as a human?

To do so we have to link the sentiment scores back to the original Haiku (without stop word removal and word tokenization). Again we fill the missing values with zero values.

haiku_full <- full_join(haiku_clean, haiku_sentiments)  %>% 
   mutate_each(funs(replace(., which(is.na(.)), 0)), starts_with("score"), starts_with("nrc"))

Let’s take a look at the 5 most positive Haiku according to the Bing lexicon.

haiku_full %>%
  top_n(5, score_bing) %>%
  select(text, author, score_bing) %>% 
  arrange(desc(score_bing)) %>%
  slice(1:5) %>%
  knitr::kable(format="html", booktabs = TRUE)

text	author	score_bing
afternoon light-bloom soft heavy hot lean against me, an old lover	John Moore Williams	5
the glow of windfall peaches morning cool	Ann K. Schwader	3
strong breeze clouds sweep by the top of the tree	Angela Kublik	3
soft breeze the gentle hum of a beehive	Anna Maris	2
river opening the politician promises faster	LeRoy Gorman	2

Note: In this pipeline there is some redundancy. The standard way to extract the “top” of anything is top_n() , however there are several ties in the data so this returns more than the requested five entries. Hence, I used slice() to take only the first five. This makes the previous top selection obsolete and sorting would have been sufficient. I have included it anyway for you to see the standard procedure.

The Haiku indeed seem positive, although I wonder if they would also be the most positive according to a human judge.

Let’s take a look at the 5 most negative Haiku according to the AFINN lexicon.

haiku_full %>%
  top_n(-5, score_afinn) %>%
  select(text, author, score_afinn) %>% 
  arrange(score_afinn) %>%
  slice(1:5) %>%
  knitr::kable(format="html", booktabs = TRUE)

text	author	score_afinn
asshole questioning doesn’t know about haiku 5-7-5, bitch	Aaron Marko	-10
Each morning, the stink of fox piss warns the poet—Don’t prettify me!	Temple Cone	-6
coffee with Buddha I talk about my worries his dick-ass smile.	Thomas Trofimuk	-6
paul mccartney, a conspiracy theory: he’s really a douche	Aaron Marko	-6
lost on the Appalachian trail I murder the map	Carolyne Rohrig	-5

Well, they are quite negative (and vivid).

I guess you already got the hang on extracting Haiku plus corresponding sentiment so I will conclude the last lexicon with a short inspirational example of what can be done with the sentiment assignments. Let’s take a look how the emotional sentiment of the published Haiku has changed over time.

haiku_full %>%
  gather(emotion, intensity,starts_with("nrc_")) %>%
  mutate(emotion = substring(emotion,5)) %>%
  filter(!emotion %in% c("positive", "negative")) %>%
  ggplot(aes(x=date, y=intensity, color=emotion, fill=emotion)) +
    geom_smooth(se = FALSE) +
    scale_color_brewer(palette="Dark2")

#> `geom_smooth()` using method = 'gam'

We can see interesting patterns here. There seems to be a general emotion decline between 2006 and 2010. After that anticipation, sadness, and fear seem to have stagnated, while for the rest we see an upward trend in the last two years. Of course this is only some crude (over?)interpretation of the plot. We could improve the interpretability by adding confidence bands to the plot with geom_smooth(se = TRUE) or you could try some more sophisticated statistical methods. What ever you try - Have fun! - And I’m happy to read about your exploits in the comment section.

Closing Remarks

I hope you have enjoyed this little exploration and have gained a little insight into sentiment analysis.

If you are going to use the described approach be aware of its two major shortcomings:

the sentiment of one word can change by the presence of another word (the unigram approach does NOT account for that)
the sentiment lexicons might not be adequate for your use case

Further, I have not done any data cleaning. With some more data preparation the results can be easily improved. For a hands-on example see Cleaning Words with R: Stemming, Lemmatization & Replacing with More Common Synonym.

In this post I have relied heavily on pipes and tidy R - if you want to learn more about those concepts check out R for Datascience by the great Hadley Wickham. If you want to deepen your knowledge about the tidytext package and text analysis in R see Text Mining with R. Both books are also available as free eBooks at bookdown. Of course pipes, text analysis and R will be again topics in future posts on BernhardLearns.com - so please stay tuned!

If you have any questions or comments please post them in the comments section.

If something is not working as outlined here, please check the package versions you are using. The system I have used was:

sessionInfo()

#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> locale:
#> [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Austria.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] hexbin_1.27.1      tidytext_0.1.2.900 dplyr_0.5.0       
#> [4] purrr_0.2.2        readr_1.1.0        tidyr_0.6.1       
#> [7] tibble_1.3.0       ggplot2_2.2.1      tidyverse_1.1.1   
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.10       RColorBrewer_1.1-2 highr_0.6         
#>  [4] cellranger_1.1.0   plyr_1.8.4         tokenizers_0.1.4  
#>  [7] forcats_0.2.0      tools_3.3.2        digest_0.6.12     
#> [10] jsonlite_1.2       lubridate_1.6.0    evaluate_0.10     
#> [13] nlme_3.1-131       gtable_0.2.0       lattice_0.20-34   
#> [16] mgcv_1.8-17        Matrix_1.2-8       psych_1.6.12      
#> [19] DBI_0.6-1          yaml_2.1.14        parallel_3.3.2    
#> [22] haven_1.0.0        janeaustenr_0.1.4  xml2_1.1.1        
#> [25] stringr_1.2.0      httr_1.2.1         knitr_1.15.1      
#> [28] hms_0.3            rprojroot_1.2      grid_3.3.2        
#> [31] R6_2.2.0           readxl_1.0.0       foreign_0.8-67    
#> [34] rmarkdown_1.5      modelr_0.1.0       reshape2_1.4.2    
#> [37] magrittr_1.5       codetools_0.2-15   SnowballC_0.5.1   
#> [40] backports_1.0.5    scales_0.4.1       htmltools_0.3.5   
#> [43] rvest_0.3.2        assertthat_0.2.0   mnormt_1.5-5      
#> [46] colorspace_1.3-2   labeling_0.3       stringi_1.1.5     
#> [49] lazyeval_0.2.0     munsell_0.4.3      broom_0.4.2

Bernhard Learns

Pages

Friday, April 21, 2017

Sentiment Analysis with R and tidytext

Preparation

Get the Sentiments

Tokenization

Sentiment Lexicons

AFINN

BING

NRC

Assign Sentiments

Take a Look at the Results

Closing Remarks

1 comment:

Recommended Post

Follow the white robot - Exploring retweets of Austrian politicians with Botometer in R

Popular Posts

Impressum

Pages

Friday, April 21, 2017

Sentiment Analysis with R and tidytext

Preparation

Get the Sentiments

Tokenization

Sentiment Lexicons

AFINN

BING

NRC

Assign Sentiments

Take a Look at the Results

Closing Remarks

1 comment:

Recommended Post

Follow the white robot - Exploring retweets of Austrian politicians with Botometer in R

Popular Posts

Impressum

Subscribe