Last week we saw how to assign sentiment to words. Although it worked reasonably well, there were some problems. For instance, many words in our text had no counterpart in the sentiment lexicons. This has multiple reasons:
- not every word has a sentiment
- the lexicons were created for other types of text
and finally
- we haven’t cleaned our data!
Data cleaning is always an important part in every data analysis - this applies to dealing with words as well.
Today I want to show you three cleaning techniques for words in R:
Stemming
Lemmatization
Replacing with more common synonym
But first we need some data to experiment on. For simplicity I will use the haiku_tidy object from my last post - if you have missed that one, you can download the needed R object here and use load() to add it to your environment, or you can run the code-block below to achieve the same.
if (!exists("haiku_tidy")){
if (!file.exists("haiku_tidy.RData")){
res <- tryCatch(download.file("http://bit.ly/haiku_tidy",
"haiku_tidy.RData", mode = "wb"),
error=function(e) 1)
}
load("haiku_tidy.RData")
}Next we need some basic R packages for the work flow. Further packages for the particular cleaning steps will be added when needed.
library(tidyverse) # R is better when it is tidy
library(stringr) # for string manipulationTo speed things up, we will not work on every word instance but filter out only the unique ones. If needed the results for the unique word instance can be easily mapped back to the larger data-frame. Some of the techniques I am going to present are quite computational expansive, so if you have a much larger data set, then maybe they are not feasible. I have added system.time() to the particular work-steps so that you can see and decide for yourself.
Further, we apply some basic cleaning:
- removing the possessive ending: ’s
- removing all words containing non alphabetic characters (depending on the task at hand this might be a bad idea - e.g., in social media emoticons can be very informative)
- removing missing values
lemma_unique <- haiku_tidy %>%
select(word) %>%
mutate(word_clean = str_replace_all(word,"\u2019s|'s","")) %>%
mutate(word_clean = ifelse(str_detect(word_clean,"[^[:alpha:]]"),NA,word_clean)) %>%
filter(!duplicated(word_clean)) %>%
filter(!is.na(word_clean)) %>%
arrange(word)
Table 1: head(lemma_unique)
| word | word_clean |
|---|---|
| <U+044F>s | <U+044F> |
| abandon | abandon |
| abandoned | abandoned |
| abandoning | abandoning |
| absent | absent |
| absently | absently |
Stemming
We can see in Table 1 that many words are very similar, e.g.,
- abandon, abandoned, abandoning
- add, added, adding
- apologies, apologize, apology
Based on specific rules these words can be reduced to their (word) stems. This process is called stemming. In R this can be done with the SnowballC package.
library(SnowballC)system.time(
lemma_unique<-lemma_unique %>%
mutate(word_stem = wordStem(word_clean, language="english"))
)#> user system elapsed
#> 0.02 0.00 0.02
Positive points for stemming are:
- It is super fast (just take a look at the
system.time()) - Algorithms exist for many languages
- It groups together related words
Negative points for stemming are:
- Stems are not always words themselves (very problematic if you plan to work with a lexicon)
- Sometimes words are grouped together, which are not related
- Sometimes related words are not groups together
Lemmatization
In contrast to stemming, lemmatization is much more sophisticated. Lemmatization is the process of grouping together the inflected forms of a word. The resulting lemma can be analyzed as a single item.
In R itself there is no package for lemmatization. However, the package koRpus is a wrapper for the free third-party software TreeTagger. Several wrappers for other programming languages (e.g., Java, Ruby, Python, …) exist as well. Before you continue download TreeTagger and install it on your computer. Don’t forget that you have to download the parameter file for the needed language as well.
TreeTagger has to be installed on your system for the next step to work!
library(koRpus)system.time(
lemma_tagged <- treetag(lemma_unique$word_clean, treetagger="manual",
format="obj", TT.tknz=FALSE , lang="en",
TT.options=list(
path="c:/BP/TreeTagger", preset="en")
)
)#> user system elapsed
#> 0.64 0.17 1.50
This took considerably longer than stemming, but even for larger text corpi it should finish in a reasonable time, especially if you lemmatize only the unique words and map the result back to all instances.
Note: If you input each word on its own (like we just did) instead of entering whole sentences, then TreeTagger’s wclass (word-class) tag might be wrong. Depending on the job at hand this can be a problem. Does it matter to you whether e.g., love is the noun or the verb?
From the lemma_tagged object we need the TT.res table.
lemma_tagged_tbl <- tbl_df(lemma_tagged@TT.res)We join this table with the data-frame of unique words and skip words with no identified lemma.
lemma_unique <- lemma_unique %>%
left_join(lemma_tagged_tbl %>%
filter(lemma != "<unknown>") %>%
select(token, lemma, wclass),
by = c("word_clean" = "token")
) %>%
arrange(word)Positive points for lemmatization are:
- It overcomes the three major problems of stemming (results are always words; neither does it group together unrelated words, nor does it miss to group together related words)
- If you have to link your data to further lexicons - often lemmatized versions exist, they are smaller and so data-joins are faster
- The TreeTagger supports many languages
Negative points for lemmatization are:
- It is computational more expansive than stemming
- Different words meaning the same thing (synonyms) are not grouped together
Replacing with more common synonym
Using lemmata instead of arbitrary word inflections helps to group together all forms of one word. However, we are often interested in the meaning of the word and not in the meaning’s particular representation.
To solve this problem we can look up the most common synonym of the words.
For English words we can use the famous WordNet and its R wrapper in the wordnet package.
WordNet has to be installed on your system for the next step to work!
library(wordnet)The synonyms() function supports not all TreeTagger word-classes, so we will use a little wrapper to have it return NA in those cases instead of throwing an error.
synonyms_failsafe <- function(word, pos){
tryCatch({
syn_list = list(syn=synonyms(word, toupper(pos)))
if (length(syn_list[["syn"]])==0) syn_list[["syn"]][[1]] = word
syn_list
},
error = function(err){
return(list(syn=word))
})
}system.time(
lemma_unique <- lemma_unique %>%
mutate(word_synonym = map2(lemma, wclass,synonyms_failsafe))
)#> user system elapsed
#> 515.99 35.60 379.38
Finding all synonyms really took long!
To identify the most common synonym we have to use a word frequency list. In this case we will rely on one extracted from the British National Corpus. Several lists were compiled by Adam Kilgarriff - we can use lemmatized version (all synonyms by wordnet are lemmata).
The word-class in the list are partly abbreviated, so we have to modify it a little to match it with wclass.
if (!exists("word_frequencies")){
if (!file.exists("lemma.num")){
res <- tryCatch(download.file("http://www.kilgarriff.co.uk/BNClists/lemma.num",
"lemma.num", mode = "wb"),
error=function(e) 1)
}
word_frequencies <-
readr::read_table2("lemma.num",
col_names=c("sort_order", "frequency", "word", "wclass"))
# harmonize wclass types with existing
word_frequencies <- word_frequencies %>%
mutate(wclass = case_when(.$wclass == "conj" ~ "conjunction",
.$wclass == "adv" ~ "adverb",
.$wclass == "v" ~ "verb",
.$wclass == "det" ~ "determiner",
.$wclass == "pron" ~ "pronoun",
.$wclass == "a" ~ "adjective",
.$wclass == "n" ~ "noun",
.$wclass == "prep" ~ "preposition")
)
}We build a little function to return the most frequent synonym. In case none of the synonyms was in the frequency it returns NA. We replace those with the original lemma.
frequent_synonym <- function(syn_list, pos=NA, word_frequencies){
syn_vector <- syn_list$syn
if (!is.na(pos) && pos %in% unique(word_frequencies$wclass)){
syn_tbl <- tibble(word = syn_vector,
wclass = pos)
} else {
syn_tbl <- tibble(word = syn_vector)
}
suppressMessages(
syn_tbl <- syn_tbl %>%
inner_join(word_frequencies) %>%
arrange(frequency)
)
return(ifelse(nrow(syn_tbl)==0,NA,syn_tbl$word[[1]]))
}Note: The frequent_synonym() function can exploit knowledge about the word-class. However, I didn’t use this feature as the word-class was extracted from single words not from words in a sentence and is therefore unreliable.
system.time(
lemma_unique <- lemma_unique %>%
mutate(synonym = map_chr(word_synonym, frequent_synonym,
word_frequencies = word_frequencies)) %>%
mutate(synonym = ifelse(is.na(synonym), lemma, synonym))
)#> user system elapsed
#> 17.96 0.00 17.95
Well, this took some time as well.
Positive points for replacing with more common synonym are:
- words with the same meaning are grouped together
Negative points for replacing with more common synonym are:
It is computational expansive!
Loss of more fine grained information
At least in R there is no multi-language wrapper as far as I know. However, following the given example it should be easy enough to create your own solution, as long as you have a synonym and a frequency list. If no sensible frequency list is available, then you should be able to compile your own with Google ngram using the wrapper offered by the ngramr package.
Comparing the Results
To get a feeling of how the techniques work I encourage you to take a little time and flip through the whole results. If you have not tried out the given code yourself, then you can download the results as csv-file here. For a snippe
Table 2: head() of cleaned word versions in lemma_unique
| word | word_clean | word_stem | lemma | synonym |
|---|---|---|---|---|
| <U+044F>s | <U+044F> | <U+044F> |
|
|
| abandon | abandon | abandon | abandon | empty |
| abandoned | abandoned | abandon | abandon | empty |
| abandoning | abandoning | abandon | abandon | empty |
| absent | absent | absent | absent | absent |
| absently | absently | absent | absently | absently |
To quickly get a hint of the usefulness of the cleaning techniques we can check how many words in our list can be linked to an sentiment in the Bing lexicon.
Note: The stemming results are not included, because the word stems are no good match for the given sentiment lexicon. In other analyses, which do not rely on lexicons, e.g., topic models, this is no problem.
n_orig <- lemma_unique %>%
inner_join(tidytext::get_sentiments("bing"),
by=c("word" = "word")) %>%
nrow()
n_orig#> [1] 519
n_clean <- lemma_unique %>%
inner_join(tidytext::get_sentiments("bing"),
by=c("word_clean" = "word")) %>%
nrow()
n_clean#> [1] 525
n_lemma <- lemma_unique %>%
inner_join(tidytext::get_sentiments("bing"),
by=c("lemma" = "word")) %>%
nrow()
n_lemma#> [1] 672
n_synonym <- lemma_unique %>%
inner_join(tidytext::get_sentiments("bing"),
by=c("synonym" = "word")) %>%
nrow()
n_synonym#> [1] 819
Out of the originally 5222 unique words a sentiment was assigned only to 519. Simply removing possessive endings increased the number to 525. The step for lemmatization is considerably larger and yields 672 assignments. After replacing the lemmata with their most common synonym, the number of assignments even soars to 819.
Closing Remarks
Data cleaning is essential. I hope the given example was illustrative and you have seen how it can profit your analyses.
Of course this overview was by no means exhaustive. E.g., it might be a good idea to check for typos and correct them. Did the TreeTagger on those words without lemma just fail because the word was misspelled? The hunspell package might be a good starting point for solving this problem. If you try it out, then please share your experiences in the comment section.
Another idea is to use WordNet for replacing with hypernyms (a broader category that includes the original word) instead of synonyms. This would allow to condense words to their concepts. Again, I’m very interested in your exploits. So don’t hesitate to share them.
If you have any questions or comments please post them in the comments section.
If something is not working as outlined here, please check the package versions you are using. The system I have used was:
sessionInfo()#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#>
#> locale:
#> [1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Austria.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] wordnet_0.1-11 koRpus_0.10-2 data.table_1.10.4
#> [4] SnowballC_0.5.1 stringr_1.2.0 dplyr_0.5.0
#> [7] purrr_0.2.2 readr_1.1.0 tidyr_0.6.1
#> [10] tibble_1.3.0 ggplot2_2.2.1 tidyverse_1.1.1
#> [13] kableExtra_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] reshape2_1.4.2 rJava_0.9-8 haven_1.0.0
#> [4] lattice_0.20-34 colorspace_1.3-2 htmltools_0.3.5
#> [7] tidytext_0.1.2.900 yaml_2.1.14 XML_3.98-1.5
#> [10] foreign_0.8-67 DBI_0.6-1 selectr_0.3-1
#> [13] modelr_0.1.0 readxl_1.0.0 plyr_1.8.4
#> [16] munsell_0.4.3 gtable_0.2.0 cellranger_1.1.0
#> [19] rvest_0.3.2 psych_1.6.12 evaluate_0.10
#> [22] knitr_1.15.1 forcats_0.2.0 parallel_3.3.2
#> [25] highr_0.6 tokenizers_0.1.4 broom_0.4.2
#> [28] Rcpp_0.12.10 scales_0.4.1 backports_1.0.5
#> [31] jsonlite_1.2 mnormt_1.5-5 hms_0.3
#> [34] digest_0.6.12 stringi_1.1.5 grid_3.3.2
#> [37] rprojroot_1.2 tools_3.3.2 magrittr_1.5
#> [40] lazyeval_0.2.0 janeaustenr_0.1.4 Matrix_1.2-8
#> [43] xml2_1.1.1 lubridate_1.6.0 assertthat_0.2.0
#> [46] rmarkdown_1.5 httr_1.2.1 R6_2.2.0
#> [49] nlme_3.1-131
Your post is really informative and learned a lot from it. I hope you continue to write such posts because they help a lot of people out there.
ReplyDeleteThanks a lot - such feedback keeps me writing this blog.
DeleteThat was a great tutorial, thanks for you effors
ReplyDeleteGreat tutorial, thanks!
ReplyDeleteI just copied this tutorial. Worked well, this was super helpful!
ReplyDeleteHi Bernhard,
ReplyDeleteI followed your steps on Lemmatization, but get the following error message:
"Error: Manual TreeTagger configuration: "en" is not a valid preset!"
This message is after running the following codes:
lemma_tagged <- treetag(lemma_unique$word_clean, treetagger="manual",
format="obj", TT.tknz=FALSE , lang="en",
TT.options=list(
path="C:/TreeTagger", preset="en")
)
Do you know how to solve this problem? Thanks so much!
Hi, sorry for the wait. Unfortunately, I can only guess what went wrong. Have you downloaded the english parameter file besides installing treetagger?
DeleteHmmm. I am having he same problem on a Mac.
Delete"Error: Manual TreeTagger configuration: "en" is not a valid preset!"
My installation of tree tagger is at /Applications/Tree-taggerdir
/Applications/Tree-taggerdir/lib/english.par exists and is about 14Meg in size.
This comment has been removed by the author.
DeleteI am happy that it worked out for you in the end. Thank you for sharing your solution!
DeleteMy solution was temporary unfortunately. Maybe I figured it out at least vaguely. I suspect difference in functions in the included packages. When I went into the R Suite environment and added all related packages I discovered tidyverse not checked. I checked that and several others I need and it worked again. No doubt what I did yesterday as well inadvertently. This might work for "Unknown" UnknownOctober 29, 2018 at 11:03 PM
DeleteI wanted to leave a little comment to support you and wish you a good continuation. Wishing you the best of luck for all your blogging efforts. pometalni stroji
ReplyDeleteI can give you the address Here you will learn how to do it correctly. Read and write something good. gutter cleaning near me
ReplyDelete