In machine learning and natural language processing this is normally done with topic models. Those statistical models try to discover the hidden semantic structures represented by abstract topics within a collection of documents. In this post I will try out Latent Dirichlet allocation (LDA), the most common topic model currently in use and an upgraded version of LDA: the Correlated Topic Model (CTM).
Both model types assume that:
- Every document is a mixture of topics. E.g., document 1 is 60% topic A, 30% topic B, and 10% topic C, while document 2 is 99% topic B and a half percent topic A and C each. This is in contrast to many other clustering algorithms (e.g., k-means), which assign each object to one distinct group only.
- Every topic is a mixture of words. E.g., in topic A the words “data”, “machine”, and “algorithm” are the most common, while in topic C the most common words are “homework”, “grade”, and “task” - the word “solution” is equally likely in both topics.
Although both model types are widely used and quite powerful, I have doubts whether they will be able to group Haiku by topic. The first problem is that Haiku are pretty short (the traditional form has only 17 syllables and the free form is even shorter). So there is not much information per document to work with but a lot of sparsity. Secondly, Haiku normally concentrate on the natural world - e.g., changing seasons, falling leaves, blossoming flowers and melting snow. Even when focusing on other themes like love or sadness, they normally use images from the natural world. In terms of finding different topics this might pose a serious problem, because most Haiku are quite similar in this regard.
Anyways, let’s give it a try. All those how2s and illustrative examples in which everything works perfectly and a transfer to one’s own problem is destined for frustration are boring anyhow.
Preparation
First, we will load the central libraries for this post. In some code-blocks I will use functions from other libraries or I want to emphasize the package - in those cases I will call the respective function directly bypackage_name::function_name()
.library(magrittr) # for %$% operator
library(tidyverse) # R is better when it is tidy
library(tidytext) # for easy handling of text
library(ldatuning) # for mathematical hint of number of topics
library(topicmodels) # nomen est omen
load()
it or use the code below for your convenience.if (!exists("haiku_clean") || !exists("lemma_unique")){
if (!file.exists("text_lemma.RData")){
res <- tryCatch(download.file("http://bit.ly/haiku_lemma",
"text_lemma.RData", mode = "wb"),
error=function(e) 1)
}
load("text_lemma.RData")
}
haiku_clean <- haiku_clean %>%
mutate (h_number = row_number())
haiku_lemma <- haiku_clean %>%
unnest_tokens(word, text) %>%
select (h_number, word) %>%
left_join (lemma_unique %>%
select(word:lemma, synonym))
Latent Dirichlet Allocation
To use text data in the topicmodels package, we have to rearrange it into a document-term matrix.dtm_lemma <- haiku_lemma %>%
count(h_number, lemma) %>%
filter(!is.na(lemma)) %>%
cast_dtm(h_number, lemma, n)
Number of Topics
Determining the number of topic is no simple task and none with an unambiguously right answer either. From my point of view the most important criterium for selecting a solution is that it is useful. In most use cases documents are grouped by topics so that humans can filter and select documents based on those topics. Accordingly, the topics must make sense to humans. This can only be achieved by a real human comparing different solutions and deciding on which is the most sensible one. Further, different use cases might need different levels of granularity. While in some situations a few coarse topics suffice, other scenarios might call for a more fine grained differentiation.That said, of course algorithms have been proposed to relieve us humans from the time consuming endeavor of skimming through a multitude of solutions to find the optimal number of topics. The ldatuning package implements four of them. Because I have no idea how many topics to expect and no requirements regarding granularity, I will try them out.
The topicmodels package allows quite a lot of fine tuning. Although the defaults are sensible, we will tweak the settings a little for demonstrative purpose. Concretely, we discard the first 2500 iterations with
burnin
, because the are most likely not good anyway. Further, we increase the number of iterations to 5000 and request 5 different starts from which we keep only the run with the best result. To make everything reproducible, we set an individual seed
for each start. For ease of use we put those hyper-parameters in a list
.control_list_gibbs <- list(
burnin = 2500,
iter = 5000,
seed = 0:4,
nstart = 5,
best = TRUE
)
mc.cores
to 4
so that the search for the optimal number of topics is performed on all four of my CPU cores.system.time(
topic_number_lemma <- FindTopicsNumber(
dtm_lemma,
topics = c(seq(from = 2, to = 9, by = 1), seq(10, 20, 2), seq(25, 50, 5)),
metrics = c( "Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = control_list_gibbs,
mc.cores = 4L,
verbose = TRUE
)
)
#> fit models... done.
#> calculate metrics:
#> Griffiths2004... done.
#> CaoJuan2009... done.
#> Arun2010... done.
#> Deveaud2014... done.
#> user system elapsed
#> 3.71 0.17 848.88
To quickly get an understanding of the results we take a look at their visualization.FindTopicsNumber_plot(topic_number_lemma)
To iterate over the different topic numbers we will use
purrr::map()
.para <- tibble(k = c(2,3,9,12,50))
system.time(
lemma_tm <- para %>%
mutate(lda = map(k,
function(k) LDA(
k=k,
x=dtm_lemma,
method="Gibbs",
control=control_list_gibbs
)
)
)
)
#> user system elapsed
#> 337.60 0.02 338.87
Next, we extract the gamma matrices containing the topic probabilities per document.lemma_tm <- lemma_tm %>%
mutate(lda_gamma = map(.x=lda,
.f=tidytext::tidy,
matrix="gamma"))
lemma_tm %>%
unnest(lda_gamma) %>%
group_by(k, document) %>%
arrange(desc(gamma)) %>%
slice(1) %>%
#top_n(1, gamma) %>%
ungroup() %>%
ggplot(aes(x=gamma, fill=factor(k))) +
geom_histogram(bins = 20) +
scale_fill_discrete(name = "Number of\nTopics") +
xlab("maximum gamma per document") +
facet_wrap(~k) +
geom_vline(aes(xintercept = 1/k),
tibble(k=lemma_tm %$% unique(k)),
color="darkred")
Closing Remarks
Although my attempt to group Haiku into topics ultimately failed, I hope the given example was illustrative and can help you to carry out your own topic model analyses. I wish you all the luck with better interpretable results. Chances are certainly in your favor, because of their brevity and their limitation to images from the natural world Haiku are a tough case for topic models. However, if your texts are similarily short, then maybe it is a good idea to check out Biterm Topic Models (BTM). You can find a BTM implementation for python 3 here.If you have any questions or comments please post them in the comments section.
If something is not working as outlined here, please check the package versions you are using. The system I have used was:
sessionInfo()
#> R version 3.3.1 (2016-06-21)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 14393)
#>
#> locale:
#> [1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Austria.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] gdtools_0.1.4 topicmodels_0.2-6 ldatuning_0.2.0
#> [4] tidytext_0.1.2 dplyr_0.5.0 purrr_0.2.2
#> [7] readr_1.1.0 tidyr_0.6.2 tibble_1.3.0
#> [10] ggplot2_2.2.1 tidyverse_1.1.1 magrittr_1.5
#> [13] kableExtra_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] modeltools_0.2-21 NLP_0.1-10 slam_0.1-40
#> [4] corrplot_0.77 reshape2_1.4.2 haven_1.0.0
#> [7] lattice_0.20-33 colorspace_1.3-2 htmltools_0.3.6
#> [10] SnowballC_0.5.1 stats4_3.3.1 yaml_2.1.14
#> [13] XML_3.98-1.7 foreign_0.8-68 DBI_0.6-1
#> [16] selectr_0.3-1 modelr_0.1.0 readxl_1.0.0
#> [19] plyr_1.8.4 stringr_1.2.0 munsell_0.4.3
#> [22] gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2
#> [25] psych_1.7.5 evaluate_0.10 labeling_0.3
#> [28] knitr_1.15.1 forcats_0.2.0 tm_0.7-1
#> [31] parallel_3.3.1 highr_0.6 broom_0.4.2
#> [34] tokenizers_0.1.4 Rcpp_0.12.10 scales_0.4.1
#> [37] backports_1.0.5 jsonlite_1.4 mnormt_1.5-5
#> [40] hms_0.3 digest_0.6.12 stringi_1.1.5
#> [43] grid_3.3.1 rprojroot_1.2 tools_3.3.1
#> [46] lazyeval_0.2.0 janeaustenr_0.1.4 Matrix_1.2-10
#> [49] xml2_1.1.1 lubridate_1.6.0 svglite_1.2.0
#> [52] assertthat_0.2.0 rmarkdown_1.5 httr_1.2.1
#> [55] R6_2.2.1 nlme_3.1-131
Hi Bernhard, thanks for this great post!
ReplyDeleteI was just wondering why did you use 5,000 iterations in your analysis? If I got it right, Griffiths & Steyvers (2004, PNAS https://www.pnas.org/content/101/suppl_1/5228) showed that Gibbs sampling methods stabilises around 250-300 iterations.
Given that LDA analysis can take a long time I was wondering if setting the number of iterations to, say, 300 is a good idea, in your opinion.
Hm.. I wasn't in a hurry, so it didn't matter too much to me. I guess I have used more iterations than necessary, but I am not sure that 300 would be enough. I would take a subsample of your data and test how stable the result is as iterations increase. Best, Bernhard
DeleteThanks! I'll try that approach.
Delete