Motivation
I love the internet - all this information only a fingertip away. Unfortunately, most information is provided in unstructured text. Ready-made tabular data, as needed for most analytic purposes, is a rare exception.E.g., I enjoy reading Haiku and DailyHaiku hooks me up with my daily dosage. They even have a huge archive with all previously published Haiku. What interests me is which percentage of Haiku is in the traditional form: 5-7-5 syllables.
Note: Haiku is a Japanese poem form and Japanese syllables are longer than their English siblings. Hence, many believe 17 English syllables are too verbose and aim at something around 11 syllables instead. These shorter Haiku are called free form Haiku.
Counting one by one would take ages and isn’t very reproducible. So this is a good job for web scraping. I.e., automated extraction of information from a website. This could be done in various environments and programming languages. For this example I am using R, because it allows for a wide range of consecutive NLP-analyses without having to implement them myself (Python would be a good alternative as well).
Preparations
Get the needed R packageslibrary(tidyverse) # because I love tidy data and %>%
library(rvest) # for the actual webscraping
library(lubridate) # for date handling
library(stringr) # for working with strings
start_page <- "http://www.dailyhaiku.org/haiku/"
Step by Step
Get the CSS-selector/XPath
From the webpage we only want to extract the Haiku and the link to the previous archive page, so we need some way to identify these elements. The browser add-on SelectorGadget comes in handy for this task. By pointing and clicking it extracts the CSS-selector and the XPath of the selected object(s), which represent ways to address elements within a html document.Let’s start with the Haiku blocks.
Figure 1: Selection of the Haiku blocks
We see in Figure 1 that the first Haiku block, that is the one I’ve clicked on, is highlighted in green. The others are highlighted in yellow, which means that they will be selected as well. If we don’t want that, we can click on one of them. This would update the selection with the others removed. In the present case, however, the default behavior is exactly what we want. In the bottom we see the CSS-selector for the Haiku Blocks (if we prefer XPath we can change the address format). Let’s store it for later use.css_haiku <- c(".haiku-archive-list")
Figure 2: Selection of the link to previous Haiku
This time the first click selected several links (e.g., to the authors), so I had to deselect them as indicated by the red highlight (see Fig. 2). Let’s store the resulting CSS-selector.css_link <- c(".prev-next a")
Extract the Information
Now that we have the CSS-selectors we can try to extract their corresponding information from the webpage. Let’s start by downloading the first page of the archive.Extract Haiku Information
first_page <- read_html(start_page)
blocks <- first_page %>% html_nodes(css_haiku)
which can be used as CSS-selectors.
# peek into the first block
blocks[[1]]
#> {xml_node}
#> <div class="haiku-archive-list">
#> [1] <h3 class="date"><a href="http://www.dailyhaiku.org/haiku/2016-march ...
#> [2] <p class="haiku">balmy breeze<br>\r\nswarming bees circle<br>\r\nthe ...
#> [3] <p class="author"><a href="/contributors/Cycle-20-October-2015-to-Ma ...
Let’s create a functions that extract those information from a block.Beginning with date.
extract_haiku_date <- function(block) {
# test whether OS is Windows
windows <- Sys.info()["sysname"] == "Windows"
# test whether you have an English locale-setting
loc <- sessionInfo()[["locale"]]
is_english <- str_detect(loc, "LC_TIME=English")
# extract date with english locale
if (is_english) {
date <- block %>% html_node(".date") %>% xml_text() %>% mdy()
} else if (windows) {
date <- block %>% html_node(".date") %>% xml_text() %>% mdy(locale = "English_United States.1252")
} else {
date <- block %>% html_node(".date") %>% xml_text() %>% mdy(locale = "en_US.utf8")
}
return(date)
}
extract_haiku_text <- function(block) {
block %>% html_node(".haiku") %>% xml_text()
}
extract_haiku_author <- function(block) {
block %>% html_node(".author") %>% xml_text()
}
extract_haiku <- function(block) {
list(date = extract_haiku_date(block), text = extract_haiku_text(block),
author = extract_haiku_author(block))
}
block_info <- blocks %>% map(extract_haiku) %>% map_df(~as.list(.))
knitr::kable(block_info, format = "html", booktabs = TRUE)
date | text | author |
---|---|---|
2016-03-26 | balmy breeze swarming bees circle the river bank | Polona Oblak |
2016-03-25 | whitecaps on the bay the overhead cries of migrating birds | Polona Oblak |
2016-03-24 | deep within the lettuce a slug shaped hole | Polona Oblak |
2016-03-23 | siskins’ trill sunlight floods the bare larch | Polona Oblak |
2016-03-22 | drifting fog an acquaintance i’d rather not meet | Polona Oblak |
2016-03-21 | ferry harbour customs area swifts dart through the heat | Polona Oblak |
2016-03-20 | frost lingers i clean brussels sprouts in the kitchen sink | Polona Oblak |
That looks good! If you look at
block_info
in an unformatted table you might notice several “\r\n” in the text - those are line breaks. Depending on the task at hand we might need to remove them. For the time being I just leave them untouched.Extract Link Information
links <- first_page %>% html_nodes(css_link)
# get Link labeled 'Older Haiku'
older_nr <- which(str_detect(links, "Older"))
older_link <- links[older_nr] %>% html_attr("href")
print(older_link)
#> [1] "http://www.dailyhaiku.org/haiku/?pg=2"
OK, now we know where to continue our Haiku collection. I hope the step by step approach is now clear and we can move on to getting all the Haiku in the archive.Get all Haiku in the Archive
We build a little wrapper to parse all pages. Go get yourself a coffee after you start this chunk of code - it will take a while. By reducing or removing theSys.sleep
you can speed up things, but on the downside, this might get you banned. Furthermore, IMHO it is common courtesy not to hog the server all for yourself. Alternatively, you can download the result here. After downloading the file into your working directory, just type load("haiku.Rdata")
in your R-console.# run only if haiku data is not already present
if (!file.exists("haiku.RData")) {
page_url <- start_page
haiku_list <- list()
counter <- 0
while (page_url != "stop") {
counter <- counter + 1
page <- read_html(page_url)
haiku_list[[counter]] <- page %>% html_nodes(css_haiku) %>% map(extract_haiku) %>%
map_df(~as.list(.))
# little break between page calls to avoid getting banned
Sys.sleep(5)
# get the next page's url
links <- page %>% html_nodes(css_link)
older_nr <- which(str_detect(links, "Older"))
if (length(older_nr) == 0) {
page_url <- "stop"
} else {
page_url <- links[older_nr] %>% html_attr("href")
}
}
# combine list
haiku <- bind_rows(haiku_list)
# save for later use
save(haiku, file = "haiku.RData")
} else load("haiku.RData")
To check how many Haiku we got overall, we can look at the number of rows in the resulting data-frame.
n <- nrow(haiku)
n
#> [1] 3633
Wow - so there were 3633 Haiku in the archive. That is more than I was expecting. But how many of them are in the traditional style? To classify the Haiku we need to count the number of syllables. 17 means classical, fewer means free form. For the syllable counting - a non-trivial task - we make use of the syllable package.haiku <- haiku %>% mutate(syllables = syllable::compute_syllable_counts(text))
haiku %>% ggplot(aes(x = syllables)) + geom_histogram(color = "black", fill = "white")
This result was somewhat unexpected. Some entries are too short (I would consider 2 + 3 + 2 = 7 as the lower bound even for the free form) and some are too long (as mentioned earlier the long traditional form has 17 syllables). So what happened?
For the short ones multiple explanations are possible: 1. something went wrong in the web scraping 2. errors on the website 3. the entries are not Haiku 4. errors in the syllables counting 5. digits and special characters were used as words (and were not recognized for syllables counting) 6. something else
Best we take a quick look at the shorter entries.
haiku %>% filter(syllables < 6) %>% knitr::kable(format = "html", booktabs = TRUE)
date | text | author | syllables |
---|---|---|---|
2015-05-23 |
t h e puck !@# $%& *+“? drops |
LeRoy Gorman | 3 |
2015-05-22 | snowinterubric | LeRoy Gorman | 5 |
2013-06-12 | goinggoinggoinggon e | Alan S. Bridges | 4 |
2013-04-12 | starving the darkness | LeRoy Gorman | 5 |
2012-03-17 | full bloom, full stop! | Rafal Zabratynski | 4 |
2012-01-31 | ki 3… 2… 1… 2012! ss | Rafal Zabratynski | 1 |
2009-04-08 | pond frogs moon bats | George Swede | 4 |
2008-12-30 | the smell of snow | Carol Pearce-Worthington | 4 |
2007-02-04 | Just Sitting Around | Bryak Webster | 5 |
Well, I’m all but an expert and I’m happy to be convinced otherwise, but the entries don’t seem like Haiku to me.
The potential errors for the too long entries are similar. Best to check them anyways.
haiku %>% filter(syllables >= 20) %>% head() %>% knitr::kable(format = "html",
booktabs = TRUE)
date | text | author | syllables |
---|---|---|---|
2016-03-03 | another glass of champagne remembering what i choose to remember | Sondra J. Byrnes | 20 |
2014-11-15 | ice in the driveway she slips into something more comfortable | kjmunro | 20 |
2014-01-08 | altocumulus undulating above the town the chime of church bells | J. Zimmerman | 20 |
2013-06-11 | shuttle-bus-driver our daily conversation about the lottery numbers | Alan S. Bridges | 20 |
2011-12-21 | Dear Malvina,It’s been a long time since we It’s already autumn here . . . lonely evening | Rafal Zabratynski | 27 |
2009-08-12 | gallery opening the couple in the lobby kissing artistically | Megan Arkenberg | 20 |
Well, for my taste they are too long. For the given task I will exclude both the extremely short and those with more than 17 syllables.
Note: This is real world data and messy and irregular for that. Some kind of clean up is necessary in the most scenarios. In this case I just remove the irregularities, but in other scenarios a different approach might be preferable. E.g. If the too short Haiku were just too short because of the usage of digits instead of written out numbers, then replacing those digits by proper words might be a better choice.
haiku_clean <- haiku %>% filter(syllables >= 7) %>% filter(syllables <= 17)
haiku_clean <- haiku_clean %>% mutate(form = ifelse(syllables == 17, "traditional",
"free"))
haiku_clean %>% group_by(form) %>% summarise(n = n()) %>% mutate(percentage = n/sum(n) *
100)
#> # A tibble: 2 × 3
#> form n percentage
#> <chr> <int> <dbl>
#> 1 free 3041 94.294574
#> 2 traditional 184 5.705426
Only 5.7% of the Haiku on DailyHaiku comply with the traditional form. Obviously, the contributors appreciate the English free form Haiku more.Closing Remarks
Of course this was only a first peek into web scrapping, but I hope it helps you start building your own solution. If you want to dive deeper into web scrapping:- Come back again later. I will surely do some more in the future.
- I can wholeheartedly recommend the book Automated Data Collection with R.
If something is not working as outlined here, please check the package versions you are using. The system I have used was:
sessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#>
#> locale:
#> [1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Austria.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] stringr_1.2.0 lubridate_1.6.0 rvest_0.3.2 xml2_1.1.1
#> [5] dplyr_0.5.0 purrr_0.2.2 readr_1.1.0 tidyr_0.6.1
#> [9] tibble_1.2 ggplot2_2.2.1 tidyverse_1.1.1
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.9 formatR_1.4 highr_0.6 plyr_1.8.4
#> [5] forcats_0.2.0 tools_3.3.2 digest_0.6.12 jsonlite_1.2
#> [9] evaluate_0.10 nlme_3.1-131 gtable_0.2.0 lattice_0.20-34
#> [13] psych_1.6.12 DBI_0.5-1 yaml_2.1.14 parallel_3.3.2
#> [17] haven_1.0.0 httr_1.2.1 knitr_1.15.1 hms_0.3
#> [21] rprojroot_1.2 grid_3.3.2 R6_2.2.0 readxl_0.1.1
#> [25] foreign_0.8-67 rmarkdown_1.4 modelr_0.1.0 reshape2_1.4.2
#> [29] magrittr_1.5 codetools_0.2-15 backports_1.0.5 scales_0.4.1
#> [33] htmltools_0.3.5 assertthat_0.1 mnormt_1.5-5 colorspace_1.3-2
#> [37] stringi_1.1.2 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
That was a great deal of information. In respect to the same, is it possible to extract information based on a selected author from the dropdown (without having to extracting all archived data then sub-setting it by author)
ReplyDeleteSure - you just have to adjust the URL. Make your selection, look at the new address, and change your start url accordingly.
DeleteThat was quick reply. Actually I got stuck there. Trying to change the selection does not change the url. The url remains the same but the data changes. In addition, the selection field is not within a form that could have allowed the use of rvest::html_form(). I think the solution will lie in within the html_session() context but there is not guide in that
ReplyDeleteHm.. That's strange. When I select the first author in the list the url changes to http://www.dailyhaiku.org/?s=haiku&c=Aaron-Marko
DeleteThe parameter c=Name-Surname seems to me what you are looking for
Sorry i realized i was trying to apply the functions to the url "https://tradingeconomics.com/" instead of the url you applied, though with correct parameters
ReplyDelete