The alt-right movement seems to be on the rise, but are they really alternative rights or just plain old Nazis? To get more information on this topic I’ve decided to take a look at their social media footprint. Or to be more precise I want to look at the tweets of people who follow Identitaere_B (the official Twitter account of the Identitäre Bewegung Österreich = Austria’s version of the alt-right movement).
But I don’t want to read all these tweets to find one really shocking tweet which I then exploit as pars pro toto. Mainly, because I really, really, really don’t want to read their tweets, but secondly because those guys are really good at coining such a result just as a bedauerlicher Einzelfall [woeful isolated case]. Instead, we will do what I like best - doing some machine learning. Hence, the result will be replicable and reproducible so that nobody can reasonably claim it fake news.
So, here is the plan:
- get followers of Identitaere_B
- get timelines of the followers
- use the R package text2vec to build a GloVe model
- use the GloVe model to look up which words in the Identitären-Tweet-Corpus are most similar (the nearest neighbors) to the German words
- Hitler
- Faschismus [fascism]
- Nationalsozialismus [National Socialism]
- Juden [jews]
- Muslime [moslems]
Preparations
We load the packages we will need.
library(tidyverse)
library(magrittr)
library(twitteR)
library(text2vec)
Get the Tweets
Wire up R to connect to Twitter
To retrieve tweets you a) need a Twitter account and b) register yourself as a Twitter developer and create an app.
After filling out some basic information about your app (name and how to you plan to use it) you can get the needed OAuth credentials from Keys and Access Tokens.
Assign your credentials accordingly:
consumer_key <- "your_consumer_key"
consumer_secret <- "your_consumer_secret"
access_token <- "your_access_token"
access_secret <- "your_access_secret"
And use them to connect to Twitter:
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#> [1] "Using direct authentication"
Get the Followers
We start by getting the official user account:
Identitaere_B <- getUser("Identitaere_B")
Next, we can retrieve all users who follow Identitaere_B:
i_followers <- Identitaere_B$getFollowers()
So how many followers does Identitaere_B have:
length(i_followers)
#> [1] 7694
Given that Twitter is quite a niche social media in Austria and Austria has only a population of about 8.8 millions that is more than I would have expected.
Unfortunately, we will not be able to get the timelines of all followers, because most likely some of them have protected accounts. Presumably, tweets from those accounts would be very interesting for my endeavor, but we can’t see their tweets unless we get manually approved by the owners of those accounts. Making up a fake profile and trying to get approved could solve this problem, but that is to much of a hassle to me (if you do it - please let me know, what you have found).
Consequently, we have to filter out the protected accounts to avoid getting an error by trying to access them. To do so we put the user vector into a tibble, create a column indicating, whether it is protected or not, and filter the data based on this column. Further, we add a column which contains the Id of the user and assigns numbers to the users consecutively.
i_follow_data <- tibble(i_followers) %>%
mutate(id = map_chr(i_followers, function(x) x$getId()),
protected = map_lgl(i_followers, function(x) x$protected)) %>%
filter(!protected) %>%
mutate(nr = row_number())
Let’s check how many user remained:
length(i_follow_data[["i_followers"]])
#> [1] 6996
Get the Timelines
To get a user’s timeline we can use the function userTimeline(). The problem is, if we want to get the timelines of all users then we have to use it repeatedly and that can cause problems. E.g., we might exceed the rate limits or have problems with our internet connection. Both would cause an error and errors normally stop running R processes.
Since I am in no hurry (and hopefully you are neither) we can circumvent those problems with a little wrapper. We do this by adding a pause between each request and repeat a request a certain number of times if it fails. Because, depending on the pause, this function might run quite a long time to process all users we add a print functionality to track the progress. You can probably easily come up with a more elegant solution, but for the time being it should work.
userTimeline_slow <- function(x,
index = NULL,
index_report = 250,
n=100,
sleep=1,
maxRetry = 3){
#repeat until timeline was downloaded or maxRetry exceeded
tweet <- NULL
attempt <- 1
while (is.null(tweet) && attempt <= maxRetry){
attempt <- attempt+1
#pause
Sys.sleep(sleep)
#report progress
if (!is.null(index) && index %% index_report == 0){
print(paste("processing user nr.", index))
}
#try to get timeline
try(
tweet <- userTimeline(x, n=n),
silent = TRUE
)
}
if (is.null(tweet)) tweet <- list("error")
return(tweet)
}
So let’s run this function for all users. We limit the number of tweets to 100 per user so that a few heavy users cannot skew the overall result.
Warning: This will run for hours! If you don’t need the most up-to-date results, then you can download mine and load them with readr::readRDS().
userTimeline() returns status objects. Like the user object status contains much more information than required. Information per se is great, but if we want to save and load a R data-frame that consists of complex objects, then it becomes a problem. Serialization of (large) hierarchical R objects requires large amounts of memory and time. To avoid the accompanying hassle we remove the user object and keep only its id. Likewise we won’t keep the whole status object, but only its text. To do so we write a small wrapper …
extract_text <- function(status_list){
unlist(status_list) %>% map_chr(function(x) x$text)
}
… which we can plug into our tweets retrieval pipeline:
system.time(
by_user <- i_follow_data %>%
##retrieve tweets
mutate(tweets = map2(i_followers, nr, userTimeline_slow, index_report=100, n=100, sleep=5)) %>%
##extract text from tweets
mutate(text = map(tweets, possibly(extract_text, NA_character_))) %>%
##remove large objects
select(-i_followers, -tweets)
)
#> [1] "processing user nr. 100"
#> [1] "processing user nr. 200"
#> [1] "processing user nr. 300"
#> [1] "processing user nr. 400"
#> [1] "processing user nr. 500"
#> [1] "processing user nr. 600"
#> [1] "processing user nr. 700"
#> [1] "processing user nr. 800"
#> [1] "processing user nr. 900"
#> [1] "processing user nr. 1000"
#> [1] "processing user nr. 1100"
#> [1] "processing user nr. 1200"
#> [1] "processing user nr. 1300"
#> [1] "processing user nr. 1400"
#> [1] "processing user nr. 1500"
#> [1] "processing user nr. 1600"
#> [1] "processing user nr. 1700"
#> [1] "processing user nr. 1800"
#> [1] "processing user nr. 1900"
#> [1] "processing user nr. 2000"
#> [1] "processing user nr. 2100"
#> [1] "processing user nr. 2200"
#> [1] "processing user nr. 2300"
#> [1] "processing user nr. 2400"
#> [1] "processing user nr. 2500"
#> [1] "processing user nr. 2600"
#> [1] "processing user nr. 2700"
#> [1] "processing user nr. 2800"
#> [1] "processing user nr. 2900"
#> [1] "processing user nr. 3000"
#> [1] "processing user nr. 3100"
#> [1] "processing user nr. 3200"
#> [1] "processing user nr. 3300"
#> [1] "processing user nr. 3400"
#> [1] "processing user nr. 3500"
#> [1] "processing user nr. 3600"
#> [1] "processing user nr. 3700"
#> [1] "processing user nr. 3800"
#> [1] "processing user nr. 3900"
#> [1] "processing user nr. 4000"
#> [1] "processing user nr. 4100"
#> [1] "processing user nr. 4200"
#> [1] "processing user nr. 4300"
#> [1] "processing user nr. 4400"
#> [1] "processing user nr. 4500"
#> [1] "processing user nr. 4600"
#> [1] "processing user nr. 4700"
#> [1] "processing user nr. 4800"
#> [1] "processing user nr. 4900"
#> [1] "processing user nr. 5000"
#> [1] "processing user nr. 5100"
#> [1] "processing user nr. 5200"
#> [1] "processing user nr. 5300"
#> [1] "processing user nr. 5400"
#> [1] "processing user nr. 5500"
#> [1] "processing user nr. 5600"
#> [1] "processing user nr. 5700"
#> [1] "processing user nr. 5800"
#> [1] "processing user nr. 5900"
#> [1] "processing user nr. 6000"
#> [1] "processing user nr. 6100"
#> [1] "processing user nr. 6200"
#> [1] "processing user nr. 6300"
#> [1] "processing user nr. 6400"
#> [1] "processing user nr. 6500"
#> [1] "processing user nr. 6600"
#> [1] "processing user nr. 6700"
#> [1] "processing user nr. 6800"
#> [1] "processing user nr. 6900"
#> user system elapsed
#> 3051.78 19.52 40538.52
Let’s check the size of by_user:
object.size(by_user)
#> 42460000 bytes
Well, although not tiny it is small enough. Let’s save it to disc:
write_rds(by_user, path="by_user.rds")
Analyze the Tweets
Get to know the data a little
At the moment each row in the data-frame represents a user and the cell tweets contains not a single value but a list. Let’s convert this data structure in something a little more accessible.
tweets <- unnest(by_user, text)
Now, the data is in a long format and looks like: Note: This returns a HTML table, the normal R shell can’t properly display - but R Markdown can.
tweets %>%
sample_n(20) %>%
knitr::kable(format="html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| id | protected | nr | text |
|---|---|---|---|
| 3779853317 | FALSE | 2131 | https://t.co/tNwbMO9nF7 |
| 771808488913825792 | FALSE | 4227 | Travel to Africa Best African Travel Company https://t.co/28CIeZUOPj |
| 816777136111493120 | FALSE | 3301 | #phnotizen US-Grenze-5) Am Ende wird die Nationalgarde an der Grenze stehen. Mit Schiessbefehl. |
| 4872861754 | FALSE | 5697 | “The Great Disorientation” https://t.co/J3clPJ0NVw |
| 2778549199 | FALSE | 5126 | Niin kauhia pumppi, että kädet meinaa räjähtää. Hiilaritankkaus tekee välillä hyvää <ed><U+00A0><U+00BE><ed><U+00B4><U+0093><ed><U+00A0><U+00BD><ed><U+00B2><U+00AA> |
| 4228230554 | FALSE | 3490 | Einzellfälle?? https://t.co/rqZ0GIaUpl |
| 3613555641 | FALSE | 2059 | I support more (proper) referendums, including popularly initiated (and binding) ones. https://t.co/BXBw2pk1JM |
| 127316382 | FALSE | 3988 | Never Forget, #ZionismIsTerrorism: https://t.co/zcChjcuasF |
| 3254930815 | FALSE | 5123 | For another time. |
| 826485879334137856 | FALSE | 2389 | #WiegehtsDeutschland Ich brauche diese TV-Shows mit Politikern nicht, die vernebeln nur Massen von Menschen, lenken einfach ab von……..! |
| 3846087802 | FALSE | 1033 | Alle #TV_Anstalten helfen an der Gelddruckmaschine im #Fußballgeschäft weil sie selbst davon provitieren. https://t.co/3v3vEy3Dpa |
| 1846072760 | FALSE | 3789 | na super https://t.co/VcJwE4qmpr https://t.co/siZgM7Nw5D |
| 3364647988 | FALSE | 4843 | @ainyrockstar Respekt, meine Hochachtung für Ihren Schneid auch zu den “Schmuddelkindern” der Republik zu gehen… https://t.co/yJZCMS5eiA |
| 487488283 | FALSE | 3966 | Freedland talks about an ideology that killed 10s of millions?? He’s talking about Jewish Bolshevik Communism, make https://t.co/0bXe9PMKcu |
| 1068706021 | FALSE | 6603 | genau! https://t.co/DfhZPyHUiU |
| 1278184561 | FALSE | 6468 | @Miquwarchar mich hat der schon vor drei Wochen blockiert nachdem er auf Argumente nicht rausgeben konnte ! |
| 872001924534587392 | FALSE | 1890 | Für wen stimmen eigentlich Parteilose? Oder würden die aus Prinzip niemanden wählen? |
| 543294811 | FALSE | 3795 | @Junge_Freiheit Dabei wird IMMER vergessen, dass es um die Opfer geht, die betrauert werden sollen. Das sind in der https://t.co/M7W6naQLRS |
| 1389881041 | FALSE | 4483 | So geht man,gegen Terroristen vor. https://t.co/fKJDVstVVb |
| 387001916 | FALSE | 1552 | @business As a Finn I know this man will not even try in any case. |
Next, we remove all users who have not tweeted anything.
tweets <- tweets %>%
filter(!is.na(text))
That leaves us with …
tweets %$% unique(id) %>% length()
#> [1] 6257
… users.
Reading the tweet examples above we might realize that not all tweets are in German, however those are the only ones we are interested in. To extract them we have to classify the tweet language first. We can do this with the help of the textcat package. Note: This will take some time. If you have a multicore processor (which you most likely have), then you can speed things up with multidplyr.
tweets <- tweets %>%
mutate(
lang = textcat::textcat(text)
)
Let’s take a look at what languages we have found and how often: Note: We transform the y-axis to log10 to see the variation among the less frequent languages.
tweets %>%
group_by(lang) %>%
count() %>%
ungroup() %>%
arrange(desc(n)) %>%
mutate(language = factor(lang),
language = forcats::fct_reorder(language, n, .desc=T)) %>%
ggplot(aes(x=language, y=n, fill = language)) +
geom_bar(stat="identity") +
scale_y_continuous(trans = "log10" ) +
guides(fill = FALSE) +
ggthemes::theme_tufte() +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.25))
For a nationalist movement its followers are quite international.
Only …
tweets %>%
group_by(lang) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n)) %>%
filter(lang=="german") %$%
freq * 100
#> [1] 59.57656
… percent of all collected tweets are German. That is less than I hoped for, but still a sufficient amount.
Next, we remove all non-German tweets:
tweets_de <- tweets %>%
filter(lang == "german")
This leaves us with …
tweets_de %$% unique(id) %>% length()
#> [1] 5537
… users.
Let’s take a look at their tweet count distribution (remember: we set 100 as maximum per user):
tweets_de %>%
group_by(id) %>%
count() %>%
ggplot(aes(n)) +
geom_histogram(bins = 25) +
ggthemes::theme_tufte()
Well, that looks like an organic tweet distribution to me. Many tweet a little and few tweet a lot. Frequent tweeters will have a stronger impact on the model because the provide more data. However, the order of differences is not that big that the results will be unusable skewed.
Preprocess Tweets
The text2vec package needs the text data in a certain format, so we start with converting our data.
We begin extracting the tokens. Choosing the right form of tokens is no easy feed. In many cases just extracting the words is a viable solution, but for tweets I am not so sure. Given the size limit of tweets many people use non-letter characters (e.g., emojis) to convey their message with fewer characters. Those would be removed together with all other punctuation if we would use words as tokens. To avoid that we put everything which is surrounded by space characters into a token, but this leaves us with the problem, that all punctuation characters are included. To avoid that we remove single punctuation characters which trail behind a combination of letters (= words). Further, we remove any letter capitalization to reduce the number of permutations of one word.
Note: A further preprocessing step we could use to reduce the variability of words is to identify and lemmatize the words in text. However, in this post we will skip this step. If you want to learn more about lemmatization, then please take a look at my post Cleaning Words with R: Stemming, Lemmatization & Replacing with More Common Synonym
tweets_de <- tweets_de %>%
#base::tolower does not like emojis --> use stringi
mutate(clean = stringi::stri_trans_tolower(text),
clean = stringr::str_replace_all(clean, "([[:alpha:]]+)[[:punct:]]\\s", "\\1\\s"))
Next, we tokenize …
tokens <- space_tokenizer(tweets_de[["clean"]])
… and create a iterator for the tokens …
it = itoken(tokens, progressbar = FALSE)
… to create a vocabulary of all tokens (FYI: the text2vec_vocabulary object is actually a data.frame):
vocab <- create_vocabulary(it,
stopwords = tokenizers::stopwords("de"))
To calculate a meaningful word vector for a token it should not be too uncommon. Here we remove all tokens with a frequency below five.
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
Let’s take a look at a small sample of the vocabulary:
vocab %>%
sample_n(20) %>%
arrange(term_count) %>%
knitr::kable(format="html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| term | term_count | doc_count |
|---|---|---|
|
|
5 | 5 |
| könnensist | 5 | 5 |
| versprechungen | 6 | 6 |
| läuft. | 7 | 7 |
| @z_eisberg | 7 | 7 |
| verschleiern | 7 | 7 |
| other | 8 | 8 |
| @marcboe | 8 | 8 |
| menschenverachtende | 10 | 10 |
| stock | 10 | 10 |
| cm | 10 | 10 |
| doe) | 11 | 11 |
| fragenswarum | 12 | 12 |
| zünden | 15 | 15 |
| ball | 19 | 19 |
| gipfel | 30 | 30 |
| verweigert | 31 | 30 |
| digitale | 32 | 32 |
| #aufschrei | 46 | 46 |
| damals | 197 | 195 |
Let’s check whether the words we are interested in exist in the vocabulary. First we create a tibble with those words …
interest <- tibble(term = c("hitler", "faschismus",
"nationalsozialismus",
"juden",
"muslime"
),
term_english = c("hitler", "faschism",
"National Socialism",
"jews",
"moslems"
))
… and then use semi_join to filter the vocabulary.
vocab %>%
semi_join(interest) %>%
knitr::kable(format="html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
#> Joining, by = "term"
| term | term_count | doc_count |
|---|---|---|
| hitler | 108 | 107 |
| faschismus | 52 | 52 |
| nationalsozialismus | 21 | 21 |
| juden | 145 | 139 |
| muslime | 329 | 328 |
OK, those frequencies should suffice. Let’s build the model!
Building a GloVe Model
We start by constructing a term-co-occurrence matrix (TCM). But first we declare how to transform our list of tokens into vector space …
vectorizer <- vocab_vectorizer(vocab)
… before we create the TCM:
tcm <- create_tcm(it, vectorizer, skip_grams_window = 2^3)
Now we can fit the GloVe word-embedding model:
glove <- GlobalVectors$new(word_vectors_size = 2^3, vocabulary = vocab, x_max = 2^4)
word_vectors <- glove$fit_transform(tcm, n_iter = 2^5)
#> INFO [2017-09-15 10:55:12] 2017-09-15 10:55:12 - epoch 1, expected cost 0.0445
#> INFO [2017-09-15 10:55:13] 2017-09-15 10:55:13 - epoch 2, expected cost 0.0349
#> INFO [2017-09-15 10:55:13] 2017-09-15 10:55:13 - epoch 3, expected cost 0.0313
#> INFO [2017-09-15 10:55:14] 2017-09-15 10:55:14 - epoch 4, expected cost 0.0293
#> INFO [2017-09-15 10:55:14] 2017-09-15 10:55:14 - epoch 5, expected cost 0.0278
#> INFO [2017-09-15 10:55:15] 2017-09-15 10:55:15 - epoch 6, expected cost 0.0267
#> INFO [2017-09-15 10:55:15] 2017-09-15 10:55:15 - epoch 7, expected cost 0.0259
#> INFO [2017-09-15 10:55:15] 2017-09-15 10:55:15 - epoch 8, expected cost 0.0252
#> INFO [2017-09-15 10:55:16] 2017-09-15 10:55:16 - epoch 9, expected cost 0.0246
#> INFO [2017-09-15 10:55:16] 2017-09-15 10:55:16 - epoch 10, expected cost 0.0241
#> INFO [2017-09-15 10:55:17] 2017-09-15 10:55:17 - epoch 11, expected cost 0.0236
#> INFO [2017-09-15 10:55:17] 2017-09-15 10:55:17 - epoch 12, expected cost 0.0233
#> INFO [2017-09-15 10:55:18] 2017-09-15 10:55:18 - epoch 13, expected cost 0.0229
#> INFO [2017-09-15 10:55:18] 2017-09-15 10:55:18 - epoch 14, expected cost 0.0227
#> INFO [2017-09-15 10:55:19] 2017-09-15 10:55:19 - epoch 15, expected cost 0.0224
#> INFO [2017-09-15 10:55:19] 2017-09-15 10:55:19 - epoch 16, expected cost 0.0222
#> INFO [2017-09-15 10:55:19] 2017-09-15 10:55:19 - epoch 17, expected cost 0.0220
#> INFO [2017-09-15 10:55:20] 2017-09-15 10:55:20 - epoch 18, expected cost 0.0218
#> INFO [2017-09-15 10:55:20] 2017-09-15 10:55:20 - epoch 19, expected cost 0.0216
#> INFO [2017-09-15 10:55:21] 2017-09-15 10:55:21 - epoch 20, expected cost 0.0215
#> INFO [2017-09-15 10:55:21] 2017-09-15 10:55:21 - epoch 21, expected cost 0.0213
#> INFO [2017-09-15 10:55:22] 2017-09-15 10:55:22 - epoch 22, expected cost 0.0212
#> INFO [2017-09-15 10:55:22] 2017-09-15 10:55:22 - epoch 23, expected cost 0.0211
#> INFO [2017-09-15 10:55:23] 2017-09-15 10:55:23 - epoch 24, expected cost 0.0210
#> INFO [2017-09-15 10:55:23] 2017-09-15 10:55:23 - epoch 25, expected cost 0.0209
#> INFO [2017-09-15 10:55:23] 2017-09-15 10:55:23 - epoch 26, expected cost 0.0208
#> INFO [2017-09-15 10:55:24] 2017-09-15 10:55:24 - epoch 27, expected cost 0.0207
#> INFO [2017-09-15 10:55:24] 2017-09-15 10:55:24 - epoch 28, expected cost 0.0206
#> INFO [2017-09-15 10:55:25] 2017-09-15 10:55:25 - epoch 29, expected cost 0.0205
#> INFO [2017-09-15 10:55:25] 2017-09-15 10:55:25 - epoch 30, expected cost 0.0205
#> INFO [2017-09-15 10:55:25] 2017-09-15 10:55:25 - epoch 31, expected cost 0.0204
#> INFO [2017-09-15 10:55:26] 2017-09-15 10:55:26 - epoch 32, expected cost 0.0203
Use GloVe model
Now we can use the word vectors to find their nearest neighbors in the text corpus of Identitären follower tweets:
result <- interest %>%
mutate(
#create similarity matrix
cos_sim = map(term,
function (x) sim2(word_vectors, y=word_vectors[x,,drop=FALSE], method = "cosine", norm = "l2")),
#convert matrix to tibble
cos_sim_tbl = map(cos_sim, function(x) tibble(term = row.names(x), distance = x[,1]) %>%
#sort distance
arrange(desc(distance))),
#find neighbors
n1 = map_chr(cos_sim_tbl, function (x) x[[2,1]]),
n2 = map_chr(cos_sim_tbl, function (x) x[[3,1]]),
n3 = map_chr(cos_sim_tbl, function (x) x[[4,1]]),
n4 = map_chr(cos_sim_tbl, function (x) x[[5,1]]),
n5 = map_chr(cos_sim_tbl, function (x) x[[6,1]])
)
#> Warning: package 'bindrcpp' was built under R version 3.3.3
And print them as a table:
result %>%
select(-starts_with("cos")) %>%
knitr::kable(format="html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| term | term_english | n1 | n2 | n3 | n4 | n5 |
|---|---|---|---|---|---|---|
| hitler | hitler | kultur | werte | slowakei | leitkultur | männer |
| faschismus | faschism | fan | schatten | lassen! | werdensda | nichtsman |
| nationalsozialismus | National Socialism | sollt | reichs | dienst | galerieshttps://t.co/2bfblhn0tp | niederlage |
| juden | jews | 10% | gesucht | sicherheit | gefallenen | mitglieder |
| muslime | moslems | moslems | terror | gewalt | aktuelles | 2002 |
Note: The glove model is not constructed deterministically, hence your results might differ slightly. Unfortunately, set.seed() does not seem to work.
For those of you who do not speak German we translate the neighbors to English with the Oxford Dictionary API. To be able to use the API you have to register your self and create an app. After confirming the terms of use you will get an application id and an application key which we need to authenticate our queries to the API.
Next, we write a little wrapper function to get the English translation of a German word from the API …
de2en <- function(word,app_id,app_key){
require(httr)
require(jsonlite)
##create query
url <- paste0("https://od-api.oxforddictionaries.com:443/api/v1/entries/de/",
word,
"/translations=en")
##query
res <- GET(url,
add_headers(app_key = app_key,
app_id = app_id))
##if no result return word
if (res$status_code != 200){
return(word)
}
##dig into the resulting json for
##the information we need
results <- fromJSON(content(res, "text"))$results
entries <- results$lexicalEntries[[1]][,1][[1]]
firstTranslation <- entries$senses[[1]]$translations[1][[1]]$text
##return result
##return only first word
if (length(firstTranslation) >= 1){
return (firstTranslation[1])
} else {
##sometimes res$status is 200
##and there is still no result
##--> return original
return (word)
}
}
and apply it to all columns containing neighboring words (replace your_app_key and your_app_id with the respective values):
result_en <- result %>%
select(-starts_with("cos")) %>%
mutate_at(vars(starts_with("n")),
##do not translate words
##containing non-alphabetical
##characters
function(x) (ifelse(stringr::str_detect(x,"^[[:alpha:]]+$"),
map_chr(x,de2en, your_app_id, your_app_key),
x)))
Let’s take a look at the result:
result_en %>%
knitr::kable(format="html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| term | term_english | n1 | n2 | n3 | n4 | n5 |
|---|---|---|---|---|---|---|
| hitler | hitler | culture | werte | Slovakia | primary culture | männer |
| faschismus | faschism | fan | shadow | lassen! | werdensda | nichtsman |
| nationalsozialismus | National Socialism | sollt | reichs | work | galerieshttps://t.co/2bfblhn0tp | defeat |
| juden | jews | 10% | gesucht | safety | gefallenen | mitglieder |
| muslime | moslems | moslems | terrorism | power | aktuelles | 2002 |
Well, that worked … somehow, but not all words were translated. Among those werte is maybe of interest which translates to merits. Further, other translations of dienst, automatically translated to work, are service and duty. Lastly, I would not translate Gewalt to power but to violence.
All in all the results are quite stunning to me. Primary culture being a neighbor of Hitler and fan of faschism does not support their claim of having nothing to do with Nazis. On the bright side the neighbors of jews do not hint antisemitism, but their opinion of moslems as violent terrorists seems clear.
Closing Remarks
I hope the given example was illustrative and can help you to carry out your own GloVe model analyses.
Concerning the content there are a few things to keep in mind:
- The sample was not huge, so the resulting model will not be overly stable.
- We used tweets from all people following _Identitaere/B - some of those probably do not agree with alt-right ideology.
- Many followers had protected accounts - their tweets were not included in the analysis although they are probably even more interesting.
Anyway, the results seems quite plausible to me, but maybe this is just confirmation bias. I am interested in your thoughts on the technique and the topic.
If you like to learn more about getting data from Twitter, then I can recommend Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. The book is focused on data mining with Python, but the concepts are very well explained and can be easily translated to other plattforms (such as R).
If something is not working as outlined here, please check the package versions you are using. The system I used was:
sessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#>
#> locale:
#> [1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Austria.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] bindrcpp_0.2 text2vec_0.5.0 twitteR_1.1.9 magrittr_1.5
#> [5] dplyr_0.7.2 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3
#> [9] tibble_1.3.3 ggplot2_2.2.1 tidyverse_1.1.1 kableExtra_0.3.0
#>
#> loaded via a namespace (and not attached):
#> [1] reshape2_1.4.2 haven_1.1.0 lattice_0.20-34
#> [4] colorspace_1.3-2 htmltools_0.3.5 yaml_2.1.14
#> [7] rlang_0.1.1 foreign_0.8-67 glue_1.1.1
#> [10] DBI_0.7 bit64_0.9-7 lambda.r_1.1.9
#> [13] modelr_0.1.1 readxl_1.0.0 foreach_1.4.3
#> [16] bindr_0.1 plyr_1.8.4 stringr_1.2.0
#> [19] futile.logger_1.4.3 munsell_0.4.3 gtable_0.2.0
#> [22] cellranger_1.1.0 rvest_0.3.2 codetools_0.2-15
#> [25] psych_1.6.12 evaluate_0.10 knitr_1.16
#> [28] forcats_0.2.0 parallel_3.3.2 highr_0.6
#> [31] broom_0.4.2 Rcpp_0.12.12 backports_1.1.0
#> [34] scales_0.4.1 RcppParallel_4.3.20 jsonlite_1.2
#> [37] bit_1.1-12 mnormt_1.5-5 rjson_0.2.15
#> [40] hms_0.3 digest_0.6.12 stringi_1.1.5
#> [43] grid_3.3.2 rprojroot_1.2 tools_3.3.2
#> [46] lazyeval_0.2.0 futile.options_1.0.0 pkgconfig_2.0.1
#> [49] Matrix_1.2-8 data.table_1.10.4 xml2_1.1.1
#> [52] lubridate_1.6.0 iterators_1.0.8 assertthat_0.2.0
#> [55] rmarkdown_1.6 httr_1.2.1 R6_2.2.0
#> [58] nlme_3.1-131
No comments:
Post a Comment