In this post I want to test the power of modern off-the-shelf data mining algorithms. Are they so powerful that anyone can use them and still get good results?
As a pars pro toto I will use gradient boosted trees (GBT), which are indeed quite powerful and often come up as a first choice algorithm when I talk to colleagues and friends. There are multiple GBT implementations available for R, today I will rely on the gbm package. To make model formulation, training, tuning, and evaluation as easy as possible I will use the toolbox wrappers offered by the mlr package.
To assess how well a powerful algorithm without much human considerations performs, first I need a task to try it out and benchmarks to compare it. Both can be found at www.kaggle.com. Kaggle is a platform for predictive modelling and analytics competitions. Enterprises post predictive modelling problems and multiple teams around the world will try to solve it the best the can. The contenders do it for fun, fame, the often not negligible price money, and job offers and the enterprises for cheap, high quality solutions to their problem.
For the job at hand we will not take up the challenge of a real price money competition, but just try the introductory competition Titanic: Machine Learning from Disaster. This has the advantage that it will not close in the near future, so that you can try it out on your own. Further, most people new to Kaggle will try it out, so there are lots of other solutions which we will use as a benchmark.
In Titanic: Machine Learning from Disaster we have to infer from data like age, gender, and cabin number who has survived and who has not - apart from the human tragedy this is a standard classification task.

Preparations

First, we will load the central libraries for this post. In some code-blocks we will use functions from other libraries or I want to emphasize the package - in those cases we will call the respective function directly by package_name::function_name().

library(tidyverse) # R is better when it is tidy
library(magrittr) # I often use %$%
library(mlr)  # topic today

Note: If you are wondering why I am not loading the gbm package - mlr does this on its own.

Next, we need data. You can download the needed data from https://www.kaggle.com/c/titanic/data, but you have to be a registered kaggle user (registration is super fast and easy). If registration is a no-go for you, but you still want to try out the example on your own, then you can use the data provided in the titanic package.
We load the data into our environment.

train <- readr::read_csv("train.csv")

test <- readr::read_csv("test.csv")

Now that we have the data the wise thing to do is to explore it - What are our variables? How are the scaled? Taking a look at summary statistics and uni- and bivariate distributions. Do we need to consider heywood cases? Perform some simple analytics. Check for missing values. … - in general: we would try to get an understanding of our data before we try anything more sophisticated.
But for this demonstration we will not do that today! Instead we just pipe the data into the analysis.

Hardly any Consideration

We will forgo any data cleaning and feature engineering, but some data wrangling is still required for gbm to accept the data. In the training data we convert the response variable (Survived) and all string variables to factors.

train <- train %>%
  mutate_each(funs(factor), Survived, Pclass, Name, Sex, Ticket, Cabin, Embarked)

Next, we define our task. We tell mlr that we want to classify the variable Survived in the train data-set. From the task we drop the variables PassengerId and Name. Those are categorical variables, with an unique value for each case, hence we cannot expect them to have any predictive value.
Note: I have used a wrapper for the mlr::makeClassifTask function to allow piping (the %>% operator), which requires the first object in a function to be the piped data.

pipe_mCT <- function(data, ...){
  makeClassifTask(data=data, ...)
}

task <- train %>%
  pipe_mCT(target = "Survived", positive ="1")

task <- dropFeatures(task, c("PassengerId", "Name"))

Let’s take a quick peak on the data …

train

#> # A tibble: 891 x 12
#>    PassengerId Survived Pclass                                                Name    Sex   Age
#>          <int>   <fctr> <fctr>                                              <fctr> <fctr> <dbl>
#>  1           1        0      3                             Braund, Mr. Owen Harris   male    22
#>  2           2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female    38
#>  3           3        1      3                              Heikkinen, Miss. Laina female    26
#>  4           4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female    35
#>  5           5        0      3                            Allen, Mr. William Henry   male    35
#>  6           6        0      3                                    Moran, Mr. James   male    NA
#>  7           7        0      1                             McCarthy, Mr. Timothy J   male    54
#>  8           8        0      3                      Palsson, Master. Gosta Leonard   male     2
#>  9           9        1      3   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female    27
#> 10          10        1      2                 Nasser, Mrs. Nicholas (Adele Achem) female    14
#> # ... with 881 more rows, and 6 more variables: SibSp <int>, Parch <int>, Ticket <fctr>,
#> #   Fare <dbl>, Cabin <fctr>, Embarked <fctr>

… and at the task.

task

#> Supervised task: data
#> Type: classif
#> Target: Survived
#> Observations: 891
#> Features:
#> numerics  factors  ordered 
#>        4        5        0 
#> Missings: TRUE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Classes: 2
#>   0   1 
#> 549 342 
#> Positive class: 1

Next, we define our learner. I.e. we tell mlr that we want to classify using gbm. As a prediction type we choose response requesting a clear distinction of survived vs. not survived in contrast to prob which returns the probabilities of the individual classes.

lrn <- makeLearner("classif.gbm", predict.type = "response")

Taking a look at the par.set of our learner, we see all its hyperparameters, their ranges, and whether they are tuneable.

lrn$par.set

#>                       Type len   Def                                   Constr Req Tunable Trafo
#> distribution      discrete   -     - bernoulli,adaboost,huberized,multinomial   -    TRUE     -
#> n.trees            integer   -   100                                 1 to Inf   -    TRUE     -
#> cv.folds           integer   -     0                              -Inf to Inf   -    TRUE     -
#> interaction.depth  integer   -     1                                 1 to Inf   -    TRUE     -
#> n.minobsinnode     integer   -    10                                 1 to Inf   -    TRUE     -
#> shrinkage          numeric   - 0.001                                 0 to Inf   -    TRUE     -
#> bag.fraction       numeric   -   0.5                                   0 to 1   -    TRUE     -
#> train.fraction     numeric   -     1                                   0 to 1   -    TRUE     -
#> keep.data          logical   -  TRUE                                        -   -   FALSE     -
#> verbose            logical   - FALSE                                        -   -   FALSE     -

Obviously, gradient boosted trees in the gbm package have quite a lot of hyperparameters. We will make use of mlr::makeParamSet() to define a sensible hyperparameter space.

ps = makeParamSet(
  makeIntegerParam("n.trees", lower = 100, upper = 1000),
  makeIntegerParam("interaction.depth", lower = 1, upper = 10),
  makeIntegerParam("n.minobsinnode", lower = 2, upper = 20),
  makeNumericParam("shrinkage", lower = -4, upper = -1, 
                   trafo = function(x) 10^x),
  makeNumericParam("bag.fraction", lower = 0.2, upper = 0.8)
)

Given that there are 5 dimensions a standard grid search even with a low resolution of 10 steps per dimension would need 10^5 = 100,000 runs and therefor ages. Hence, we need to use an alternative strategy for hyperparameter optimization. To find a good configuration in acceptable time we use model-based / Bayesian optimization provided by the mlrMBO.

ctrl = makeTuneControlMBO()

Now that we have defined the hyperparameter space and how we will maneuver through it we need a strategy to evaluate which combination works well. In this case we will use 10-fold cross-validation with stratification of the target variable.

CV10 = makeResampleDesc("CV", iters = 10, stratify=T)

To speed things up it is a good idea to activate parallel processing, telling mlr that it should use more than one core (I normally use all but one core, because I sometimes work on an i5 and I do not want to render it unusable during the processing. If you are working on an i7 or do not mind your PC fully max out, then you better use all cores). To activate parallel processing uncomment the following chunk of code. The downside is that (at the moment) seeds will not work on Windows systems. Hence, if you follow along on your own PC with activated parallel processing, then your results might differ slightly.

## (parallel::detectCores(logical = F)-1) %>% 
##   parallelMap::parallelStartSocket()

Finally, we put everything together.

set.seed(0, "L'Ecuyer-CMRG")

res = tuneParams(lrn, task, CV10, 
                 par.set = ps, control = ctrl, 
                 measures = list(acc, setAggregation(acc, test.sd)),
                 show.info = FALSE)

Let’s take a look at the result and the best selection of hyperparameters.

res

#> Tune result:
#> Op. pars: n.trees=982; interaction.depth=9; n.minobsinnode=17; shrinkage=0.000121; bag.fraction=0.418
#> acc.test.mean=-0.616

Well, that does not look too good. Let’s check the the ratio of the majority class for comparison.

train %$% 
  table(Survived) %>% 
  max() %>% 
  (function(x) x/dim(train)[1])

#> [1] 0.6161616

As expected - there is no difference in accuracy between our model and just predicting the most common category for every case. We do not need to submit predictions based on this model to Kaggle, but better move on.

A little Human Consideration

Just feeding everything to the algorithm and hoping for the best did not work. Let’s try to think a little about the data (and I really only mean a little) and make some minor adjustments to the data:

make cabin usefull - The cabin values have the structure: <letter><digits>. I assume the letter represents the floor or deck and the digits were given in series to the cabins on that floor, so we will split them into two variables. Note: If we were trying more than a little, then it would be a good idea to look this up - maybe this would allow a more sensible clustering. Further, we could assume that people with the same ticket have cabins near to oneanother. If cabin data for one person is missing [those are missing quite a lot], then imputing it with the cabin data of another person with the same ticket might be a good guess.
calculate family size - We engineer one feature representing the number of family members on board by summing SibSp and Parch. Note: With more effort we could be more clever about it and leverage on other variables like Name and Ticket, too.
trim family name - We trim the values of Name to the family names of the passenger. In contrast to the full names those are not unique and hence are probably more informative.

## a littel feature engineering
train_temp <- train %>%
  ##1
  mutate(Cabin = stringr::str_replace(Cabin, "\\s.*", "")) %>% 
  separate(Cabin, into=c("cabin_chr", "cabin_nr"), sep=1) %>% 
  mutate(cabin_chr = factor(cabin_chr),
         cabin_nr = as.numeric(cabin_nr))  %>%
  ##2
  mutate(family_size = SibSp + Parch +1) %>%
  ##3
  mutate(Name = stringr::str_replace(Name,"([[:alpha:]]+),.*", "\\1")) %>%
  ##remove Survived (= response)
  ##remove PassengerId (no prediction value)
  select(-c(Survived, PassengerId))

Further, we will use vtreat to switch the dense factorial data into sparse one hot encoding. Although gbm can deal with dense factor data other packages like xgboost cannot, so I normally do this. Further and more important, the use of vtreat guarantees that the same data structure (e.g., factor levels) applies to new data by storing the encoding schema. The setting minFraction = 0.002 excludes factor categories with very few entries (in this case, those with only one) from the one hot encoding. Above and beyond vtreat has several more cool features - so just check it out.

vars <- setdiff(names(train_temp), "Survived")

train_plan <- vtreat::designTreatmentsZ(train_temp, vars,
                                        minFraction = 0.002, 
                                        verbose = F)

train_better <- bind_cols(train %>% select(Survived), 
                          vtreat::prepare(train_plan, train_temp))

With the amended data we create a new task.

task_better <- train_better %>%
   pipe_mCT(target = "Survived", positive ="1")

#> Warning in makeTask(type = type, data = data, weights = weights, blocking = blocking, : Provided
#> data is not a pure data.frame but from class tbl_df, hence it will be converted.

For hyperparameter-tuning we use the settings from before.

set.seed(0, "L'Ecuyer-CMRG")

res_better = tuneParams(lrn, task_better, CV10, 
                 par.set = ps, control = ctrl, 
                 measures = list(acc, setAggregation(acc, test.sd)),
                 show.info = FALSE)

Let’s take a look at the results.

res_better

#> Tune result:
#> Op. pars: n.trees=778; interaction.depth=10; n.minobsinnode=7; shrinkage=0.00272; bag.fraction=0.637
#> acc.test.mean=-0.835

This looks more promising. Let’s use the hyperparameters to define an optimized learner …

lrn_optim = setHyperPars(makeLearner("classif.gbm", predict.type = "response"), 
                         par.vals = res_better$x)

… and train it.

model_optim = train(lrn_optim, task_better)

Predictions with new Data

To apply the optimized model to new data (e.g., the test data) it needs to be in the same format as the training data.
We start by repeating all mutate() on the test data.

test_temp <- test %>%
  mutate_each(funs(factor), Pclass, Name, Sex, Ticket, Cabin, Embarked) %>%
  mutate(Cabin = stringr::str_replace(Cabin, "\\s.*", "")) %>% 
  separate(Cabin, into=c("cabin_chr", "cabin_nr"), sep=1) %>% 
  mutate(cabin_chr = factor(cabin_chr),
         cabin_nr = as.numeric(cabin_nr))  %>%
  mutate(family_size = SibSp + Parch +1) %>%
  mutate(Name = stringr::str_replace(Name,"([[:alpha:]]+),.*", "\\1"))

Next, we change all factors to one-hot encoding. vtreat takes care that the factor structures match, even when the new data has new or missing categories.

test_optim <- vtreat::prepare(train_plan, test_temp)

Finally, we make our predictions for the test data and prepare it in the format required for submission to kaggle (2 columns: 1st = PassengerId, 2nd = Survived).

result <- bind_cols(
  test %>% select(PassengerId),
  predict(model_optim, newdata = test_optim)$data
  ) %>%
  rename(Survived = response)

The submission needs to be in a comma separated file, so we save it accordingly.

readr::write_csv(result, "titanic_submission.csv")

Submitting it gives us:

rank 2991 out of 6935 teams

We placed 2991th out of 6935 teams with an accuracy of about .78. That is not bad, but by no means good either. Just using a state of the art algorithm obviously is not enough to be competative among people who really try.

Closing Remarks

I believe the given example demonstrated quite nicely that just relying on algorithms and black-box machines often is not enough when dealing with real world data problems.
Further, I hope you enjoyed this hands-on example of the mlr-toolbox. Finally R has something like scikit-learn for python and I am really happy about it.
If you have any questions or comments please post them in the comments section.
If something is not working as outlined here, please check the package versions you are using. The system I have used was:

sessionInfo()

#> R version 3.3.1 (2016-06-21)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 14393)
#> 
#> locale:
#> [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252    LC_MONETARY=German_Austria.1252
#> [4] LC_NUMERIC=C                    LC_TIME=German_Austria.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] mlr_2.11          ParamHelpers_1.10 magrittr_1.5      dplyr_0.5.0       purrr_0.2.2.2    
#>  [6] readr_1.1.1       tidyr_0.6.3       tibble_1.3.3      ggplot2_2.2.1     tidyverse_1.1.1  
#> [11] kableExtra_0.2.1 
#> 
#> loaded via a namespace (and not attached):
#>  [1] gbm_2.1.3          reshape2_1.4.2     splines_3.3.1      haven_1.0.0        lattice_0.20-33   
#>  [6] colorspace_1.3-2   viridisLite_0.2.0  htmltools_0.3.6    yaml_2.1.14        plotly_4.7.0      
#> [11] survival_2.41-3    rlang_0.1.1        foreign_0.8-68     DBI_0.6-1          plot3D_1.1        
#> [16] RColorBrewer_1.1-2 lhs_0.14           mco_1.0-15.1       modelr_0.1.0       readxl_1.0.0      
#> [21] plyr_1.8.4         stringr_1.2.0      munsell_0.4.3      gtable_0.2.0       cellranger_1.1.0  
#> [26] rvest_0.3.2        htmlwidgets_0.8    psych_1.7.5        evaluate_0.10      misc3d_0.8-4      
#> [31] knitr_1.16         forcats_0.2.0      parallelMap_1.3    parallel_3.3.1     broom_0.4.2       
#> [36] Rcpp_0.12.11       scales_0.4.1       backports_1.1.0    checkmate_1.8.2    jsonlite_1.4      
#> [41] vtreat_0.5.31      mnormt_1.5-5       hms_0.3            digest_0.6.12      stringi_1.1.5     
#> [46] BBmisc_1.11        RJSONIO_1.3-0      grid_3.3.1         rprojroot_1.2      tools_3.3.1       
#> [51] lazyeval_0.2.0     mlrMBO_1.1.0       Matrix_1.2-10      data.table_1.10.4  xml2_1.1.1        
#> [56] smoof_1.5          lubridate_1.6.0    assertthat_0.2.0   rmarkdown_1.5      httr_1.2.1        
#> [61] R6_2.2.1           nlme_3.1-131

Bernhard Learns

Pages

Friday, June 02, 2017

Taking R's mlr running on gbm for a spin on Kaggle

Preparations

Hardly any Consideration

A little Human Consideration

Predictions with new Data

Closing Remarks

1 comment:

Recommended Post

Follow the white robot - Exploring retweets of Austrian politicians with Botometer in R

Popular Posts

Impressum

Pages

Friday, June 02, 2017

Taking R's mlr running on gbm for a spin on Kaggle

Preparations

Hardly any Consideration

A little Human Consideration

Predictions with new Data

Closing Remarks

1 comment:

Recommended Post

Follow the white robot - Exploring retweets of Austrian politicians with Botometer in R

Popular Posts

Impressum

Subscribe