Hi folks! In this post I will continue work from a previous post and will start with my rationale why I do so with k nearest neighbors. If you are only interested in a how2 of this method, then you can skip the next few paragraphs.
OK, so you are here for the whole story - excellent. In my last post Finding Vienna’s House Price Regions. Regionalization with R and ClustGeo we created spatially constraint clusters of sold single-family and duplex houses in Vienna, Austria in 2016 based on their price per square meter. But for 20% of the homes we did not assign any cluster, because we put those aside as a test set. The test set will come in handy when we want to evaluate how good the models work on data that is new to them. To use the cluster membership in any of those future models as a feature/predictor, we first have to come up with a way to allocate new cases to those cluster.
If we had not used clustering based on soft spatial constraints, which allow a little overlap of clusters, but hard constraints with clear cut borders between clusters, then we could have used those borders to describe the clusters as polygons and just check into which polygon the new case falls. But, in absence of clear cluster boundaries we have to come up with another way. Still, although not perfectly, the clusters are quite spatially coherent with not to much overlap, so some kind of assignment based on the proximity to other cluster members seems reasonable.
In example we could calculate the distances between a new case and all homes which already have a cluster membership and use those to look up the shortest distances. Then we can assign the same membership to the new case as the nearest known home or choose the majority membership of the k nearest known homes (maybe weighted by their distance). In a nutshell this is what k-nearest neighbors (k-NN) classification does. Well, and it is what we will try to accomplish in today’s post.
Preparations
Before the fun begins some preparations are needed.
Packages
First, we will load the packages we will work with.
Load Basic Data-Wrangling Packages
As you probably already know I am a huge tidyverse fan-boy, so …
library(tidyverse)
library(magrittr)
Data
Next, we need data to work with.
Get the Data
We will use the result from my last post - if you have missed that one, then you can either download the data here and load it with readr::read_csv()
or use the code below for your convenience.
if (!exists("dat_raw")){
if (!file.exists("data_result.csv")){
res <- tryCatch(download.file(
"https://dl.dropboxusercontent.com/s/4jq1vjbxyt8dy7t/data_result.csv?dl=0",
"data_result.csv", mode = "wb"),
error=function(e) 1)
}
dat_raw <- read_csv("data_result.csv")
}
Quickly, take a look at the data to see if loading and parsing was successful.
dat_raw
#> # A tibble: 6,749 x 48
#> KGCode Katastralgemeinde EZ PLZ Straße ON
#> <chr> <chr> <int> <int> <chr> <int>
#> 1 01205 Hietzing 290 1130 hietzinger hauptstrasse 31
#> 2 01512 Unterdöbling 355 1190 fuerfanggasse 5
#> 3 01512 Unterdöbling 33 1190 nusswaldgasse 7
#> 4 01502 Grinzing 1471 1190 schreiberweg 31
#> 5 01503 Heiligenstadt 928 1190 amalgergasse 7
#> 6 01511 Salmannsdorf 71 1190 salmannsdorfer strasse 26
#> 7 01215 Unter St.Veit 409 1130 larochegasse 33
#> 8 01209 Ober St.Veit 2462 1130 am meisenbuehel 5
#> 9 01502 Grinzing 1120 1190 hocheneggasse 5
#> 10 01514 Währing 2342 1180 hasenauerstrasse 63
#> # ... with 6,739 more rows, and 42 more variables: Gst <chr>, GstFl <int>,
#> # ErwArt <chr>, Widmung <chr>, Bauklasse <chr>, Gebäudehöhe <dbl>,
#> # Bauweise <chr>, Zusatz <chr>, Schutzzone <chr>, Wohnzone <chr>,
#> # öZ <chr>, Bausperre <chr>, seitbis <chr>, zuordnung <chr>,
#> # Geschoße <int>, parz <chr>, Erwerbsdatum <chr>, Anteile <chr>,
#> # Zähler <int>, Nenner <int>, BJ <int>, TZ <chr>, Kaufpreis <dbl>,
#> # mGfl <dbl>, Baureifgest <chr>, percentWidmung <int>, Baurecht <chr>,
#> # Bis <chr>, aufEZ <int>, Stammeinlage <chr>, sonstwid <chr>,
#> # sonstwidprz <chr>, Bauzins <chr>, address <chr>, one_more_time <lgl>,
#> # lon <dbl>, lat <dbl>, price_m2 <dbl>, price_log <dbl>, train <int>,
#> # c_fine <int>, c_coarse <int>
Seems fine.
We will add an ID to identify the individual cases.
dat_raw <- dat_raw %>%
mutate(id = row_number())
Select and Split the Data
Today we will only use longitude (´lon´) and latitude (lat
) for k-NN, so we can reduce our data set a little. Additional to the coordinates we will need the cluster membership. For now we will use the coarser assignment in c_coarse
and leave c_fine
for individual training. Lastly, we will split the data in two sets: one with cluster assignments and one without.
dat_select <- dat_raw %>%
select(id, lon, lat, c_coarse)
dat_train <- dat_select %>%
filter(!is.na(c_coarse))
dat_new <- dat_select %>%
filter(is.na(c_coarse))
k-NN
From Scratch
k-NN (done naively, without optimization to speed things up) is quite simple. To understand it a little better we will try to implement it ourselves.
Calculate Distances
First, we have to calculate the distances between the new cases without class memberships and the cases with class memberships. For simplicity we will use euclidean distance, although this is not totally correct (one degree longitude not necessarily equals the same distance as one degree latitude). There are several R packages which could do this for us (e.g., pdist), but we want to do it from scratch.
We start by expanding the data by creating all combinations of dat_train
and dat_new
dat_expand <- dat_train %>%
crossing(dat_new)
Let’s take a quick glance:
dat_expand
#> # A tibble: 7,300,788 x 8
#> id lon lat c_coarse id1 lon1 lat1 c_coarse1
#> <int> <dbl> <dbl> <int> <int> <dbl> <dbl> <int>
#> 1 1 16.29557 48.18678 1 5 16.38846 48.20699 NA
#> 2 1 16.29557 48.18678 1 14 16.34483 48.25785 NA
#> 3 1 16.29557 48.18678 1 16 16.34175 48.26051 NA
#> 4 1 16.29557 48.18678 1 26 16.34530 48.24970 NA
#> 5 1 16.29557 48.18678 1 28 16.29597 48.18167 NA
#> 6 1 16.29557 48.18678 1 29 16.26796 48.15112 NA
#> 7 1 16.29557 48.18678 1 39 16.26407 48.18614 NA
#> 8 1 16.29557 48.18678 1 40 16.26320 48.13987 NA
#> 9 1 16.29557 48.18678 1 60 16.30946 48.24223 NA
#> 10 1 16.29557 48.18678 1 61 16.33922 48.23340 NA
#> # ... with 7,300,778 more rows
Next, we calculate the euclidean distances:
dat_dist <- dat_expand %>%
mutate(dist = sqrt((lat-lat1)^2 + (lon-lon1)^2))
Find k-NN
Let’s choose k = 5 and filter out the 5 nearest homes with known cluster membership per case without cluster membership. By using -5
as the n
argument we get the 5 smallest values.
Notice: %T>% print()
prints the tibble without having any impact on the remaining pipeline. For more information check out the marvellous magrittr package.
dat_5nn <- dat_dist %>%
group_by(id1) %>%
top_n(-5, dist) %>%
arrange(id1) %T>%
print()
#> # A tibble: 6,765 x 9
#> # Groups: id1 [1,353]
#> id lon lat c_coarse id1 lon1 lat1 c_coarse1
#> <int> <dbl> <dbl> <int> <int> <dbl> <dbl> <int>
#> 1 1223 16.38580 48.20733 3 5 16.38846 48.20699 NA
#> 2 2039 16.38585 48.20376 3 5 16.38846 48.20699 NA
#> 3 2603 16.38925 48.20900 3 5 16.38846 48.20699 NA
#> 4 5705 16.39098 48.20675 3 5 16.38846 48.20699 NA
#> 5 6564 16.38459 48.20334 3 5 16.38846 48.20699 NA
#> 6 62 16.34506 48.25786 2 14 16.34483 48.25785 NA
#> 7 146 16.34387 48.25823 2 14 16.34483 48.25785 NA
#> 8 230 16.34576 48.25751 2 14 16.34483 48.25785 NA
#> 9 420 16.34444 48.25850 2 14 16.34483 48.25785 NA
#> 10 551 16.34512 48.25760 2 14 16.34483 48.25785 NA
#> # ... with 6,755 more rows, and 1 more variables: dist <dbl>
Classify
Now, we are ready for the actual classification. However, we will decide the class membership not by a simple majority vote, but weight the individual votes by the distance. Let’s calculate the weight as 1/distance. Further, we will remove some variables which are not longer needed.
dat_weights <- dat_5nn %>%
select(-c(lon, lat, id)) %>%
mutate(dist_w = 1/dist)
To easily count votes we add an additional group_by
layer to count the votes and then choose the maximum. Let’s vote:
dat_classified <- dat_weights %>%
group_by(id1, c_coarse) %>%
summarize(votes = sum(dist_w)) %>%
top_n(1, votes) %T>%
print()
#> # A tibble: 1,353 x 3
#> # Groups: id1 [1,353]
#> id1 c_coarse votes
#> <int> <int> <dbl>
#> 1 5 3 1659.149
#> 2 14 2 10228.327
#> 3 16 2 3984.920
#> 4 26 2 3460.397
#> 5 28 1 3211.209
#> 6 29 1 2142.197
#> 7 39 1 3384.799
#> 8 40 1 3777.423
#> 9 60 2 7262.791
#> 10 61 2 3218.572
#> # ... with 1,343 more rows
That’s about it. The only thing left to do is copy the new cluster assignments into the original data set:
dat_result <- dat_raw %>%
full_join(dat_classified %>%
select(-votes),
by=c("id" = "id1")) %>%
mutate(c_coarse = if_else(
is.na(c_coarse.x),
c_coarse.y,
c_coarse.x
)) %>%
select(-c(c_coarse.x, c_coarse.y))
Notice: I am not happy with the code above and would be more than happy to hear about a more beautiful pipeline solution from you.
dat_result
#> # A tibble: 6,749 x 49
#> KGCode Katastralgemeinde EZ PLZ Straße ON
#> <chr> <chr> <int> <int> <chr> <int>
#> 1 01205 Hietzing 290 1130 hietzinger hauptstrasse 31
#> 2 01512 Unterdöbling 355 1190 fuerfanggasse 5
#> 3 01512 Unterdöbling 33 1190 nusswaldgasse 7
#> 4 01502 Grinzing 1471 1190 schreiberweg 31
#> 5 01503 Heiligenstadt 928 1190 amalgergasse 7
#> 6 01511 Salmannsdorf 71 1190 salmannsdorfer strasse 26
#> 7 01215 Unter St.Veit 409 1130 larochegasse 33
#> 8 01209 Ober St.Veit 2462 1130 am meisenbuehel 5
#> 9 01502 Grinzing 1120 1190 hocheneggasse 5
#> 10 01514 Währing 2342 1180 hasenauerstrasse 63
#> # ... with 6,739 more rows, and 43 more variables: Gst <chr>, GstFl <int>,
#> # ErwArt <chr>, Widmung <chr>, Bauklasse <chr>, Gebäudehöhe <dbl>,
#> # Bauweise <chr>, Zusatz <chr>, Schutzzone <chr>, Wohnzone <chr>,
#> # öZ <chr>, Bausperre <chr>, seitbis <chr>, zuordnung <chr>,
#> # Geschoße <int>, parz <chr>, Erwerbsdatum <chr>, Anteile <chr>,
#> # Zähler <int>, Nenner <int>, BJ <int>, TZ <chr>, Kaufpreis <dbl>,
#> # mGfl <dbl>, Baureifgest <chr>, percentWidmung <int>, Baurecht <chr>,
#> # Bis <chr>, aufEZ <int>, Stammeinlage <chr>, sonstwid <chr>,
#> # sonstwidprz <chr>, Bauzins <chr>, address <chr>, one_more_time <lgl>,
#> # lon <dbl>, lat <dbl>, price_m2 <dbl>, price_log <dbl>, train <int>,
#> # c_fine <int>, id <int>, c_coarse <int>
k-NN with an ready-made implementation
Well, doing it from scratch was not hard, but let’s try it again with one of the numerous k-NN packages out there. Today we will use kknn because it was the first that came up in my Google search.
Let’s load the package:
library(kknn)
And conduct the classification:
fit_kknn <- kknn(
as.factor(c_coarse) ~ lon + lat,
train = dat_train,
test = dat_new,
k = 5,
distance = 2,
kernel = "inv",
scale = F
)
Notice1: distance = 2
sets the parameter of Minkowski distance to 2 which is equal to the previously used euclidean distance and kernel = "inv"
should be equal to our weighting by 1/distance.
Notice2: We needed to wrap c_coarse
in as.factor()
to get a classification. Without a k-NN regression would have been conducted, treating c_coarse
as a continuous variable.
Now, we can easily extract the fitted values:
dat_imp <- dat_new
dat_imp$c_coarse <- fit_kknn$fitted.values %>% as.numeric
While we are at it, let’s get the assignments for c_fine
too.
fit_fine <- kknn(
as.factor(c_fine) ~ lon + lat,
train = dat_result %>% filter(train==1),
test = dat_result %>% filter(train==0),
k = 5,
distance = 2,
kernel = "inv",
scale = F
)
And save them to dat_result
.
dat_result$c_fine[dat_result$train==0] <- fit_fine$fitted.values
Check the results
Compare From Scratch with Ready-Made Implementation
Let’s check whether the results match:
all(dat_classified$c_coarse == dat_imp$c_coarse)
#> [1] TRUE
That’s nice - we have not messed up.
Visualize the new cluster assignments
We will need the ggmap package. Let’s load it:
library(ggmap)
Next, we create a bounding box of our coordinates …
bbox <- make_bbox(lon, lat, dat_result)
… and download the required map data.
vienna <- get_stamenmap(bbox, zoom = 13, maptype = c("toner-lite"))
Now, we can create plots of the cluster membership for the known …
plot_train <- ggmap(vienna, extent = "device") +
coord_cartesian() +
geom_point(data = dat_result %>%
filter(train==1),
aes(x=lon, y=lat, colour = as.factor(c_coarse)),
alpha=.5) +
ggtitle("Train")+
theme(legend.position="none")
… and new cases.
plot_new <- ggmap(vienna, extent = "device") +
coord_cartesian() +
geom_point(data = dat_result %>%
filter(train==0),
aes(x=lon, y=lat, colour = as.factor(c_coarse)),
alpha=.5)+
ggtitle("New") +
theme(legend.position="none")
Let’s put both into one plot together:
gridExtra::grid.arrange(plot_train, plot_new, ncol=1)
That looks good to me - the color regions in both plots align and therefor the new cases have gotten sensible cluster memberships.
Closing Remarks
I hope you enjoyed our short digression on k-nearest neighbors. Of course you can use it for more than just coordinates, but I believe that its application to spatial data is especially illustrative.
Before we end for today just a quick warning: k-NN does NOT scale well. Although, there are some amazingly optimized packages out there, k-NN is not a good fit for large amounts of data.
If you have any questions or comments please post them in the comments section.
If something is not working as outlined here, please check the package versions you are using. The system I used was:
sessionInfo()
#> R version 3.4.2 (2017-09-28)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
#> [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Austria.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggmap_2.7 kknn_1.3.1 bindrcpp_0.2 magrittr_1.5
#> [5] forcats_0.2.0 stringr_1.2.0 dplyr_0.7.4 purrr_0.2.4
#> [9] readr_1.1.1 tidyr_0.7.2 tibble_1.3.4 ggplot2_2.2.1
#> [13] tidyverse_1.2.1 kableExtra_0.6.1
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.13 lubridate_1.7.1 lattice_0.20-35
#> [4] png_0.1-7 assertthat_0.2.0 rprojroot_1.2
#> [7] digest_0.6.12 psych_1.7.8 R6_2.2.2
#> [10] cellranger_1.1.0 plyr_1.8.4 backports_1.1.1
#> [13] evaluate_0.10.1 httr_1.3.1 RgoogleMaps_1.4.1
#> [16] rlang_0.1.4 lazyeval_0.2.1 readxl_1.0.0
#> [19] rstudioapi_0.7 geosphere_1.5-7 Matrix_1.2-11
#> [22] rmarkdown_1.7 proto_1.0.0 labeling_0.3
#> [25] foreign_0.8-69 igraph_1.1.2 munsell_0.4.3
#> [28] broom_0.4.2 compiler_3.4.2 modelr_0.1.1
#> [31] pkgconfig_2.0.1 mnormt_1.5-5 htmltools_0.3.6
#> [34] gridExtra_2.3 viridisLite_0.2.0 crayon_1.3.4
#> [37] bitops_1.0-6 grid_3.4.2 nlme_3.1-131
#> [40] jsonlite_1.5 gtable_0.2.0 scales_0.5.0
#> [43] cli_1.0.0 stringi_1.1.5 mapproj_1.2-5
#> [46] reshape2_1.4.2 sp_1.2-5 xml2_1.1.1
#> [49] rjson_0.2.15 tools_3.4.2 glue_1.2.0
#> [52] maps_3.2.0 hms_0.3 jpeg_0.1-8
#> [55] parallel_3.4.2 yaml_2.1.14 colorspace_1.3-2
#> [58] rvest_0.3.2 knitr_1.17 bindr_0.1
#> [61] haven_1.1.0
No comments:
Post a Comment