Metrics

library(rnndescent)

A lot of distance functions are implemented in rnndescent, which you can specify in every function which needs them with the metric parameter. Technically not all of these are metrics, but let’s just let that slide. Typical are "euclidean" or "cosine" the latter being more common for document-based data. For binary data, "hamming" or "jaccard" might be a good place to start.

The metrics here are a subset of those offered by the PyNNDescent Python package which in turn reproduces those in the scipy.spatial.distance module of SciPy. Many of the binary distances seem to have definitions shared with (Choi et al. 2010) so you may want to look in that reference for an exact definition.

For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:

Specialized Binary Metrics

Some metrics are intended for use with binary data. This means that:

The metrics you can use with binary data are:

Here’s an example of using binary data stored as 0s and 1s with the "hamming" metric:

set.seed(42)
binary_data <- matrix(sample(c(0, 1), 100, replace = TRUE), ncol = 10)
head(binary_data)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,]    0    0    0    1    0    1    0    1    1     0
#> [2,]    0    1    0    1    0    0    1    0    0     1
#> [3,]    0    0    0    1    1    1    1    1    1     0
#> [4,]    0    1    0    1    1    1    1    1    0     1
#> [5,]    1    0    0    0    1    1    1    1    0     1
#> [6,]    1    0    1    1    1    1    1    1    1     1
nn <- brute_force_knn(binary_data, k = 4, metric = "hamming")

Now let’s convert it to a logical matrix:

logical_data <- binary_data == 1
head(logical_data)
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
#> [1,] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE
#> [2,] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
#> [3,] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
#> [4,] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
#> [5,]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
#> [6,]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
logical_nn <- brute_force_knn(logical_data, k = 4, metric = "hamming")

The results will be the same:

all.equal(nn, logical_nn)
#> [1] TRUE

but on a real-world dataset, the logical version will be much faster.

References

Choi, Seung-Seok, Sung-Hyuk Cha, Charles C Tappert, et al. 2010. “A Survey of Binary Similarity and Distance Measures.” Journal of Systemics, Cybernetics and Informatics 8 (1): 43–48.
Heidarian, Arash, and Michael J. Dinneen. 2016. “A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering.” In 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), 142–51. https://doi.org/10.1109/BigDataService.2016.14.