# Contents

This document aims at clarifying relations between dissimilarity and similarity methods for hierarchical agglomerative clustering (HAC) and at explaining implementation choices in adjclust. In this document, we suppose given $$n$$ objects, $$\{1, \ldots, n\}$$ that have to be clustered using adjacency-constrained HAC, that is, in such a way that only adjacent objects/clusters can be merged.

# 1 Basic implementation of CHAC in adjclust

The basic implementation of adjclust takes, as an input, a kernel $$k$$ which is supposed to be symmetric and positive (in the kernel sense) and to satisfy: $k(i,i) = 1,\qquad \forall\, i=1,\ldots,n.$

If your data are under this format, then the constrained clustering can be performed with

fit <- adjClust(k, type = "similarity")

or with

fit <- adjClust(k, type = "similarity", h = h)

if, in addition, the kernel $$k$$ is supposed to have only null entries outside of a diagonal of size h.

The implementation is the one described in [1].

# 2 More advanced used for kernel or similarity matrices

## 2.1 Non positive but normalized similarities

In this section, the available data set is a matrix $$s$$ that can either have only positive entries (in this case it is called a similarity) or both positive and non-positive entries. If, in addition, $$s(i,i) = 1$$ for all $$i=1,\ldots,n$$ and $$s(i,j) \leq 1$$ for all $$i,j=1,\ldots,n$$ then the algorithm implemented in adjclust can be applied directly, similarly as for a standard kernel (section 1). This section explains why this is the case.

The interpretation is similar to the kernel case, under the assumption that small similarity values or similarity values that are strongly negative are less expected to be clustered together than large similarity values. The application of the method is justified by the fact that, for a given matrix $$s$$ described as above, we can find a $$\lambda > 0$$ such that the matrix $$k_\lambda$$ defined by $\forall\,1,\ldots,n,\qquad k_\lambda(i,j) = s(i,j) + \lambda \mathbb{1}_{\{i=j\}}$ is a kernel (i.e., the matrix $$k = s + \lambda I$$ is positive definite; indeed, it is the case for any $$\lambda$$ larger than the opposite of the smallest negative eigenvalue of $$s$$. [2] shows that the HAC obtained from the distance induced by the kernel $$k_\lambda$$ in its feature space and the HAC obtained from the ad hoc dissimilarity defined by $\forall\, i,j=1,\ldots,n,\qquad d(i,j) = \sqrt{s(i,i) + s(j,j) - 2s(i,j)} = \sqrt{2(1-s(i,j))}$ are identical, except that all the merging levels are shifted by $$\lambda$$.

In conclusion, to address this case, the command lines that have to be used are the ones described in section 1.

## 2.2 Non normalized similarities

Suppose now that the data set is described by a matrix $$s$$ as in the previous section except that which does not satisfy one of the two following conditions:

• [C1] $$\forall i \in\{1,\ldots,n\}$$, $$s(i,i)=1$$
• [C2] $$\forall i,j \in\{1,\ldots,n\}$$, $2s(i,j) \leq s(i,i) + s(j,j).$

These properties were implicitly assumed to hold in the original implementation of constrained HAC described in [1]. Specifically, this implementation relied on the implicit dissimilarity $\forall\,i,j=1,\ldots,n,\qquad d^2(i,j) = s(i,i) + s(j,j) - 2s(i,j)$ which has no meaning if 2. above does not hold. If either [C1] or [C2] is not met, the package performs the following pre-transformations:

1. a matrix $$s^{*}$$ is defined as $\forall\,i,j=1,\ldots,n,\qquad s^{*}(i,j) = s(i,j) + \lambda \mathbb{1}_{\{i=j\}}$ for a $$\lambda$$ large enough to ensure that [C2] holds for $$s^{*}$$. In the package, $$\lambda$$ is chosen as $\lambda := \epsilon + \max_{i,j} \left(s(i,j) - s(i,i)\right)_+$ for a small $$\epsilon > 0$$. This case is justified by the property described in Section 2.1 (Non-positive but normalized similarities). The underlying idea is that, shifting the diagonal entries of a similarity matrix does not change HAC result and thus they can be shifted until they induce a proper ad-hoc dissimilarity matrix;

2. $$s^{*}$$ is then normalized to unit diagonal ([C1]), in order to allow its use in the original implementation as in [1]: $\forall\,i,j=1,\ldots,n,\qquad s^{**} =s^{*}(i,j) + 1 - \frac{1}{2} (s^{*}(i,i) +s^{*}(j,j)).$

By definition, $$s^{**}$$ satisfies [C1] and [C2], and the ad-hoc dissimilarities induced by $$s^{*}$$ and $$s^{**}$$ are identical: $\forall\,i=1,\ldots,j,\qquad d^{**}(i,j) = d^{*}(i,j).$

Important remark: The transformation described above affects all the elements of the similarity matrix (and not only its diagonal), contrarily to what happens in section 2.1. Hence, it is not (yet) compatible with the fast implementation that requires to have a $$h$$-band centered on the diagonal outside of which all entries are supposed to be 0. The current behavior of the package consists in forbidding the use of the fast implementation when the provided matrix is not normalized (all diagonal entries equal to 1). For all other cases, the constrained clustering from $$s$$, including the above-described matrix transformations, is available with the full matrix, i.e.:

fit <- adjClust(s, type = "similarity")

## 2.3 Case of dissimilarity data

The original implementation of (unconstrained) HAC in stats::hclust takes as input a dissimilarity matrix. However, the implementation of adjclust is based on a kernel/similarity approach. We describe in this section how the case of dissimilarity is handled.

Suppose given a dissimilarity $$d$$ which satisfies:

• $$d$$ has non negative entries: $$d(i,j) \geq 0$$ for all $$i=1,\ldots,n$$;

• $$d$$ is symmetric: $$d(i,j) = d(j,i)$$ for all $$i,j=1,\ldots,n$$;

• $$d$$ has a null diagonal: $$d(i,i) = 0$$ for all $$i=1,\ldots,n$$.

Any sequence of positive numbers $$(a_i)_{i=1,\ldots,n}$$ would provide a similarity $$s$$ for which $$d$$ is the ad-hoc dissimilarity by setting: $\left\{ \begin{array}{l} s(i,i) = a_i\\ s(i,j) = \frac{1}{2} (a_i + a_j - d^2(i,j)) \end{array} \right. .$ By definition, such an $$s$$ satisfies [C2], and any choice for $$(a_i)_{i=1,\ldots,n}$$ yields the same clustering (since they all correspond to the same ad-hoc dissimilarity). To satisfy [C1], the choice $$a_i = 1$$ for all $$i=1,\ldots,n$$ has been made.

The similarity $$s$$ is thus processed as described in the previous sections. In short, since by definition it satisfies [C1] and [C2] it is processed as described in Sections 1 or 2.1 and the basic and the sparse implementations are both available with, respectively,

fit <- adjClust(d, type = "dissimilarity")

and

fit <- adjClust(d, type = "dissimilarity", h = h)

# 3 References

[1] Dehman A. (2015). Spatial clustering of linkage disequilibrium blocks for genome-wide association studies. Phd Thesis, Université Paris Saclay.

[2] Miyamoto S., Abe R., Endo Y., Takeshita J. (2015) Ward method of hierarchical clustering for non-Eclidean similarity measures. In: Proceedings of the VIIth International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015).