# Information theoretic clustering

Post-publication activity

Curator: Jose Carlos Principe

Renyi entropy-based information theoretic clustering is the process of grouping, or clustering, the items comprising a data set, according to a divergence measure between probability density functions based on Renyi's quadratic entropy (Renyi, 1976). Such an information theoretic divergence measure captures directly the statistical information contained in the data as expressed by the probability density function and can thus produce non-convex cluster boundaries. As is well known K-means clustering (MacQueen, 1967), or for that matter any other clustering method that utilizes second order data statistics, produce only convex clusters (Jain et al., 2000). The information divergence measure is estimated directly from the data with pairwise sample interactions using concepts from kernel density estimation. Renyi entropy-based information theoretic clustering is related to graph-theoretic clustering and Mercer kernel based learning.

## Problem specification Figure 1: Information theoretic clustering: A label $$l_k\ ,$$ $$k=1,\dots,N\ ,$$ out of $$C$$ possibilities, is assigned to each data point $$\mathbf{x}_k\ ,$$ $$k=1,\dots,N\ .$$ This creates $$C$$ clusters in the data set. The labels are assigned in such a way that a Renyi entropy-based divergence measure between the resulting clusters is maximized.

In data clustering, a label $$l_k \in \{1,\dots,C \}\ ,$$ $$k=1,\dots,N\ ,$$ is assigned to each $$d$$-dimensional data point $$\mathbf{x}_k\ ,$$ $$k=1,\dots,N\ ,$$ comprising a data set. The data points with the same label form a group in the data set. Each group should correspond to a "natural" cluster of data points, such that members of the same cluster are more similar, in some sense, than members of different clusters. In the current information theoretic clustering approach, similarity is quantified by a divergence measure, $$D(p_1,\dots,p_C),$$ between the probability density functions (pdfs), $$p_1(\mathbf{x}),\dots,p_C(\mathbf{x}),$$ of the clusters. The divergence measure is based on the quadratic Renyi entropy, which is an information theoretic measure of the uncertainty of a random variable. The goal is to assign labels such that an estimate, $$D(\hat{p_1},\dots,\hat{p}_c)\ ,$$ of the divergence is maximized, because this corresponds to clusters which are maximally distant in terms of the cluster pdfs. This process is illustrated in Figure 1.

## Renyi entropy-based divergence measures

The quadratic Renyi entropy is given by $\tag{1} H_2(p) = - \log \int p^2(\mathbf{x}) d\mathbf{x},$

where $$p(\mathbf{x})$$ is the pdf. It is a special case of the family of Renyi entropies, $$H_\alpha(p) = \frac{1}{1-\alpha} \log \int p^\alpha(\mathbf{x}) d\mathbf{x}\ ,$$ where $$\alpha$$ is the Renyi entropy parameter (Renyi, 1976). In the limit $$\alpha \rightarrow 1$$ the Shannon entropy is obtained.

Consider two pdfs, $$p_1(\mathbf{x})$$ and $$p_2(\mathbf{x})\ .$$ By the Cauchy-Schwarz (CS) inequality, the following holds $\tag{2} \int p_1(\mathbf{x}) p_2(\mathbf{x}) d\mathbf{x} \leq \sqrt{\int p_1^2(\mathbf{x}) d\mathbf{x} \int p_2^2(\mathbf{x}) d\mathbf{x}}.$

Using (2), a CS divergence measure between these two pdfs may therefore be defined as (Principe et al., 2000) $\tag{3} D_{CS}(p_1,p_2) = -\log \frac{\int p_1(\mathbf{x}) p_2(\mathbf{x}) d\mathbf{x}}{\sqrt{\int p_1^2(\mathbf{x}) d\mathbf{x} \int p_2^2(\mathbf{x}) d\mathbf{x}}}.$

This measure is principled in the sense that it is additive under statistical independence (joint densities are products of marginal densities), a fundamental requirement for an information measure, it is symmetric and obeys $$D_{CS}(p_1,p_2) \in [0,\infty)\ .$$ It is zero only if $$p_1(\mathbf{x}) = p_2(\mathbf{x})\ .$$ It does not satisfy the triangle inequality (nor does e.g. the Kullback-Leibler divergence) and is for that reason not a distance metric.

The CS divergence is based on the quadratic Renyi entropy since $$D_{CS}(p_1,p_2) = - \log \int p_1(\mathbf{x}) p_2(\mathbf{x}) d\mathbf{x} - \frac{1}{2} H_2(p_1) - \frac{1}{2} H_2(p_2),$$ where $$- \log \int p_1(\mathbf{x}) p_2(\mathbf{x}) d\mathbf{x}$$ can be considered a "cross" Renyi entropy between $$p_1(\mathbf{x})$$ and $$p_2(\mathbf{x})\ .$$ As shown in the next section, the relation to the Renyi entropy provides a convenient estimator for this quantity.

The CS divergence may be extended to the $$C$$-cluster case as follows $D_{CS}(p_1,\dots,p_C) = -\log \frac{1}{\kappa} \sum_{i=1}^{C-1} \sum_{j > i} \frac{\int p_i(\mathbf{x})p_j(\mathbf{x}) d\mathbf{x}}{\sqrt{\int p_i^2(\mathbf{x}) d\mathbf{x} \int p_j^2(\mathbf{x}) d\mathbf{x}}},$ where $$\kappa = \sum_{c=1}^{C-1} c\ .$$

## Sample-based estimators via kernel density estimation

In order to be able to utilize the Renyi entropy-based CS divergence as a cost function for clustering, a sample-based estimator must be derived.

Notice that by (1), $$H_2(p) = - \log V(p)$$ where $$V(p) = \mathcal{E}_p(p)$$ and $$\mathcal{E}_p(\cdot)$$ denotes the expectation operator wrt. $$p(\mathbf{x})\ .$$ Hence, the Renyi quadratic entropy is a function of the expectation value of the density $$p(\mathbf{x})\ .$$ This expectation can be estimated via the kernel density estimator formalism, as follows:

Assume that a data set $$\mathbf{x}_k\ ,$$ $$k=1,\dots,N\ ,$$ is available, generated from $$p(\mathbf{x})\ .$$ A kernel density estimator of $$p(\mathbf{x})$$ based on these samples is given by (Parzen, 1962) $\hat{p}(\mathbf{x}) = \frac{1}{N} \sum_{k=1}^N W_\sigma(\mathbf{x},\mathbf{x}_k).$ There are many valid kernel functions $$W_\sigma(\cdot,\cdot)\ .$$ The Gaussian function is the most widely used $$W_\sigma(\mathbf{x},\mathbf{x}_k) = \frac{1}{2 \pi \sigma^d} \exp \left\{ - \frac{1}{2\sigma^2} \left\| \mathbf{x} - \mathbf{x}_k \right\|^2 \right\},$$ where $$\sigma$$ is the scale parameter.

A (biased) sample-based estimator of the quadratic Renyi entropy is obtained by approximating the expectation operator using the sample mean, i.e. $$\mathcal{E}_p(p) \approx \frac{1}{N} \sum_{k=1}^N p(\mathbf{x}_k)$$ and by replacing $$p(\mathbf{x})$$ by $$\hat{p}(\mathbf{x})\ ,$$ obtaining $\begin{array}{lcl} H_2(\hat{p}) &=& - \log \frac{1}{N} \sum_{k=1}^N \hat{p}(\mathbf{x}_k) \\ &=& - \log \frac{1}{N} \sum_{k=1}^N \frac{1}{N} \sum_{k^\prime = 1}^N W_{\sigma}(\mathbf{x}_k,\mathbf{x}_{k^\prime}) \\ &=& - \log \frac{1}{N^2} \sum_{k,k^\prime = 1}^{N,N} W_{\sigma}(\mathbf{x}_k,\mathbf{x}_{k^\prime}). \end{array}$

Assume furthermore that samples corresponding to two clusters are available, such that $$\mathbf{x}_i\ ,$$ $$i=1,\dots,N_1\ ,$$ are generated from $$p_1(\mathbf{x})$$ and $$\mathbf{x}_j\ ,$$ $$j=1,\dots,N_2\ ,$$ are generated from $$p_2(\mathbf{x})\ .$$ Following a similar derivation as above, replacing $$p_1(\mathbf{x})$$ and $$p_2(\mathbf{x})$$ by kernel density estimators $$\hat{p}_1(\mathbf{x})$$ and $$\hat{p}_2(\mathbf{x})\ ,$$ a (biased) sample-based estimator for the CS divergence (3) is obtained as follows $\tag{4} D_{CS}(\hat{p}_1,\hat{p}_2) = - \log \frac{\frac{1}{N_1N_2} \sum_{i,j=1}^{N_1,N_2} W_\sigma(\mathbf{x}_i,\mathbf{x}_j)}{\sqrt{\frac{1}{N_1^2} \sum_{i,i^\prime = 1}^{N_1,N_1} W_\sigma(\mathbf{x}_i,\mathbf{x}_{i^\prime} ) \frac{1}{N_2^2} \sum_{j,j^\prime = 1}^{N_2,N_2} W_\sigma(\mathbf{x}_j,\mathbf{x}_{j^\prime})}}.$

This estimator is easily extended to the $$C$$-cluster case. The use of kernel functions other than the Gaussian is also possible (Jenssen et al., 2007). Below we will present a graph clustering interpretation of this expression.

## Gradient descent for clustering optimization

To be useful as a clustering cost function, the sample-based estimator of the CS divergence must be expressed in terms of the labels of the data points. The goal of the current clustering approach is precisely to adjust the labels, such that the cost function used obtains its maximum value. This optimization procedure may be implemented in several different ways. Here, two different strategies are discussed.

Gokcay and Principe (2002) first devised an exhaustive search optimization. This means that every clustering possibility had to be examined by computing the cost function in each case, for then to pick the best result in the end. Since the computation of the cost function scales as $$O(N^2)\ ,$$ where $$N$$ is the number of data points, this method became prohibitive for anything but very small and manageable data sets. The results obtained were however very promising, and instigated research focused on more efficient optimization rules.

Jenssen et al. (2007) instead derived a fuzzy fixed point optimization procedure based on the technique of Lagrange multipliers by treating the labels as continuous (fuzzy) variables. This resulted in an efficient gradient descent-based clustering algorithm. Being a fuzzy algorithm, each data point may thus be assigned to several clusters at the same time during the optimization process, yielding a smooth cost function. This is similar in spirit to Fuzzy C-means cluster analysis.

This information theoretic clustering algorithm has several advantages.

• Firstly, the computational complexity of calculating gradients may be reduced by using only a random subset of size $$M \ll N$$ of the data points in the calculation. This creates an $$O(MN)$$ algorithm at each iteration step.
• Secondly, the problem of local optima of non-convex cost functions can to a large degree be avoided by incorporating kernel annealing as an optimization strategy. Hence, the kernel (window) width may start out large during the initial iterations, for then to be annealed, or reduced, as the number of iterations increases.
• Thirdly, the annealing procedure may help the algorithm cope with different data scales, by discovering large scale structures in the initial phase, followed by a "fine tuning" as the kernel width is decreased.

## Some examples

A few examples of Renyi entropy-based information theoretic clustering results are provided. See for example Jenssen et al. (2007) or Gokcay and Principe (2002) for more examples. Note that no parametric assumptions are made with respect to the shape of the cluster probability density functions. Therefore, clusters with elongated shapes can be clustered, as shown in Figure 2. In the left panel, the data set to be clustered is shown. In the right panel, the obtained clusters are shown using different symbols to indicate cluster membership. Non-convex cluster boundaries may also be discovered, as shown in Figure 3. Finally, Figure 4 shows an example of information theoretic clustering for segmentation of a textured image, where image features are created using a bank of Gabor filters. Figure 2: Information theoretic clustering of data set with elongated clusters. Reprinted from Pattern Recognition, 40 (3), R. Jenssen, D. Erdogmus, K. E. Hild III, J. C. Principe and T. Eltoft, Information Cut for Clustering using a Gradient Descent Approach, Pages No. 796-806, Copyright (2007), with permission from Elsevier. Figure 3: Information theoretic clustering of a data set with highly non-linear clusters. Reprinted from Pattern Recognition, 40 (3), R. Jenssen, D. Erdogmus, K. E. Hild III, J. C. Principe and T. Eltoft, Information Cut for Clustering using a Gradient Descent Approach, Pages No. 796-806, Copyright (2007), with permission from Elsevier. Figure 4: Information theoretic clustering of textured image. Reprinted from Pattern Recognition, 40 (3), R. Jenssen, D. Erdogmus, K. E. Hild III, J. C. Principe and T. Eltoft, Information Cut for Clustering using a Gradient Descent Approach, Pages No. 796-806, Copyright (2007), with permission from Elsevier.

## Connection to graph-theoretic clustering

A set of points, $$\mathbf{x}_k, k=1,\dots,N\ ,$$ in an arbitrary data space can be represented as a weighted undirected graph $$\mathcal{G}\ .$$ Each node in the graph corresponds to a data point. The edge formed between a pair of nodes, say $$i$$ and $$j\ ,$$ is weighted according to the similarity between the corresponding data points. The edge-weight is denoted $$v(\mathbf{x}_i,\mathbf{x}_j)\ ,$$ and is often computed using a radial basis function, i.e $$v(\mathbf{x}_i,\mathbf{x}_j)=\exp \left\{ - \frac{1}{2\sigma^2} \left\| \mathbf{x}_i - \mathbf{x}_j \right\|^2 \right\}\ .$$ The graph cut provides a measure of the cost of partitioning a graph $$\mathcal{G}$$ into two subgraphs $$\mathcal{G}_1$$ and $$\mathcal{G}_2\ ,$$ and is defined as $\tag{5} Cut(\mathcal{G}_1,\mathcal{G}_2) = \sum_{i,j=1}^{N_1,N_2} v(\mathbf{x}_i,\mathbf{x}_j),$

where the index $$i=1,\dots,N_1\ ,$$ runs over the $$N_1$$ nodes of subgraph $$\mathcal{G}_1$$ and the index $$j=1,\dots,N_2\ ,$$ runs over the $$N_2$$ nodes of subgraph $$\mathcal{G}_2\ .$$ That is, the cut measures the weight of the edges which have to be removed in order to create the two subgraphs. Wu and Leahy (1993) first proposed to minimize the cut-cost as a means for clustering and image segmentation. Many attempts have been made at somehow normalizing the cut-cost, as it tends to produce a skewed partition. A prominent example is Shi and Malik (2000), who introduced the normalized cut, optimized using the spectral properties of an affinity, or kernel, matrix.

By comparing (4) to (5), it becomes clear that the CS divergence estimator may be related to the graph cut, since the numerator of (4) is exactly equivalent to (5) by considering $$W_\sigma(\mathbf{x}_i,\mathbf{x}_j)$$ an edge weight similar to $$v(\mathbf{x}_i,\mathbf{x}_j)\ .$$ Moreover, the volume of subgraph $$\mathcal{G}_1$$ is given by $$Vol(\mathcal{G}_1)= \sum_{i,i^\prime=1}^{N_1,N_1} v(\mathbf{x}_i,\mathbf{x}_{i^\prime})$$ and the volume of subgraph $$\mathcal{G}_2$$ is given by $$Vol(\mathcal{G}_2)= \sum_{j,j^\prime=1}^{N_2,N_2} v(\mathbf{x}_j,\mathbf{x}_{j^\prime}) \ .$$ Hence, the CS divergence estimator may be expressed as $D(\hat{p}_1,\hat{p}_2) = - \log \frac{Cut\mathcal{G}_1,\mathcal{G}_2}{\sqrt{Vol(\mathcal{G}_1) Vol(\mathcal{G}_2)}}.$ Thus, the CS divergence estimated using kernel density estimation yields a well-defined normalization of the graph cut. This version of the graph cut has been named the Information cut (Jenssen et al., 2007). Optimization the CS divergence for clustering is therefore equivalent to the optimization of a normalized version of the graph cut. In (Jenssen et al., 2006a), a graph spectral version of information theoretic clustering was proposed.

## Connection to Mercer kernel based learning

Mercer kernel based learning (Shawe-Taylor and Cristianini, 2004) is based on deriving non-linear learning algorithms by computing inner-products in a potentially infinite dimensional space obtained by the non-linear mapping $$\mathbf{x}_k \rightarrow \mathbf{\Phi}(\mathbf{x}_k), \ k=1,\dots,N.$$ Inner-products are computed by means of a Mercer kernel function $$v(\mathbf{x}_i,\mathbf{x}_j) = \left \langle \mathbf{\Phi}(\mathbf{x}_i),\mathbf{\Phi}(\mathbf{x}_j) \right \rangle.$$ A Mercer kernel is a positive semidefinite function. The most widely used such kernel is the radial basis function.

Jenssen et al., (2005, 2006b) noticed that the Gaussian kernel $$W_\sigma(\mathbf{x}_i,\mathbf{x}_j)$$ is a Mercer kernel. Thus, $$W_\sigma(\mathbf{x}_i,\mathbf{x}_j) = \left \langle \mathbf{\Phi}(\mathbf{x}_i),\mathbf{\Phi}(\mathbf{x}_j) \right \rangle.$$ Hence, based on (4), the CS divergence estimator may be rewritten as follows $\begin{array}{ccl} D_{CS}(\hat{p}_1,\hat{p}_2) & = & - \log \frac{\frac{1}{N_1 N_2} \sum_{i,j=1}^{N_1,N_2} \left \langle \mathbf{\Phi}(\mathbf{x}_i),\mathbf{\Phi}(\mathbf{x}_j) \right \rangle}{\sqrt{\frac{1}{N_1^2} \sum_{i,i^\prime=1}^{N_1,N_1} \left \langle \mathbf{\Phi}(\mathbf{x}_i),\mathbf{\Phi}(\mathbf{x}_{i^\prime}) \right \rangle \frac{1}{N_2^2} \sum_{j,j^\prime=1}^{N_2,N_2} \left \langle \mathbf{\Phi}(\mathbf{x}_j),\mathbf{\Phi}(\mathbf{x}_{j^\prime}) \right \rangle }} \\ &=& - \log \frac{\left \langle \frac{1}{N_1} \sum_{i=1}^{N_1} \mathbf{\Phi}(\mathbf{x}_i), \frac{1}{N_2}\sum_{j=1}^{N_2} \mathbf{\Phi}(\mathbf{x}_j)\right \rangle }{\sqrt{\left \langle \frac{1}{N_1} \sum_{i=1}^{N_1} \mathbf{\Phi}(\mathbf{x}_i), \frac{1}{N_1}\sum_{i^\prime=1}^{N_1} \mathbf{\Phi}(\mathbf{x}_{i^\prime})\right \rangle \left \langle \frac{1}{N_2} \sum_{j=1}^{N_2} \mathbf{\Phi}(\mathbf{x}_j), \frac{1}{N_2}\sum_{j^\prime=1}^{N_2} \mathbf{\Phi}(\mathbf{x}_{j^\prime})\right \rangle}} \\ &=& - \log \frac{ \left \langle \mathbf{m}_{1},\mathbf{m}_{2} \right \rangle }{\sqrt{ \left \langle \mathbf{m}_{1},\mathbf{m}_{1} \right \rangle \left \langle \mathbf{m}_{2},\mathbf{m}_{2} \right \rangle }} = - \log \cos \angle (\mathbf{m}_{1},\mathbf{m}_{2}), \end{array}$ where $$\mathbf{m}_1 = \frac{1}{N_1} \sum_{i=1}^{N_1} \mathbf{\Phi}(\mathbf{x}_i)$$ and $$\mathbf{m}_2 = \frac{1}{N_2} \sum_{j=1}^{N_2} \mathbf{\Phi}(\mathbf{x}_j)$$ can be considered mean vectors of feature space data clusters corresponding to the data points associated with $$p_1(\mathbf{x})$$ and $$p_2(\mathbf{x})\ ,$$ respectively. This means that maximization of the CS divergence for clustering is equivalent to minimization of the cosine of the angle between the Mercer kernel feature space cluster mean vectors.