Dunn index

The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms.^[1]^[2] This is part of a group of validity indices including the Davies–Bouldin index or Silhouette index, in that it is an internal evaluation scheme, where the result is based on the clustered data itself. As do all other such indices, the aim is to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance. For a given assignment of clusters, a higher Dunn index indicates better clustering. One of the drawbacks of using this is the computational cost as the number of clusters and dimensionality of the data increase.

Preliminaries

There are many ways to define the size or diameter of a cluster. It could be the distance between the farthest two points inside a cluster, it could be the mean of all the pairwise distances between data points inside the cluster, or it could as well be the distance of each data point from the cluster centroid. Each of these formulations are mathematically shown below:

Let C_i be a cluster of vectors. Let x and y be any two n dimensional feature vectors assigned to the same cluster C_i.

\Delta _{i}={\underset {x,y\in C_{i}}{\text{max}}}d(x,y)

, which calculates the maximum distance (the version proposed by Dunn).

\Delta _{i}={\dfrac {2}{|C_{i}|(|C_{i}|-1)}}{\underset {x,y\in C_{i},x\neq y}{\sum }}d(x,y)

, which calculates the mean distance between all pairs.

\Delta _{i}={\dfrac {{\underset {x\in C_{i}}{\sum }}d(x,\mu )}{|C_{i}|}},\mu ={\dfrac {{\underset {x\in C_{i}}{\sum }}x}{|C_{i}|}}

, calculates distance of all the points from the mean.

This can also be said about the intercluster distance, where similar formulations can be made, using either the closest two data points (used by Dunn), one in each cluster, or the farthest two, or the distance between the centroids and so on. The definition of the index includes any such formulation, and the family of indices so formed are called Dunn-like Indices. Let $\delta (C_{i},C_{j})$ be this intercluster distance metric, between clusters C_i and C_j.

Definition

With the above notation, if there are m clusters, then the Dunn Index for the set is defined as:

{\mathit {DI}}_{m}={\frac {{\underset {1\leqslant i<j\leqslant m}{\text{min}}}\left.\delta (C_{i},C_{j})\right.}{{\underset {1\leqslant k\leqslant m}{\text{max}}}\left.\Delta _{k}\right.}}

where $\delta (C_{i},C_{j})$ is the inter-cluster distance between the clusters $C_{i}$ and $C_{j}$ while $\Delta _{k}$ is the within cluster distance, e.g. the maximum distance within one cluster when following Dunn's original definition.

Explanation

Being defined in this way, the DI depends on m, the number of clusters in the set. If the number of clusters is not known apriori, the m for which the DI is the highest can be chosen as the number of clusters. There is also some flexibility when it comes to the definition of d(x,y) where any of the well known metrics can be used, like Manhattan distance or Euclidean distance based on the geometry of the clustering problem. This formulation has a peculiar problem, in that if one of the clusters is badly behaved, where the others are tightly packed, since the denominator contains a 'max' term instead of an average term, the Dunn Index for that set of clusters will be uncharacteristically low. This is thus a worst case indicator, and has to be kept in mind. There are ready implementations of the Dunn index in some vector based programming languages like MATLAB, R and Apache Mahout.^[3]^[4]^[5]

Notes and references

^ Dunn, J. C. (1973-09-17). "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters". Journal of Cybernetics. 3 (3): 32–57. doi:10.1080/01969727308546046. S2CID 120919314.
^ Dunn, J. C. (1973-09-01). "Well-Separated Clusters and Optimal Fuzzy Partitions". Journal of Cybernetics. 4 (1) (published 1974): 95–104. doi:10.1080/01969727408546059. ISSN 0022-0280.
^ "MATLAB implementation of the Dunn Index". Retrieved 5 December 2011.
^ Lukasz, Nieweglowski. "Package 'clv'" (PDF). R project. CRAN. Retrieved 2 April 2013.
^ "Apache Mahout". Apache Software Foundation. Retrieved 9 May 2013.

External links

Pakhira, Malay K.; Bandyopadhyay, Sanghamitra; Maulik, Ujjwal (2004). "Validity index for crisp and fuzzy clusters". Pattern Recognition. 37 (3): 487–501. Bibcode:2004PatRe..37..487P. doi:10.1016/j.patcog.2003.06.005.
Bezdek, J.C.; Pal, N.R. (1995). "Cluster validation with generalized Dunn's indices". Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems. IEEE Xplore. pp. 190–193. doi:10.1109/ANNES.1995.499469. ISBN 0-8186-7174-2. S2CID 7816379.

[1] Dunn, J. C. (1973-09-17). "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters". Journal of Cybernetics. 3 (3): 32–57. doi:10.1080/01969727308546046. S2CID 120919314.

[2] Dunn, J. C. (1973-09-01). "Well-Separated Clusters and Optimal Fuzzy Partitions". Journal of Cybernetics. 4 (1) (published 1974): 95–104. doi:10.1080/01969727408546059. ISSN 0022-0280.

[3] "MATLAB implementation of the Dunn Index". Retrieved 5 December 2011.

[4] Lukasz, Nieweglowski. "Package 'clv'" (PDF). R project. CRAN. Retrieved 2 April 2013.

[5] "Apache Mahout". Apache Software Foundation. Retrieved 9 May 2013.

[1]

[2]

[3]

[4]

[5]

v t e Machine learning evaluation metrics
Regression	MSE MAE sMAPE MAPE MASE MSPE RMS RMSE/RMSD R² MDA MAD
Classification	F-score P4 Accuracy Precision Recall Kappa MCC AUC ROC Sensitivity and specificity Logarithmic Loss
Clustering	Silhouette Calinski-Harabasz index Davies-Bouldin Dunn index Hopkins statistic Jaccard index Rand index Similarity measure SMC SimHash
Ranking	MRR NDCG AP
Computer Vision	PSNR SSIM IoU
NLP	Perplexity BLEU
Deep Learning Related Metrics	Inception score FID
Recommender system	Coverage Intra-list Similarity
Similarity	Cosine similarity Euclidean distance Pearson correlation coefficient
Confusion matrix