- Batch Self-Organizing Map (batchSOM) Clustering [Wehrens/Buydens, 2007]
- ADP Clustering (Clustering by fast search and find of density peaks) [Rodriguez/Laio, 2014]
- Affinity Propagation (AP) Clustering [Frey/Dueck, 2007]
- Databionic swarm (DBS) [Thrun, M. C., 2018]
- DBscan [Ester et al., 1996]
- Fuzzy Clustering (Fanny) [Rousseeuw/Kaufman, 1990]
- Markov Clustering [Van Dongen, 2000]
- Model-based Clustering (Mixture of Gaussians - MoG) [Fraley/Raftery, 2002, 2006]
- Spectral Clustering [Ng et al., 2002]
- Quality clustering (QT) clustering) [Heyer et al., 1999; Scharl/Leisch, 2006]
- Self-organizing Tree Algorithm (SOTA) [Herrero et al., 2001]
- Large Application Clustering (CLARA) Rousseeuw/Kaufman, 1990]
- Convex Clustering:
- Neural Gas Clustering [Martinetz et al., 1993]
- On-line Update Hard Competitive learning (HCL Clustering) [Dimitriadou, 2002]
- Partitioning Around Medoids (PAM) [Rousseeuw/Kaufman, 1990]
- The k-means Clustering (LBG algrithm) [Linde et al., 1980]
- Subspace Clustering:
- Orclus Clustering [Aggarwal/Yu, 2000]
- ProClus Clustering [Aggarwal et al., 1999]
- Ward.D [Murtagh/Legendre, 2014]
- Ward.D2 [Ward Jr, 1963]
- Single linkage [Florek et al., 1951; Sokal/Sneath, 1963]
- Complete linkage [Defays, 1977; G. N. Lance/Williams, 1967]
- Average linkage [Sokol/Michener, 1958]
- Mcquitty linkage [McQuitty, 1966]
- Median linkage [Everitt et al., 2011; G. Lance/Williams, 1966]
- Centroid linkage [Sokol/Michener, 1958]
- Divisive Analysis Clustering (DIANA) Rousseeuw/Kaufman, 1990]

Benchmarking Cluster Analysis Methods

The following 27 Clustering algorithms will be evaluated here on 9 Fundamental Clustering Problem Suite (FCPS) data sets^{[Ultsch, 2005a]}
and 6 high-dimensional data sets.

Hierarchical Clusterings:

Hepta

The Hepta data set [Ultsch, 2003a] is used to illustrate the general problems with quality
measures (QMs) and projections from the perspective of structure preservation. The three-dimensional
Hepta data set consists of seven clusters that are clearly separated by distance, one
of which has a much higher density. The data set consists of 212 points, comprising seven
clusters of thirty points each plus two additional points in the center cluster. The centroids of
the clusters span the coordinate axes of R^{3} The density of the central cluster is almost twice as
high as the density of the other six clusters. The structure of the data set is clearly defined by
distances and is compact.

Chainlink

The Chainlink data set [Ultsch, 1995; Ultsch et al., 1994] consists of two clusters in R^{3}. Together,
the two clusters form intricate links of a chain, and therefore, they cannot be separated
by linear decision boundaries [Herrmann, 2011, pp. 99-100]. The rings are cohesive in R^{3};
however, many projections are not. This data set serves as an excellent demonstration of several
challenges facing projection methods: The data lie on two well-separated manifolds such that
the global proximities contradict the local ones in the sense that the center of each ring is closer
to some elements of the other cluster than to elements of its own cluster.
The two rings are intertwined in R^{3} and have the same average distances and densities. Every cluster contains 500 points.

Atom

The Atom data set [Ultsch, 2005c] consists of two clusters in R^{3}. The first cluster is completely enclosed by the
second one and, therefore, cannot be separated by linear decision boundaries. Additionally, both clusters have
different densities and variances. The Atom data set consists of a dense core of 400 points surrounded by a well
separated, but sparse hull of 400 points. Both clusters are not linearly separable and many algorithms cannot
construct a cohesive projection. The core is located in the center of the hull, which, for some methods based on
averaging, makes it hard to solve it. The density of the core is much higher than the density in the hull. For data in
the hull, some of the inner-cluster distances are bigger than the distance to the other clusters.

EngyTime

The EngyTime data set [Baggenstoss, 2002] contains 4,096 points belonging to two clusters in
R^{2}; the data set is typical for sonar applications with the variables “Engy” and “Time” as a twodimensional
mixture of Gaussians. The clusters overlap, and cluster borders can be defined only
by using density information. There is no empty space between the clusters.

Lsun3D

The Lsun3D data set consists of three well-separated clusters and four outliers in R^{3}; it is based
on the two-dimensional Lsun data set of Moutarde and Ultsch [Moutarde/Ultsch, 2005]. Two
of the clusters contain 100 points each, and the third contains 200 points. “The inter-cluster
minimum distances, however, are in the same range as or even smaller than the inner-cluster
mean distances” [Moutarde/Ultsch, 2005, p. 28]. The data set consists of 404 data points and
was not preprocessed.

Target

The Target data set [Ultsch, 2005c] consists of two main clusters and four groups of four outliers
each. The first main cluster is a sphere of 363 points, and the second cluster is a ring around
the sphere and consists of 395 points. The data set as a whole consists of 770 points in R^{3}. The
main challenge of this data set is the four groups of outliers in the four corners.

Tetra

The Tetra data set, which is part of the FCPS, consists of 400 data points in four clusters in R^{2}
that have large intracluster distances [Ultsch, 2005c]. The clusters are nearly touching each
other, resulting in low intercluster distances.

TwoDiamonds

“The data consists of two clusters of two-dimensional points. Inside each “diamond” the values for each data point were drawn independently from uniform distributions” [Ultsch, 2003c, p. 8]. The clusters contain 300 points each. “[In] [e]ach cluster[, the] points are uniformly distributed within a square, and at one point the two squares almost touch. This data set is critical for clustering algorithms using only distances” [Moutarde/Ultsch, 2005, p. 28].

WingNut

“The Wing Nut dataset […] consists [of] two symmetric data subsets of 500 points each. Each of these subsets is an overlay of equal[ly] spaced points with a lattice distance of 0.2 and random points with a growing density in one corner. The data sets are mirrored and shifted such that the gap between the subsets is larger than 0.3. Although there is a bigger distance in between the subsets than within the data of a subset, clustering algorithms like Kmeans parameterized with the right number of clusters (k=2) produce classification errors” [Moutarde/Ultsch, 2005, pp. 27-28].

Breast cancer

The Breast Cancer dataset was taken from the CRAN package “mlbench”. It consists of 9 nominal scaled fea-tures with 699 cases having either benign or malignant tumor cells. The samples arrived periodically as Dr. Wolberg reports his clinical cases [41]. Each variable except for the first was converted into 11 primitive nu-merical attributes with values ranging from 0 through 10. There are 16 missing attribute values which were KNN imputed with k=7. Unique Id’s were chosen as cases, and a small amount of uniform distributed noise was added to prevent distances to equal zero. The robust normalization of Milligan and Cooper [42] was applied. The resulting dataset had 645 cases of which 413 are benign, and 232 are malignant.

Iris

“Anderson’s [Anderson, 1935] Iris data set was made famous by Fisher [Fisher, 1936], who used it to exemplify his linear discriminant analysis. It has since served to demonstrate the performance of many clustering algorithms” [G. Ritter, 2014, p. 220]. The Iris data set consists of data points in with a prior classification and describes the geographic variation of Iris flowers. The data set consists of 50 samples from each of three species of Iris flowers, namely, Iris setosa, Iris virginica and Iris versicolor. Four features were measured for each sample: the lengths and widths of the sepals and petals (see [Herrmann, 2011, pp. 99-100]). The observations have “only two digits of precision preventing general position of the data” [G. Ritter, 2014, p. 220] and “observations 102 and 142 are even equal” [G. Ritter, 2014, p. 220]. The I. setosa cluster is well separated, whereas the I. virginica and I. versicolor clusters slightly overlap (see [Herrmann, 2011, pp. 99-100]). This presents “a challenge for any sensitive classifier” [G. Ritter, 2014, p. 220].

Leukemia

The anonymized leukemia data set consists of 12,692 gene expressions66 from 554 subjects and is available from a previous publication [Haferlach et al., 2010]. Each gene expression is a logarithmic luminance intensity (presence call), which was measured using Affymetrix technology. The presence calls are related to the number of specific RNAs in a cell, which signals how active a specific gene is. Of the subjects, 109 were healthy, 15 were diagnosed with acute promyelocytic leukemia (APL), 266 had chronic lymphocytic leukemia (CLL), and 164 had acute myeloid leukemia (AML). “The study design adhered to the tenets of the Declaration of Helsinki and was approved by the ethics committees of the participating institutions before its initiation” [Haferlach et al., 2010, p. 2530]. The leukemia data set was preprocessed, resulting in a high-dimensional data set with 7.747 variables and 554 data points separated into natural clusters, as determined by the illness status and defined by discontinuities (see chapter 2). Additionally, patient consent was obtained for the data set, in accordance with the Declaration of Helsinki, and the Marburg local ethics board approved the study (No. 138/16).

The algorthms model-based clustering (MoG) and orclus were not able to compute this dataset. For QT cluserting no valid Radius could be estimated. Due to the high-dimensionality many clustering algorithms are computing the results very slowly. Thus, only ten trial per algorithm are computed.

Swiss banknotes

“The idea is to produce bills at a cost substantially lower than the imprinted number. This calls for a compromise and forgeries are not perfect” [G. Ritter, 2014, pp. 223-224]. “If a bank note is suspect but refined, then it is sent to a money-printing company, where it is carefully examined with regard to printing process, type of paper, water mark, colors, composition of inks, and more. Flury and Riedwyl [Flury/Riedwyl, 1988] had the idea to replace the features obtained from the sophisticated equipment needed for the analysis with simple linear dimensions” [G. Ritter, 2014, p. 224]. The Swiss Banknotes data set consists of six variables measured on 100 genuine and 100 counterfeit old Swiss 1000-franc bank notes. The variables are the length of the bank note, the height of the bank note (measured on the left side), the height of the bank note (measured on the right side), the distance from the inner frame to the lower border, the distance from the inner frame to the upper border and the length on the diagonal. The robust normalization of Milligan and Cooper [Milligan/Cooper, 1988] is applied to prevent a few features from dominating the obtained distances.

Pan Cancer

The Pan-Cancer datasetconsists of 801 subjects with 20531 random extractions of gene expressions, and it is a part of the RNA-Seq (HiSeq) PANCAN dataset which is maintained by the cancer genome atlas pan-cancer analysis project [39]. The dataset was taken from the UCI machine learning Repository [40]. An Illumina HiSeq platform measured RNA-Seq gene expression levels. The subjects have different types of tumor: BRCA (300 subjects), KIRC (146 subjects), COAD (78 subjects), LUAD (141 subjects) and PRAD (136 subjects). Gene expressions which were zero for all subjects were disregarded. The dataset was decorrelated and robust z-transformed. After preprocessing the high-dimensional dataset had 18617 dimensions of 801 cases.

The algorthms model-based clustering (MoG) and orclus were not able to compute this dataset. For QT cluserting no valid Radius could be estimated. Due to the high-dimensionality many clustering algorithms are computing the results very slowly. Thus, only ten trial per algorithm are computed.

Wine

The Wine data set [Aeberhard et al., 1992] is a 13-dimensional, real-valued data set. It consists of chemical measurements of wines grown in the same region in Italy but derived from three different cultivars. The robust normalization of Milligan and Cooper [Milligan/Cooper, 1988] is applied to prevent a few features from dominating the obtained distances.

References

[Ultsch et al., 1994] Ultsch, A., Guimaraes, G., Korus, D., & Li, H.: Knowledge extraction from artificial neural networks and applications, Parallele Datenverarbeitung mit dem Transputer, pp. 148-162, Springer, 1994.

[Ultsch, 1995] Ultsch, A.: Self organizing neural networks perform different from statistical k-means clustering, Proc. Society for Information and Classification (GFKL), Vol. 1995, Basel 8th-10th March, 1995.

[Ultsch, 2003a] Ultsch, A.: Maps for the visualization of high-dimensional data spaces, Proc. Workshop on Self organizing Maps (WSOM), pp. 225-230, Kyushu, Japan, 2003.

[Ultsch, 2005a] Ultsch, A.: Clustering wih SOM: U* C, Proc. Proceedings of the 5th Workshop on Self-Organizing Maps, Vol. 2, pp. 75-82, 2005.

Please cite it accordingly. Someday, I will publish the Benchmarking in an seperate publication. The following clustering algorithms are used:

This research is currently an extension of

Please cite it accordingly. Someday, I will publish the Benchmarking in a seperate peer-reviewed journal.