Clustering techniques The two general categories of clustering
Figure1.Thestandardk-meansclusteringalgorithm.
techniques are pairwiseclusteringand centralclustering.5The former, also called similarity- based clustering, groups similar data instances together on the basis of a data-pairwise proximity measure. Graph-partitioning-based clustering and single-link hierarchical agglomerative clustering are two commonly used techniques in this category. Central cluster ing, also called centroid-based or model- based clustering, represents each cluster using a model—that is, the cluster centroid. Regular k-means clustering, mixture-of- models clustering, and some more sophisticated variations belong to this category. Centroid-based clustering algorithms are often more efficient than similarity-based clustering algorithms, which is why we chose to use them. Similarity-based algorithms usually have a complexity of at least O(N2) (for computing the data-pairwise proximity measures), where N is the number of data instances. In contrast, centroid-based algorithms are more scalable, with a complexity of O(NKM), where Kis the number of clusters and Mthe number of batch iterations. We study two clustering techniques—k-means and Neural- Gas6—for grouping unlabeled modules of a high-assurance software system.
k-means Figure 1 shows the widely used standard k- means algorithm. The objective function it minimizes is given by the meansquarederror,
Neural-Gas
The Neural-Gas clustering algorithm is a competitive learning technique using the SoftMaxlearning rule. That is, multiple centroids get updated whenever the algorithm visits the data instance. The amount of update depends on the distance between the data instance and the cluster centroid. This con- trasts the Winner-Take-Allrule used in the k- means algorithm, where only the closest cen- troid gets updated every time an instance is visited. The Neural-Gas algorithm is closely related to the self-organizingmap(SOM) and maximum entropy clustering. In both, a deterministic annealing mechanism is built into the learning and optimization process to avoid many locally optimal solutions.
Let xbe a data vector, ythe cluster index, and ythe centroid (mean) of cluster y. The Neural-Gas algorithm’s batch version amounts to iterating between the following two steps:
where is an equivalent temperature para- meter that controls the error surface’s smoothness and r(x, y) is a rank function that takes the value k 1 if yis the kth-closest cluster centroid to data vector x. So, clusters are ranked according to their distances to data instance x, and the assignment probability
P(y | x) decreases as cluster y’s rank goes down.
Both the k-means and Neural-Gas algorithms have online versions in which the cluster centroids are updated incrementally as each data vector is supplied to the algorithms. The online update rules are available as a gradient descent approach. Earlier work shows an online version of the Neural-Gas algorithm to converge faster and find better local solutions than the SOM and maximum- entropy-clustering techniques for certain problems.6