Кластеризации данных в области прикладной информатики и программной инженерии на примере зарубежного опыта и зарубежных публикаций

жүктеу/скачать 177,5 Kb.

бет	3/17
Дата	15.12.2022
өлшемі	177,5 Kb.
	#57493

1 2 3 4 5 6 7 8 9 ... 17

Байланысты:
Лаб 1 Даурен

Neural-Gas

Clustering techniques
The two general categories of clustering
Figure 1. The standard k-means clustering algorithm.

techniques are pairwise clustering and central clustering.⁵The former, also called similarity- based clustering, groups similar data instances together on the basis of a data-pairwise proximity measure. Graph-partitioning-based clustering and single-link hierarchical agglomerative clustering are two commonly used techniques in this category. Central cluster ing, also called centroid-based or model- based clustering, represents each cluster using a model—that is, the cluster centroid. Regular k-means clustering, mixture-of- models clustering, and some more sophisticated variations belong to this category. Centroid-based clustering algorithms are often more efficient than similarity-based clustering algorithms, which is why we chose to use them. Similarity-based algorithms usually have a complexity of at least O(N²) (for computing the data-pairwise proximity measures), where N is the number of data instances. In contrast, centroid-based algorithms are more scalable, with a complexity of O(NKM), where K is the number of clusters and M the number of batch iterations. We study two clustering techniques—k-means and Neural- Gas⁶—for grouping unlabeled modules of a high-assurance software system.
k-means
Figure 1 shows the widely used standard k- means algorithm. The objective function it minimizes is given by the mean squared error,

Neural-Gas

The Neural-Gas clustering algorithm is a competitive learning technique using the SoftMax learning rule. That is, multiple centroids get updated whenever the algorithm visits the data instance. The amount of update depends on the distance between the data instance and the cluster centroid. This con- trasts the Winner-Take-All rule used in the k- means algorithm, where only the closest cen- troid gets updated every time an instance is visited. The Neural-Gas algorithm is closely related to the self-organizing map (SOM) and maximum entropy clustering. In both, a deterministic annealing mechanism is built into the learning and optimization process to avoid many locally optimal solutions.
Let x be a data vector, y the cluster index, and _y the centroid (mean) of cluster y. The Neural-Gas algorithm’s batch version amounts to iterating between the following two steps:
where  is an equivalent temperature para- meter that controls the error surface’s smoothness and r (x, y) is a rank function that takes the value k  1 if y is the kth-closest cluster centroid to data vector x. So, clusters are ranked according to their distances to data instance x, and the assignment probability
P(y | x) decreases as cluster y’s rank goes down.
Both the k-means and Neural-Gas algorithms have online versions in which the cluster centroids are updated incrementally as each data vector is supplied to the algorithms. The online update rules are available as a gradient descent approach. Earlier work shows an online version of the Neural-Gas algorithm to converge faster and find better local solutions than the SOM and maximum- entropy-clustering techniques for certain problems.6

жүктеу/скачать 177,5 Kb.

Достарыңызбен бөлісу:

1 2 3 4 5 6 7 8 9 ... 17