Кластеризации данных в области прикладной информатики и программной инженерии на примере зарубежного опыта и зарубежных публикаций

Clustering and quality estimation results

жүктеу/скачать 177,5 Kb.

бет	6/17
Дата	15.12.2022
өлшемі	177,5 Kb.
	#57493

1 2 3 4 5 6 7 8 9 ... 17

Data set Technique Mean squared error
Figure 2. Clustering- and expert-based classification results for (a) JM1-8850 and (b) KC2-520

Clustering and quality estimation results
Table 3 shows the clustering-quality results. The Neural-Gas algorithm performs significantly better in terms of mean squared error and comparably in terms of average
Table 3. Clustering results for 30 JM1-8850 clusters and 20 KC2-520 clusters.

Data set	Technique	Mean squared error	Average purity	Time (seconds)
JM1-8850	k-means	3,342.7	0.746	0.966
	Neural-Gas	1,574.4	0.727	9.928
KC2-520	k-means	738.3	0.808	0.016
	Neural-Gas	244.0	0.815	0.375

E n h a n c i n g I n f o r m a t i o n

Figure 2. Clustering- and expert-based classification results for (a) JM1-8850 and (b) KC2-520

purity. The k-means algorithm, however, runs much faster. The runtime results are recorded on a 3.06-GHz Pentium 4 PC with 1 Gbyte of memory running Windows XP.

Figure 2 reports the overall classification error, FPRs, and FNRs for both k-means and Neural-Gas algorithms, compared to a decision-tree-based classifier, C4.5. We chose C4.5 for comparison because it’s commonly used and known for its robustness in classification accuracy. Because the FPR and FNR are inversely related for a given classification technique, we obtained the C4.5 classification results by adjusting some parameters (such as tree-depth, pruning ratio, and so on) to achieve FPR errors similar to the two clustering techniques.
Neural-Gas performed slightly worse for JM1-8850 and only slightly better for KC2- 520 than k-means in terms of the overall error rate, although Table 3 gives a significantly lower MSE value. This is likely due to noise in the data sets (wrong labels or insufficient attributes). The error numbers are compara- ble to the C4.5 results, suggesting that clus- tering- and expert-based software quality classification is a viable option (compared to supervised learning) when software quality data isn’t available.
It’s worth mentioning that classifying the JM1-8850 data set is difficult, even for many state-of-the-art classifiers. For example, using the LIBSVM software package (avail- able at www.csie.ntu.edu.tw/~cjlin/libsvm) with a twofold cross-validation setting, the support-vector-machine method achieves only 20 percent overall accuracy, with an FPR of 0 but an FNR of 98 percent. That is, the support vector machine method classi- fies almost all data as not fault prone. We saw similar results for KC2-520. However, the difficulty in classifying KC2-520 was rela- tively lower than for JM1-8850, as Figure 2 shows. The promising accuracy results in Figure 2 warrant further investigations into building clustering- and expert-based soft- ware quality-prediction systems.
Feedback from our expert indicated that the Neural-Gas results seemed to be easier to label than the k-means results. We suspect the reason is that the Neural-Gas algorithm generates more coherent clusters. This is important for a real interactive data analysis system because the expert will have higher confidence explaining the clusters he or she derives from software metrics.

жүктеу/скачать 177,5 Kb.

Достарыңызбен бөлісу:

1 2 3 4 5 6 7 8 9 ... 17