Кластеризации данных в области прикладной информатики и программной инженерии на примере зарубежного опыта и зарубежных публикаций


Clustering and quality estimation results



бет6/17
Дата15.12.2022
өлшемі177,5 Kb.
#57493
1   2   3   4   5   6   7   8   9   ...   17
Clustering and quality estimation results
Table 3 shows the clustering-quality results. The Neural-Gas algorithm performs significantly better in terms of mean squared error and comparably in terms of average
Table 3. Clustering results for 30 JM1-8850 clusters and 20 KC2-520 clusters.

Data set

Technique

Mean squared error

Average purity

Time (seconds)

JM1-8850

k-means

3,342.7

0.746

0.966




Neural-Gas

1,574.4

0.727

9.928

KC2-520

k-means

738.3

0.808

0.016




Neural-Gas

244.0

0.815

0.375



E n h a n c i n g I n f o r m a t i o n




Figure 2. Clustering- and expert-based classification results for (a) JM1-8850 and (b) KC2-520

purity. The k-means algorithm, however, runs much faster. The runtime results are recorded on a 3.06-GHz Pentium 4 PC with 1 Gbyte of memory running Windows XP.


Figure 2 reports the overall classification error, FPRs, and FNRs for both k-means and Neural-Gas algorithms, compared to a decision-tree-based classifier, C4.5. We chose C4.5 for comparison because it’s commonly used and known for its robustness in classification accuracy. Because the FPR and FNR are inversely related for a given classification technique, we obtained the C4.5 classification results by adjusting some parameters (such as tree-depth, pruning ratio, and so on) to achieve FPR errors similar to the two clustering techniques.
Neural-Gas performed slightly worse for JM1-8850 and only slightly better for KC2- 520 than k-means in terms of the overall error rate, although Table 3 gives a significantly lower MSE value. This is likely due to noise in the data sets (wrong labels or insufficient attributes). The error numbers are compara- ble to the C4.5 results, suggesting that clus- tering- and expert-based software quality classification is a viable option (compared to supervised learning) when software quality data isn’t available.
It’s worth mentioning that classifying the JM1-8850 data set is difficult, even for many state-of-the-art classifiers. For example, using the LIBSVM software package (avail- able at www.csie.ntu.edu.tw/~cjlin/libsvm) with a twofold cross-validation setting, the support-vector-machine method achieves only 20 percent overall accuracy, with an FPR of 0 but an FNR of 98 percent. That is, the support vector machine method classi- fies almost all data as not fault prone. We saw similar results for KC2-520. However, the difficulty in classifying KC2-520 was rela- tively lower than for JM1-8850, as Figure 2 shows. The promising accuracy results in Figure 2 warrant further investigations into building clustering- and expert-based soft- ware quality-prediction systems.
Feedback from our expert indicated that the Neural-Gas results seemed to be easier to label than the k-means results. We suspect the reason is that the Neural-Gas algorithm generates more coherent clusters. This is important for a real interactive data analysis system because the expert will have higher confidence explaining the clusters he or she derives from software metrics.


Достарыңызбен бөлісу:
1   2   3   4   5   6   7   8   9   ...   17




©emirsaba.org 2024
әкімшілігінің қараңыз

    Басты бет