Кластеризации данных в области прикладной информатики и программной инженерии на примере зарубежного опыта и зарубежных публикаций



бет2/17
Дата15.12.2022
өлшемі177,5 Kb.
#57493
1   2   3   4   5   6   7   8   9   ...   17
Why clustering?
Unsupervised learning methods such as clustering techniques are a natural choice for analyzing soft- ware quality in the absence of fault proneness labels. Clustering algorithms can group the software mod- ules according to the values of their software metrics. The underlying software-engineering assumt tion is that fault-prone software modules will have similar software measurements and so will likely form clusters. Similarly, not-fault-prone modules will likely group together. When the clustering analysis is complete, a software engineering expert inspects each cluster and labels it fault prone or not fault prone.

Software metric

Minimum

Maximum

Mean

Median

75%

80%

85%

90%

Branch_Count

1

826

11.54

5

13

15

19

25

Total_Lines_of_Code

1

3,442

43.73

24

47

57

71

94

Executable_LOC

0

2,824

32.04

17

34

41

52

69

Comments_LOC

0

344

3.36

0

3

4

6

9

Blank_LOC

0

447

5.66

3

6

8

9

13

Code_And_Comments_LOC

0

108

0.45

0

0

0

0

1

Total_Operators

1

5,420

83.25

40

85

104

134

180

Total_Operands

0

3,021

56.73

27

59

72

92

126

Unique_Operators

1

411

13.40

12

17

18

20

22

Unique_Operands

0

1,026

20.35

14

24

27

33

41

Cyclomatic_Complexity

1

470

6.49

3

7

8

10

13

Essential_Complexity

1

165

3.33

1

3

4

5

7

Design_Complexity

1

402

4.14

2

4

5

6

8

A clustering approach offers practical benefits to the expert who must decide the labels. Instead of inspecting and labeling software modules one at a time, the expert can inspect and label a given cluster as a whole; he or she can assign all the modules in the cluster the same quality label. Such a strategy eases the tediousness of the labeling task, which is com- pounded when modules are numerous. For each cluster, the clustering algorithm can provide a representative software module, which the expert can inspect for labeling all modules in that cluster (aided by other descriptive data statistics). Moreover, when actual labels for the software modules are available, clustering analysis can provide the expert with valuable feedback for improving expert-based labeling in future releases of the given software project or other software projects.
The clustering approach is also attractive for addressing noise detection and removal for software quality classification. It can be a preprocessing step in analyzing training data quality before building a classifier. For example, in a given cluster, if a majority of software modules aren’t fault prone and a few modules are fault prone, the expert can make an educated assessment that the few fault prone modules in the cluster are likely noise and could be eliminated from the software measurement data.


Software metrics
Software engineering has devoted much research to developing and studying effec- tive software metrics that characterize a soft- ware system’s complexity. Software engi- neering researchers and practitioners alike have proposed several software complexity metrics, most of which use the program code for expressing software complexity. The best-known software metrics are Maurice H. Halstead’s software metrics and Thomas J. McCabe’s complexity metrics. Halstead defined a range of metrics on the basis of the syntactic elements in a program (the opera- tors and operands); McCabe’s cyclomatic complexity metric is derived from a pro- gram’s control flow graph and is equal to the number of linearly independent executable paths in the program. These software metrics have been widely and successfully used even though some issues related to their effec- tiveness remain open research problems.
Table 1 shows the 13 software metrics we used on the data sets in our experiments. These include (in the order in which they appear in Table 1) one Branch Count, five Line Counts, four Halstead metrics, and three McCabe metrics. Besides these software metrics, a software fault measurement (also called software quality metric or measure- ment) is available as the error rate, which indicates the number of defects (or software faults) in each software module. You can use it (based on a threshold) to categorize each software module as fault prone or not fault prone. In our studies, we also refer to a mod- ule’s class membership as its defect label.


Noise-handling techniques
In general, three different approaches to effective noise handling in data analysis exist—designing robust algorithms that are insensitive to noise, filtering out noise,3 and correcting noise.4 Comparative studies in related research works have shown that which noise-handling approach works best depends on specific applications and the given data sets.
Most robust algorithms have a complexity control mechanism so that the resulting models don’t overfit training data and generalize well to future unseen data. Cross-validation, minimum description length, and structural risk minimization are some commonly used model selection principles.
Noise-filtering or noise-removal techniques identify and eliminate potential outliers and mislabeled instances in the data set. One typical machine learning method in this category is to use an ensemble of multiple classifiers and treat as potential noise the data instances misclassified by a given majority of the classifiers. In our study, the modules misclassified by the clustering- and expert- based quality-estimation approach are com- pared to noisy modules identified by an ensemble of 25 classifiers.
Noise-correction or polishing methods assume that each attribute or feature in the data is correlated with others and can be reliably predicted. The correction process starts by predicting the value of each feature for each data instance from other features. Heuristics are then used to determine whether you should change (or correct) the original value of a feature for an instance to the pre- dicted value. This approach is usually more computationally expensive than the first two. Most earlier research works have focused on data noise-handling algorithms instead of real-world applications, using artificially injected noise to evaluate proposed algorithms’ effectiveness. The difficulty with real-world data, as we encountered in our experiments, is that no ground truth exists for the identified noise instances. We justify our clustering-based approach’s efficacy from several practical viewpoints, including clustering-based labeling accuracy and the benefits to real-world software development— detecting potentially mislabeled modules and reducing the amount of data necessary for decision making.


Достарыңызбен бөлісу:
1   2   3   4   5   6   7   8   9   ...   17




©emirsaba.org 2024
әкімшілігінің қараңыз

    Басты бет