Р. Г. Стронгина. Ниж- ний Новгород: Изд-во Нижегородского университета, 2002, 217 с



Pdf көрінісі
бет138/151
Дата26.01.2022
өлшемі1,64 Mb.
#24342
түріСеминар
1   ...   134   135   136   137   138   139   140   141   ...   151
4. Parallel Data Mining 
Research and development work in the area of knowledge discovery 
and data mining concerns the study and definition of techniques, methods 
and tools for the extraction of novel, useful and not explicitly available pat-
terns from large volumes of data. In particular, data mining is the main step 
of the knowledge discovery process (KDD) that consists of several steps 
from data extraction to models and patterns interpretation (fig. 4) [1]Data 
mining techniques originated from the use of statistical analysis and ma-
chine learning techniques in the mining of patterns from databases. In the 
latest few years new techniques and algorithms have been designed. 
 
 
Fig. 4. Description of the KDD process 
 
Today the information overload is a problem like the shortage of infor-
mation. Knowledge discovery in large data repositories can find what is 
interesting in them representing it in an understandable way [2]. Mining 
large data sets requires large computational resources. In fact, data mining 
algorithms working on very large data sets take very long times on conven-
tional computers to get results. One approach to reduce response time is 
sampling. But, in some case reducing data might result in inaccurate mod-
els, in some other case is not useful (e.g. outliers identification). The other 
approach is parallel computing. High performance computers and parallel 
data mining algorithms can offer the best way to mine very large data sets 
[3] [4]. Is not uncommon to have sequential data mining applications that 
require several days or weeks to complete their task. Parallel computing 
systems can bring significant benefits in the implementation of data mining 
and knowledge discovery applications by means of the exploitation of in-
herent parallelism of data mining algorithms. 


 
205 
Main goals of the use of parallel computing technologies in the data 
mining field are: 
•  performance improvements of existing techniques, 
•  implementation of new (parallel) techniques and algorithms, and 
•  concurrent analysis with different data mining techniques and result 
integration to get a better model (that is more accurate). 
There are three main strategies in the exploitation of parallelism in data 
mining algorithms: 
•  independent parallelism, 
•  task parallelism, 
•  SPMD parallelism. 
According to task parallelism (or control parallelism) each process exe-
cutes different operations on (a different partition of) the data set. In SPMD 
parallelism a set of processes execute in parallel the same algorithm on dif-
ferent partitions of the data set and processes exchange partial results. Fi-
nally, independent, parallelism is exploited when processes are executed in 
parallel in an independent way; generally each process has access to the 
whole data set. These three strategies are riot necessarily alternative for par-
allelizing data mining algorithms. They can be combined to improve both 
performance and accuracy of results. In combination with strategies for par-
allelization, different data partition strategies can be used 
•  sequential partitioning: separate partitions are defined without overlap-
ping among them; 
•  cover-based partitioning: some data can be replicated on different par-
titions; 
•  range-based query: partitions are defined on the basis of some queries 
that select data according to attribute values. 
Parallelization strategies can be applied to different data mining tech-
niques (DM) and algorithms such as classification, association rules, epi-
sodes discovery, link analysis and clustering. Among the various DM tech-
niques, one of the most useful is clustering.  Clustering algorithms try to 
separate data into several groups so that similar items fall into the same 
group and similarity between items in different group is low [5]. This parti-
tion is performed without any user support so this particular DM task is 
referred as unsupervised learning. Most of the early cluster analysis algo-
rithms have been originally designed for relatively small data sets. In the 
recent years, clustering algorithms have been extended to efficiently work 


206 
on large databases and some of them are able to deal with high-dimensional 
feature items [6]. When used to classify large data sets, clustering algo-
rithms are very computing demanding (computation may require several 
days) and require high-performance machines to get results in reasonable 
time [7] [3]. 
A well-known clustering algorithm is AutoClass [8], designed by 
Cheeseman and colleagues at NASA Ames Research Center. The execution 
of the sequential version of AutoClass on large data sets requires high com-
putational times. For instance, the sequential AutoClass runs on a data set of 
14K tuples, each one composed of a few hundreds bytes, have taken more 
the 3 hours on Pentium-based PC. Considering that the execution time in-
creases very fast with the size of data set, more than 1 day is necessary to 
analyze a data set composed of about 140K tuples, that is not a very large 
data set. For clustering a satellite image, AutoClass took more than 130 
hours and for the analysis of protein sequences its discovery process re-
quired from 300 to 400 hours [9]. 
These considerations and experiences suggested the necessity of a faster 
version of AutoClass to handle very large data sets in reasonable time. This 
has been done by Pizzuti and Talia at ISI-CNR by exploiting the inherent 
parallelism present in the AutoClass algorithm by implementing it in paral-
lel on MIMD multicomputers [10]. The resulting P-AutoClass algorithm is 
based on the Single Program Multiple Data (SPMD) approach that works by 
dividing data among the processors and executing the same operations at 
each processor using a partition of data. At the end of each local computa-
tion partial results (weights for each instance and parameters) are exchanged 
among the processors to obtain the global result. This strategy does not re-
quire to replicate the entire data set on each processor. Furthermore, it also 
does not produces load balancing problems because each processor executes 
the same program on data of equal size. Finally, the amount of data ex-
changed among the processors is not so large since most operations are per-
formed locally at each processor. 
The main part of the algorithm is devoted to classification, generation 
and evaluation of generated classification. This loop is composed of a set of 
sub-steps, among which the new try of classification step is the most com-
putationally intensive. In AutoClass class membership is expressed by 
weights. This step computes the weights of each item for each class and 
then it computes the parameters of classification. These operations are exe-


 
207 
cuted by the function base_cycle  which calls the three functions up-


Достарыңызбен бөлісу:
1   ...   134   135   136   137   138   139   140   141   ...   151




©emirsaba.org 2024
әкімшілігінің қараңыз

    Басты бет