205
Main goals of the use of parallel computing technologies in the data
mining field are:
• performance improvements of existing techniques,
• implementation of new (parallel) techniques and algorithms, and
• concurrent analysis with different data mining
techniques and result
integration to get a better model (that is more accurate).
There are three main strategies in the exploitation of parallelism in data
mining algorithms:
•
independent parallelism,
•
task parallelism,
•
SPMD parallelism.
According to task parallelism (or control parallelism) each process exe-
cutes different operations on (a different partition of) the data set. In SPMD
parallelism a set of processes execute in parallel the same algorithm on dif-
ferent partitions of the data set and processes exchange partial results. Fi-
nally, independent, parallelism is exploited when processes are executed in
parallel in an independent way; generally each process has access to the
whole data set. These three strategies are riot necessarily alternative for par-
allelizing data mining algorithms. They can be combined to improve both
performance and accuracy of results. In combination with strategies for par-
allelization, different data partition strategies can be used
•
sequential partitioning: separate partitions are defined without overlap-
ping among them;
•
cover-based partitioning: some data can be replicated on different par-
titions;
•
range-based query: partitions are defined on the basis of some queries
that select data according to attribute values.
Parallelization strategies can be applied to different data mining tech-
niques (DM) and algorithms such
as classification, association rules, epi-
sodes discovery, link analysis and clustering. Among the various DM tech-
niques, one of the most useful is
clustering. Clustering algorithms try to
separate data into several groups so that similar items fall into the same
group and similarity between items in different group is low [5]. This parti-
tion is performed without any user support so this particular DM task is
referred as
unsupervised learning. Most of the early cluster analysis algo-
rithms have been originally designed for relatively small data sets. In the
recent years, clustering algorithms have been extended to efficiently work
206
on large databases and some of them are able to deal with high-dimensional
feature items [6]. When used
to classify large data sets, clustering algo-
rithms are very computing demanding (computation may require several
days) and require high-performance machines to get results in reasonable
time [7] [3].
A well-known clustering algorithm is AutoClass [8], designed by
Cheeseman and colleagues at NASA Ames Research Center. The execution
of the sequential version of AutoClass on large data sets requires high com-
putational times. For instance, the sequential AutoClass runs on a data set of
14K tuples, each one composed of a few hundreds bytes, have taken more
the 3 hours on Pentium-based PC. Considering that the execution time in-
creases very
fast with the size of data set, more than 1 day is necessary to
analyze a data set composed of about 140K tuples, that is not a very large
data set. For clustering a satellite image, AutoClass took more than 130
hours and for the analysis of protein sequences its discovery process re-
quired from 300 to 400 hours [9].
These considerations and experiences suggested the necessity of a faster
version of AutoClass to handle very large data sets in reasonable time. This
has been done by Pizzuti and Talia at ISI-CNR by exploiting the inherent
parallelism present in the AutoClass algorithm by implementing it in paral-
lel on MIMD multicomputers [10]. The resulting P-AutoClass
algorithm is
based on the Single Program Multiple Data (SPMD) approach that works by
dividing data among the processors and executing the same operations at
each processor using a partition of data. At the end of each local computa-
tion partial results (weights for each instance and parameters) are exchanged
among the processors to obtain the global result. This strategy does not re-
quire to replicate the entire data set on each processor. Furthermore, it also
does not produces load balancing problems because each processor executes
the same program on data of equal size. Finally, the amount of data ex-
changed among the processors is not so large since most operations are per-
formed locally at each processor.
The main part of the algorithm is devoted to classification, generation
and evaluation of generated classification. This loop
is composed of a set of
sub-steps, among which the new try of classification step is the most com-
putationally intensive. In AutoClass class membership is expressed by
weights. This step computes the weights of each item for each class and
then it computes the parameters of classification. These operations are exe-
207
cuted by the function
base_cycle which calls the three functions
up-
Достарыңызбен бөлісу: