4. Parallel Data Mining
Research and development work in the area of knowledge discovery
and data mining concerns the study and definition of techniques, methods
and tools for the extraction of novel, useful and not explicitly available pat-
terns from large volumes of data. In particular, data mining is the main step
of the knowledge discovery process (KDD) that consists of several steps
from data extraction to models and patterns interpretation (fig. 4) [1]. Data
mining techniques originated from the use of statistical analysis and ma-
chine learning techniques in the mining of patterns from databases. In the
latest few years new techniques and algorithms have been designed.
Fig. 4. Description of the KDD process
Today the information overload is a problem like the shortage of infor-
mation. Knowledge discovery in large data repositories can find what is
interesting in them representing it in an understandable way [2]. Mining
large data sets requires large computational resources. In fact, data mining
algorithms working on very large data sets take very long times on conven-
tional computers to get results. One approach to reduce response time is
sampling. But, in some case reducing data might result in inaccurate mod-
els, in some other case is not useful (e.g. outliers identification). The other
approach is parallel computing. High performance computers and parallel
data mining algorithms can offer the best way to mine very large data sets
[3] [4]. Is not uncommon to have sequential data mining applications that
require several days or weeks to complete their task. Parallel computing
systems can bring significant benefits in the implementation of data mining
and knowledge discovery applications by means of the exploitation of in-
herent parallelism of data mining algorithms.
205
Main goals of the use of parallel computing technologies in the data
mining field are:
• performance improvements of existing techniques,
• implementation of new (parallel) techniques and algorithms, and
• concurrent analysis with different data mining techniques and result
integration to get a better model (that is more accurate).
There are three main strategies in the exploitation of parallelism in data
mining algorithms:
• independent parallelism,
• task parallelism,
• SPMD parallelism.
According to task parallelism (or control parallelism) each process exe-
cutes different operations on (a different partition of) the data set. In SPMD
parallelism a set of processes execute in parallel the same algorithm on dif-
ferent partitions of the data set and processes exchange partial results. Fi-
nally, independent, parallelism is exploited when processes are executed in
parallel in an independent way; generally each process has access to the
whole data set. These three strategies are riot necessarily alternative for par-
allelizing data mining algorithms. They can be combined to improve both
performance and accuracy of results. In combination with strategies for par-
allelization, different data partition strategies can be used
• sequential partitioning: separate partitions are defined without overlap-
ping among them;
• cover-based partitioning: some data can be replicated on different par-
titions;
• range-based query: partitions are defined on the basis of some queries
that select data according to attribute values.
Parallelization strategies can be applied to different data mining tech-
niques (DM) and algorithms such as classification, association rules, epi-
sodes discovery, link analysis and clustering. Among the various DM tech-
niques, one of the most useful is clustering. Clustering algorithms try to
separate data into several groups so that similar items fall into the same
group and similarity between items in different group is low [5]. This parti-
tion is performed without any user support so this particular DM task is
referred as unsupervised learning. Most of the early cluster analysis algo-
rithms have been originally designed for relatively small data sets. In the
recent years, clustering algorithms have been extended to efficiently work
206
on large databases and some of them are able to deal with high-dimensional
feature items [6]. When used to classify large data sets, clustering algo-
rithms are very computing demanding (computation may require several
days) and require high-performance machines to get results in reasonable
time [7] [3].
A well-known clustering algorithm is AutoClass [8], designed by
Cheeseman and colleagues at NASA Ames Research Center. The execution
of the sequential version of AutoClass on large data sets requires high com-
putational times. For instance, the sequential AutoClass runs on a data set of
14K tuples, each one composed of a few hundreds bytes, have taken more
the 3 hours on Pentium-based PC. Considering that the execution time in-
creases very fast with the size of data set, more than 1 day is necessary to
analyze a data set composed of about 140K tuples, that is not a very large
data set. For clustering a satellite image, AutoClass took more than 130
hours and for the analysis of protein sequences its discovery process re-
quired from 300 to 400 hours [9].
These considerations and experiences suggested the necessity of a faster
version of AutoClass to handle very large data sets in reasonable time. This
has been done by Pizzuti and Talia at ISI-CNR by exploiting the inherent
parallelism present in the AutoClass algorithm by implementing it in paral-
lel on MIMD multicomputers [10]. The resulting P-AutoClass algorithm is
based on the Single Program Multiple Data (SPMD) approach that works by
dividing data among the processors and executing the same operations at
each processor using a partition of data. At the end of each local computa-
tion partial results (weights for each instance and parameters) are exchanged
among the processors to obtain the global result. This strategy does not re-
quire to replicate the entire data set on each processor. Furthermore, it also
does not produces load balancing problems because each processor executes
the same program on data of equal size. Finally, the amount of data ex-
changed among the processors is not so large since most operations are per-
formed locally at each processor.
The main part of the algorithm is devoted to classification, generation
and evaluation of generated classification. This loop is composed of a set of
sub-steps, among which the new try of classification step is the most com-
putationally intensive. In AutoClass class membership is expressed by
weights. This step computes the weights of each item for each class and
then it computes the parameters of classification. These operations are exe-
207
cuted by the function base_cycle which calls the three functions up-
Достарыңызбен бөлісу: |