5. Grid Computing
The term Grid refers to an infrastructure that enables the integrated,
collaborative use of high-end computers, networks, databases, and scientific
instruments owned and managed by multiple organizations. Grid applica-
tions often involve large amounts of data and/or computing and often re-
208
quire secure resource sharing across organizational boundaries, and are thus
not easily handled by today's Internet and Web infrastructures [l1].
Grid computing has emerged as an important new field, distinguished
from conventional distributed computing by its focus on large-scale re-
source sharing, innovative applications, and, in some cases, high-
performance orientation.The real and specific problem that underlies the
Grid concept is coordinated resource sharing and problem solving in dy-
namic, multi-institutional virtual organizations. The sharing that we are
concerned with is not primarily file exchange but rather direct access to
computers, software, data, and other resources, as is required by a range of
collaborative problem-solving and resource-brokering strategies emerging
in industry, science, and engineering. This sharing is. necessarily, highly
controlled, with resource providers and consumers defining clearly and
carefully just what is shared, who is allowed to share, and the conditions
under which sharing occurs. Tools that implement grid services are emerg-
ing and some of these, such as Globus and Legion, are used by many re-
search teams in several countries [12].
Grid computing concepts were first, explored in the 1995 I-WAY ex-
periment, in which high-speed networks were used to connect, for a short
time, high-end resources at 17 sites across North America. Out of this activ-
ity grew a number of Grid research projects that developed the core tech-
nologies for "production" Grids in various communities and scientific disci-
plines. For example, the US National Science Foundation's National Tech-
nology Grid and NASA's Information Power Grid are both creating Grid
infrastructures to serve university and NASA researchers, respectively.
Across Europe and the United States, the closely related European Data
Grid, Particle Physics Data Grid and Grid Physics Network (GriPhyN) pro-
jects plan to analyze data from frontier physics experiments. And outside
the specialized world of physics, the Network for Earthquake Engineering
Simulation Grid (NEESgrid) aims to connect US civil engineers with the
experimental facilities, data archives and computer simulation systems used
to engineer better buildings.
In the grid computing area, at ISI-CNR we are working in two specific
areas: grid programming and knowledge discovery on grids. In this section
we describe the research achievements in the latter area, and in particular
we discuss the Knowledge Grid architecture.
The Knowledge Grid architecture, designed by Cannataro and Talia
[13], is built on top of a computational grid that provides dependable, con-
209
sistent, and pervasive access to high-end computational resources. The pro-
posed architecture uses the basic grid services (i.e., the Globus services) and
defines a set of additional layers to implement the services of distributed
knowledge discovery process on world wide connected computers where
each node can be a sequential or a parallel machine. The Knowledge Grid
enables the collaboration of scientists that must mine data that are stored in
different research centers as well as executive managers that must use a
knowledge management system that operates on several data warehouses
located in the different company establishments.
The Knowledge Grid attempts to overcome the difficulties of wide area,
multi-site operation by exploiting the underlying grid infrastructure that
provides basic services such as communication, authentication, resource
management, and information. To this end, the knowledge grid architecture
is organized so that more specialized data mining tools are compatible with
lower-level grid mechanisms and also with the Data Grid services. This
approach benefits from "standard" grid services that are more and more
utilized and offers an open Parallel and Distribute Knowledge Discovery
(PDKD) architecture that can be configured on top of grid middleware in a
simple way.
Fig. 5. Layers and components of the Knowledge Grid architecture
210
The Knowledge Grid services (layers) are organized in two hierarchic
levels: core K-grid layer and high level K-grid layer. The former refers to
services directly implemented on the top of generic grid services, the latter
are used to describe, develop and execute PDKD computations over the
Knowledge Grid (see fig.5).
The core K-grid layer supports the definition, composition and execu-
tion of a PDKD computation over the grid. Its main goals are the manage-
ment of all metadata describing characteristics of data sources, third party
data mining tools, data management, and data visualization tools and algo-
rithms. Moreover, this layer has to coordinate the PDKD computation exe-
cution, attempting to match the application requirements and the available
grid resources. This layer comprises the following basic services:
• Knowledge Directory Service (KDS) responsible for maintaining a
description of all the data and tools used in the Knowledge Grid.
• Resource allocation and execution management (RAEM) services used
to find a mapping between an execution plan and available resources,
with the goal of satisfying requirements (computing power, storage,
memory. database, network bandwidth and latency) and constraints.
The high-level K-grid layer comprises the services used to compose, to
validate, and to execute a PDKD computation. Moreover, the layer offers
services to store and analyze the knowledge discovered by PDKD computa-
tions. Main services are:
• Data Access (DA) services that are responsible for the search, selection
(Data search services), extraction, transformation and delivery (Data
extraction services) of data to be mined.
• Tools and algorithms access (TAAS) services that are responsible for
the search, selection, downloading of data mining tools and algorithms
• Execution plan management (EPM) that handles execution plans as an
abstract description of a PDKD grid application. An execution plan is a
graph describing the interaction and data flows between data sources,
extraction tools, DM tools, visualization tools, and storing of knowl-
edge results in the Knowledge Base Repository.
• Results presentation service (RPS) that specifies how to generate, pre-
sent and visualize the PDKD results (rules, associations, models, classi-
fication, etc.). Moreover, it offers the API to store in different formats
these results in the Knowledge Base Repository.
211
This Knowledge Grid represents a first step in the process of studying
the unification of PDKD and computational grid technologies and defining
an integrating architecture for distributed data mining and knowledge dis-
covery based on grid services. We hope that the definition of such an archi-
tecture will accelerate progress on very large-scale geographically distrib-
uted data mining by enabling the integration of currently disjoint ap-
proaches and revealing technology gaps that require further research and
development. Currently a first prototype of the system built on top of
Globus is available. In particular, Cannataro, Talia and Trunfio have im-
plemented the Knowledge Directory Service and the Knowledge Metadata
Repository of the Core K-grid layer, and the Data Access Service of the
High level K-grid layer [14].
The metadata describing relevant objects for PDKD computations, such
as data sources and data mining software, are represented by XML docu-
ments into a local repository (KMR), and their availability is published by
entries into the Directory Information Tree maintained by a LDAP server,
which is provided by the Grid Information Service (GIS) of the Globus
Toolkit. The main attributes of the LDAP entries specify the location of the
repositories containing the XML metadata, whereas the XML documents
maintain more specific information for the effective use of resources. The
basic tools of the DA service have been implemented allowing to find, re-
trieve and select metadata about PDKD objects on the grid, on the basis of
different search parameters and selection filters. Moreover, we are modeling
the representation of execution plans as graphs, where nodes represents
computational elements (data sources, software programs, results, etc.) and
arcs represents basic operations (data movements, data filtering, program
execution, etc.). We plan to consider different network parameters, such as
topology, bandwidth and latency, for PDKD program execution optimiza-
tion.
Достарыңызбен бөлісу: |