Professional Documents
Culture Documents
scenarios in OGSA-DAI
Keke Qi
I, Keke,Qi, confirm that this dissertation and the work presented in it are my own
achievement.
1. Where I have consulted the published work of others this is always clearly attributed;
2. Where I have quoted from the work of others the source is always given. With the
exception of such quotations this dissertation is entirely my own work;
3. I have acknowledged all main sources of help;
4. If my research follows on from previous work or is part of a larger collaborative
research project I have made clear exactly what was done by others and what I have
contributed myself;
5. I have read and understand the penalties associated with plagiarism.
Signed:
Date:
Matriculation no:
DATA INTEGRATION SCENARIOS IN
OGSA‐DAI
Keke Qi
10 September 2004
ABSTRACT
The OGSA‐DAI is middleware that provides data access and integration
capabilities to a Grid consists with the OGSA vision. The role of the OGSA‐DAI
middleware is to present a unified programming model for application writers
and mask out problems of heterogeneity and distribution. The data access
capabilities of OGSA‐DAI have been well‐tested and demonstrated. In this work,
the OGSA‐DAI’s data integration capabilities are investigated and evaluated.
The evaluation mainly is based on two data integration scenarios. In addition, a
proof of concept work which proposes a service driven data integration model
is introduced and discussed.
I
TABLE OF CONTENTS
ABSTRACT.......................................................................................................................................I
TABLE OF CONTENTS ............................................................................................................... II
LIST OF FIGURES.......................................................................................................................IV
LIST OF TABLES .........................................................................................................................VI
LIST OF CODES ......................................................................................................................... VII
ACKNOWLEDGMENTS......................................................................................................... VIII
1 INTRODUCTION ................................................................................................................. 1
1.1 BACKGROUND ................................................................................................................... 1
1.2 OGSA-DAI....................................................................................................................... 3
1.3 DATA INTEGRATION IN OGSA-DAI ................................................................................. 6
2 METHODOLOGIES............................................................................................................. 8
2.1 SCENARIOS ........................................................................................................................ 8
2.2 METHOD ............................................................................................................................ 9
2.3 ENVIRONMENTS AND TOOLS ........................................................................................... 10
3 THE DATA COPY SCENARIO AND BENCHMARKING ......................................... 13
3.1 DATA COPY BETWEEN TWO RELATIONAL DATABASES ................................................... 13
3.1.1 Direct approach..................................................................................................... 13
3.1.2 OGSA-DAI approach: client control..................................................................... 14
3.1.3 Analysis and discussion ......................................................................................... 16
3.2 PROFILING THE BLOCKAGGREGATOR ACTIVITY ............................................................. 23
3.3 DATA COPY USING OGSA-DAI DELIVERY ACTIVITIES .................................................. 29
3.3.1 DeliverToGDT ....................................................................................................... 30
3.3.2 DeliverFromGDT .................................................................................................. 30
3.3.3 Analysis and discussion ......................................................................................... 31
3.4 DATA COPY FROM XML DATABASE TO RELATIONAL DATABASE .................................. 36
3.4.1 Description............................................................................................................. 36
3.4.2 Direct approach..................................................................................................... 36
3.4.3 OGSA-DAI approach............................................................................................. 37
3.4.4 Analysing and discussing....................................................................................... 39
3.5 SUMMARY ....................................................................................................................... 41
4 A SERVICE DRIVEN MODEL ........................................................................................ 44
4.1 INTRODUCTION ................................................................................................................ 44
4.1.1 Client and service driven data integration............................................................ 44
4.1.2 Definition................................................................................................................ 46
4.2 SYSTEM OVERVIEW ......................................................................................................... 47
4.2.1 Introduction............................................................................................................ 47
4.2.2 GDSActivity............................................................................................................ 49
4.2.3 Sequence and flow activity..................................................................................... 49
4.2.4 Security................................................................................................................... 50
4.3 DETAIL DESIGN................................................................................................................ 50
4.3.1 Introduction............................................................................................................ 50
4.3.2 GDSActivity............................................................................................................ 50
II
4.3.3 Sequence activity.................................................................................................... 55
4.3.4 Flow activity........................................................................................................... 58
5 DATA INTEGRATION USING THE SERVICE DRIVEN MODEL ......................... 62
5.1 DATA COPY...................................................................................................................... 62
5.1.1 Client driven........................................................................................................... 62
5.1.2 Service driven......................................................................................................... 63
5.2 DISTRIBUTED JOIN........................................................................................................... 65
5.2.1 Client driven........................................................................................................... 66
5.2.2 Service driven......................................................................................................... 68
5.3 PROFILING THE SERVICE DRIVEN MODEL ........................................................................ 72
5.4 SUMMARY ....................................................................................................................... 73
6 CONCLUSIONS .................................................................................................................. 75
6.1 FUTURE WORK ................................................................................................................. 76
APPENDIX A ................................................................................................................................. 79
1 XML SCHEMAS ................................................................................................................. 79
1.1 GRID_DATA_SERVICES_TYPE_EXT.XSD .......................................................................... 79
1.2 GDS_ACTIVITY.XSD ......................................................................................................... 79
1.3 SEQUENCE_ACTIVITY.XSD ............................................................................................... 81
1.4 FLOW_ACTIVITY.XSD ...................................................................................................... 82
REFERENCES............................................................................................................................... 85
III
LIST OF FIGURES
Number Page
IV
FIGURE 34 UML DIAGRAM OF PROCESSING A SEQUENCE ACTIVITY ................................................ 56
FIGURE 35 AN EXAMPLE OF CONTAINED ACTIVITIES OF A SEQUENCE ACTIVITY ARE CHAINED BY AN
INNER IO ................................................................................................................................. 57
FIGURE 36 UML DIAGRAM OF PROCESSING A FLOW ACTIVITY........................................................ 60
FIGURE 37 DATA COPY USING SERVICE DRIVEN ............................................................................... 64
FIGURE 38 DISTRIBUTED JOIN .......................................................................................................... 65
FIGURE 39 DISTRIBUTED JOIN USING CLIENT DRIVEN MODEL ......................................................... 66
FIGURE 40 DISTRIBUTED JOIN IMPLEMENTED USING THE SERVICE DRIVEN MODEL ........................ 69
FIGURE 41 PERFORMANCE OF THE SIMPLE DATA COPY SCENARIO USING THE SERVICE DRIVEN
MODEL..................................................................................................................................... 72
V
LIST OF TABLES
Number Page
VI
LIST OF CODES
Number Page
VII
ACKNOWLEDGMENTS
I would like to gratefully acknowledge the enthusiastic supervision of Dr. Mario
Antonioletti, who gives me great support on my work, his comments and
suggestions make this work possible, and Tom Sugden, who shares his great
knowledge and skills of OGSA‐DAI and wonderful music with me.
I also wish to thank Alastair Hume(EPCC), Alex Woehrer(NESC), Konstantinos
Karasavvas(NESC), Neil p chue hong(EPCC) for their comments and
discussions on my work. In addition, I thank the OGSA‐DAI team and EPCC
support team for their assistance with all types of technical problems.
More important, I am grateful to all my friends in Edinburgh, for their care and
attention.
VIII
Chapter 1: Introduction
1 Introduction
1.1 Background
Data plays a fundamental role in all kinds of cross‐organisational research and
collaborations. These organizations can be collectively deemed to form virtual
organizations (VOs) [15] . Data will exist in a variety of different formats, such
as unstructured or multimedia files in file systems, structured collections stored
in relational or XML databases [4], which can vary in volume and may be
geographically distributed over a VO. Besides, some large projects ([40], [41])
such as the LHC [40] will generate multiple terabyte‐sized or even
petabyte‐sized data. It is impossible to handle such large amounts of data
within a single organisation or institute. Addressing data access and integration
across organizations is going to be one of the big challenges in setting up VOs.
At present, Grid technologies are being developed to facilitate “coordinated
resource sharing and problem solving in dynamic, multi‐institutional virtual
organisations” [15] . The Grid is “a system that coordinates distributed resources
using standard, open, general‐purpose protocols and interfaces to deliver nontrivial
qualities of services.” [27] The concept of VOs proposed by [15] can be regarded as
a framework in which data access and integration have to be addressed.
The Open Grid Service Architecture (OGSA) [9] effort within the Global Grid
Forum (GGF) [28] is trying to define a standard framework to address the key
concerns in the Grid, e.g. services registries and the discovery process, lifecycle
management, metadata services, etc. This work is based on Web services
framework [16], “a distributed computing paradigm based on standard techniques for
describing interfaces to software components, methods for accessing these components
via interoperable protocols and discovery methods that enable the identification of
relevant service providers” [29]. The XML [10] technologies are used as the core
foundation for Web services as it provides platform independence and
1
Chapter 1: Introduction
portability. XML schema [36] [37] defines a grammar that can be used to define
other XML languages. The Web services stack (shown in Figure 1) consists of
multiple layers. Each layer is dedicated to a particular functionality and defined
by an XML based specification.
Discovery UDDI
Description WSDL
SOAP
Messaging
y The Simple Object Access Protocol (SOAP)[30] is an XML protocol
defined using XML schema which is used to specify the
communication message.
y The Web Services Description Language (WSDL) [18] is defined using
an XML schema that describes Web services interfaces.
y The Universal Description, Discovery and Integration (UDDI)[31] is
used to specify a directory model for Web services.
As a result, OGSA inherits the advantages of Web services – platform
independence and language neutrality. The OGSA services can thus be
implemented in different languages, deployed on different platforms and
interoperate with each other using XML messaging.
The Open Grid Service Infrastructure (OGSI) specification [3] attempted to
define a detailed set of interfaces, which could be used to realise the OGSA
2
Chapter 1: Introduction
vision based on extensions to the Web services technologies. OGSI extended the
concept of Web services [16] by defining a “Grid service”. A Grid service can
simply be looked on as being a web service with state and lifecycle management
properties explicitly added. A Grid service is identified by a GSH [9] and
described using GWSDL [3], which is an extension of WSDL. Each operation of
a Grid service is defined as a WSDL portType. A portType is a collection of
related operations. For a variety of reasons [38] OGSI has been deprecated and
will be replaced by WS‐Resource Framework (WSRF) [42]. At the current time, a
public implementation of WSRF is not available.
1.2 OGSA-DAI
The Open Grid Service Architecture‐ Data Access and Integration (OGSA‐DAI)
[1] is middleware1 that provides data access and integration capabilities to a
Grid consists with the OGSA vision. The role of the OGSA‐DAI middleware is
to present a unified programming model for application writers and mask out
problems of heterogeneity and distribution [32]. OGSA‐DAI enables various
homogeneous or heterogeneous data resources, such as relational databases,
XML databases [4] and even file systems, to be accessed through a uniform web
service based data access interface. At present, OGSI is used by OGSA‐DAI for
its infrastructure. This will be replaced by WSRF in a future release. OGSA‐DAI
is currently implemented on top of Globus toolkit 3 (Globus core) [20], which
also relies on OGSI, and written using the Java language [21].
In OGSA‐DAI, a data resource is exposed as a persistent Grid service [3], called
a Grid Data Service Factory (GDSF), which acts as a point of presence for a data
resource on the Grid and is identified by a GSH. Clients access a data resource
Middleware is a distributed software layer, or ‘platform’ which abstracts the
1
complexity and heterogeneity of the underlying distributed environment with its
multitude of network technologies, machine architectures, operating systems and
programming languages.
3
Chapter 1: Introduction
through a transient Grid service called a Grid Data Service (GDS) created by the
corresponding GDSF. A GDS acts as a client session and is responsible for
managing the access to a data resource. Any OGSA‐DAI services can declare
their existences and expose their metadata by registering with a
DAIServiceGroupRegistry (DAISGR) service provided by OGSA‐DAI. The basic
OGSA‐DAI framework is illustrated in Figure 2.
Container
DAISGR
Client GDSF
Data Resource
GDS
Two portTypes defined by the GGF Database Access and Integration Services
(DAIS) work group [22] are supported by OGSA‐DAI. At this time, OGSA‐DAI
is based on the GGF 7 version [23] of the DAIS specification. The
GridDataService portType defines a perform method which consumes an XML
document composed by the client. The document specifies the operations that
the GDS needs to execute. The GridDataTransport portType defines a set of
operations that allow a GDS to push or pull data to/from a third party. By this
mean, data can be transferred between two GDSs or between a client and a
GDS.
4
Chapter 1: Introduction
OGSA‐DAI has adopted a document based framework [39]. An XML document,
called a perform document, is used to describe the functionality that an
OGSA‐DAI service can undertake A basic task to be executed in a perform
document is called an activity. An OGSA‐DAI activity defines a capability or
action that can be executed by a GDS. The activities2 discussed in this work are
listed in Table 1.
Activity Description
sqlQueryStatement Run an SQL [24] query statement.
sqlBulkLoadRowSet Bulk load data into a table.
xPathStatement Run an XPath [25] statement against an XML database.
inputStream Receive data through the GridDataTransport portType.
outputStream Deliver data through the GridDataTransport portType.
deliverFromGDT Pull data from a service implementing the
GridDataTransport portType.
deliverToGDT Push data to a service implementing the
GridDataTransport portType.
deliverFromURL Retrieve data from a URL.
xsltTransform Transform data using an XSLT [14].
blockAggregator Aggregate multiple blocks into a single block
Table 1 Activities supported by the current OGSA-DAI
release3
A perform document can contain one or more activities. Activities can be
chained to form a pipeline by linking their inputs and outputs together. By this
means, they can interoperate with each other. By combining various activities,
very complex scenarios may be enacted using OGSA‐DAI.
The OGSA‐DAI engine, the heart of a GDS, is responsible for validating and
analysing perform documents and constructing the activity instances specified
in these perform documents. A pipe is constructed by the engine to link any two
2 In the following work, except where otherwise noted, the “activity” term is used as a
replacement for the more accurate term “OGSA‐DAI activity”.
3 This table is replicated from the OGSA‐DAI release 4 documentation.
5
Chapter 1: Introduction
activities for which a data flow has been specified in a perform document. The
data flow describes the motion of data, which is from the output of one activity
to the input of another activity.
OGSA‐DAI provides a Java API, the client toolkit, to facilitate the composition
of perform documents for clients. The basic block of the client toolkit is a client
toolkit activity. A client toolkit activity is a Java representation of an activity at
the server side that serialises into the XML fragment of the server side activity
that is to be executed by a GDS.
different sources, and providing the user with a unified view of these data”. The
evaluation and discussion of data integration in OGSA‐DAI in this work is
going to be based on the investigation of some data integration scenarios
described later. Three key advantages of data integration from [8] are that it
frees users from:
• locating the data residing at various data resources;
• interacting each data resources independently; and
• combining results from different data resources manually.
The proposal of this work is to evaluate and extend the OGSA‐DAI’s data
integration capabilities. This work tries to identify and assess following issues,
which are related to the data integration capabilities of OGSA‐DAI.
y What is the overhead of using OGSA‐DAI to implement a data
integration scenario?
6
Chapter 1: Introduction
y How easy is it for OGSA‐DAI to implement a data integration
scenario?
y How well does OGSA‐DAI scale for these data integration scenarios?
y Is OGSA‐DAI robust and stable? In addition to that, the using and
developing experiences of OGSA‐DAI are discussed.
The remainder chapters are organised as the follows. Chapter 2 introduces the
scenarios and methodologies for the evaluation. A simple data copy data
integration scenario is investigated in chapter 3. Based on the results obtained in
chapter 3, a service driven model for data integration is considered in chapter 4.
A detailed design is given using activities as proof of concept. The use of this
service driven model for data integration is discussed in chapter 5. In chapter 6
the conclusions for this work are outlined and future possible directions for data
integration in OGSA‐DAI are discussed.
7
Chapter 2: Methodologies
2 Methodologies
2.1 Scenarios
Two data integration scenarios are introduced and used to evaluate
OGSA‐DAI’s data integration capabilities.
The simple data copy scenario (shown in Figure 3) is discussed and investigated
in sections 3 and 5.1. In this scenario data is copied from a source data resource
(DR) to a sink data resource. A data resource in this scenario can be either a
relational or an XML database. This scenario is implemented using both
OGSA‐DAI and a direct approach that uses the APIs provided by the Java
programming language to connect to data resources, mainly, JDBC [11] for
relational databases and the XMLDB API [4] to connect to XML databases.
A more complex data integration scenario (shown in Figure 4) implements a
distributed join case. In this case, the data is distributed between two
geographically separated data resources. When a client wants to query these
distributed data resources for information the data resources have to be queried
8
Chapter 2: Methodologies
independently and the results from these queries need to be merged to form the
final result by the client. These data resources may or may not have same data
schemas.
1
DR1
DR3
Client 5
4
DR2
3
A very simple method is used to implement this scenario, described as follows.
y Step 1, 2: The client executes a select query on data resource 1 and
inserts the data retrieved into data resource 3.
y Step 3, 4: The client executes a select query on data resource 2 and
inserts the data retrieved into data resource 3.
y Step 5: The client executes a join select on data resource 3 and retrieves
the final result.
2.2 Method
A simple benchmarking framework was developed which used the following
tools to measure the performance:
y Java System.currentTimeMillis is a Java method which returns
the current system time in milliseconds.
9
Chapter 2: Methodologies
y Apache Log4J [12] is a log toolkit used to output the system log to the
console or to a logging file.
Each benchmark experiment was run 10 times. Then the mean for these was
computed excluding the maximum and minimum values in order to reduce the
influence on the mean. The standard deviation was calculated to tell how much
the results spread out from the mean, which are presented as error bars in the
graphs later on.
and diabase. Their specification of these machines is listed in Table 2.
The operating systems and JDK versions installed in each of these machines are
listed in Table 3.
4 Diabase is the service that hosts the big commercial databases. The version of JDK it has is
irrelevant as this machine could not be accessed directly.
10
Chapter 2: Methodologies
OGSA‐DAI R4, Tomcat 5.0.25 and Globus Toolkit core 3.2 were installed on
brucite and coal. Figure 5 illustrates the deployment of the experiment
environment.
Diabase
OGSA-DAI OGSA-DAI
GT Core GT Core
Brucite Coal
Five relational databases and one XML database, those currently officially
supported by OGSA‐DAI, were used to conduct the experiments on. Their
specifications are presented in Table 4.
11
Chapter 2: Methodologies
DB2 RDB Host: diabase.epcc.ed.ac.uk
Version: 8.1.0.36
JDBC driver: 1.0
Oracle RDB Host: diabase.epcc.ed.ac.uk
Version: 9.2.0.1.0
JDBC driver: 1.0
SQLServer RDB Host: diabase.epcc.ed.ac.uk
Version: SQL Server 2000
JDBC driver: 2.2
Xindice XMLDB Host: coal.epcc.ed.ac.uk
Version 1.0:
XMLDB driver: Xindice 1.0
Table 4 Specification of data resources
Databases with the same data schema were deployed in each of the above
relational databases. A one million row table was inserted into all the relational
databases, and 50,000 documents, each having a size of 250 bytes, were created
in the XML database.
12
Chapter 3: The data copy scenario and benchmarking
3 The data copy scenario and benchmarking
3.1 Data copy between two relational databases
In this case, the data copy scenario involved two relational databases, one acted
as a source and the other as a sink. The data was copied from the source to the
sink under the client’s control.
executes an N‐ROW SQL query6:
on the source database to obtain a ResultSet instance. The client then inserts
each row read from the ResultSet into the sink database. The data size, N, was
selected from {100, 500, 1000, 5000, 10000, 50000, 100000}.
The whole data copy procedure consisted of four sub operations.
1. Creating two JDBC connections
2. Executing the SQL query
3. Pulling and pushing data
4. Releasing resources
The time measurements taken to benchmark this scenario when implemented
using a direct approach started just before point 2 and ended just after point 3.
6 N‐ROW select query: A select query returns N rows of data.
13
Chapter 3: The data copy scenario and benchmarking
3.1.2 OGSA-DAI approach: client control
Figure 6 shows a graph where an OGSA‐DAI client is used to control the
pulling and pushing of data. Two perform documents are needed for this
scenario in OGSA‐DAI.
Source
Source GDS DR
Client
Sink
Sink GDS DR
GDT
The perform document sent to the source GDS is schematically described by
Code 1.
<gridDataServicePerform>
<sqlQueryStatement/>
<blockAggregator/>
<outputStream/>
</gridDataServicePerform>
Code 1 Perform document to source GDS of data copy
scenario
y The sqlQueryStatement activity has an SQL query expression:
The data size N is varied from {100, 500, 1000, 5000, 10000, 50000}.
y The blockAggregator activity is used to aggregate larger numbers of
small blocks of data generated by sqlQueryStatement activity into
14
Chapter 3: The data copy scenario and benchmarking
smaller number of large blocks of data. In this case, the block size is
fixed to be 100 rows.
y The outputStream activity opens a GDT port on the source GDS from
which the users can pull data to the client side.
When processing a large amount of data, the blockAggregator and the
outputStream activities were used together to avoid pulling lots of small chunks
of data which can cause the client or the service to fail with a java
OutOfMemory error.
Another perform document (shown in Code 2) was sent to the sink GDS, which
consisted of one inputStream activity and one sqlBulkLoadRowSet activity.
<gridDataServicePerform>
<inputStream/>
<sqlBulkLoadRowSet/>
</gridDataServicePerform>
Code 2 Perform document to the sink GDS
y The inputStream activity opens a GDT port on the sink GDS through
which the clients can push data to the sink GDS.
y The sqlBulkLoadRowSet activity bulk loads the data on to a table. The
input of the activity must take the form of an XML document
formatted using the WebRowSet standard [26]. It is the client’s
responsibility to translate the data pulled from the source data resource
into a WebRowSet formatted document.
Since the perform method on the GridDataService of the OGSA‐DAI client
toolkit is synchronous, it will not return until its operation has completed, a
thread is needed to be created on the client side for this scenario using
OGSA‐DAI to enable data transfer to be performed asynchronously between
these two GDSs.
15
Chapter 3: The data copy scenario and benchmarking
The whole procedure of the data copy scenario using OGSA‐DAI consisted of
the following operations.
1. Create two GDS instances.
2. Compose two perform documents.
3. Send the perform document (see Code 1) to the source GDS and get the
ResultSet instance which is used to retrieve data from the source database.
4. Create a new thread to transfer the data read from the ResultSet to the
sinkGDS through its GDTPortType.
5. Send the perform document (see Code 2) to the sink GDS. Once the
GDTPortType is initiated on the sink GDS, the data transfer can be started
by the thread.
6. Release the resources allocated during the processing of this scenario on
the client side.
To be consistent with the direct approach scenario, the time measurements
taken to benchmark this scenario when implemented using an OGSA‐DAI client
started just before point 2 and ended just after point 5.
Figure 11. The fact that OGSA‐DAI performs worse than JDBC is not surprising
because of the overheads introduced by using OGSA‐DAI.
Figure 7 shows the performance results when MySQL database was used as the
source database. Both the fastest and slowest performances occur in the copies
made from the MySQL database to the SQLServer database.
16
Chapter 3: The data copy scenario and benchmarking
450000 [JDBC]MySQL to
Oracle
400000
[JDBC]MySQL to
mean time(ms) for 10 runs
350000
PSQL
300000
[JDBC]MySQL to
250000 SQL
200000 [OGSA-
DAI]MySQL to
150000 Oracle
[OGSA-
100000 DAI]MySQL to
50000 PSQL
[OGSA-
0 DAI]MySQL to
SQL
0 10000 20000 30000 40000 50000 60000
datasize(row s)
As the data type retrieved from the Oracle database was incompatible with the
data type required by the other databases the sqlBulkLoadRowSet activity could
not insert the data read from the Oracle database into the other databases except
MySQL. As a result, Figure 8 only shows the performance required by the copy
made from the Oracle database to/from the MySQL database. Using the
OGSA‐DAI client control approach, where the data is copied back to the client
and then pushed back to the sink GDS, was about 9 times slower than when
using a direct approach.
350000
300000
mean time(ms) for 10 runs
250000 [JDBC]Oracle
to MySQL
200000
150000
[OGSA-
100000 DAI]Oracle to
MySQL
50000
0
0 20000 40000 60000
datasize(row s)
17
Chapter 3: The data copy scenario and benchmarking
Figure 8 Data copy: JDBC vs. OGSA-DAI (2)
The performance results when the PostgreSQL was used as the source database
are shown in Figure 9. In the two cases where the OGSA‐DAI client control
approach was used the results were very similar hence the lines in the graph
overlap. The times taken by the OGSA‐DAI client control approach to copy data
from the PostgreSQL database to the MySQL database and the SQLServer
database are almost same. However, the time required by the direct approach to
perform these two copies are significantly different. A possible reason for this is
that both the MySQL database and the PostgreSQL database are installed on the
same machine (coal) and the SQLServer database is installed on another
machine (brucite). The network traffic may cause the observed disparity in
performance.
350000
300000
mean time(ms) for 10 runs
[JDBC] PSQL to
250000 MySQL
[JDBC] PSQL to
200000 SQL
[OGSA-DAI]
150000
PSQL to MySQL
100000 [OGSA-DAI]
PSQL to SQL
50000
0
0 20000 40000 60000
datasize(row s)
The performance results when the DB2 database was used as the source
database are shown in Figure 10. Again, it is observed that both the fastest and
18
Chapter 3: The data copy scenario and benchmarking
slowest performances occurred in the copies made between the same pair of
databases (the DB2 and PostgreSQL).
450000
[JDBC] DB2 to
400000 Oracle
[JDBC] DB2 to
mean time(ms) for 10 runs
350000
PSQL
300000
[JDBC] DB2 to
250000 SQL
200000 [OGSA-DAI]
150000 DB2 to Oracle
100000 [OGSA-DAI]
DB2 to PSQL
50000
[OGSA-DAI]
0 DB2 to SQL
0 20000 40000 60000
datasize(row s)
Figure 11 shows the performance results where the SQLServer database was
used as the source database.
Oracle
300000
[JDBC] SQL to
250000 PSQL
200000 [OGSA-DAI]
SQL to
150000 MySQL
[OGSA-DAI]
100000 SQL to Oracle
50000 [OGSA-DAI]
SQL to PSQL
0
0 20000 40000 60000
datasize(row s)
19
Chapter 3: The data copy scenario and benchmarking
It is observed from these graphs that the superiority in performance of JDBC is
more and more obvious as the data size increased. When the data size is 50000
rows, the OGSA‐DAI approach is 3 to 13 times slower than the direct approach.
This shows that the overheads introduced by OGSA‐DAI are in proportion to
the data size it processes.
complexity is introduced by using the OGSA‐DAI client control approach. The
additional work done by OGSA‐DAI is mainly engaged in the operations that
compose the two perform documents and push the data.
y Composing perform documents
In the simple data copy scenario, composing a perform document in
OGSA‐DAI looked relatively complex in comparison with the JDBC
code. OGSA‐DAI also allows the clients to construct a perform
document by loading a predefined XML document, this might be a
simplified alternative.
y Pushing data
In the current OGSA‐DAI release, the perform method on the GDS of
the OGSA‐DAI client toolkit has to be synchronous. Thus, a new
thread spawned by the client is required to transfer data
asynchronously between the two GDSs. An asynchronous version of
the perform method is considered to be a good enhancement to the
OGSA‐DAI client toolkit. The observer pattern [13] could be adopted to
notify the clients as to whether the perform method completes or not.
20
Chapter 3: The data copy scenario and benchmarking
3.1.3.3 Robustness
Some OGSA‐DAI bugs and flaws were detected during the investigation of this
scenario.
Connection releasing
When the number of activities that a perform document contains
goes over a certain value, an exception indicating that “Can not
connect the database” is thrown. This happens frequently. The
possible reason for this are:
y Each end point activity is executed by an individual thread
in an OGSA‐DAI service.
y The OGSA‐DAI service allocates a JDBC connection for
each SQL activity.
y However, these connections allocated by JDBC are not
released until the Java objects that contain them are
finalised by the JVM.
Thus, all the connections of a relational database are consumed
quickly, and new request cannot be served.
statement, e.g.
was executed on the MySQL database using OGSA‐DAI. The reason
for this was identified after it was reported as a bug to the
OGSA‐DAI team. As the clients cannot specify type information for
the parameter of a parameterised SQL query in the perform
21
Chapter 3: The data copy scenario and benchmarking
document, the OGSA‐DAI engine cannot determine whether the
parameter data parsed from the perform document should be
converted into an integer or a string. Currently, the engine uses
thePreparedStatement.setObject method to set the
parameters. Therefore, all parameters are set as Java Strings. As a
result, when an unexpected string parameter is sent to MySQL, an
exception is inevitable.
100, 000 the OGSA‐DAI service gradually became unstable, but the
JDBC version worked fine even when the data size was increased to
one million7. The reason causing this problem is not known at this
time.
accessing an Oracle database onto a table residing on a GDS any
databases other than MySQL, an exception stating “unable to copy
the data” was thrown. The reason for this was:
y In a ResultSet from an Oracle DB, all numbers are defined as
SQL type NUMERIC.
y OGSA‐DAI converts the NUMERIC type into a Java double type
which causes the problem.
A bug report of this was submitted to the OGSA‐DAI team and the
problem has been fixed.
7 It has been reported that the one million rows data copy case had already been done by the
OGSA‐DAI team.
22
Chapter 3: The data copy scenario and benchmarking
3.2 Profiling the blockAggregator activity
The previous experiments show that OGSA‐DAI does not performs as well as a
direct approach. However, it does not indicate where the overheads are being
introduced by OGSA‐DAI. To better understand of that, this section conducts a
detailed investigation on the blockAggregator activity.
The blockAggregator activity aggregates small pieces of data read from the
output of another activity into a larger chunk for transportation. It is expected
that by using a blockAggregator activity together with an outputStream activity
the traffic of data transferring could be reduced.
The experiment used the perform document shown in Code 1 and only
involved one GDS. The detail procedure is as follows.
1. Create a GDS instance
2. Send the perform document to the GDS and then obtain an instance of a
ResultSet from the outputStream activity
3. Loop over the ResultSet and read in all the data
4. Release the GDS instance
The MySQL database was used as the target data resource.
The total time included the cost of points 1, 2, 3 above and the results are shown
in Figure 12. For each line in the graph, the data size was fixed and the block
size was varied from {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2500, 5000,
7500, 10000}. The graph shows that:
y The performance is proportional to the size of the data to be pulled.
y Within an increasing block size, the total time of each case increases
linearly.
23
Chapter 3: The data copy scenario and benchmarking
80000
70000
mean time(ms) for 10 runs
60000
10000
50000
20000
40000
30000
30000 50000
75000
20000
100000
10000
0
0 2000 4000 6000 8000 10000 12000
blocksize
The performance of creating a GDS instance is shown in Figure 13. A zoomed
view of this graph is shown in Figure 14, where the x‐axis in the 0 to 1000 region
is examined. The fact that the time to create a GDS is related to the data size is
unexpected. The reason for this is not known at this time.
75
70
mean time(ms) for 10 runs
65
10000
60 20000
30000
55
50000
75000
50
100000
45
40
0 2000 4000 6000 8000 10000 12000
blocksize
24
Chapter 3: The data copy scenario and benchmarking
Figure 13 Time to create a GDS
80
45
40
0 500 1000 1500
blocksize(row s)
Obtaining a ResultSet consisted of using two sub operations:
• Sending a perform document to the target GDS and
• Obtaining a ResultSet from the outputStream activity.
Inside the OGSA‐DAI engine, the three activities described in Code 1 are
chained together because of the dependency set between them. None of them
will be executed by the engine until a GDT is opened by the outputStream
activity receives a request sent by the client. Once a request is received, the SQL
select query is executed on the database by the sqlQueryStatement activity and
the first chunk of data is retrieved and aggregated by the blockAggregator
activity and returned to the client.
25
Chapter 3: The data copy scenario and benchmarking
Therefore, it is no surprise to see that the time required to obtain a ResultSet,
presented in Figure 15 , is proportional to both the block size and the data size
used. For each line in the graph, the data size was fixed, and the block size was
varied from {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2500, 5000, 7500,
10000}.
16000
14000
mean time(ms) for 10 runs
12000
10000
10000
20000
8000 30000
50000
6000
75000
4000 100000
2000
0
0 2000 4000 6000 8000 10000 12000
blocksize
In order to clearly see what is happening in the region where the block size was
smaller than 1000 rows a new set of experiments were undertaken that zoomed
into this region. The corresponding results are presented in Figure 16.
26
Chapter 3: The data copy scenario and benchmarking
4000
3000
50000
2500 30000
20000
2000
10000
1500
1000
0 200 400 600 800 1000 1200
blocksize
The block size determines the size of each communication and thus the number
of transportations required. The blockAggregator activity uses a StringBuffer to
concatenate multiple small blocks into a larger one. Therefore, a larger block
size introduces more overhead through the use of StringBuffer while, on the
other hand, a smaller blocksize requires more transportation.
Figure 17 shows that the performance from pulling the data results from the
MySQL database. It is observed that almost 90% of the total time (see Figure 12)
was spent on the data transfer. From the results shown in the graph one can see
that when the block size was in the 1000 to 7500 range, increasing the block size
lead to a reduction in execution time. However, when the data size was larger
than 30000, a larger block size i.e. 10000 can degrade the performance. This
seems to indicate the use of the StringBuffer operation is introducing a bigger
overhead than the transportation cost.
27
Chapter 3: The data copy scenario and benchmarking
80000
70000
0
0 2000 4000 6000 8000 10000 12000
blocksize
Figure 18 shows the results of pulling data from a GDS, zooming into the region
where the block size range was varied from 50 rows to 1000 rows in fixed steps
of 50. The five lines presented in the graph show similar behaviour regardless of
the size of the data being retrieved. The jump when the block size is somewhere
between 600 rows and 650 rows is unexpected. Furthermore, the maximum
amount of time required to pull the data from the GDS happened when the
block size was set to 650. The reason for this behaviour is not known at this
time.
28
Chapter 3: The data copy scenario and benchmarking
40000
35000
10000
5000
0 200 400 600 800 1000 1200
blocksize
the client pushing it to a sink GDS, the deliverToGDT and deliverFromGDT
activities supported by OGSA‐DAI are used to enable the two GDSs to transfer
data directly between themselves through the GDT ports (shown in Figure 19).
Source
Source
DR
Client
Sink
Sink
DR
Control flow
Data flow
GDT
29
Chapter 3: The data copy scenario and benchmarking
3.3.1 DeliverToGDT
Two perform documents were needed to compose this scenario. The perform
document sent to the source GDS consisted of a sqlQueryStatement activity and
a deliverToGDT activity, which is shown in Code 3. The sqlQueryStatement
activity performs a select SQL query on the source database. The deliverToGDT
activity enables the source GDS to push data to the sink GDS via a GDT port
opened on the sink GDS. The perform document sent to the sink GDS is the
same as Code 2.
<gridDataServicePerform>
<sqlQueryStatement/>
<deliverToGDT/>
</gridDataServicePerform>
Code 3 Perform document to the source GDS
3.3.2 DeliverFromGDT
Two perform documents needed to be composed for this scenario. The perform
document sent to the source GDS, is shown in Code 4, consisted of two
activities. The sqlQueryStatement activity performs a select SQL query on the
source database. The outputStream activity opens a GDT port on the source
GDS and serves the request.
<gridDataServicePerform>
<sqlQueryStatement/>
<outputStream/>
</gridDataServicePerform>
Code 4 Perform document sent to the source GDS
using deliverFromGDT activity
The perform document sent to the sink GDS is shown in Code 5. The
deliverFromGDT activity pulls data from the GDT opened on the source GDS.
The sqlBulkLoadRowSet activity is used to bulk load data comes from the
30
Chapter 3: The data copy scenario and benchmarking
output of the deliverFromGDT activity into a corresponding table on the sink
database.
<gridDataServicePerform>
<deliverFromGDT/>
<sqlBulkLoadRowSet/>
</gridDataServicePerform>
Code 5 Perform document sent to the sink GDS using
deliverFromGDT activity
simple data copy scenario are shown in the following graphs from Figure 20 to
Figure 24. These results show that that the performance of using the
deliverToGDT activity is similar to using the OGSA‐DAI client control approach
(see section 3.1.2). All the results made between any two databases used in this
work are quite similar. The slowest result happens when data is copied from the
MySQL database to the PostgreSQL database. These two databases are installed
on the same machine (coal).
Figure 20 shows the performance results where the source database was a
MySQL database. The three results obtained when copies are made from
MySQL to Oracle, PostgreSQL and SQLServer are all very close.
31
Chapter 3: The data copy scenario and benchmarking
80000
Figure 21 shows the performance results when the source database was the
Oracle database. For the same reason described in section 3.1.3.1, only the result
made from the Oracle database to the MySQL database is presented in the
graph.
70000
mean time(ms) for 10 runs
60000
50000
40000 Oracle
to
30000 MySQL
20000
10000
0
0 5000 10000 15000
datasize(row s)
Figure 22 shows the performance results where the source database was the
PostgreSQL database. The three results of data copies made from PostgreSQL to
MySQL, PostgreSQL and SQLServer overlap.
32
Chapter 3: The data copy scenario and benchmarking
80000
Figure 23 shows the performance results where the source database was the
DB2 database. The four results made from DB2 to MySQL, Oracle, PostgreSQL
and SQLServer are all very close.
80000
70000
mean time(ms) for 10 runs
60000
DB2 to
50000 MySQL
DB2 to
40000 Oracle
DB2 to
30000 PSQL
DB2 to
20000 SQL
10000
0
0 5000 10000 15000
datasize(row s)
Figure 24 shows the performance results when the source database was a
SQLServer database. The three results obtained for data copies from SQLServer
to Oracle, PostgreSQL and MySQL are very close.
33
Chapter 3: The data copy scenario and benchmarking
80000
70000
The benchmarking results obtained whenusing the deliverFromGDT activity to
implement the simple data copy scenario are shown in the following graphs,
Figure 25 and Figure 26. Compared to the performance obtained for the
deliverToGDT activity, the deliverFromGDT approach is slower than the
deliverToGDT approach. Copying data from the MySQL database to the Oracle
database is the slowest (shown in Figure 25) amongst all these results.
Compared to the performance results obtained using the deilveryToGDT
activity to implement the data copy scenario, the performance results of each
case are relatively dissimilar.
34
Chapter 3: The data copy scenario and benchmarking
100000
90000
The best performance of this scenario implemented by the deliverFromGDT
activity is made from the case that the data was copied from the SQL database
to the MySQL database (see Figure 26).
90000
80000
mean time(ms) for 10 runs
70000
60000 DB2 to
50000 SQL
SQL to
40000 MySQL
SQL to
30000 Oracle
20000
10000
0
0 5000 10000 15000
datasize(row s)
corresponding delivery activity, the client did not need to handle the data
35
Chapter 3: The data copy scenario and benchmarking
transfer explicitly as was the case with the JDBC or OGSA‐DAI approaches. By
specifying the data flow using the delivery activities, OGSA‐DAI can handle the
data transfer of the data integration scenario automatically. What the client
needs to do is to compose two perform documents and send them to the
corresponding GDSs. Thus, some complexity on the client side is shifted to the
server side.
A defect of this approach is the client has to compose several perform
documents and manage their execution on the corresponding GDSs. As
mentioned before, each perform operation has to be synchronous. As a result,
the client has to explicitly use threads to deal with this.
that it is performed between an XML database which acts as the source data
resource, and a relational database which acts as the sink data resource. The
case where the XML database acts as the sink data resource was not
investigated because the current OGSA‐DAI release does not support bulk
loading data into the XML database.
format to an appropriate tabular format for entry into a relational data resource.
The XMLDB API [4] was used to access the XML database. The XPath [25] query
used to select N XML documents was
36
Chapter 3: The data copy scenario and benchmarking
/entry[@id<=N]
The size N was selected from {100, 500, 1000, 5000, 10000}. The corresponding
data extracted from each XML document was composed to a SQL insert query
to be executed at the sink database.
The procedure using the direct approach in this scenario is described below.
1. Create a JDBC connection and an XML collection. An XML collection is
similar to a JDBC connection, which is mainly used to manage an access
session.
2. Select N XML document from the XML collection.
3. Extract the corresponding data from each selected XML document and
compose a SQL insert query to be executed on the sink database.
4. Release the resources.
The time taken for this benchmark started just before the point 2 and ended just
after the point 3.
(Code 6) sent to the source GDS consists of two activities, xPathStatement and
deliverToGDT. The xPathQueryStatement activity performs an XPath query on
the source XML database. The deliverToGDT activity delivers the data retrieved
from the output of the xPathQueryStatement activity to the sink GDS thought a
GDT port opened on the sink GDS.
<gridDataServicePerform>
<xPathQueryStatement/>
<deliverToGDT/>
</gridDataServicePerform>
Code 6 Perform document sent to the source GDS
37
Chapter 3: The data copy scenario and benchmarking
The perform document sent to the sink GDS is described in Code 7.
<gridDataServicePerform>
<inputStream/>
<deliverFromUrl/>
<xsltTransform/>
<sqlBulkLoadRowSet/>
</gridDataServicePerform>
Code 7 Perform document sent to the sink GDS
y The inputStream activity opened a GDT port on the sink GDS.
y The deliverFromUrl activity was used to load the XSLT definition from
a specified URL.
y The xsltTransform activity was used to translate the XML data into
WebRowSet format. Two inputs are required by an xsltTransform
activity. The XML data was read from the inputStream activity. The
XSLT definition was read from the deliverFromURL activity.
According to the XSLT definition, the xsltTransform can transform
XML data between arbitrary schemas.
y The sqlBulkLoadRowSet bulk loaded the data in to a corresponding
table on the sink database.
The detail procedure for this scenario using OGSA‐DAI is described as below.
1. Create two GDS instances.
2. Compose two perform documents.
3. Asynchronously perform the perform document (described in Code 7)
on the source GDS. A new thread needs to be created for this on the client
to achieve this.
4. Send the perform document presented in code 6 to the source GDS and
wait until both two operations complete.
5. Release resources.
38
Chapter 3: The data copy scenario and benchmarking
The time taken to benchmark this scenario when implemented using an
OGSA‐DAI approach started just before point 2 and ended just after point 4.
in Figure 27. The line marked as “XPath” represents the time required to
perform the XPath query on the XML database using the direct approach. The
gap between the two approaches is not as big as that found in the data copy
scenario between two relational databases. One possible reason for this might
be that the performance is mainly dependent on the amount of time taken to
process the XPath query.
The performance of the direct approach mainly depended on the time required
to process an XPath query. The reason why the time required to process the
XPath query decreased with the increasing data size is not known.
The cost of using OGSA‐DAI increased with the data size which is different
from the behaviour encountered with the direct approach. This may be caused
by the overheads introduced by using the xmlTransform activity and the
deliverToURL activity. Both the time required to process these two activities are
dominated by the size of the data they process.
39
Chapter 3: The data copy scenario and benchmarking
110000
100000
70000 OGSA-DAI
approach
60000
XPath
50000
40000
0 2000 4000 6000 8000 10000 12000
datasize
by the client using the DOM [43] API to extract data directly from the XML
document or using an XSLT processor.
The xsltTransform activity enables OGSA‐DAI to transform arbitrary XML
documents to arbitrary document types on the server side. By embedding it in
the activities chain, the data can flow smoothly from the inputStream activity to
the sqlBulkLoadRowSet, and the transformation is done automatically. The
mechanism relieves the complexity of the client side.
Comparing this scenario with the data copy scenario between relational data
resources, the heterogeneity of using relational and xml data resources
introduces more difficulties when trying to do data integration using a direct
approach. Moreover, this becomes more difficult as more data resources are
introduced.
y The client has to use two different sets of APIs.
40
Chapter 3: The data copy scenario and benchmarking
y The client has to address the difference in data formats between the
different data resources.
y The scalability of the direct approach is poor because of the fact that the
XML transformation operation has to be explicitly performed on the
client.
y The client has to handle the data transfer between the two data
resources.
These difficulties can be addressed by OGSA‐DAI in this scenario.
y OGSA‐DAI itself provides a uniform interface to heterogeneous data
resources.
y The format transformation can be addressed by defining an XSLT
document, which is loosely coupled to the code. Furthermore, a XSLT
document can also be looked upon as a data source and integrated into
the whole scenario.
y The data transfer is handled by the OGSA‐DAI service.
3.5 Summary
From the results of the above investigation, it appears that the performance of
OGSA‐DAI is about 3 to 13 times slower for the biggest data sizes than when
using a JDBC direct approach. The major overhead of OGSA‐DAI identified
from above investigations is introduced by GDT when transferring data.
Robustness is also an issue for OGSA‐DAI. OGSA‐DAI was not found to be
fully compatible with the five relational database it officially supports. Some
OGSA‐DAI bugs and flaws were found during the experiments. When the data
sizes in the data copy scenario were increased OGSA‐DAI always failed for a
variety of different reasons (see section 3.1.3.3).
41
Chapter 3: The data copy scenario and benchmarking
In terms of the programming experience of this scenario, using the OGSA‐DAI
client toolkit and the direct approach were quite similar. Both of these two
approaches used the same query languages, e.g. SQL and XPath, and
programming interfaces, such as ResultSet in JDBC and Collection in XML DB
API, to implement the data copy scenario. To compose a perform document
using the OGSA‐DAI client toolkit introduces complexity on the client side,
however it is a simple job and can be avoided through using a pre‐defined
perform document instead.
The advantages of using OGSA‐DAI to perform data integration identified in
this data copy scenario can be summarised as follows:
y Uniform access interface
The uniform access interface of OGSA‐DAI can greatly simplify the
complexity of data integration. All data resources, whether
homogeneous or heterogeneous, provided by different vendors, are are
exposed by their GDSFs identified through GSHs. The client does not
need to be concerned as to whether JDBC or XMLDB drivers need to be
used or what the details of the data resource configuration they need to
use. Furthermore, only one interface, the GDS, is needed by the clients
to access all kinds of data resources. The instruction as to how the data
integration scenario is to be processed by a GDS is described in a
perform document. By this means, the OGSA‐DAI users can focus on
the data integration logic instead of how to implement it.
y GDT
OGSA‐DAI enables data to be transported between two GDSs directly
through a GDT instead of going through the client side back and forth.
42
Chapter 3: The data copy scenario and benchmarking
This basic service coordination functionality frees the client from
handling the data transportation explicitly.
y Data flow
In the data integration scenario above, the client can specify the data
flows in a perform document by linking the input and output of two
related activities together. Moreover, these data flows can be
recognised and handled by OGSA‐DAI. This reduces the complexity at
the client side, and enables the whole data integration to be processed
automatically and independently of the client.
It is observed from the data copy scenario that the complexity on the client side
can be reduced due to functionality, such as data transformation and data
transport, being implemented at the server side. Two GDSs can be coupled
together in OGSA‐DAI using the deliverToGDT or the deliverFromGDT
activities to enable data to be transported directly between them. This frees the
clients from having to handle the data transfer directly.
However, what is observed from the above scenario is that the client still has to
interact with every data resource independently and synchronously. The GDS
needs to be initialised and the perform documents need to be set up
independently and sent to each perform method on the GridDataService in the
OGSA‐DAI client toolkit is synchronous. When the number of the data
resources participating in a data integration scenario is increased, the client still
needs to interact with each GDS. A high level OGSA‐DAI data integration
service driven framework is proposed in the next section, which is trying to
reduce this restriction.
43
Chapter 4: A service driven model
of the analyst. An analyst is mainly responsible for analysing the data
integration scenario and identifying the data integration logic for that scenario.
The analyst would be happy if OGSA‐DAI could directly apply the logic
deduced by the analyst regarding the instructions required to direct the whole
scenario. Thus, the data integration could be applied smoothly and
automatically.
However, this cannot be done easily with the current OGSA‐DAI release. The
main reason for this is that the interaction between two GDSs only has limited
support; only the data transfer between services is supported. Also, OGSA‐DAI
does not have support in its document framework to express the data
integration logic properly. The perform document only can express very limited
data integration logic, such as the actions and a limited expression of data flow.
As a result, some logic has to be implemented on the client side and the whole
scenario has to be explicitly controlled by the client.
data integration scenarios, such as the one shown in Figure 28. In this case a
user, or application, would have to create both GDSs and then compose
perform documents, using the OGSA‐DAI client toolkit, which would then be
sent to each GDS. However, when it comes to coordinating more than one GDS
in a data integration scenario it is up to the user to compose the corresponding
number of perform documents and manage the GDSs explicitly.
44
Chapter 4: A service driven model
Sink
Sink GDS
DR
Control flow
Data flow
GDT
In this chapter, a service driven model is proposed where the service‐to‐service
interactions are mediated through a single service, shown schematically in
Figure 29. This would free users from a lot of the explicit service management.
In the service driven case a user would only need to describe one perform
document and send this to the coordinating service. A GDS supporting an extra
set of activities would then be used as a pivot point to coordinate the other
GDSs. This preliminary work thus designs and implements a service driven
model by extending the existing OGSA‐DAI activity model. This work should
be regarded as a proof of concept which, if successful, should see some of the
functionality suggested here as activities hopefully being integrated into
OGSA‐DAI components such as the GDS engine.
45
Chapter 4: A service driven model
Source
Source
DR
Client
Source
Sink
DR
Control flow
Data flow
GDT
4.1.2 Definition
Some new terminology is introduced here, base and composite activities, which
are used throughout the remainder of this document to mean:
y base activities: all activities supported in the current OGSA‐DAI
release are referred to as being base activities. These do not contain
other activities and usually perform a single operation, such as
executing a SQL or XMLDB query on a target data resource,
transforming an XML document using an XSLT document or
transferring data, etc.
y composite activities: act as containers for other activities. The
activities enclosed in a composite activity are called contained activities.
A composite activity may manage and control these contained
activities in a similar manner to the OGSA‐DAI engine.
46
Chapter 4: A service driven model
4.2 System overview
4.2.1 Introduction
The current OGSA‐DAI release, as it stands, is not entirely sufficient to use the
service‐driven model for the following reasons:
1. Limited interaction functionality between GDSs
In the current OGSA‐DAI release interaction between GDSs is only
supported through the GDT. Moreover this interaction is limited to
only doing data transfer. No function is currently supported by
OGSA‐DAI to enable a GDS to control or manage other GDSs.
2. Local data flow
The users can specify the data flow between two activities in a perform
document. However, in a data integration scenario involving multiple
GDSs, each GDS will only be able to see its local contribution to the
global data flow. This is a problem for a service driven model. Hence,
the coordination among multiple GDSs has to adopt the client‐driven
model.
3. No control flow syntax or control flow support
In the service driven mode, the coordinating GDS needs to know the
particular order in which activities are to execute. However, control
flow syntax and management is not supported in the current
OGSA‐DAI distribution.
In order to be able to use a service‐driven model the extensions discussed in the
following sections are necessary.
47
Chapter 4: A service driven model
4.2.1.1 Interaction between GDSs
In a service‐driven model a GDS needs to coordinate other GDSs to implement
a given data integration scenario. The GDSActivity proposed in this document
is used to meet this requirement. A GDSActivity is a composite activity
executed by a GDS to send a collection of base activities to another GDS and
receive the corresponding results. By this means, a GDS is able to use or manage
other GDSs in OGSA‐DAI.
manage the sequence in which GDSActivities are executed on GDSs to enable
services coordination. There is no support for the specification and management
of control flow8 in the current OGSA‐DAI release. Hence, two new composite
activities, sequence and flow, are used to prescribe two types of simple control
flow for any activities they contain.
Because this document mainly proposes an approach to support service‐driven
data integration in OGSA‐DAI, discussion of control flow within a GDS is out of
the scope. As a result of these restrictions the type of activities contained in
sequence and flow activities must not be base activities.
following two forms:
y Internal IO: Both the producer and the consumer of an internal IO are
activities located in the same GDS.
8 Control flow: An abstract representation of all possible sequences of events in a programʹs
execution.
48
Chapter 4: A service driven model
y External IO: The producer and the consumer of an external IO are
activities located in different GDSs. For example, the output of an
outputStream activity is an external output.
Two new types of IO are identified in composite activities, their definitions are
only limited in the scope of a composite activity:
y Inner IO: Both the producer and the consumer of an IO are activities
inside a composite activity.
y Outer IO: Either the producer or the consumer of an IO is an activity
outside of a composite activity but located on the same GDS.
Composite activities need to explicitly handle the inner and outer types of IO.
The details of composite activities are discussed in the following sections.
4.2.2 GDSActivity
A GDSActivity is a composite activity which is used to perform a collection of
activities on a target GDS and receive the corresponding results. The client must
specify the GSH of either a GDS or a GDSF. The GDSActivity decides how to
obtain the GDS instance according to the name of the element containing the
GSHs. For simplicity, all activities contained in a GDSActivity must be base
activities.
complex control flow can be defined and processed by nesting composite
activities.
A sequence activity prescribes a sequential execution order. The activities
contained in a sequence activity must be executed in the order they are specified
49
Chapter 4: A service driven model
by the coordinating GDS. A flow activity enables the contained activities to be
executed in parallel.
4.2.4 Security
As this document only presents a preliminary proposal for a service driven data
integration in OGSA‐DAI, security is not a main concern in this document.
service‐driven model. The abstract class CompositeActivity implements the
Activity interface and provides common functions for all three inheritors, such
as identifying the inputs and outputs for each of activity.
4.3.2 GDSActivity
The GDSActivity XML schema is shown in appendix 1.2. The activities
contained in a GDSActivity can be one or more base activities. They are
50
Chapter 4: A service driven model
enclosed in a <content> element so as to facilitate the XML processing. Either a
<gdsf> or <gds> element is used by a client to specify the target GDS. Code 8
presents an example of a GDSActivity.
The <gdsf> element indicates that a GDS needs to be created from a
GridDataServiceFactory which is specified by the url attribute. The GDSActivity
(gdsActivity) contains one sqlQueryStatement (statement) activity. The output
of the GDSActivity is established and mapped to the output (results) of the
containing sqlQueryStatement activity. After the GDSActivity receives the
result of the sqlQueryStatement, it rewrites the result to its output. The data may
be composed into a response document or consumed by another activity.
<gdsActivity name="gdsActivity">
<gdsf url="urlOfGDSF"/>
<content>
<sqlUpdateStatement name="statement">
<!-- value of first parameter -->
<sqlParameter position="1">
12
</sqlParameter>
<!-- value of second parameter -->
<sqlParameter position="2">
321
</sqlParameter>
<expression>
insert into littleblackbook values ? ?
</expression>
<resultStream name="results"/>
</sqlUpdateStatement>
</content>
<output activity="statement" name="results" type=”0”/>
</gdsActivity>
Code 8 Example of a GDSActivity
to perform the activities it contains.
51
Chapter 4: A service driven model
When a GDSActivity is executed by the GDS engine, it acts as a client to the
GDS specified in its description. The processing steps of a GDSActivity are
illustrated in Figure 32.
1. If a <gdsf> element is specified in the XML description of a GDSActivity,
the GDSActivity needs to be able to create a new instance of GDS from the
given GDSF’s GSH. If a <gsh> element is specified, a GDSActivity uses that
instance.
2. The GDSActivity constructs the activities it contains into a
DocumentRequest object. Because these activities are executed by the
coordinated GDS, it is not necessary for a GDSActivity to check whether
they are supported by the coordinating GDS.
3. The DocumentRequest is sent to the coordinated GDS using the
OGSA‐DAI client toolkit.
4. If any outer input is identified in this GDSActivity, it transfers data
from the input to the coordinated GDS using a GDT portType.
5. A response document is received.
52
Chapter 4: A service driven model
6. The GDSActivity is responsible for transferring data from the
coordinated GDS to any outer output identified in the GDSActivity using a
GDT portType.
7. If a GDSActivity creates the coordinated GDS from a
GridDataServiceFactory, it destroys the GDS instance when processing of
the base activities in that GDS has finished.
53
Chapter 4: A service driven model
Any exception that arises from these steps causes an ActivityUserException to
be thrown by the ActivityEngine.
activities it contains. Thus, if a base activity contained in a GDSActivity needs to
read data from an outer input of the GDSActivity, an input link is established
by the GDSActivity. The GDSActivity is responsible for transferring data
between these two activities. If an activity contained in a GDSActivity needs to
write data to an outer output of the GDSActivity, an output link is established
by the GDSActivity. The transfer is also done by the GDSActivity. The inner IOs
of a GDSActivity are managed by the coordinated GDS.
Currently, in order to transfer data between two GDSs a GDT has to be
explicitly specified. Because of this, if a GDSActivity takes an outer input, the
consumer point of this outer output must be an inputStream Activity.
An example of data flow of a GDSActivity is shown in Figure 33.
Coordinated
GDSActivity
GDS
Activity1
Activity1 GDT
Activity2
Activity2
Activity3
Activity3
Data flow
Input of a GDSActivity
Output of a GDSActivity
54
Chapter 4: A service driven model
Figure 33 An example of GDSActivity data flow
The GDSActivity in Figure 33 contains three base activities. Dash lined boxes
are used to represent the activities contained in the GDSActivity as they are
actually executed by the coordinated GDS. The data flow is illustrated by the
arrow lines. Activity3 needs to read data from an outer input of this
GDSActivity, so an input (the top rectangle in the GDSActivity) is set up by the
GDSActivity and the data is transferred from the outside to Activity3 by the
GDSActivity. The same to the input processing, an output (the bottom rectangle
in the GDSActivity) is established by the GDSActivity and is used to write data
generated by Activity3 to the outside of the GDSActivity.
activities that are performed sequentially, in the order in which they are
specified. The XML schema for a sequence activity is shown in appendix 1.3.
The sequence activity completes when the last activity contained in the
sequence finishes or any error happens during execution.
contains, it then constructs them into corresponding composite activities
implementation instances and finally executes them in turn. The processing is
illustrated in Figure 34.
55
Chapter 4: A service driven model
1. A sequence activity needs to construct the activity elements it contains
into activities instances. If an unsupported activity exists, an exception is
thrown and the sequence activity is terminated.
2. If an inner IO operation is identified between two activities contained in
a sequence activity, a pipe is created to link these two activities. Thus an
activity chain is formed. An example is shown in Figure 35.
56
Chapter 4: A service driven model
A sequence activity
Activity 1
Activity 3
An activity chain
Inner IO
1. The sequence activity processes the end point activity of each activity
chain in turn. If any exception arises during this step it causes the sequence
activity to terminate and an exception is thrown up to the GDS engine.
2. The sequence activity completes when there is no sub activity instances
left.
4.3.3.2 Data flow
In the same way as for a GDSActivity (section 4.3.2.2), a sequence activity also
needs to link inputs and outputs that come from outside to the inputs and
outputs of the activities it contains. For an inner IO of a sequence activity, a pipe
is used to link the two corresponding activities involved in the IO. By this
means, an activity chain is formed. For an outer IO in a sequence activity, the
sequence activity needs to handle the data transfer explicitly.
4.3.3.3 Example
There are two gdsActivities in the sequence activity presented in Code 9,
gdsActivity1 and gdsActivity2. These must be executed sequentially. The
sequence activity completes when gdsActivity2 completes. No input is required
57
Chapter 4: A service driven model
for this case; the two outputs (results1 and statementResponse) are exposed by
the sequence activity links the two most inner outputs that belong to the
activities update1 and query1 respectively.
<sequence name="sequence">
<content><!-- gdsActivity 1-->
<gdsActivity name="gdsActivity1">
<gdsf url="someurl1"/>
<content>
<sqlUpdateStatement name="update1">
<expression>
insert into littleblackbook values 12,33
</expression>
<resultStream name="results1"/>
</sqlUpdateStatement>
</content>
<output activity="update1" name="result1" type=”0”/>
</gdsActivity>
<!-- gdsActivity2-->
<gdsActivity name="gdsActivity2">
<gdsf url="someurl2"/>
<content>
<sqlQueryStatement name="query1">
<expression>
select * from littleblackbook where id<=321 and
id>=12
</expression>
<webRowSetStream name="statementResponse"/>
</sqlQueryStatement>
</content>
<output activity="query1" name="statementResponse"
type=”0”/>
</gdsActivity>
</content>
<!-- output mapping-->
<output activity="gdsActivity1" name="results1" type=”0”/>
<output activity="gdsActivity2" name="statementResponse”
type=”0”/>
</sequence>
Code 9 Example of a sequence activity
can be run in parallel. The types of activities that can be contained in a flow
activity have to be composite activities. A flow activity completes when all the
58
Chapter 4: A service driven model
activities it contains have completed. The XML schema for a flow activity is
presented in appendix 1.4.
contains, constructs them into corresponding activities implementation
instances, and then executes them concurrently. Figure 36 presents the sequence
for processing a flow activity.
1. A flow activity needs to construct the activities elements it contains into
activities instances. If an unsupported activity exists, an exception is
thrown and the flow activity is terminated.
2. For each activity instance contained in the flow activity, it spawns a
new thread and lets the thread execute the activity. Thus, all activities can
be processed in parallel.
3. Data flow between two sub activities in a flow activity is not currently
allowed, because of possible hidden dependencies.
4. A flow activity waits for all the activities it contains to complete or fail.
The flow activity scans all activities to see whether any errors have
occurred. If no error is detected, the flow activity completes as usual, else it
quits with an exception with the information that indicates which activities
are failed.
59
Chapter 4: A service driven model
activity also needs to link inputs and outputs to external outputs and inputs for
the activities it contains.
4.3.4.3 Example
An example of a flow activity is shown in Code 10. Two GDSActivities,
gdsActivity1 and gdsActivity2 can be executed in parallel. The flow activity
completes when both GDSActivities have completed. No input is required for
60
Chapter 4: A service driven model
this case; two outputs (result1, statementResponse) are exposed by the flow
activity.
<flow name="flow">
<content><!-- gdsActivity 1-->
<gdsActivity name="gdsActivity1">
<gdsf url="someurl1"/>
<content>
<sqlUpdateStatement name="update1">
<expression>
insert into littleblackbook values 12,33
</expression>
<resultStream name="result1"/>
</sqlUpdateStatement>
</content>
<output activity="update1" name="result1" type=”0”/>
</gdsActivity>
<!-- gdsActivity2-->
<gdsActivity name="gdsActivity2">
<gdsf url="someurl2"/>
<content>
<sqlQueryStatement name="query1">
<expression>
select * from littleblackbook where id<=12 and
id>=321
</expression>
<webRowSetStream name="statementResponse"/>
</sqlQueryStatement>
</content>
<output activity="query1" name="statementResponse"
type=”0”/>
</gdsActivity>
</content>
<!-- output mapping-->
<output activity="gdsActivity1" name="result1" type=”0”/>
<output activity="gdsActivity2" name="statementResponse"
type=”0”/>
</flow>
Code 10 Example of a flow activity
61
Chapter 5: Data integration using the service driven model
5 Data integration using the service driven model
In this section, a more complex data integration scenario is introduced ‐
distributed join scenario (described in section 5.2). In additional to the data copy
scenario (introduced in section 3.1), these data integration scenarios are
implemented using both the client and service driven models.
are needed to be composed by the client and sent to corresponding GDSs in
turn.
<gridDataServicePerform>
<sqlQueryStatement name="query1">
<expression>
select * from littleblackbook
</expression>
<webRowSetStream name="statementResponse"/>
</sqlQueryStatement>
<deliveryToGDT name="delivery1">
<fromLocal from="statementResponse"/>
<toGDT streamId="inputStream1" mode="full">
someURL2
</toGDT>
</deliveryToGDT>
</gridDataServicePerform>
Code 11 Complete perform document sent to the
source GDS in the client driven model
<gridDataServicePerform>
<inputStream name="inputStream1">
<toLocal name="dataSource"/>
</inputStream>
<sqlBulkLoadRowSet name="bulkLoad1">
<webRowSetStream from="dataSource"/>
<loadIntoTable tableName="littleblackbook"
transactionally="false"/>
<resultStream name="results"/>
</sqlBulkLoadRowSet>
</gridDataServicePerform>
Code 12 Complete perform document sent to the sink
GDS in the client driven model
62
Chapter 5: Data integration using the service driven model
The sink GDS needs to be set up beforehand to achieve this scenario using the
client driven model. Thus a thread needs to be created for as the perform
operation in the OGSA‐DAI client toolkit has to be synchronous.
(flow1) activity (shown in Code 13) is needed. The above two perform
documents (Code 11 and Code 12) used in the client driven model are now
wrapped into two GDSActivities (gdsActivity1 and gdsActivity2) and added in
a flow activity (flow1).
<gridDataServicePerform>
<flow name="flow1">
<content>
<gdsActivity name="gdsActivity1">
<gdsf url="sinkURL"/>
<content>
<inputStream name="inputStream1">
<toLocal name="dataSource"/>
</inputStream>
<sqlBulkLoadRowSet name="bulkLoad1">
<webRowSetStream from="dataSource"/>
<loadIntoTable tableName="littleblackbook"
transactionally="false"/>
<resultStream name="results"/>
</sqlBulkLoadRowSet>
</content>
<output activity="bulkLoad1" name="results" type=”0”/>
</gdsActivity>
<gdsActivity name="gdsActivity2">
<gdsf url="sourceURL"/>
<content>
<sqlQueryStatement name="query1">
<expression>
select * from littleblackbook
</expression>
<webRowSetStream name="statementResponse"/>
</sqlQueryStatement>
<deliveryToGDT name="delivery1">
<fromLocal from="statementResponse"/>
<toGDT streamId="inputStream1" mode="full">
someURL2
</toGDT>
</deliveryToGDT>
</content>
63
Chapter 5: Data integration using the service driven model
</gdsActivity>
</content>
<output activity="gdsActivity1" name="results" type=”0”/>
</flow>
</gridDataServicePerform>
Code 13 Complete perform document sent to a
coordinating GDS in the service driven model
Once the flow activity (flow1) is executed by the coordinating GDS, the two
GDSActivities contained in this document are run concurrently. Figure 37
schematically illustrates the data copy scenario using a service driven model.
Source
Source GDS
DR
2
1
Client
3
Coordinating GDS Sink
Sink GDS
DR
1. The client sends the perform document (Code 13) to the coordinating
GDS first.
2. The coordinating GDS sends a perform document containing the
GDSActivity (gdsActivity1) to the sink GDS.
3. The coordinating GDS sends a perform document containing the
GDSActivity (gdsActivity2) to the source GDS.
The steps 2 and 3 are executed in parallel by the coordinating GDS.
64
Chapter 5: Data integration using the service driven model
5.2 Distributed Join
A simple example, schematically illustrated in Figure 38, can be described as
follows. A data resource DS1 has a table called table1, which has two columns
name and price used to describe the name and price of a product respectively. A
data resource DS2 has a table called table2, which has two columns name and
itemNumber. Correspondingly, they represent the name and number of items
stored in a warehouse.
GDS3 DS3
Client
A client needs to know the number of items as well as the name and price of the
product stored in a warehouse. To achieve this, three GDSs are required. GDS1
and GDS2 handle requirements for DS1 and DS2 respectively. GDS3 is used to
join data come from GDS1 and GDS2 and provide the result.
65
Chapter 5: Data integration using the service driven model
5.2.1 Client driven
Use of the client driven model to implement the distributed join scenario is
depicted in Figure 39. The solid arrow lines represent the control flow in this
scenario. The dashed arrow lines indicate the data flow.
GDS1 GDS2
1 2
GDS3
1
2
3 3
Client
Five perform documents are required by the client driven model to implement
this scenario. The client needs to interact with each GDS individually.
1. The perform documents (described in Code 14 and Code 15) are sent to
GDS1 and GDS3 respectively. The data selected from GDS1 is delivered to
GDS3 through the GDT.
<gridDataServicePerform>
<sqlQueryStatement name="query1">
<expression>
select name as name1,price from table1
</expression>
<webRowSetStream name="statementResponse1"/>
</sqlQueryStatement>
<deliverToGDT name="deliverToGDT1">
<fromLocal from="statementResponse1"/>
66
Chapter 5: Data integration using the service driven model
<toGDT streamId="inputStream1" mode="full">
urlOfGDS3
</toGDT>
</deliverToGDT>
</gridDataServicePerform>
Code 14 A perform document sent to the GDS1 to
select data and deliver the data to GDS3
<gridDataServicePerform>
<inputStream name="inputStream1">
<toLocal name="dataSource1"/>
</inputStream>
<sqlBulkLoadRowSet name="bulkLoad1">
<webRowSetStream from="dataSource1"/>
<loadIntoTable tableName="table1" transactionally="false"/>
<resultStream name="results1"/>
</sqlBulkLoadRowSet>
</gridDataServicePerform>
Code 15 A perform document sent to the GDS3 to
insert the data delivered from GDS1
2. The perform documents (described in Code 16 and Code 17 ) are sent to
GDS2 and GDS3 respectively. The data selected from GDS2 is delivered to
GDS3 through the GDT.
<gridDataServicePerform>
<sqlQueryStatement name="query2">
<expression>
select name as name2,itemNumber from table1
</expression>
<webRowSetStream name="statementResponse2"/>
</sqlQueryStatement>
<deliverToGDT name="deliverToGDT2">
<fromLocal from="statementResponse2"/>
<toGDT streamId="inputStream2" mode="full">
urlOfGDS3
</toGDT>
</deliverToGDT>
</gridDataServicePerform>
Code 16 A perform document sent to the GDS2 to
select data and deliver the data to GDS3
<gridDataServicePerform>
<inputStream name="inputStream2">
<toLocal name="dataSource2"/>
</inputStream>
<sqlBulkLoadRowSet name="bulkLoad2">
<webRowSetStream from="dataSource2"/>
67
Chapter 5: Data integration using the service driven model
<loadIntoTable tableName="table1" transactionally="false"/>
<resultStream name="results2"/>
</sqlBulkLoadRowSet>
</gridDataServicePerform>
Code 17 A perform document sent to the GDS3 to
insert the data delivered from GDS2
3. Once steps 2 and 3 are completed, the client sends a perform document
(seen Code 18) to GDS3 to perform a join select query and retrieve the
results.
<gridDataServicePerform>
<sqlQueryStatement name="query3">
<expression>
select name1 as name,itemNumber,price from table1 where
name1=name2
</expression>
<webRowSetStream name="statementResponse3"/>
</sqlQueryStatement>
</gridDataServicePerform>
Code 18 A perform document sent to the GDS3 to
perform a join select query and retrieve the result
illustrated in Figure 40. GDS3 is responsible for coordinating the two GDSs
(GDS1 and GDS2) and driving the data integration scenario.
68
Chapter 5: Data integration using the service driven model
GDS1 GDS2
GDS3
Data flow
Control flow Client
By combining flow and sequence activities, only one perform document, shown
in Code 19, is necessary for this scenario.
<gridDataServicePerform>
<sequence name="sequence1">
<content>
<flow name="flow1">
<content>
<flow name="flow2">
<content>
<gdsActivity name="gdsActivity1">
<gdsf url="urlOfGDS3"/>
<content>
<inputStream name="inputStream1">
<toLocal name="dataSource1"/>
</inputStream>
<sqlBulkLoadRowSet name="bulkLoad1">
<webRowSetStream from="dataSource1"/>
<loadIntoTable tableName="table1"
transactionally="false"/>
<resultStream name="results1"/>
</sqlBulkLoadRowSet>
</content>
<output activity="bulkLoad1" name="results1"
type=”0”/>
</gdsActivity>
<gdsActivity name="gdsActivity2">
69
Chapter 5: Data integration using the service driven model
<gdsf url="urlOfGDS1"/>
<content>
<sqlQueryStatement name="query1">
<expression>
select name as name1,price from table1
</expression>
<webRowSetStream name="statementResponse1"/>
</sqlQueryStatement>
<deliverToGDT name="deliverToGDT1">
<fromLocal from="statementResponse1"/>
<toGDT streamId="inputStream1" mode="full">
urlOfGDS3
</toGDT>
</deliverToGDT>
</content>
</gdsActivity>
</content>
<output activity="gdsActivity1" name="results1"
type=”0”/>
</flow>
<flow name="flow3">
<content>
<gdsActivity name="gdsActivity3">
<gdsf url="urlOfGDS3"/>
<content>
<inputStream name="inputStream2">
<toLocal name="dataSource2"/>
</inputStream>
<sqlBulkLoadRowSet name="bulkLoad2">
<webRowSetStream from="dataSource2"/>
<loadIntoTable tableName="table1"
transactionally="false"/>
<resultStream name="results2"/>
</sqlBulkLoadRowSet>
</content>
<output activity="bulkLoad2" name="results2"
type=”0”/>
</gdsActivity>
<gdsActivity name="gdsActivity4">
<gdsf url="urlOfGDS2"/>
<content>
<sqlQueryStatement name="query2">
<expression>
select name as name2,itemNumber from table1
</expression>
<webRowSetStream name="statementResponse2"/>
</sqlQueryStatement>
<deliverToGDT name="deliverToGDT2">
<fromLocal from="statementResponse2"/>
<toGDT streamId="inputStream2" mode="full">
urlOfGDS3
</toGDT>
</deliverToGDT>
</content>
</gdsActivity>
70
Chapter 5: Data integration using the service driven model
</content>
<output activity="gdsActivity3" name="results2"
type=”0”/>
</flow>
</content>
<output activity="flow2" name="results1" type=”0”/>
<output activity="flow3" name="results2" type=”0”/>
</flow>
<gdsActivity name="gdsActivity5">
<gdsf url="urlOfGDS3"/>
<content>
<sqlQueryStatement name="query3">
<expression>
select name1 as name,itemNumber,price from table1 where
name1=name2
</expression>
<webRowSetStream name="statementResponse3"/>
</sqlQueryStatement>
</content>
<output activity="query3" name="statementResponse3"
type=”0”/>
</gdsActivity>
</content>
<output activity="flow1" name="results1" type=”0”/>
<output activity="flow1" name="results2" type=”0”/>
<output activity="gdsActivity5" name="statementResponse3"
type=”0”/>
</sequence>
</gridDataServicePerform>
Code 19 A perform document used in the service
driven model to implement the distributed join scenario
y Two activities (flow1 and gdsActivity5) are contained in the sequence
activity (sequence1). They are executed in the order in which they are
specified in the perform document by the coordinating GDS (GDS3).
y The flow activity (flow1) includes two other flow activities (flow2 and
flow3). Since no dependency is specified between the two flow
activities, they can be executed concurrently.
y Both that two sub flow activities (flow2 and flow3) are used to
delivered data selected from the source GDSs to the sink GDSs. They
can be regarded as two data copy scenarios described in section 5.1.2.
71
Chapter 5: Data integration using the service driven model
y Once the flow activity (flow1) is completed, the GDSActivity
(gdsActivity5) is executed to perform a join select query on GDS3
which returns the results.
service driven model was used to implement the above two data integrations
scenarios (see section 5.1.2 and 5.2.2).
Only the simple data copy scenario was benchmarked because of the limited
time available. The results are shown in Figure 41.
80000
70000
mean time(ms) fro 10 runs
60000
Service
50000 driven
Client driven I
40000
Client driven II
30000
20000 Direct
approach
10000
0
0 5000 10000
datasize(row s)
Client driven I shows the performance of the scenario when implemented using
the deliverToGDT activity (introduced in section 3.3.1). Client driven II
indicates the performance of the scenario when implemented using a client
controlled data transfer approach (mentioned in section 3.1.2). The data copy
72
Chapter 5: Data integration using the service driven model
was made from a PostgreSQL database to a MySQL database. The Oracle
database was used as the temporary database in the service driven model.
The time required to perform the data copy scenario using the three OGSA‐DAI
approaches (the service driven, the client driven I and the client driven II) are
quite similar. No significant overhead is introduced by this implementation of
the service driven model.
5.4 Summary
In the client driven model, the client needs to compose multiple perform
documents in relation to the complexity of the data integration scenario, as has
been shown two for the data copy scenario and five for the distributed join
scenario, and interact with all the GDSs independently. Besides, multiple
threads need to be created by the client to enable these perform operations to be
executed asynchronously. Presumably, the complexity of the client using the
client driven model will be increased when more data resources are added to
this scenario.
In the service driven model, only one perform document is needed. By
specifying the control flow in the perform document, the coordinating GDS can
handle the whole data integration automatically. The major work on the client
side is to identify the control flow and data flow of the data integration scenario
and compose the perform document. The major tasks of the data integration are
embedded at the service side this simplifies the code complexity of the client
side even more data resources are introduced in this scenario.
When more data resources are added to a data integration scenario, the service
–driven model scales better than the client driven model. In the direct approach
and in the OGSA‐DAI client control approach, the client has to handle the data
73
Chapter 5: Data integration using the service driven model
transfer between two data resources explicitly. Thus, the complexity of code
increases as the number of participating data resources increases. The client is
the initiator and controller of the whole data integration process. Using the
service driven model, the data integration is managed by the OGSA‐DAI service
automatically according to the data flow and the control flow definition
specified in the perform documents.
74
Chapter 6: Conclusions
6 Conclusions
In this work, the data integration capabilities of OGSA‐DAI have been
evaluated based on the investigation on two data integration scenarios. Some
potential advantages and disadvantages of using OGSA‐DAI for data
integration have been presented.
The results of the benchmarking have shown that OGSA‐DAI performs 3~13
times slower in the simple data copy scenario than the direct approach
implemented using JDBC. The blockAggregator activity was examined in more
detail to better understand its behaviour. The result shows that the main
overhead introduced by OGSA‐DAI in the simple data copy scenario is from the
GDT when transferring data.
Some bugs and flaws were detected during these experiments. It was found that
OGSA‐DAI does not fully support these five databases officially supported by
OGSA‐DAI. The robustness is also another problem of OGSA‐DAI. When the
data size is larger than 100,000 rows, the data copy scenario implemented using
OGSA‐DAI has not been able to handle these data set sizes.
The performance and robustness are regarded as being two factors of greatest
concern which require some attention for future OGSA‐DAI releases. On the
other hand, the performance could be acceptable when the data size involved in
a data integration scenario are not very large. The client can still retrieve the
data in an acceptable time.
However, compared to a direct approach OGSA‐DAI does ease the ability to
support data integration through the providision of functionality, such as the
uniform access interface, the GDT, and supporting data flow.
75
Chapter 6: Conclusions
Furthermore, a preliminary approach that aims to improve the data integration
capabilities of OGSA‐DAI was introduced and discussed in this work. The
service‐driven model proposed in this document would allow OGSA‐DAI to
orchestrate multiple OGSA‐DAI services.
The GDSActivity introduced here allows a GDS to use and control other GDSs.
This is a complement to the GDT that could be supported in future OGSA‐DAI
releases.
The sequence activity and flow activities allow OGSA‐DAI to support control
flow. The expectation is that such control flow features would be better
supported by the GDS engine instead of taking the form as activities. For
example, a new <sequence> element may be added to perform documents to tell
the OGSA‐DAI engine that the activities contained in this element should be
executed sequentially.
By mix‐using GDSActivities and sequence and flow activities, a GDS can
orchestrate multiple GDSs to accomplish a data integration operation
automatically according the description specified by the clients. The client can
be freed from explicitly creating GDSs and transferring data through the
OGSA‐DAI framework.
services approach using the scenarios described in this work. However, this has
not been done as there was not enough time to do this. This should be
investigated in the future.
76
Chapter 6: Conclusions
The performance and robustness are two most important issues need to be
addressed in future OGSA‐DAI release.
OGSA‐DAI needs a better resource management framework to manage these
allocated resources when a GDS processes a perform document. The resource
pool might be considered as a better replacement of the current adopted
resource management framework used in OGSA‐DAI.
OGSA‐DAI needs to support bulk load function for XML database to facilitate
the data integration.
It is observed that data flow is one of the most important factors of data
integration. However, data flow is not sufficiently supported in the current
OGSA‐DAI release. It is hoped that in a future OGSA‐DAI release, a complete
data flow of a complex data integration scenario which involves multiple data
resources could be specified in the perform document and handled by these
GDS cooperatively.
Other work that might be considered to improve the data integration
capabilities of OGSA‐DAI include:
y User‐definable exception handling would allow more complex
workflow patterns to be expressed within a perform document.
Furthermore, this could allow the exceptions to be handled
independently with the client’s control.
y Reuse of activity and process definition allowing more rapid
development of data integration patterns, akin to using stored
procedures.
77
Chapter 6: Conclusions
y At present, two different query languages are used to query different
types of data resources, SQL to query relational databases and XPath to
query XML databases. A uniform query language is regarded as an
important part of the uniform access interface provided by OGSA‐DAI.
OGSA‐DAI expects to adopt XQuery [34] as its query language in the
future. XQuery is going to be supported by both relational databases
and XML databases.
y A higher level workflow framework which would describe the
relationships between GDSs and data consumers, how they cooperate
with each other, and how data transportation is defined. This would
also allow better interoperability with other services, e.g. job
submission services.
y Embeddable code within perform documents could help reduce data
transfer and increase the scalability of OGSA‐DAI. An approach similar
to the Java code snippets described in BPELJ [35]could be taken, with a
code sandbox within the GDS allowing access to activities running on
the service.
y More complex data integration scenarios require the OGSA‐DAI users
to compose more complex perform documents. A GUI program with
“drag and drop” function could be used to facilitate the composition of
the perform documents on the client side.
78
Appendix A
1 XML SCHEMAS
1.1 grid_data_services_type_ext.xsd
The XML schema shown in Code 20 extends the grid_data_service_types.xsd
provided by OGSA‐DAI and defines a new type CompositeActivityType.
<xsd:schema
targetNamespace="http://ogsadai.org.uk/namespaces/2003/07/gds/
types"
xmlns:tns="http://ogsadai.org.uk/namespaces/2003/07/gds/types"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
attributeFormDefault="unqualified">
<xsd:include schemaLocation="grid_data_service_types.xsd"/>
1.2 gds_activity.xsd
The GDSActivityType extends the CompositeActivityType defined in Code 20.
The clients can use either a <gdsf> or <gds> element to identify the coordinating
GDS. The <content> element is a container element for all sub activities
containing in the GDSActivity. A GDSActivityType has zero or more <input>
79
elements. A GDSActivityType has one or more <output> elements. The XML
schema is shown in Code 21.
<xsd:include
schemaLocation="../../types/grid_data_service_types_ext.xsd"
/>
<xsd:complexType name="GDSActivityType">
<xsd:complexContent>
<xsd:extension base="gdstypes:CompositeActivityType">
<xsd:sequence>
<!--url of target GDSF or GDS-->
<xsd:choice>
<xsd:element name="gdsf" type="tns:urlType"/>
<xsd:element name="gds" type="tns:urlType"/>
</xsd:choice>
</xsd:complexType>
</xsd:element>
80
<xsd:extension base="gdstypes:ActivityOutputType">
<xsd:attribute name="activity" type="xsd:string"
use="required"/>
<xsd:attribute name="type" type="xsd:integer"
use="required"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="urlType">
<xsd:attribute name="url" type="xsd:string" use="required"/>
</xsd:complexType>
<!-- Define the name the activity will take on in the perform
documents -->
<xsd:element name="gdsActivity" type="tns:GDSActivityType"
substitutionGroup="gdstypes:compositeActivity"/>
</xsd:schema>
Code 21 gds_activity.xsd
1.3 sequence_activity.xsd
Code 22 presents the XML schema of the sequence activity. A
SequenceActivityType is defined in this schema. It can have zero or more
<input> elements and one or more <output> elements. The <content> element
can contain one or more CompositeActivities.
<xsd:include
schemaLocation="../../types/grid_data_service_types_ext.xsd"
/>
<xsd:complexType name="SequenceActivityType">
<xsd:complexContent>
<xsd:extension base="gdstypes:CompositeActivityType">
<xsd:sequence>
81
<!--url of target GDSF or GDS-->
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<!-- Define the name the activity will take on in the perform
documents -->
<xsd:element name="sequence" type="tns:SequenceActivityType"
substitutionGroup="gdstypes:compositeActivity"/>
</xsd:schema>
Code 22 sequence_activity.xsd
1.4 flow_activity.xsd
82
The XML schema of the flow activity is described in Code 23. The flow activity
has the same structure as the sequence activity.
<xsd:include
schemaLocation="../../types/grid_data_service_types_ext.xsd
"/>
<xsd:complexType name="FlowActivityType">
<xsd:complexContent>
<xsd:extension base="gdstypes:CompositeActivityType">
<xsd:sequence>
<!--url of target GDSF or GDS-->
<xsd:element name="input" minOccurs="0"
maxOccurs="unbounded">
<xsd:complexType mixed="true">
<xsd:complexContent>
<xsd:extension base="gdstypes:ActivityInputType">
<xsd:attribute name="activity" type="xsd:string"
use="required"/>
<xsd:attribute name="type" type="xsd:integer"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
</xsd:element>
83
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<!-- Define the name the activity will take on in the perform
documents -->
<xsd:element name="flow" type="tns:FlowActivityType"
substitutionGroup="gdstypes:compositeActivity"/>
</xsd:schema>
Code 23 flow_activity.xsd
84
References
[1] Ian Foster, Carl Kesselman, The Grid: Blueprint for a New Computing
Infrastructure (Second Edition) 2004, Morgan Kaufmann
[2] OGSA‐DAI http://www.ogsadai.org.uk
[3] Tuecke, S., Czajkowski, K., Foster, I., Frey, J., Graham, S., Kesselman, C.
and Nick, J., Open Grid Service Infrastructure.
http://www.gridforum.org/ogsi‐wg/.
[4] The XML:DB Project. ʺXML:DB Database API Working Draftʺ.
http://xmldb‐org.sourceforge.net//. Technical report, 2001
[5] OGSA‐DAI projects list, http://www.ogsadai.org.uk/projects/
[6] Mario Antonioletti, et al., OGSA‐DAI Usage Scenarios and Behaviour:
Determining good practice, All hand meeting 2004
[7] Data integration: M. Lenzerini, Data Integration: A Theoretical Perspective,
Proceedings of the 21st ACM SIGACT‐SIGMOD‐SIGART Symposium on
Principles of Database Systems (PODS 2002), Madison, WI, USA, 2002
[8] Z. Ives, D. Florescu, M. Friedman, A. Levy, D. S. Weld. An Adaptive Query
Execution System for Data Integration. To appear in ACM SIGMOD Conf.,
Philadelphia, PA, 1999.
[9] I. Foster, C. Kesselman, J. Nick, S. Tuecke. The Physiology of the Grid: An
Open Grid Services Architecture for Distributed Systems Integration. Open
Grid Service Infrastructure WG, Global Grid Forum, June 22, 2002.
http://forge.gridforum.org/projects/ogsa‐wg
[10]Tim Bray, Jean Paoli, and C. M. Sperberg‐McQueen. Extensible Markup
Language (XML) 1.0 (Third Edition) W3C Recommendation 04 February
2004. http://www.w3.org/TR/REC‐xml
[11]JDBC: Java Database Connectivity, http://java.sun.com/products/jdbc/
[12]Apache Log4J, http://logging.apache.org/log4j/
85
[13]Gamma, E., R. Helm, R. Johnson, J. Vlissides, Design Patterns,
Addison‐Wesley (1995)
[14]James Clark (ed). XSL Transformations (Working Draft). WWW Consortium,
April 1999. http://www.w3.org/TR/WD‐xslt
[15]Foster, I., Kesselman, C., and Tuecke,S. The anatomy of the Grid: Enabling
scalable virtual organizations, International Journal of Supercomputer
Applications 15(3), 200‐222, 2001
[16]Web Services http://www.w3.org/2002/ws/
[17]Web Services Glossary http://www.w3.org/TR/ws‐gloss/
[18]WSDL: Web Service Description Language, Web Services Description
Language (WSDL) 1.1, http://www.w3.org/TR/wsdl
[19]Berners‐Lee, T., Masinter, L., and M. McCahill, Editors. ʺUniform Resource
Locators (URL)ʺ, RFC 1738, CERN, Xerox Corporation, University of
Minnesota, December 1994.
[20]Globus project http://www.globus.org
[21]Java programming language http://java.sun.com
[22]DAIS group https://forge.gridforum.org/projects/dais‐wg
[23]Neil P Chue Hong, Amy Krause, Simon Laws, Susan Malaika, Gavin
McCance, James Magowan, Norman W. Paton and Greg Riccardi, Grid
Database Service Specification, GGF7, 2003
http://www.cs.man.ac.uk/grid‐db/papers/DAIS_GGF7StatementSpec.pdf
[24]Date, C., Darwen, H., A Guide to the SQL Standard. Addison‐Wesley, 4th
edition, 1997.
[25]XPath: XML Path Language XML Path language (1.0),
http://www.w3.org/TR/xpath
[26]The WebRowSet XML Schema definition.
http://java.sun.com/xml/ns/jdbc/webrowset.xsd
86
[27]Ian Foster, Carl Kesselman, “Concepts and Architecture” The Grid:
Blueprint for a New Computing Infrastructure, (Second Edition) Morgan
Kaufmann, 2004. P46
[28]Global Grid Forum. http://www.ggf.org/.
[29]Ian Foster, Carl Kesselman and Steven Tuecke, “The Open Grid Services
Architecture” The Grid: Blueprint for a New Computing Infrastructure,
(Second Edition) Morgan Kaufmann, 2004. P221
[30]SOAP: Simple Object Access Protocol (SOAP) 1.1.
http://www.w3.org/TR/2000/NOTE‐SOAP‐20000508/
[31]UDDI: Universal Description, Discovery and Integration of Business of the Web.
2004. www.uddi.org,
[32]Blair, G.S., Coulson, G., Robin, P. and M. Papathomas, An Architecture for
Next Generation Middleware, Proc. Middleware ʹ98, The Lake District,
England, November 1998. 17
http://citeseer.ist.psu.edu/blair98architecture.html
[33]Mario Antonioletti1, et.al. OGSA‐DAI: Two years on
[34]XQuery: An XML query language. http://www.w3.org/TR/xquery/
[35]BPELJ: BPEL for Java technology whitepaper.
http://www‐106.ibm.com/developerworks/webservices/library/ws‐bpelj/
[36]Fallside, D.C. XML Schema Part 0: Primer. W3C Recommendation, 2 May
2001. http://www.w3.org/TR/xmlschema‐0
[37]Biron, P.V., Malhotra, A. XML Schema Part 2: Datatypes. W3C
Recommendation 02 May 2001. See: http://www.w3.org/TR/xmlschema‐2
[38]K. Czajkowski, D. Ferguson, I. Foster, J. Frey, S. Graham, T. Maguire, D.
Snelling, S. Tuecke, From Open Grid Services Infrastructure to WS‐Resource
Framework: Refactoring & Evolution. March 5, 2004.
87
[39]Todd Hodes and Randy Katz. A Document‐based Framework for Internet
Application Control. 2nd USENIX Symposium on Internet Technologies
and Systems, October 1999.
[40]The Large Hadron Collider project:
http://lhc‐new‐homepage.web.cern.ch/lhc‐new‐homepage/
[41]Ian Foster, Carl Kesselman and Steven Tuecke, “Predictive maintenance:
Distributed Aircraft Engine Diagnostics” The Grid: Blueprint for a New
Computing Infrastructure, (Second Edition) Morgan Kaufmann, 2004. P69
[42]WSRF: WS‐Resource Framework, http://www.globus.org/wsrf/
[43]DOM: Document Object Model http://www.w3.org/DOM/
88