You are on page 1of 12

134

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 23,

NO. 1,

JANUARY 2012

Online System for Grid Resource Monitoring and Machine Learning-Based Prediction
Liang Hu, Xi-Long Che, Member, IEEE, and Si-Qing Zheng, Senior Member, IEEE
AbstractResource allocation and job scheduling are the core functions of grid computing. These functions are based on adequate information of available resources. Timely acquiring resource status information is of great importance in ensuring overall performance of grid computing. This work aims at building a distributed system for grid resource monitoring and prediction. In this paper, we present the design and evaluation of a system architecture for grid resource monitoring and prediction. We discuss the key issues for system implementation, including machine learning-based methodologies for modeling and optimization of resource prediction models. Evaluations are performed on a prototype system. Our experimental results indicate that the efficiency and accuracy of our system meet the demand of online system for grid resource monitoring and prediction. Index TermsGrid resource, monitoring and prediction, neural network, support vector machine, genetic algorithm, particle swarm optimization.

1 INTRODUCTION
are combined together to feed a grid system for analyzing performance, eliminating bottleneck, diagnosing fault, and maintaining dynamic load balancing, thus, to help grid users obtain desired computing results by efficiently utilizing system resources in terms of minimized cost, maximized performance, or trade-offs between cost and performance. To reduce overhead, the goal of designing a grid resource monitoring and prediction system is to achieve seamless fusion between grid technologies and efficient resource monitoring and prediction strategies. Resource monitoring is a basic function in most of computing systems. Along with grid development, monitoring tools have been evolving to support grid computing, such as those developed in the PAPI project [2], Iperf project [3], Hawkeye project [4], and Ganglia project [5]. In addition, some projects have designed a distributed monitoring module of their own, such as Grid Monitoring Architecture (GMA) project [6] and Autopilot project [7]. The monitoring techniques employed by such projects are partly compatible to the grid environment and, thus, fit for achieving grid resource monitoring. Resource monitoring alone, however, can only support instantaneous resource information acquisition. It cannot generalize the dynamic variation of resources. Resource state prediction is inevitable to fill this gap. Typical previous prediction systems, such as NWS [8] and RPS [9], can provide both monitoring and prediction functions. However, dynamic features of grid resources were not taken into consideration in these design frameworks. Some efforts like the Collectors of Resource Information (CORI) project [10] and Dindas research [11] were devoted to integrate a prediction tool into a system as a patching component, but the integration of component systems was realized by building a message passing interface. Grid middlewares, such as the ATOP-Grid (Adaptive Time/Space Sharing through Over Partitioning) project [12] and the Grid Harvest Service (GHS) project [13], were developed to include a prediction component. Nevertheless, these projects are usually restricted in certain applications. In summary, previous approaches have the limitation of being unable to
Published by the IEEE Computer Society

RID computing removes the limitations that exist in traditional shared computing environment, and becomes a leading trend in distributed computing system. It aggregates heterogeneous resources distributed across Internet, regardless of differences between resources such as platform, hardware, software, architecture, language, and geographical location. Such resources, that include computing, storage, data, communication bandwidth resources, and other resources, are combined dynamically to form high-performance computing capability of solving problems in large-scale applications. Dynamically sharing resources gives rise to resource contention. One of the challenging problems is deciding the destination nodes where the tasks of grid application are to be executed. From the perspective of system architecture, resource allocation and job scheduling are the most crucial functions of grid computing. These functions are based on adequate information of available resources. Thus, timely acquiring resource status information is of great importance in ensuring overall performance of grid computing [1]. There are mainly two mechanisms for acquiring information of grid resources: grid resource monitoring and grid resource prediction. Grid resource state monitoring cares about the running state, distribution, load, and malfunction of resources in a grid system by means of monitoring strategies. Grid resource state prediction focuses on the variation trend and running track of resources in a grid system by means of modeling and analyzing historical monitoring data. Historical information generated by monitoring and future variation generated by prediction

. L. Hu and X.-L. Che are with the College of Computer Science and Technology, Jilin University, No. 2699, QianJin Street, Changchun 130012, China. E-mail: {hul, chexilong}@jlu.edu.cn. . S.-Q. Zheng is with the Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083. E-mail: sizheng@utdallas.edu. Manuscript received 16 May 2010; revised 4 Jan. 2011; accepted 15 Feb. 2011; published online 17 Mar. 2011. Recommended for acceptance by K. Li. For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number TPDS-2010-05-0295. Digital Object Identifier no. 10.1109/TPDS.2011.108.
1045-9219/12/$31.00 2012 IEEE

HU ET AL.: ONLINE SYSTEM FOR GRID RESOURCE MONITORING AND MACHINE LEARNING-BASED PREDICTION

135

achieve seamless fusion of various components and overall simplification of system structure using a universal scheme. This paper reports our effort aiming at building a distributed system for grid resource monitoring and prediction. We first outline the main design principles of our system. Then, we present our overall system architecture design that seamlessly integrates various cooperating components to achieve high performance. We discuss the key issues in machine learning-based prediction, and justify our decisions by comparative studies through extensive simulations. We present a new optimization algorithm called Parallel Hybrid Particle Swarm Optimization (PH-PSO), and show its effectiveness. We then discuss the implementation and performance evaluation of a prototype system. The rest of this paper is organized as follows: Section 2 gives the problem statement of grid resource monitoring and prediction. Section 3 provides the overall system architecture based on design principles defined. Section 4 discusses the key issues for building prediction components. Section 5 gives a description of the proposed optimization algorithm. Section 6 explains the prototype system and evaluates its performance and overhead. Section 7 closes the paper with conclusions as well as indication for future works.

Fig. 1. Prediction pattern.

takes historical data as input and generates prediction for future variation. Our research goal is to design a distributed system that seamlessly integrates various cooperating components to achieve high performance in grid resource monitoring and prediction.

SYSTEM ARCHITECTURE DESIGN

In this section, we present the architecture of our grid resource monitoring and prediction system. We first introduce the principles used in our design. Then, we illustrate the service distribution and work flow of our system in detail.

PROBLEM STATEMENT

Suppose that a computing grid consists of n nodes. Without loss of generality, assume that each node i has k resource elements ri;j , 1 j k, 1 i n, which could be host load, bandwidth/latency to certain destination, available memory, etc. The state of ri;j at time t is denoted by si;j t. The state GSt of the entire grid is represented by a matrix as follows: 2 3 s1;1 t . . . s1;k t ... . . . 5: GSt 4 . . . 1 sn;1 t . . . sn;k t The monitoring and prediction on grid resources are realized by monitoring and prediction on the state of each concerned resource element in the matrix GSt. A program that generates resource performance data si;j t with timestamps is called a resource sensor. A program that predicts resource performance data si;j t with timestamps is called a prediction model. Let S s1; s2; . . . ; st represent the history set generated by a resource sensor, and S st 1; st 2; . . . represent the future set generated by a prediction model. Then, any mapping from S to S is a prediction function, and grid resource state prediction is a kind of regression procedure [14]. What should be emphasized in our research is that we focus on prediction of multi-step-ahead instead of one-stepahead, as in f : st q fst; st 1; st 2; . . . ; st m 1; 2 where m is the input feature number in the prediction model. The prediction pattern is schematically shown in Fig. 1. A historical data set is divided into three parts: training, validation, and test sets. The training set is used to build prediction model, which is optimized using validation set and evaluated using test set. The model

3.1 System Design Principles A computing grid is a complex distributed system. Embedded with the grid, its resource monitoring and prediction system is also a distributed system that dynamically processes grid resource state information. In what follows, we enumerate several features of such a system. Attaining each of these features serves as a design principle for our system architecture. Responsiveness and robustness. Since grid resource states vary dynamically, the information monitored or predicted has to be timely updated, to guarantee online reflection of resource conditions. In our system, resource sensors and prediction models are periodically executed to generate up-to-date information for users. Function independent and starlike distribution are introduced in our system. Thus, monitoring and prediction components can work well if some of nodes are down. Modularity and extensibility. Modularity and extensibility are closely related to each other. In achieving tight cohesion as well as loose coupling, the components embedded in our system have independent functions and can be integrated into most grid computing environments as an independent subsystem. Besides, information generated or passed is designed in XML format in order to support new resource types and interact with other components. Efficiency. Executing jobs is the fundamental function of a grid system, so embedded monitoring or prediction components should minimize overhead to guarantee grids normal service. We deploy resource sensors on computing nodes since its inevitable, they also run and sleep dynamically to reduce overhead, while we deploy other components out of computing nodes to avoid extra overhead. Transparency. Grid users do not need traversal of all the nodes or grid expertise to get information. We design a uniform and friendly interface component for accessing the information monitored or predicted. 3.2 Service Distribution Considering the heterogenous and dynamic characteristics of computing grid, resource monitoring and prediction

136

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 23,

NO. 1,

JANUARY 2012

Fig. 2. Overall system structure designed.

system has a distributed service structure. Based on the intended system features, we propose to build the whole system that consists of two subsystems: resource monitoring subsystem inside the computing environment, and resource state prediction subsystem outside the computing environment. Most computing grid system maintains a service container for taking grid jobs, such container should be reused for seamless fusion between a grid environment and our system. Therefore, we design a series of supporting services: monitoring service, prediction service, evaluation service, and information service. These services are deployed on service containers of distributed nodes, and all the functions are realized through dynamic collaboration among them. Besides, the resource information is managed using a hierarchical structure. Fig. 2 presents the overall structure designed. In what follows, we explain each service in detail. Monitoring service. Monitoring service is deployed on each computing resource node. It manages resource sensors and generates resource monitoring data. Following monitoring request customized by grid user, monitoring service enables or disables certain resource sensor dynamically. Prediction service and evaluation service. In order to ensure the responsiveness and robustness of the prediction subsystem, a symmetrical starlike structure is adopted. Prediction service and evaluation service are deployed on each prediction node. Corresponding to a prediction request customized by grid user, one prediction service takes charge of the whole prediction procedure and manages resource prediction models, and then all the evaluation services work as prediction services assistants for evaluating accuracy and efficiency of candidate models. Information service. Information service is deployed on information node; it interacts with grid users and runs the

storage, query, as well as publication of resource state information. Two types of mechanism are defined for information acquisition: local register and group register. Local register timely collects information from resource sensors to monitoring service and from prediction models to prediction service, while group register timely collects information from both services and aggregates them for storage or publication. In order to provide friendly interface to grid users, a web server is set on information node for customizing request and publishing information. Therefore, grid user needs nothing but a browser. In our system, information generated or passed is designed in XML format in order to support new resource types and interact with other components. Fig. 3 shows monitoring information generated by a resource sensor, as an excerpt of a sample XML document.

3.3 System Work Flow Monitoring is the precondition of prediction; thus, we enclose the monitoring work flow into the prediction work

Fig. 3. A sample XML document.

HU ET AL.: ONLINE SYSTEM FOR GRID RESOURCE MONITORING AND MACHINE LEARNING-BASED PREDICTION

137

Fig. 4. Sequence diagram of monitoring and prediction work flow.

flow for a more compact description. The sequence diagram of the system work flow is illustrated in Fig. 4. In what follows, we provide description in detail. 1. Grid user logs on information node and customizes three terms before sending a monitoring/prediction request: which node, which resource type, and how long the prediction will last. A prediction request is then created accordingly and sent to information service. Information service launches the monitoring work flow by sending a monitoring request to the monitoring service; a resource sensor is activated as requested, and timely updated monitoring data are then sent back to information service through local and group registers; historical records are stored in the database for achieving prediction. Information service chooses a prediction service and sends a customized prediction request. The chosen prediction service then takes charge of the whole prediction procedure. Prediction service acquires historical monitoring data from information service, and builds a set of candidate prediction models of different types and parameters. It combines each candidate model with historical data as an evaluation subtask.

5.

6.

2.

7.

3.

8.

Prediction service sends subtasks to evaluation services for feedback, and then fixes the model with best performance out of comparison on evaluation results. If an evaluation service is time out, prediction service will redirect the subtask to another one. Prediction service feeds the fixed prediction model with timely updated monitoring data for prediction, and then timely updated prediction data are sent to information service through local and group register; historical prediction records are stored in the database for checking prediction error. Grid user gets the resource information monitored or predicted from a browser. If upon request or prediction error exceeds a certain threshold, the prediction service will reload the latest historical data, and go over step 4 for model optimization. Information service terminates the monitoring or prediction procedure when the customized time is used out.

4.

MACHINE LEARNING-BASED PREDICTION

In this section, we discuss key issues in realizing resource prediction using machine learning strategies. First, we propose a universal procedure for building and optimizing a prediction model. Then, we conduct comparative studies

138

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 23,

NO. 1,

JANUARY 2012

Fig. 6. Universal procedure for resource prediction.

4.
Fig. 5. Basic prediction model.

and make decisions on selecting appropriate strategies for building prediction components.

5.

Fix an optimization algorithm, which evolves the hyperparameters of the prediction model for better fitness (performance). When a termination condition is met, the optimized prediction model is achieved, and then tested using test set.

4.1 Universal Procedure This research focuses on q-step-ahead prediction; its equation is defined previously, and we augment it in
f : st q fst; st 1; st 2; . . . ; st m 1 ) f : y fx1 ; x2 ; . . . ; xm1 ; xm s:t: y st q; xi st i 1; i 1 . . . m: 3 This equation will certainly lead to a basic prediction model, as shown in Fig. 5. The historical time series S s1; s2; . . . ; st is a spot set; it cant feed such model directly; therefore, we transform it into a sample set based on overlapped segmentation on time series. Table 1 shows the transformation results. Resource state prediction should be working and evolving in a self-learning way; therefore, machine learning strategies are applicable to achieve automodeling and autooptimization of prediction models. We propose a universal procedure for this, with its structure given by Fig. 6. 1. Fix a machine learning algorithm for the prediction model, and set its default hyperparameters. Separate the sample set into three parts: training set, validation set, and test set. Feed the learning algorithm with a sample of training set, and repeat it one by one until all samples are used. For some algorithms, the training procedure runs only once; for others, iterations are needed. Feed the trained model with all samples of validation set, and record the errors between true data and predicted ones.
TABLE 1 Sample Set for a Prediction Model

2.

4.2 Experimental Setups Considering that computing grid is loosely distributed in the environment of Internet, host load of a computing node on Internet and bandwidth between two nodes across Internet are the most representative resource information that need to be monitored and predicted. Moreover, we prefer using public data rather than historical data generated by ourselves for the purpose of giving comparable and reproducible results. For available bandwidth data set, we believe that the set of iepm-bw.bnl.gov.iperf2 [24] can reflect the true variation between two nodes across Internet. For host load data set, we choose mystere10000.dat [25], a trace of workstation node, for the reason that workstation is a most typical computing node. After transformation of original data as in Table 1, the latest 200 samples were sequentially chosen to form experiment data set. The data set was then divided into training set, validation set, and test set, with a proportion of 100:50:50. Summary statistics for data sets are listed in Table 2. Experiment was running on a single Intel Pentium IV 3.0 GHz CPU under Fedora Core Linux 9.0 system; all the algorithms are coded in Java. We recorded the training CPU time to measure efficiency, and used mean absolute error (MAE) to measure accuracy, as in (4), where l counts the number of samples, st q and s t q denote the true value and predicted value, respectively.
MAE
l 1X jst q s t qj: l i1

3.

TABLE 2 Statistics of Data Sets

HU ET AL.: ONLINE SYSTEM FOR GRID RESOURCE MONITORING AND MACHINE LEARNING-BASED PREDICTION

139

TABLE 3 Parameters for ANNs

TABLE 4 Parameters for SVRs

4.3 Comparative Study on Modeling Methods Artificial neural network (ANN) and support vector machine (SVM) are two typical machine learning strategies in the category of regression computation. These two methods can be employed for modeling resource state prediction. ANN is a powerful tool for self-learning, and it can generalize the characteristics of resource variations by proper training. ANN is inherently a distributed architecture with high robustness. It is suitable for multiinformation fusion, and competent for quantitative and qualitative analysis. ANNs have been used in resource state prediction in the past. It was indicated in [15] that the ANN prediction outperforms the NWS methods [8]. However, ANNs learning process is quite complex and inefficient for modeling. Furthermore, the choices of model structures and parameters are lack of standard theory, so it usually suffers from overfitting or underfitting with ill chosen parameters. As a promising solution to nonlinear regression problems, SVM [16] has recently been winning popularity due to its remarkable characteristics such as good generalization performance, absence of local minima, and sparse solution representation. SVM is proposed based on structural risk minimization (SRM) principle, which tries to control model complexity as well as the upper bound of generalization risk. On the contrary, traditional regression techniques, including neural networks, are based on empirical risk minimization (ERM) principle, which tries to minimize the training error only. Therefore, SVM is expected to achieve performance better than traditional methods. Prem and Raghavan [17] have explored the possibility of applying SVM to forecast resource measures. They indicated that the SVM-based forecasts outperform the NWS methods, including autoregressive and mean/ median-based methods. This study aims at comparing efficiency and accuracy of different models for multi-step-ahead prediction of grid resources by simulations. The modeling methods considered are variations of ANN, including back propagation neural network (BPNN) [20], radial basis function neural network (RBFNN) [21], and generalized regression neural network (GRNN) [22], which hybridizes RBFNN and BPNN, plus variations of SVM, including Epsilon-support vector regression (ESVR) [16], and Nu-support vector regression (NSVR) [23]. The model parameters are initialized with values that are commonly used, as given in Tables 3 and 4. For the input feature number, since we are predicting up to 5-step-ahead, it should be bigger than 5. The MAE results of different models are shown in Figs. 7a and 7b. From both figures, we can find that GRNN

achieves better accuracy than BPNN and RBFNN, while NSVR and ESVR win the best performance for all q values considered. As prediction step q increases, the prediction error of GRNN, NSVR, and ESVR, does not exceed tolerance interval with bandwidth MAE below 40 Mbps and host load MAE below 0.12. This means that these three methods are suitable for both one-step-ahead and multistep-ahead resource state predictions. A remarkable characteristic of SVR is sparse solution representation, namely model with less support vectors is better in achieving same accuracy. We can see from Figs. 7c and 7d that the comparison results between the two are data set dependent. In this case, we can see that these two SVR methods achieve similar accuracy and complexity.

Fig. 7. Comparative results on modeling methods.

140

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 23,

NO. 1,

JANUARY 2012

TABLE 5 Parameters for GA-SVR

TABLE 6 Parameters for PSO-SVR

The training CPU time of models being considered are compared in Figs. 7e and 7f. From each subfigure, we can see that the training time does not show a remarkable tendency as step q increases. SVRs cost less time than ANNs, namely within 120 ms on both data sets. Based on comparative results of accuracy and efficiency, SVR is selected by our system as prediction strategy for modeling resource variations.

4.4 Comparative Study on Optimization Methods Genetic algorithm (GA) and particle swarm optimization (PSO) are two typical machine learning strategies in the category of evolutionary computation. These two methods can be employed to optimize the prediction model, for the expectation of achieving higher performance. GA was proposed by John Holland and his students in 1975 [18], inspired by the theory of natural selection and evolution. GA uses a set of chromosomes to represent solutions. The chromosomes from one population are taken and used to form a new population which is called offspring. The chromosomes with better fitness will have more chances for reproduction, and consequently, the new population will be better than the old one. The PSO was proposed by Kennedy and Eberhart [19], inspired by social behavior of nature system, such as bird flocking or fish schooling. The system initializes a population of random particles and searches a multidimensional solution space for optima by updating particle generations. Each particle moves based on the direction of local best solution discovered by itself, and global best solution shared by the swarm. This study aims at comparing the optimization performance of GA and PSO by simulation. We concentrate on hyperparameter selection using host load data set. Parameters are initialized with values that are commonly

used: acceleration constants c1 and c2 are selected according to [19], decreasing inertia weight w linearly with time as proposed in [26], and changing SVRs hyperparameters C; "; exponentially during optimization [27]. The initialized parameters of GA and PSO and optimized hyperparameters of SVRs are given in Tables 5 and 6. If the input feature number is too small, we cant tell the difference of optimizing time between the two, so we set it to 10. MAE was used to measure the accuracy of the optimized model, and optimizing time was recorded to measure its efficiency. From Figs. 8a and 8b we can find that PSO achieves lower error than GA in most of the q values considered, and has less optimizing time. Based on such comparative results, PSO is selected by our system as optimization strategy for prediction models.

PROPOSED OPTIMIZATION ALGORITHM

According to the previous comparisons, SVR is selected as the automodeling strategy, and PSO is selected as the autooptimization strategy. Generally, a prediction model

Fig. 8. Comparative results on optimization methods.

HU ET AL.: ONLINE SYSTEM FOR GRID RESOURCE MONITORING AND MACHINE LEARNING-BASED PREDICTION

141

relies directly on the choice of models hyperparameters. In addition, irrelevant input features in resource samples will also spoil the accuracy and efficiency of the model. Moreover, hyperparameter selection and feature selection also correlate with each other. Besides, the prediction subsystem has a starlike distributed structure; such topology should be utilized to accelerate the modeling and optimizing procedure. In this section, we define a combined criteria for fitness evaluation, and propose a Parallel Hybrid Particle Swarm Optimization algorithm for the resource prediction subsystem. PH-PSO takes both hyperparameter selection and feature selection under consideration and, thus, is expected to enhance the accuracy and efficiency of the subsystem.

parallel optimization algorithm which hybridizes continuous PSO and binary PSO together, namely PH-PSO. The algorithm is initialized with a population of random particles and searches a multidimensional solution space for optima by updating particle generations. Each particle moves based on the direction of local best solution discovered by itself and global best solution discovered by the swarm. Each particle calculates its own velocity and updates its position in each iteration until the termination condition is met. Suppose P particles in a D-dimensional search space. AP D denotes the position matrix of all particles, p 1; 2; . . . ; P ; d 1; 2; . . . ; D; row vector ap in A denotes the position of the pth particle, recorded as ap fap1 ; ap2 ; . . . ; apD g; 2. VP D denotes the velocity matrix of all particles; row vector vp in V denotes the velocity of the pth particle, recorded as vp fvp1 ; vp2 ; . . . ; vpD g; 3. LBP D denotes the local best position of all particles; row vector lbp in LB denotes the local best position of the pth particle, recorded as lbp flbp1 ; lbp2 ; . . . ; lbpD g; 4. Row vector gb fgb1 ; gb2 ; . . . ; gbD g denotes the global best position shared by all particles. The particle is represented by hybrid vector P R, and during each iteration the real and binary parts of P R are updated jointly using different rules, namely (7) and (8), respectively. rdm0; 1, rdm10; 1, and rdm20; 1 are random numbers evenly distributed, respectively, in 0; 1. t denotes the step of iteration. Inertia weight w plays the role of balancing global search and local search; it can be a positive constant or even a positive linear/nonlinear function of time. Acceleration constant c1 and c2 represent personal learning factor and social learning factor, respectively. Sg is a sigmoid function limiting transformation. 1. vpd t 1 w vpd t c1 rdm10; 1 lbpd t apd t c2 rdm20; 1 gbd t apd t; apd t 1 apd t vpd t 1; vpd t 1 w vpd t c1 rdm10; 1 lbpd t apd t c2 rdm20; 1 gbd t apd t; if rdm0; 1 < Sgvpd t 1 then apd t 1 1; else apd t 1 0; 1 : Sgv 1 ev The flowchart of PH-PSO is shown in Fig. 9, with major steps explained as follows: 1. Initialize system: set parameters for the PSO system, including population, iteration number, and dimensional search intervals; set parameters for particles such as inertia weight, personal learning factor, and social learning factor; randomly generate position 7

5.1 Optimization Problem Definition Concerning hyperparameter selection and feature selection jointly, we code such combinational optimization problem with a hybrid vector P R, which consists of real numbers and binary numbers. The real numbers p1 ; p2 ; . . . represent the hyperparameters of the model, and the binary numbers bf1 ; bf2 ; . . . represent the choice of sample features. The value 1 or 0 for bfs , respectively, stands for whether the corresponding feature in samples is selected, and m is the full input feature number of samples. The optimization can be defined as
max F itP R 8 > P R fp1 ; p2 ; . . . ; bf1 ; bf2 ; . . . ; bfm g; < p1 ; p2 ; . . . 2 R s:t: > : bfs 2 f0; 1g; s 1; 2; . . . ; m:

The definition of the fitness function F it is crucial in that it determines what an algorithm should optimize. Accuracy and efficiency are both concerned in evaluating the fitness of the prediction model. In other words, a model is better (with larger fitness) only if it has lower prediction error as well as less training time, thus comes to a relationship of symmetrical inverse proportion. Moreover, when the training time is acceptable, accuracy is considered prior. Accordingly, we define the fitness function as in h ; F itness MSEt ln Tt l 1X MSEt st q s t q;2 l i1

where MSEt is the training mean squared error of 5-fold cross validation [28], h is a constant controlling the bound of fitness, and Tt denotes the models training time. l counts the number of samples, st q and s t q denote the true value and predicted value, respectively.

5.2 Parallel Hybrid Particle Swarm Optimization There are mainly two types of PSO distinguished by different updating rules for calculating the positions and velocities of particles: continuous version [19], [26] and binary version [29]. Hyperparameter selection is a kind of continuous optimization problem, and feature selection is a kind of binary optimization problem. Concerning our optimization problem definition, this study proposes a

142

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 23,

NO. 1,

JANUARY 2012

memory usage 180 170 160 usage (MB) 150 140 130 120 110 0 20 time (min) 40 60 monitoring off subsystem on Hawkeye on

Fig. 9. Flowchart of the PH-PSO Algorithm.

Fig. 10. Comparative results on monitoring overhead.

2.

3.

4.

5. 6. 7.

and velocity for each particle which is coded using hybrid vector P R. Preprocessing in parallel: prepare the sample set with corresponding features as well as candidate model with corresponding hyperparameters according to particle representation. Fitness evaluation in parallel: use validation set to evaluate candidate model, and then calculate fitness of particle according to (6). Update the local best/global best: if a particles fitness is better than its local best fitness, update corresponding local best fitness and local best position; if a particles fitness is better than the global best fitness, update global best fitness and global best position. Termination judgement: if the termination condition is met, then go to step 7. Update velocity and position of each particle, and then go to step 2 for next iteration. Finish: output the global best position, and prepare the sample set with selected features and prediction model with selected hyperparameters according to the representation of the global best position.

prediction subsystem, we compared the accuracy and efficiency of different prediction models.

EVALUATION ON PROTOTYPE SYSTEM

Prototype systems nodes are running under Fedora Core Linux 9.0 system and connected by 100 MB LAN; each node is equipped with a single Intel Pentium IV 3.0 GHz CPU. Globus environment is built based on Globus Toolkits 4.2.0 [30], and Libsvm toolkit [31] is employed to solve the QP problem of the SVR algorithm. Supporting services are coded in Java, and deployed in globus container [32]. The whole system falls into two subsystems; the two are evaluated individually because of different evaluation criteria. For the monitoring subsystem, we made a comparative study on overhead between our system and Condor Hawkeye system [4], which is a famous monitoring system that can work with Globus Toolkits. For the

6.1 Evaluation on Monitoring Subsystem Following our design, the monitoring subsystem is built by coding monitoring service and resource sensors. Table 7 illustrates the resource sensors implemented in our system and their techniques used in codes. In the table, O indicates that the sensor executes certain operations (i.e., I/O operation) and calculates the running performance of resource as monitoring data, such as latency and bandwidth; V means that the sensor gets information by parsing a /proc virtual file system. Low overhead is the primary design purpose of the monitoring subsystem, which means that monitoring activities should not bring obvious influence on computing nodes. Condor version 7.2.5 [33] and Hawkeye version 1.0.0 [4] are used in our experiments. For both our monitoring subsystem and Condor Hawkeye system, we recorded the CPU and memory usage when the monitoring is up and down so as to evaluate their overhead. Sampling frequency is set to once per minute, and the recording process lasted for an hour. The performance data are given in Fig. 10. Table 8 lists the statistics of the performance data. Both monitoring systems occupy similar CPU usage, namely ours 11 percent and Hawkeye 12 percent. The memory used by our monitoring subsystem is about 21 MB, while the memory used by Condor and Hawkeye is about 50 MB: part of them is costed by Condor since Hawkeye needs Condor to achieve monitoring. It can be inferred that our monitoring subsystem does not bring obvious influence on computing nodes. Hence, the design of our monitoring subsystem is acceptable. 6.2 Evaluation on Prediction Subsystem The ESVR model has three hyperparameters, that is, C; "; [16]. However, the NSVR model is able to select " by itself [23], only C and are considered hyperparameters, which

TABLE 7 Resource Sensors and Monitoring Techniques Used

HU ET AL.: ONLINE SYSTEM FOR GRID RESOURCE MONITORING AND MACHINE LEARNING-BASED PREDICTION

143

TABLE 8 Statistical Results of Overhead

means that the order of complexity for model optimization decreases from On3 to On2 . Therefore, NSVR is chosen to build the resource state prediction model in realizing the prototype system. PH-PSO is used to optimize the prediction model in the prototype system. The parameters of PH-PSO are initialized with values that are commonly used: acceleration constants c1 and c2 are selected according to [19], decreasing inertia weight w linearly with time as proposed in [26], and changing hyperparameters C; exponentially during optimization [27], as is listed in Table 9. One prediction service controls the overall prediction procedure, and evaluation services are used for evaluating model fitness in parallel. The number of evaluation services used for fitness evaluation is equal to the number of particles in the PH-PSO algorithm. All the tests are implemented through dynamic collaboration of system services. High accuracy and efficiency is the primary design goal of the prediction subsystem. We present the prediction and optimization results of bandwidth and host load data sets. In q-step-ahead prediction, q 1; 2; 3; 4; 5 are considered. In model optimization, we implemented four different strategies including feature selection with hyperparameter selection on NSVR (FH), feature selection without hyperparameter selection on NSVR (F0), hyperparameter selection without feature selection on NSVR (0H), and parameters given directly on SVR (00) without any optimization mechanism, as in [17]. The test data sets used are the same as in Section 4. We recorded parallel CPU time in optimizing models, and used MAE to measure prediction accuracy. The MAE results of different models are shown in Figs. 11a and 11c. The MAE of bandwidth stays below 17.9 Mbps, and there is no remarkable difference between prediction of onestep-ahead and multi-step-ahead. The MAE of host load stays below 0.08, with one-step-ahead prediction slightly better than multi-step-ahead prediction. As prediction step q
TABLE 9 PH-PSO Parameter Initialization

increases, there is no obvious ascending trend in MAE on both data sets. It is implied that our modeling method is suitable for both one-step-ahead and multi-step-ahead resource state predictions. Furthermore, comparing four strategies, the introduction of optimizing mechanism helps to enhance the accuracy of the prediction model. This is especially true for the combinational optimization FH, since it achieves lower errors for most of the q values considered. We can see from Figs. 11a and 11b that the NSVR models being optimized have higher accuracy than the SVR model without optimizing strategy. NSVRs SV numbers are over 50, whereas SVRs SV number is less than 10. Similar phenomena can be observed from Figs. 11c and 11d, except that SVRs SV number is around 25. Clearly, there is a tradeoff between model accuracy and solution sparseness: model with more SVs is more complicated as well as more capable in characterization. The parallel CPU time of combinational/individual optimization is compared in Figs. 12a and 12b. From each sub-figure, we can see that the optimization time does not show a remarkable tendency as step q increases. 0H costs more time than FH and F0. This indicates that the models training time can be obviously reduced by feature selection rather than hyperparameter selection. It is clear that the optimizing time of combinational optimization FH is rather short by means of parallelization, namely within 3 seconds on both data sets.

Fig. 11. Comparative results on prediction performance.

144

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 23,

NO. 1,

JANUARY 2012

evaluation shows that the monitoring subsystem does not bring obvious influence on computing nodes. A Parallel Hybrid Particle Swarm Optimization algorithm was proposed which combines discrete PSO and continuous PSO, for the purpose of combinational optimization of prediction model. Evaluation results indicate that the combinational model of PH-PSO and NSVR meets the accuracy and efficiency demand of an online system. The results of this paper will contribute to building and advancing of computing grid infrastructure. We plan to move on our research further in the following aspects: monitoring and prediction of grid tasks, classification and evaluation of grid resources, classification and evaluation of grid tasks, etc. It is believed that machine learning strategies are applicable tools for modeling and optimizing, and they will play a more important role by the virtue of their potential in distributed computing environment.

ACKNOWLEDGMENTS
Fig. 12. Comparative results on optimization performance.

The global best fitness of combinational optimization FH during each iteration was recorded to evaluate the convergence performance. Landscape comparisons among different q values are shown in Figs. 12c and 12d. A trend is obvious on host load data set that the global best fitness decreases clearly as the prediction step q increases, while such a trend is not found on bandwidth data set, which implies that the bandwidth variation has got more noise than host load. It is also implied in these subfigures that the combinational optimization converges during proper iterations for most of the q values considered.

This research work is funded by National Natural Science Foundation of China under Grant No. 61073009, 60873235, and 60473099 and by Science-Technology Development Key Project of Jilin Province of China under Grant No. 20080318 and by Program of New Century Excellent Talents in University of China under Grant No. NCET-06-0300.

REFERENCES
L.F. Bittencourt and E.R.M. Madeira, A Performance-Oriented Adaptive Scheduler for Dependent Tasks on Grids, Concurrency and Computation: Practice and Experience, vol. 20, no. 9, pp. 10291049, June 2008. [2] F. Wolf and B. Mohr, Hardware-Counter Based Automatic Performance Analysis of Parallel Programs, Proc. Conf. Parallel Computing (ParCo 03), pp. 753-760, Sept. 2003. [3] J. Dugan et al., Iperf Project, http://iperf.sourceforge.net/, Mar. 2008. [4] M. Livny et al., Condor Hawkeye Project, Univ. of WisconsinMadison, http://www.cs.wisc.edu/condor/hawkeye/, Sept. 2009. [5] M.L. Massie, B.N. Chun, and D.E. Culler, The Ganglia Distributed Monitoring System: Design, Implementation, and Experience, Parallel Computing, vol. 30, no. 7, pp. 817-840, July 2004. [6] A. Waheed et al., An Infrastructure for Monitoring and Management in Computational Grids, Proc. Fifth Intl Workshop Languages, Compilers and Run-Time Systems for Scalable Computers, vol. 1915, pp. 235-245, Mar. 2000. [7] J.S. Vetter and D.A. Reed, Real-Time Performance Monitoring, Adaptive Control, and Interactive Steering of Computational Grids, Intl J. High Performance Computing Applications, vol. 14, no. 4, pp. 357-366, 2000. [8] D.M. Swany and R. Wolski, Multivariate Resource Performance Forecasting in the Network Weather Service, Proc. ACM/IEEE Conf. Supercomputing, pp. 1-10, Nov. 2002. [9] P.A. Dinda and D.R. OHallaron, Host Load Prediction Using Linear Models, Cluster Computing, vol. 3, no. 4, pp. 265-280, 2000. [10] E. Caron, A. Chis, F. Desprez, and A. Su, Design of Plug-in Schedulers for a GRIDRPC Environment, Future Generation Computer Systems, vol. 24, no. 1, pp. 46-57, 2008. [11] P.A. Dinda, Design, Implementation, and Performance of an Extensible Toolkit for Resource Prediction in Distributed Systems, IEEE Trans. Parallel and Distributed Systems, vol. 17, no. 2, pp. 160-173, Feb. 2006. [12] A.C. Sodan, G. Gupta, L. Han, L. Liu, and B. Lafreniere, Time and Space Adaptation for Computational Grids with the ATOPGrid Middleware, Future Generation Computer Systems, vol. 24, no. 6, pp. 561-581, 2008. [1]

CONCLUSIONS AND FUTURE WORKS

We proposed a distributed resource monitoring and prediction architecture that seamlessly combines grid technologies, resource monitoring, and machine learning-based resource state prediction. This system consists of a set of distributed services to accomplish all required resource monitoring, data gathering, and resource state prediction functions. We defined a universal procedure for modeling and optimization of resource state prediction. In building a prediction model of multi-step-ahead, ANNs and SVRs were compared concerning both efficiency and accuracy criteria. Comparative simulations indicate that EpsilonSupport Vector Regression and Nu-Support Vector Regression achieve better performance than Back Propagation Neural Network, Radial Basis Function Neural Network, and Generalized Regression Neural Network. For the expectation of achieving higher performance, we compared Genetic Algorithm and Particle Swarm Optimization for prediction models hyperparameter selection. Comparative simulations indicate that the PSO achieves lower error and costs less optimizing time than GA. In the prototype system, we implemented a series of sensors that cover most of resource measures. Overhead

HU ET AL.: ONLINE SYSTEM FOR GRID RESOURCE MONITORING AND MACHINE LEARNING-BASED PREDICTION

145

[13] M. Wu and X.H. Sun, Grid Harvest Service: A Performance System of Grid Computing, J. Parallel and Distributed Computing, vol. 66, no. 10, pp. 1322-1337, 2006. [14] L.T. Lee, D.F. Tao, and C. Tsao, An Adaptive Scheme for Predicting the Usage of Grid Resources, Computing Computers and Electrical Eng., vol. 33, no. 1, pp. 1-11, 2007. [15] A. Eswaradass, X.H. Sun, and M. Wu, A Neural Network Based Predictive Mechanism for Available Bandwidth, Proc. 19th IEEE Intl Parallel and Distributed Processing Symp. (IPDPS 05), p. 33a, 2005. [16] V.N. Vapnik, The Nature of Statistical Learning Theory, second ed. Springer-Verlag, 1999. [17] H. Prem and N.R.S. Raghavan, A Support Vector Machine Based Approach for Forecasting of Network Weather Services, J. Grid Computing, vol. 4, no. 1, pp. 89-114, 2006. [18] J.H. Holland, Adaptation in Natural and Artificial Systems. MIT Press, 1975. [19] J. Kennedy and R.C. Eberhart, Particle Swarm Optimization, Proc. IEEE Intl Conf. Neural Networks, pp. 1942-1948, 1995. [20] L. Fausett, Fundamentals of Neural Networks. Prentice-Hall, 1994. [21] S. Haykin, Neural Networks: A Comprehensive Foundation. Macmillan Publishing, 1994. [22] D. Patterson, Artificial Neural Networks. Prentice-Hall, 1996. [23] A.J. Smola and B. Scholkopf, A Tutorial on Support Vector Regression, Statistics and Computing, vol. 14, no. 33, pp. 199-222, 2004. [24] Bandwidth Data Set, http://www.slac.stanford.edu/comp/net/ iepm-bw.slac.stanford.edu/combinedata/, Feb. 2009. [25] Host Load Data Set, http://people.cs.uchicago.edu/~lyang/ Load/, Feb. 2009. [26] Y. Shi and R.C. Eberhart, A Modified Particle Swarm Optimizer, Proc. IEEE Intl Conf. Evolutionary Computation, pp. 69-73, 2000. [27] C.W. Hsu, C.C. Chang, and C.J. Lin, A Practical Guide to Support Vector Classification, http://www.csie.ntu.edu.tw/~cjlin/ papers/guide/guide.pdf Dept. of Computer Science and Information Eng., Natl Taiwan Univ., 2003. [28] M.W. Browne, Cross-Validation Methods, J. Math. Psychology, vol. 44, no. 1, pp. 108-132, 2000. [29] J. Kennedy and R.C. Eberhart, A Discrete Binary Version of the Particle Swarm Optimization, Proc. IEEE Intl Conf. Neural Networks, pp. 4104-4108, 1997. [30] Globus Alliance Globus Project, http://www.globus.org/ toolkit/downloads/4.2.0/, 2008. [31] C.C. Chang and C.J. Lin, LIBSVM: a Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/, May 2008. [32] S. Borja, The Globus Toolkit 4 Programmers Tutorial, http:// gdp.globus.org/gt4-tutorial/, Nov. 2005. [33] M. Livny et al., Condor Project, Univ. of Wisconsin-Madison, http://www.cs.wisc.edu/condor/, Sept. 2009.

Liang Hu received the MS and PhD degrees in computer science from Jilin University, in 1993 and 1999, respectively. Currently, he is a professor and doctoral supervisor at the College of Computer Science and Technology, Jilin University, China. His research areas are network security and distributed computing, including related theories, models, and algorithms of PKI/IBE, IDS/IPS, and Grid Computing. He is a member of the China Computer Federation.

Xi-Long Che received the MS and PhD degrees in computer science from Jilin University, in 2006 and 2009, respectively. Currently, he is a lecturer at the College of Computer Science and Technology, Jilin University, China. His current research areas are machine learning and parallel computing, including related theories, models, and algorithms of ANN, SVC/SVR, GA/ACO/ PSO, and their combinations with Parallel Computing. He is a member of the IEEE. He is the corresponding author of this paper.

Si-Qing Zheng received the PhD degree in electrical and computer engineering from the University of California, Santa Barbara, in 1987. He is currently a professor of computer science, computer engineering, and telecommunications engineering. He served as the head of the Computer Engineering Program and Telecommunications Engineering Program, and associate head of the Computer Science Department and Electrical Engineering Department, all at the University of Texas, Dallas. His research interests include algorithms, computer architectures, networks, parallel and distributed processing, performance evaluation, circuits and systems, hardware/software codesign, real-time and embedded systems, and telecommunications. He has published in these areas extensively. He is a senior member of the IEEE.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

You might also like