You are on page 1of 5

Sandeep Tayal / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 5, Issue No.

2, 111 - 115

Tasks Scheduling optimization for the Cloud Computing Systems


University School of Information Technology Guru Gobind Singh Indraprastha University, Delhi-110006,India
AbstractCloud computing is a latest new computing paradigm where applications, data and IT services are provided over the Internet. The Task management is the key role in cloud computing systems. Task scheduling problems are premier which relate to the efficiency of the whole cloud computing facilities. Task scheduling algorithm is an NP- completeness problem which play key role in cloud computing. In Hadoop, the open-source implementation of MapReduce, many scheduling policies such as FIFO scheduling is used by the master node to distribute waiting tasks to computing nodes. So it can still achieve more improvement in the task scheduling process. Therefore, in this paper, we purpose an optimized algorithm based on the FuzzyGA optimization which makes a scheduling decision by evaluating the entire group of task in the job queue. Keywords-Cloud Computing, Fuzzy, Genetic Algorithm (GA), task scheduler.

Sandeep Tayal1

I. INTRODUCTION

Cloud computing is the next generation in computation. Possibly people can have everything they need on the cloud. Cloud computing is the next natural step in the evolution of ondemand information technology services and products. Cloud Computing is an emerging computing technology that is rapidly consolidating itself as the next big step in the development and deployment of an increasing number of distributed applications [1, 2]. Cloud Computing emerges for varieties of internet businesses, many computing frameworks are proposed for the huge data store and highly parallel computing needs, such as Google MapReduce [2]. Hadoop MapReduce running on top of Hadoop Distributed File System (HDFS) is inspired by Google MapReduce. Hadoop breaks jobs with a Map function and a Reduce function into map tasks and reduce tasks. Job scheduling in Hadoop is performed by a master node which receives heartbeats sent by slaves every few seconds. Each slave has a fixed number of map slots and reduces slots to execute map or reduce tasks. These tasks are parallel processed on the nodes of the cluster by the policy which strives to keep the work as close to the data as possible. Task scheduling problems are of paramount importance

Which relate to the efficiency of the whole cloud computing facilities. The scheduling algorithms in distributed systems usually have the goals of spreading the load on processors and maximizing their utilization while minimizing the total task execution time [3]. Task scheduling, one of the most famous combinatorial optimization problems, plays a key role to improve flexible and reliable systems. The main

IJ
ISSN: 2230-7818

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

ES
II. A. Cloud Computing 1

Purpose is to schedule tasks to the adaptable resources in accordance with adaptable time, which involves finding out a proper sequence in which tasks can be executed under transaction logic constraints [5]. Based on the system information used by the scheduling approaches, there are two main categories, namely static and dynamic. Both have their own limitations. Usually dynamic load-balancing mechanism has better performance in comparison to static one, but has higher overhead since the schedule need to be determined dynamically and system information should be updated on the fly. Current Hadoop scheduler includes FIFO, FAIR scheduler [4], and Capacity scheduler [9]. In this paper, we describe and evaluate Fuzzy sets to model imprecise scheduling parameters and also to represent satisfaction grades of each objective. Genetic algorithms with different components are developed on the based technique for task level scheduling in Hadoop MapReduce. To achieve a better balanced load across all the nodes in the cloud environment, we revise the scheduler by predicting the execution time of tasks assigned to certain processors and making an optimal decision over the entire group of tasks. RELATED WORK

Cloud computing is Internet-based development and use of computer technology. The cloud is a metaphor for the Internet and is an abstraction for the complex infrastructure it conceals. Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. It define in three models Software as a Service (SaaS), Platform as a Service (PaaS), Infrastructure as a Service (IaaS) [6]. Figure1 show the architecture of the cloud computing. Cloud computing system scales applications by maximizing concurrency and using computing resources more efficiently One must optimize locking duration, statelessness, sharing pooled resources such as task threads and network connections bus, cache reference data and partition large databases for scaling services to a large number of users. IT companies with innovative ideas for new application services are no longer required to make large capital outlays in the hardware and software infrastructures. By using clouds as the application hosting platform, IT companies are freed from the trivial task of setting up basic hardware and software infrastructures. Thus they can focus more on

Page 11

Sandeep Tayal / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 5, Issue No. 2, 111 - 115

innovation and creation of business values for their application services [10]. Some of the traditional and emerging Cloudbased application services include social networking, web hosting, content delivery, and real time instrumented data processing. Each of these application types has different composition, configuration, and deployment requirements. Quantifying the performance of provisioning (scheduling and allocation) policies in a real Cloud computing environment (Amazon EC2 [13], Microsoft Azure [15], Google App Engine [14]) for different application models.

C.

Overview of Genetic Algorithms

Figure1. Cloud Computing Architecture Cloud computing also describes applications that are extended to be accessible through the Internet. These cloud applications use large data canter and powerful servers that host Web applications and Web services. Anyone with a suitable Internet connection and a standard browser can access a cloud application. B. Task Scheduling and Load-balancing Technique

IJ
ISSN: 2230-7818

A task is a (sequential) activity that uses a set of inputs to produce a set of outputs. Processes in fixed set are statically assigned to processors, either at compile-time or at start-up (i.e. partitioning). Avoids overhead of load balancing using these load-balancing algorithms. In grid computing algorithms can be broadly categorized as centralized or decentralized, dynamic or static [7], or the hybrid policies in latest trend. A centralized load balancing approach can support larger system. Hadoop system takes the centralized scheduler architecture. In static load balancing, all information is known in advance and tasks are allocated according to the prior knowledge and will not be affected by the state of the system. Dynamic load-balancing mechanism has to allocate tasks to the processors dynamically as they arrive. Redistribution of tasks has to take place when some processors become overloaded [4].

In cloud computing, each application of users will run on a virtual operation system, the cloud systems distributed resources among these virtual operation systems. Every application is completely different and is independent and has no link between each other whatsoever, for example, some require more CPU time to compute complex task, and some others may need more memory to store data, etc. Resources are sacrificed on activities performed on each individual unit of service. In order to measure direct costs of applications, every individual use of resources (like CPU cost, memory cost, I/O cost, etc.) must be measured. When the direct data of each individual resources cost has been measured, more accurate cost and profit analysis. [12]

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

ES
D. Fuzzy setting of GA 2

The genetic paradigm is a flexible approach enabling, for the same problem, different individual representations and algorithm implementations to select individuals and perform mutation. However, the appropriate representation of potential solutions is crucial to ensure that the mutation of any pair of individual (i.e. chromosome) will result in new valid and meaningful individual for the problem. Conversely, the choice of the fitness function that discriminates and converts back to Boolean space should be carefully studied [8].

The design of the algorithm for fuzzy setting of GA parameters based. Our idea is the adaptation of the GA operators value (selection; crossover; mutation) during the run of the GA. The fuzzy control is applied if the condition of fuzzy adaptation is true. In many problems especially in the complexity industrial systems, there exist many kinds of fuzzy nonlinear scheduling and production planning problems and they cannot be explained and solved by traditional production planning and scheduling model, so, the research on optimization method for nonlinear programming under fuzzy environment is important in the fuzzy optimization theory and great and wide value in application to the production planning and scheduling problems. In general, the process of solving a fuzzy optimization problem consists on the following steps: - Depending on the problem type, formalize it in the form of a linear programming problem or in the form of a multiobjective problem, or even as a non-linear programming problem. - If we intend to fuzzify the objectives, define the goal for those objectives. - Select the membership functions to represent the fuzzification of any parameter and constraints, such as Triangular, sinusoidal, trapezoidal, or other. - Define the membership functions with the necessary parameters for the preference value and tolerances. - Define thresholds for the allowable degree of the deviations/violations in the constraints satisfaction.

Gas is stochastic techniques based on the mechanism of natural selection and genetics. Genetic algorithms are a particular category of evolutionary algorithms which aim at finding exact or approximate solutions to optimization problems, they are encoded in binary strings and they use mutation and crossover for modifying the population through generations. In mathematics, the optimization problem seeks to minimize or maximize a function by choosing appropriate values for its variables. Genetic algorithms are inspired by biological evolution and based on genetic operations on genes, such as the mutation which changes the current value of a gene, and crossover that makes a new chromosome inheriting characteristics from the two parent chromosomes.[11] These operations are applied together with Darwins evolution theory that states that individuals more fitted to the environment will survive, and will reproduce themselves maximizing their genetic code in the offspring, that will born with similar characteristics than their parents and by consequence they will be equally or better fitted to the same environment. [10]

Page 12

Sandeep Tayal / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 5, Issue No. 2, 111 - 115

- Define the aggregation operator to combine the constraints (and goals), e.g. the t-norm operator minimum. - Solve the problem (in this paper with a GA algorithm). III. PROPOSED MODEL

c) Each task has two ways of access to a process unit:(1) exclusive access, in which case no other task can use the resource with it, or (2) shared access, in which case it can share the resource with another task (the other task also should be willing to share the resource) [2]. B. Predicted Execution Time Mode Some prior knowledge needs for a holistic evaluation of a schedule. So estimated or predicted execution time of a certain task assigned to a certain processor is required. MapReduce frameworks make an explicit assumption that storage is local to computation. We need a model to compute the time to data communication when we estimate the computation time. One possible way is to calculate Tcomp+Tcomn separately and get the sum later. But we found it is very difficult to model network communication. So we refer to a model combining two things together. We adapt the model to our matrix E[t][p]. The item in the matrix represents the predicted execution time of task t on processor p. One similar model is the Expected Time to Compute (ETC) model [7] in grid computing. It is claimed to be liable since it is easy to know the computing capacity of each resource and the requirements about computation need of the tasks can be known from specifications provided by the user, or from historic data or from predictions. It made an assumption that 1) the computing capacity of each resource.2) a prediction of the computational needs (workload) of each job, and 3) the load of prior work of each resource is known. In the cloud environment, attempts have been made in predicting Hadoop job performance. One such effort is a statistical modeling approach - Kernel Canonical Correlation Analysis (KCCA) [16], which representing each Hadoop job as a feature vector of job characteristics and a corresponding vector of performance metrics. The work adapts their previously proposed framework to predict Hadoop job performance, supporting the possibility of getting these data before real task execution. From the Hadoop job logs, performance feature vectors, such as map time, reduce time, total execution time, map output bytes, HDFS bytes written, and locally written bytes, could be extracted. These results convincingly demonstrate the effectiveness of this approach which can also be used for our mechanism of MapReduce scheduling optimization. C. System Model The system model describes the data store and computing cluster that jobs could be assigned to the cluster includes machines arranged in a general tree-shaped switched network as in Figure1. The nodes are commodity PCs. Data are distributed through these nodes. There are several replicas for each data block in the distributed file system. By default, the number of replicas is set as 3 in Hadoop. Map tasks generate the intermediate data stored the same node. We assume the communication overhead exits when the data does not locate in the same node as the computing node. The network rate between two nodes in the same rack is faster than the communication between nodes in different racks when network traffic on the main backbone network is big. Usually each rack contains 30-40 nodes. The links between racks are 1 Gbps while rack internal is 1 Gbps and local disk read is 2 Gbps. Each node can contain several processors. For each node, there are several map slots and reduce slots. Usually there is per slot for one processor.

Scheduling optimizer

A. Task Model and Related Properties

MapReduce tasks or more general tasks have dependence with each other and computing related factors. For each task, task related information is attached to each task including runtime demands, e.g. CPU and disk requirements. But it could be extended for a more comprehensive situation. Suppose that a cloud computing system consist of heterogeneous process unit possess of m(m>1) units. The tasks in this system features are as follows: a) Tasks are aperiodic; i.e., task arrival times are not known a priori. Every task has the attributes arrival time (ai ), worst case computation time (ci), and deadline (di). The ready time of a task is equal to its arrival time. b) Tasks are nonpreemptive; each of them is independent.

IJ
ISSN: 2230-7818

Figure2. Proposed Model

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

ES
3

The inspiration of our work is to give the centralized scheduler (master node) a choice by referring to a global view of the whole system. The framework of our proposed model is shown in Figure2. System Model describes the information related to processors which includes slot information, data replication information and workload information of processors. Task Model includes the job and asks information to be processed in the queue. Predicted Execution Time Model is a base for later schedule optimization. It could be got by statistics techniques with tolerable deviation. Using the information of Task Model, System Model, Predicted Execution Time Model, Objective Function as input, to Fuzzification of parameter is implemented and gone through GA algorithm and generates an optimal schedule. When new jobs arrive or rescheduling condition is met, such as processor failure, Reschedule needs to be done. The detail of each part in the model is illustrated in later sections.

Page 13

Sandeep Tayal / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 5, Issue No. 2, 111 - 115

Although name node of Hadoop itself is aware, we have our own models for simulation purpose. Data location model, which is a table containing the data block locations for each task, is used by Predicted Execution Time Model (Section B) to predict task execution time. D. Reschedule When a task cannot be completed due to a disk failure, processor failure, or other problems, the job may never complete. A time out Tout could be set for this situation and the non-completed tasks could be rescheduled in the next calculation. Other than failures, when new jobs arrive, the new jobs also need to be scheduled with uncompleted tasks. E. Objective Function The objective function for our algorithm is the latest completion time of the task schedule, referred as Makespan. The Makespan is calculated in objective function. Where represents the time that processor i will have finished the previously assigned jobs and E[t][i] is the predicted execution time that task t is processed on processor i. F. Scheduling Optimizer This paper also assume centralized scheduling scheme; i.e., a master processor unit in cloud, collecting all tasks, will take charge of dispatching them to other process units. Each process unit has its own dispatch queue (DQ). The master unit communicates with other process units through these dispatch queues. This organization ensures that processor units always find some tasks in the dispatch queue when the finish the execution of their current task. The master unit works in parallel with other units, scheduling the newly arrived tasks, and periodically updating the dispatch queues. Tasks are sorted ascending by the value of deadlines. Reasons to choose GA as an optimization algorithm is simplicity of operation and power of effect. It is suitable to some NP-hard problems. 1) Fuzzification

3) Genetic Operators In the initialization process, the slot number (node number) which has the data for the task is chosen as the initial schedule. If there are several choices for one task, the slot number is randomly selected among these slots. We believe it could lead us faster to an optimal schedule. The selection operator used in this paper is based on the roulette wheel method. These probabilities are calculated by dividing the fitness values of the individuals by the sum of the fitness values. Crossover operator is the simple one-point crossover. Mutation operator used is Flip Mutator. 4) Our Overall Algorithm Our whole algorithm, considering the aspects discussed above, is shown below: I. Get new tasks to be scheduled. The tasks to be scheduled include the uncompleted task and new jobs. But if jobs arrive in dynamically and make too many jobs waiting to be assigned at one time, the slidingwindow technique [3] could be used as an option. The window size is fixed. Tasks fall into the sliding window will get scheduled. Generating E matrix for the job Using KCCA technique to predict the execution time of any individual task assigned to every node. Get the current state of the system. Fuzzification of all above parameter to get optimized task schedule. The Fuzzify parameter Map in GA to get optimized. a. Generate an initial population of chromosomes randomly. Evaluate the fitness of each chromosome in the population. Evaluate P according to information in E; Create a new population by repeating the following steps until the new population is complete, Selection Select two parent chromosomes from a population according to their fitness. (The better the fitness, the higher is the chance for getting selected). Crossover With a crossover probability, do cross over operations on the parents to form a new offspring. If no crossover is performed, offspring is the exact copy of the parents. Mutation With a mutation probability, mutate new offspring at each locus (Position in chromosome) Acceptance Place the new offspring in the new population. Using the newly generated population for a further sum of the algorithm. If the test condition is satisfied, stop and return the best solution in the current population. Repeat Step c until the target is met.

II.

The fuzzification comprises the process of transforming values into grades of membership for linguistic terms of fuzzy sets. The membership function is used to associate a grade to each linguistic term. For each input and output variable selected, we define two or more membership functions (MF), normally three but can be more. We have to define a qualitative category for each one of them, for example: low, normal or high. The shape of these functions can be diverse but we will usually work with triangles. For this reason we need at least three parameters for the triangular function we use the Execution time, work load and objective function as three variable input for fuzzification. After fuzzify all these parameter using triangular MF of the fuzzy logic to get the optimized task scheduling by apply GA to it.

IJ
2) Genome representation
ISSN: 2230-7818

We use direct representation. For job scheduling problem, a direct representation is obtained as follows. Feasible solutions are encoded in a vector, called schedule, of size n tasks, where the numbers indicates the slot where task i is assigned by the schedule. Thus, the values of this vector are natural numbers included in the range [1; m slots].

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

ES
III. IV. V. b. c. d. e. f. VI. 4

Finally obtain the optimal solution.

The task scheduling using Genetic Algorithm is done.

Page 14

Sandeep Tayal / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 5, Issue No. 2, 111 - 115
[11] Jos Francisco Saray Villamizar, Youakim Badr, Ajith Abraham, An Enhanced Fuzzy-Genetic Algorithm to Solve Satisfiabilit Problems IEEE Computer Society Washington, DC, USA 2009 On page(s): 77 - 82 JA Brimson, Activity Accounting: An Activity-based Costing Approach, John Wiley & Sons, 1991. Amazon EC2, http://aws.amazon.com/, (accessed 09.01.2011) GoogleAppEngine, (accessed 14.01.2011) http://code.google.com/appengine/,

IV.

CONCLUSION

References
[1]

Message Passing Interface www.mcs.anl.gov/mpi/,(accessed 10.01.2011)

[2]

J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Sixth Symposium on Operating System Design and Implementation (OSDI04), Dec. 2004, pp. 113. A. Y. Zomaya, and Y. The, Observations on using genetic algorithms for dynamic load-balancing, IEEE Transaction on Parallel and Distributed Systems, vol. 12, no. 9, 2001, pp. 899-911. M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker and I. Stoica, Job scheduling for multi-user mapreduce clusters, EECS Department, University of California, Berkeley, Tech. Rep. USB/EECS-2009-55, Apr 2009. XIAO Zhi-Jiao, CHANG Hui-You,YI Yang An Optimization M ethod of W orkflow Dynamic Scheduling Based on Heuristic GA , Computer Science, Vo1.34 No.2 2007. Cloud computing and distributed http://www.cncloudcomputing.com/. computing.

[3]

[4]

[5]

[10]

IJ
[6] [7] [8] [9]

H.J. Braun, T. D.and Siegel, N. Beck, L.L. Blni, M. Maheswaran, A.I.Reuther, J.P. Robertson, M.D. Theys, and B. Yao, A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems, Journal of Parallel and Distributed Computing, vol. 61, no. 6, 2001, pp.810837. Z. Michalewicz, Genetic Algorithms Data Structures Evolution Programs. Springer Verlag. Third edition 1998.

Hadoop Capacity Scheduler, Available:http://hadoop.apache.org/common/docs/current/capacity _scheduler.html, accessed on Feb 18, 2011. D. Goldberg. Genetic Algorithms in Search Optimization and Machine Learning. Reading MA Addison Wesley, 1989.

ISSN: 2230-7818

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

ES
(MPI),

5
Page 15

With the advancement of Cloud technologies rapidly, there is a new need for tools to study and analyse the benefits of the technology and how best to apply the technology to largescaled applications. Efficient task scheduling mechanism can meet users' requirements, and improve the resource utilization, thereby enhancing the overall performance of the cloud computing environment. But the task scheduling in grid computing is often about the static task requirements, and the resources utilization rate is also low. According to the new features of cloud computing, such as flexibility, virtualization and etc, this paper discusses a two levels task scheduling mechanism based on load balancing in cloud computing. This task scheduling mechanism can not only meet user's requirements, but also get high resource utilization. But it need more improvement as this whole algorithm is based on the accuracy of the predicted execution time of each task. Although KCCA method [16] is approved to be effective for some Hadoop jobs, e.g., Hive jobs and Extract Transform Load jobs, it has not been explored thoroughly under task scheduling situation. Second, the efficiency of the prediction using KCCA method is highly affected by the choice of task vector. Therefore, more research work need to be done in this topic.

[12]

[13] [14]

[15]

MicrosoftAzure, http://www.microsoft.com/windowsazure (accessed 15.01.2011) A. Ganapathi, Y. Chen, A. Fox, R. Katz, D. Patterson, Statisticsdriven workload modeling for the cloud, University of California, Berkeley, Tech. Rep, Nov 2009.

[16]

You might also like