Cloud Computing: Application On Data Farming Report

NANYANG TECHNOLOGICAL UNIVERSITY
CLOUD COMPUTING: APPLICATION ON DATA FARMING
Yong Yong Cheng
School of Computer Engineering 2010
NANYANG TECHNOLOGICAL UNIVERSITY
SCE09-0445 CLOUD COMPUTING: APPLICATION ON DATA FARMING
Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelor of Engineering (Computer Science) of the Nanyang Technological University By
Yong Yong Cheng
School of Computer Engineering 2010
Abstract
Objective-Based Data Farming requires massive amount of computing power to run thousands/millions of simulations. To acquire this massive amount of computing power, one has to own the infrastructure: a cluster/grid of many cheap computers or an expensive supercomputer. Both actions amount to an exorbitant sum of money over time to satisfy this increasing need. With the introduction of Amazon Elastic Compute Cloud (Amazon EC2) and MapReduce programming model, the story of one having to own the infrastructure to gain access to this massive amount of computing power is in the past. The term Cloud Computing has been introduced and is more popular with each passing day. Cloud Computing allows massive amount of computing power to be available as a utility and at a cheap cost. It also offers other benefits such as scalability in real-time and with great ease, high availability and fault tolerant. This project implements a robust private Cloud to address the security concerns when using military applications. It also implements a public Cloud to exhibit the feasibility of using a public infrastructure. In this project, a Web Service and 6 MapReduce applications, which are used in distributing workloads within a Cloud, are designed and implemented. They allow conventional ObjectiveBased Data Farming frameworks to take full advantage of Cloud Computing.
Acknowledgements
I would like to express my thanks to Asst. Prof. Malcolm Low Yoke Hean, Dr. James Decraene and Mr. Zeng Fanchao. Their guidance had provided me with an insight to various ideas and concepts that are useful to this project.
Table of Contents
Abstract .............................................................................................................................. 3 Acknowledgements ........................................................................................................... 4 Table of Contents ............................................................................................................... 5 List of Tables ...................................................................................................................... 8 List of Figures ..................................................................................................................... 9 Chapter 1: Introduction ................................................................................................... 11
1.1 Objectives........................................................................................................................... 12 1.2 Background ........................................................................................................................ 12
1.2.1 Data Farming ............................................................................................................................. 12 1.2.2 Computing Environments ......................................................................................................... 13 1.2.3 Objective-Based Data Farming .................................................................................................. 16
1.3 Scope .................................................................................................................................. 17 1.4 Report Organization ........................................................................................................... 17
Chapter 2: Concepts & Frameworks ................................................................................ 19

2.1 Apache Hadoop .................................................................................................................. 19
2.1.1 Hadoop Distributed File System ................................................................................................ 20 2.1.2 MapReduce ............................................................................................................................... 21 2.1.3 Condor ....................................................................................................................................... 22
2.2 Web Service........................................................................................................................ 24 2.3 Complex Adaptive System Evolver ..................................................................................... 24

2.3.1 Map Aware Non-Uniform Automata......................................................................................... 26 2.3.2 Evolutionary Algorithm ............................................................................................................. 26
2.3.3 Island Model .............................................................................................................................. 27
Chapter 3: Implementing Cloud Infrastructure & Applications....................................... 29

3.1 Hadoop Cluster @NTU ....................................................................................................... 29
3.1.1 Development Cluster ................................................................................................................ 31 3.1.2 Auto-Updating & Reporting Tool .............................................................................................. 32
3.2 Hadoop Cluster @EC2 ........................................................................................................ 33 3.3 Hadoop Service .................................................................................................................. 34

3.3.1 Web Service Clients ................................................................................................................... 38
Chapter 4: Preliminary Work ........................................................................................... 42

4.1 Automated Red Teaming Framework ................................................................................ 42 4.2 MapReduce MANA ............................................................................................................. 45
4.2.1 Problems & Solutions ................................................................................................................ 46 4.2.2 Demonstrating Apache Hadoop-Compliant ART ....................................................................... 46 4.2.2.1 Results & Analysis .............................................................................................................. 47
Chapter 5: Implementing Apache Hadoop-Compliant CASE ........................................... 48

5.1 Apache Hadoop-Compliant CASE ....................................................................................... 48 5.2 Replication Model .............................................................................................................. 50
5.2.1 Testing Cluster Robustness ....................................................................................................... 51 5.2.1.1 Results & Analysis .............................................................................................................. 51
5.3 Standard Model.................................................................................................................. 52

5.3.1 Demonstrating Scalability ......................................................................................................... 53 5.3.1.1 Results & Analysis .............................................................................................................. 54 5.3.2 Evaluating Hadoop Cluster @EC2 ............................................................................................. 58 5.3.2.1 Results & Analysis .............................................................................................................. 59
5.4 Island MapReduce .............................................................................................................. 60
5.4.1 Island MapReduce 1 .................................................................................................................. 60 5.4.2 Island MapReduce 2 .................................................................................................................. 61 5.4.3 Evaluating Island MapReduce ................................................................................................... 62 5.4.3.1 Results & Analysis .............................................................................................................. 63
Chapter 6: Conclusion ...................................................................................................... 66

6.1 Summary ............................................................................................................................ 66 6.2 Limitations .......................................................................................................................... 67 6.3 Future Enhancements ........................................................................................................ 68
References ....................................................................................................................... 69 Appendix .......................................................................................................................... 72

Appendix A: List of Nodes in Hadoop Cluster @NTU ............................................................... 72 Appendix B: Past Problems in MRMANA & Solutions .............................................................. 74 Appendix C: Evolving Agent-Based Simulations in the Clouds................................................. 81
List of Tables
Table 1: Pros & Cons of Computing Environments for Data Farming Table 2: Comparison between Condor & Apache Hadoop Table 3: 32-Bit Instance Types on Amazon Elastic Compute Cloud (Amazon EC2) Table 4: Web API (Application Programming Interface) of Hadoop Service Table 5: Solutions to Problems in MRMANA Table 6: Times Taken By MRMANA & MOMANA in Executing Simple & Complex Scenarios Table 7: 5 Entry Points in Apache Hadoop-Compliant CASE Table 8: Execution Times Demonstrating Robustness Table 9: Execution Times Using 5 & K Excursions in a Map Task Table 10: Execution Times Demonstrating Scalability Table 11: Benchmarks for Virtual Machine, Small & Medium Instances Table 12: Execution Times on Hadoop Cluster @EC2 & Hadoop Cluster @NTU Table 13: Differences between Configuration 1, 2, 3 & 4 for Island MapReduce Table 14: Execution Times for Island MapReduce with Configuration 1, 2, 3 & 4
List of Figures
Figure 1: Search Popularity on Google Since 2007 Figure 2: Data Farming Iterative Process Figure 3: Cloud Computing Stack Figure 4: Component Stacks of Apache Hadoop & Google MapReduce Framework Figure 5: MapReduce Execution Figure 6: Complex Adaptive System Evolver (CASE) Figure 7: General Structure of an Evolutionary Algorithm (EA) Figure 8: Differences between Single Population Model & Island Model Figure 9: 37-Nodes Production Cluster Figure 10: Screen Capture of the Installer Figure 11: 4-Nodes Development Cluster Figure 12: Steps in Getting an Update to Be Applied on Each Slave Node Figure 13: Hadoop Cluster @EC2 Figure 14: Steps for Executing a MapReduce Job on Hadoop Cluster @NTU Figure 15: Strategy Formulated To Solve a Problem in Log Recording Figure 16: Solution to Solve Performance Degradation Due To Increasing Number of Queries Figure 17: Screen Capture of Hadoop Service Figure 18: Package Diagram for Hadoop Service & Java-Based Web Service Client Figure 19: Screen Capture of Java-Based Web Service Client Figure 20: Screen Capture of Web-Based Web Service Client Figure 21: Screen Capture of CASE GUI Figure 22: Cloud Computing Architecture Figure 23: Architecture of ART Framework Figure 24: Screen Capture of ART Framework Figure 25: Steps for ART Framework to Submit & Run a MapReduce Job Figure 26: Application Workflows of MRMANA & MOMANA Figure 27: Differences between Apache Hadoop-Compliant ART Framework & CASE Figure 28: 5 Entry Points in Apache Hadoop-Compliant CASE
Figure 29: Application Workflow of Replication Model Figure 30: Application Workflow of Standard Model Figure 31: Execution Times Using 5 & K Excursions in a Map Task Figure 32: Execution Times Demonstrating Scalability Using Simple Scenario Figure 33: Execution Times Demonstrating Scalability Using Complex Scenario Figure 34: Comparison of Solutions between Standard Model & CASE Figure 35: Application Workflow of Island MapReduce 1 Figure 36: Application Workflow of Island MapReduce 2 Figure 37: Comparison of Solutions between Island MapReduce 1 & Island MapReduce 2 Figure 38: Comparison of Solutions between Configuration 2, 3 & 4 for Island MapReduce 1
10
Chapter 1: Introduction
Merrill Lynch, one of the worlds leading financial management and advisory companies, issued a research note titled The Cloud Wars: $100+ Billion at Stake on 7th May 2008 [1]. The analysts estimated that by 2011 the volume of Cloud Computing market opportunity would amount to $160 billion. Today, Cloud Computing is rapidly gaining widespread acceptance both in the public and industry. Figure 1 shows the search popularity on Google, a popular search engine, since 2007 for three terms: Cloud Computing, Grid Computing and Cluster Computing. From the figure, it can be observed that the popularity of Cloud Computing is on the rise.
25 Search Volume Index 20
15
10
0 Jan 7 2007 Dec 16 2007 Cloud Computing Nov 23 2008 Grid Computing Nov 1 2009 Oct 10 2010
Cluster Computing
Figure 1: Search Popularity on Google Since 2007
Many large corporations, such as Activision, British Telecom (BT) and Yahoo have jumped on the bandwagon of Cloud Computing. They have obtained the benefits that Cloud Computing has promised. These benefits include reduced cost, scalability, high availability and fault
11
tolerant. Gartner, a famous information technology research and advisory company, has even identified Cloud Computing as one of the top 10 strategic technologies for 2011 [2].
1.1 Objectives
This project aims to: Explore the paradigm of Cloud Computing through Apache Hadoop, a popular Cloud Computing software framework; and Incorporate Cloud Computing into Data Farming so that the latter can be carried out with a lower cost, in a larger scale and a faster time.
1.2 Background
1.2.1 Data Farming
Data Farming [3] [4] [5] is a technique that uses simulation models that are executed thousands or millions of times to reveal the complexities in a problem landscape. It combines a set of enabling technologies and processes into a single integrated task of automating the above scientific method. These set of technologies and processes include distributed and highperformance computing, agent-based simulations and rapid model development, knowledge discovery methods, high dimensional data visualization techniques, design-of-experiments methods, human computer interfaces, teamwork and collaborative environments, and heuristic search techniques. However, it is not intended to be used for predicting an outcome, but employed instead to aid intuition and to gain insight on a problem scenario. Data Farming is a collaborative and iterative process. The steps, as shown in Figure 2, are essential in the process and may be repeated until sufficient insights to a problem are gained.
12
Figure 2: Data Farming Iterative Process
The obtained results may be incorporated into other modeling and operational analysis activities, while the insight gained may be used to provide input to deterministic models or equations, or build more realistic simulations and models.
1.2.2 Computing Environments

During Data Farming, each simulation model has to be executed thousands or millions of times either for model testing or for parameter space exploration. These executions are performed using a single computer, a cluster of computers (also known as Cluster Computing [6]) or a grid of computers (also known as Grid Computing [7]). The pros and cons of each type of computing environment are illustrated in Table 1.
13
Table 1: Pros & Cons of Computing Environments for Data Farming
Computing Environments A Single Computer
Pros - Simple to deploy simulation models.
Cons - Time-consuming to obtain large amount of results. - May not be able to handle large and complex computing tasks. - High cost involved in managing, maintaining and upgrading the computers. - Limited size, as computers have to be located within the same organization. - Complex to deploy simulation models, since executions have to be split across multiple computers. - High cost involved in managing, maintaining and upgrading the computers. - Expensive, as computers may be left unused most of the times if computing tasks are not large and complex enough to utilize them. - Has to deal with more management issues when computers are managed by unrelated organizations. - More complex to deploy simulation models, since executions also have to be performed on heterogeneous computers.
A Cluster of Computers (Cluster Computing)
A Grid of Computers (Grid Computing)
- Reduces the time required to obtain large amount of results. - Able to handle large and complex computing tasks. - Low network latency, as homogeneous computers are geographically located near to each other and linked together within a dedicated network. - Reduces the time required to obtain large amount of results. - Able to handle large and complex computing tasks. - Unlimited size, as computers can be geographically dispersed and located across multiple organizations.
Cloud Computing [8] [9] [10] represents a technology advancement in which Grid Computing is made more user-friendly and attractive. One of the most outstanding characteristics of Cloud Computing is the ability to provision resources on-demand, thus eliminating over-provisioning and removing the need to over-provision in order to meet the demands of large and complex computing tasks. These resources include computing power, storage space and network bandwidth. Cloud Computing is a computing concept in which resources in distributed computing systems are provided as a service, allowing users to consume them via the Internet and on a utility computing basis. The users do not require any knowledge of, expertise with, or control over the technology infrastructure that supplies them with the resources.
14
A typical Cloud Computing architecture is composed of six layers, as shown in Figure 3.
Figure 3: Cloud Computing Stack [11]
Clients Computer hardware and/or software that are solely designed to deliver Cloud services and are essentially useless without it. Application (also known as Software-as-a-Service, SaaS) Eliminates the need to install and run applications on the users computers. It mitigates the burden of maintenance, upgrades and support. An example is FaceBook. Platform (also known as Platform-as-a-Service, PaaS) Provides a platform as a service, consuming Cloud infrastructure and supporting Cloud applications. It aids in the development and deployment of Cloud applications, and eliminates the cost and complexity in buying and managing the underlying infrastructure. An example is Apache Hadoop. Infrastructure (also known as Infrastructure-as-a-Service, IaaS) Offers a computing infrastructure as a service. It allows the users to purchase resources, such as computing power and storage space, on a utility computing basis. Two such examples are High Performance Computing Centre (HPCC) in Nanyang Technological University (NTU) and Amazon Elastic Compute Cloud (Amazon EC2). Servers Computer hardware and/or software that are used to support the delivery of Cloud services.
The benefits of Cloud Computing are:
15
Cost Users benefit from lower up-front capital cost, as they do not need to purchase their own infrastructures and instead, they are paying only for those resources that they use; Scalability Users do not need to over-provision their resources to meet the peak demands, as resources are provisioned dynamically on a fine-grained and self-service basis near real-time; and Mobility Users are able to access the resources via the Internet regardless of their locations and devices.
The drawbacks of Cloud Computing include privacy and security concerns about handing over confidential data to the third-party providers and the inability of the users to do anything when the third-party providers suffer outages. However, it is in the best interest of the third-party providers to employ the most sophisticated high-availability and security strategies available for the users data, and these strategies are likely to be far more stringent than any companys in-house policies.
1.2.3 Objective-Based Data Farming

Data Farming explores the entire parameter space. The size of this parameter space is proportional to the complexity of the problem landscape. As the problem landscape becomes more complex, the parameter space gets even larger. Thus the number of times to execute each simulation model, as well as the time involved in doing so, increases tremendously. Objective-Based Data Farming uses Evolutionary Algorithms (EAs) and objectives to direct a search within the parameter space. This search reduces the number of times to execute each simulation model, as points, which prove worthwhile based on objectives, in the parameter space are used in discovering those points that may be worth evaluating. Each point in the parameter space is equivalent to one execution of each simulation model. However, there is a limitation to the reduction that can be performed by the search. A large reduction may hide the complexities in the problem landscape and thus hinder in gaining insight on a problem scenario. Objective-Based Data Farming requires a massive amount of resources to execute a simulation model thousands or millions of times across a large parameter and value space. By adopting
16
Cloud Computing, huge amount of resources can be purchased inexpensively and only when needed so that Data Farming can be performed with a lower cost and in a larger scale and a faster time. Simultaneously, the benefits of using Cluster Computing and Grid Computing can be preserved while their drawbacks are being eliminated.
1.3 Scope
Many Cloud Computing offerings and platforms are available in the market. Due to time constraints, it is impossible to evaluate all of them in this project. Hence, this project explores the paradigm of Cloud Computing through Apache Hadoop, which is a popular Cloud Computing software platform inspired by the Google MapReduce framework. The development of Apache Hadoop is initiated and led by Yahoo, and has even spawned many startups. One of the startups is Cloudera, which has a funding of $36 million [12]. Besides exploring with a private Cloud that is implemented using Apache Hadoop, this project also takes into consideration of having the exploration performed on a public Cloud. Amazon Elastic Compute Cloud (Amazon EC2) is chosen as the public Cloud, as it supports Windows operating system. It has been in operation since 25th August 2006. Due to time constraints, two Objective-Based Data Farming frameworks are chosen to incorporate with Cloud Computing. They are Automated Red Teaming (ART) Framework and Complex Adaptive System Evolver (CASE).
1.4 Report Organization

This report is organized as follows: Chapter 2 illustrates the concepts and the software frameworks that are employed in this project; Chapter 3 describes the implementation of the Cloud infrastructure and applications; Chapter 4 describes the preliminary work of incorporating Cloud Computing into an Objective-Based Data Farming framework;
17
Chapter 5 describes Apache Hadoop-compliant CASE, and presents the experiments that had been performed and their obtained results; and Chapter 6 summarizes this project, and describes its limitations and possible enhancements in the near future.
18
Chapter 2: Concepts & Frameworks

This chapter describes the concepts and the software frameworks that are employed in this project. The software frameworks are Apache Hadoop and Complex Adaptive System Evolver (CASE). MapReduce programming model and the concepts of Web Service, Map Aware NonUniform Automata (MANA), Evolutionary Algorithm, and Island Model are also reviewed in this chapter.
2.1 Apache Hadoop

Apache Hadoop [13] is an open-source Java software framework for developing and deploying applications that can be run on large clusters built of commodity computers. Being a popular software platform that has been used to realize Cloud Computing, it is largely inspired by the Google MapReduce framework, which is implemented in C++ and processes more than 20000 terabytes of data across Googles massive computing clusters per day. The framework is able to transparently furnish applications with scalable and reliable distributed computing capabilities with the implementation of two core components: Hadoop Distributed File System (HDFS) and MapReduce. Figure 4 shows the component stacks of the Apache Hadoop and Google MapReduce framework. Each component in Apache Hadoop has its equivalent component in Google MapReduce framework.
19
Figure 4: Component Stacks of Apache Hadoop & Google MapReduce Framework
The advantages of using Apache Hadoop are: Common tasks, such as scheduling, input partitioning, failover, replication and sorting of intermediate results, in distributed computing systems are automatically taken care of by the framework; Massive computing clusters have become increasingly easier to utilize because of the simplified MapReduce programming model; and The simplified MapReduce programming model also allows the users to concentrate on designing the workflows of their applications.
The design of Apache Hadoop assumes that it is much more efficient to move and execute the computation closer to where the required data is located. This is especially true when the size of the data is huge. This assumption aims to minimize the network congestion and increase the overall throughput of the computing system.
2.1.1 Hadoop Distributed File System

Hadoop Distributed File System (HDFS) [14] is a distributed file system designed to be deployed on low-cost commodity computers. It is highly fault-tolerant, provides high throughput access to application data and is especially suitable for applications that handle large data-sets.
20
Although HDFS allows the users to view and store their data as files, each file is actually split into one or more blocks internally. These blocks are replicated and then stored among multiple computers within the computing system. A typical HDFS cluster consists of a single NameNode and one or more DataNodes. The NameNode manages the namespace of the file system and executes only namespace operations, such as renaming files and directories. It also regulates accesses to files by the clients and determines the mapping of blocks to the DataNodes. The DataNode serves read/write requests from the clients and performs block creation, deletion, and replication upon receiving instructions from the NameNode. The NameNode is a single point of failure for a typical HDFS cluster. If it fails, manual intervention is required to start the namespace recovery. Existing works are in progress to start this namespace recovery for the failed NameNode automatically [15].
2.1.2 MapReduce
MapReduce [16] was introduced to the world by Google in 2004. It is a programming model and an associated implementation to generate and process large data-sets in a scalable, reliable and fault-tolerant manner. In MapReduce programming model, a computation has to be expressed as two functions: Map and Reduce. The computation consumes a set of input key-value pairs and produces a set of output key-value pairs, conceivably of different types. The Map function processes a key-value pair to generate a set of intermediate key-value pairs, while the Reduce function merges and processes all intermediate values associated with the same intermediate key to produce one or more output key-value pairs. MapReduce implementation distributes both Map and Reduce invocations across multiple computers within the computing system. It also automatically partitions the input data into a set of input splits and these input splits are then processed in-parallel by the Map invocations on different computers. Using a partitioning function, the intermediate key space is partitioned into R pieces and each Reduce invocation processes one or more pieces. After successful
21
completion, the output of the MapReduce execution, as shown in Figure 5, is available in R output files. A typical MapReduce cluster consists of a single JobTracker and one or more TaskTrackers. The JobTracker schedules Map/Reduce invocations to be executed across multiple nodes, monitors them and re-schedules those failed invocations for execution. The TaskTracker executes the invocations as instructed by the JobTracker. In this project, a MapReduce execution and a Map/Reduce invocation are addressed as a MapReduce job and a Map/Reduce task respectively.
Figure 5: MapReduce Execution
2.1.3 Condor
Condor [17] is an open-source high throughput computing software framework to distribute and execute computationally intensive tasks in-parallel across multiple computers in large clusters. It runs on multiple operating systems: Linux, UNIX, Mac OS X, FreeBSD, and Windows.
22
Condor is able to seamlessly integrate both dedicated and non-dedicated resources into one computing environment. One outstanding feature in Condor is the ability to identify idle computers and distribute tasks to these computers for execution. Although Condor has been successfully deployed as a Cloud platform to replace Apache Hadoop [18], it is still not suitable for large number of tasks that are short-running, dataintensive or both. Existing works are in progress to use Condor in managing the clusters to support Apache Hadoop [19]. Table 2 shows the comparison between Condor and Apache Hadoop.
Table 2: Comparison between Condor & Apache Hadoop
Characteristics Task Type Application Structure Checkpoint Mechanism
Condor Computation-Intensive. Sequential, MW (Master Worker), MPI (Message Passing Interface) and PVM (Parallel Virtual Machine). Yes. Only supported on UNIX-like operating systems. - Idle computers are being identified and tasks are distributed to these computers for execution.
Apache Hadoop Data-Intensive. MapReduce.
No. - Memory usages of computers are being monitored and a task is executed on a computer if memory requirements specified by the task are met on that particular computer. Only supported on Linux operating system. - Tasks are executed on computers, which store the required data or are closed to the location of the required data. - A shared file system is required. Prefers to utilize a distributed file system. - Large clusters are partitioned into multiple racks. Each rack consists of computers in close proximity and has a replica of the required data. During the execution of a task in each rack, a large portion of the data transfer occurs within that particular rack.
Scheduling Awareness
Data Transfer
- No shared file system is required. - During the execution of a task on a remote computer, input/output data is transferred automatically from/to the users computer.
23
2.2 Web Service

A Web Service [20] is a software system that supports interoperable computer-to-computer interaction over a network and has an interface described in WSDL (Web Service Definition Language). Other software systems interact with the Web Service using SOAP (Simple Object Access Protocol) messages. These messages are typically transmitted using HTTP (Hypertext Transfer Protocol) with an XML (Extensible Markup Language) serialization. A Web Service can be engaged and used in several ways. In general, the following broad steps are required. 1. Both the requester and provider become known to each other. Or at least one of them becomes known to the other. 2. The requester and provider also agree on the service description, which governs the mechanism of interacting with the service, and semantics, which dictates the meaning and purpose of the interaction. Both the service description and semantics will govern the interaction between the requesters and providers software systems. 3. The service description and semantics are realized by the requesters and providers software systems. 4. The requesters and providers software systems exchange messages, thus performing some tasks on behalf of the requester and provider. The exchange of messages with the providers software system represents the concrete manifestation of interacting with the providers Web Service. Web Services have many advantages. They provide interoperability between various software systems running on heterogeneous platforms, and allow software applications and services from different companies and locations to be combined easily to provide an integrated service. By utilizing HTTP, they are able to work through many firewall security measures without requiring any changes to the firewall filtering rules.
2.3 Complex Adaptive System Evolver

Complex Adaptive System Evolver (CASE) [21] is a framework developed by the EVOSIM project in Parallel & Distributed Computing Center (PDCC) of Nanyang Technological University (NTU).
24
It is developed under a project funded by the Singapore Defence Science & Technology Agency (DSTA). CASE is designed to simulate and evolve complex war game scenarios using Evolutionary Algorithms (EAs). A scenario represents a military operation that is modeled using Map Aware Non-Uniform Automata (MANA), while an excursion represents a change to one or more parameters in the scenario. The CASE framework, as illustrated in Figure 6, is constructed in a modular fashion using the Ruby programming language. It is composed of three main components. Excursion Generator Takes in a scenario XML file and a set of excursion specification text files as inputs. Using these inputs, a set of excursion XML files are generated and sent to the Simulation Engine. Simulation Engine Receives the set of excursion XML files and executes MANA using them as inputs. A set of result text files detailing the outcomes of the simulations are generated and used by the Evolutionary Algorithm to direct the search. Evolutionary Algorithm Receives the set of result text files. Paired with the associated set of excursion specification text files, the result text files are processed to generate a new set of excursion specification text files. This new set of excursion specification text files may then be sent to the Excursion Generator to begin another round of evolution.
Figure 6: Complex Adaptive System Evolver (CASE)
25
2.3.1 Map Aware Non-Uniform Automata

Map Aware Non-Uniform Automata (MANA) [22] is a proprietary agent-based simulation model designed by the Defence Technology Agency (DTA) in New Zealand. It is developed using Delphi as the programming language and runs only on the Microsoft Windows operating system. MANA is being used largely to model military operations, such as civil violence management, maritime surveillance and coastal patrols, due to its easy representation of the more chaotic and intangible aspects of military conflicts. By leaving out detailed physical attributes of the military subjects concerned, scenarios can be run relatively fast and over many excursions with on MANA so that unique situations or tactics, where friendly forces can achieve dominance over an enemy, can be discovered. In this project, large amount of time in conducting the experiments is being spent on using MANA to simulate scenarios over thousands or millions of excursions.
2.3.2 Evolutionary Algorithm

Evolutionary algorithms (EAs) [23] are stochastic search methods that imitate the natural biological evolution. By applying the principle of the survival of the fittest, EAs operate on a population of potential solutions to generate better and better approximations to a solution. At each generation, a new set of approximations is created by selecting the individuals based on their fitness level in the problem domain and breeding them together using operators borrowed from the natural adaptation. Eventually, the above process leads to an evolution of individuals that are better suited to their environment than the individuals that they are created from. Figure 7 shows the general structure of an EA. A population is initially created at random, and then a loop, which consists of evaluation, selection, crossover and/or mutation, is executed for a certain number of times. Each loop is called a generation, and the termination criteria of the loop can be either a predefined maximum number of generations or other conditions, such as
26
stagnation in the population or existence of an individual with sufficient quality. Finally, the individuals in the last population represent the best outcomes of the EA.
Figure 7: General Structure of an Evolutionary Algorithm (EA)
2.3.3 Island Model

The Island Model [24] [25] is an efficient parallelization technique to implement an EA. It consists of several islands, with each island executing an EA and maintaining its own subpopulation for searching. They work together by periodically exchanging a portion of their subpopulations in a process called migration. Figure 8 shows the differences between the single population model and the Island Model.
27
Figure 8: Differences between Single Population Model & Island Model
The Island Model has often been reported to display better search performance than the single population model, in terms of the amount of computation time required, the quality of solutions found and the effort measured in the total number of evaluations of individuals sampled in the search space [26] [27]. One reason for this improvement in the search performance is that various islands maintain some degree of independence and thus explore different regions of the search space, while at the same time sharing information by means of migration. This can be seen as a mean of sustaining genetic diversity. However, the Island Model introduces more parameters into the process. Four parameters, which are always needed to be fine-tuned when using the Island Model, are described below. Migration Interval The number of generations or evaluations of individuals before a migration occurs; Migration Size The number of individuals on an island to migrate; Migration Policy The type of individuals on the source island to migrate and those on the destination island to substitute with; and Migration Topology The destination island that the individuals on a source island are to be migrated to.
28
Chapter 3: Implementing Cloud Infrastructure & Applications

This chapter describes the implementation of a private Cloud and a public Cloud. They are called Hadoop Cluster @NTU and Hadoop Cluster @EC2 respectively. Since this project involves the usage of military applications, such as MANA, the implementation of the private Cloud is necessary to ensure the confidentiality of both the applications and the data. A Web Service, which makes an Apache Hadoop cluster available via the Internet, and all its Web Service clients are also presented in this chapter. This Web Service is called Hadoop Service.
3.1 Hadoop Cluster @NTU

Hadoop Cluster @NTU is located in Parallel & Distributed Computing Centre (PDCC) of Nanyang Technological University (NTU). Figure 9 illustrates the 37-nodes production cluster. The master node in this production cluster is a dedicated physical computer, and the other 36 slave nodes are made up of 30 non-dedicated physical computers and 6 dedicated virtual machines. No user-intervention is required in bringing up the 37-nodes production cluster. The NameNode and the JobTracker will run automatically after the boot-up of the master node, while the DataNode and the TaskTracker on each slave node will run automatically after the boot-up of the node.
29
Figure 9: 37-Nodes Production Cluster
Each slave node is performing two roles: a DataNode and a TaskTracker. With this configuration for each slave node, the MapReduce implementation is able to effectively schedule tasks on those slave nodes where data is located. High bandwidth is also achieved throughout the production cluster. The following tasks have been performed to complement the production cluster. The last two tasks have also been implemented for the development cluster. The master node has been moved to the server room, which is off-limits to the public. It can only be accessed remotely. 6 dedicated virtual machines have been added to the production cluster. This ensures that no MapReduce job will be terminated prematurely due to the inability of a task to execute, even if all 30 non-dedicated physical machines are offline. Rack-awareness has been enabled by a tool (named as topology.7z) implemented in Python. It consolidates the movement of network traffic within the same rack/place, which is much more desirable than network traffic moving across the racks/places. High bandwidth is achieved throughout the production cluster and fault-tolerance is improved, as the NameNode places block replicas on multiple racks/places.
30
A shell script (named as hadoopservice-maintenance.sh) and a batch file (named as hadoopservice-maintenance.bat) have been written to remove any temporary files created by the Web service and delete any application data that has been stored with the Web service for more than three months. The shell script and the batch file are scheduled to be executed automatically every night. A batch file (named as hadoop-maintenance.bat) has been written to maintain the NameNode and the JobTracker in a good and working state. This batch file has to be executed manually and can also be used to restart the NameNode when the NameNode fails.
A physical computer can be added as a new node to the production cluster by running an installer (named as HadoopUpdater.7z). This installer is created by using NSIS (Nullsoft Scriptable Install System) [28]. Figure 10 shows the screen capture of the installer.
Figure 10: Screen Capture of the Installer
The production cluster is equivalent to having Cloud Computing on a private network and can be used to address the privacy, security and reliability concerns. However, it does not have the two benefits of Cloud Computing: lower up-front capital cost and less hands-on management. For a list of nodes in the production cluster, please refer to Appendix A.
3.1.1 Development Cluster

The development cluster consists of 4 nodes. They are all dedicated virtual machines. Figure 11 illustrates the 4-nodes development cluster. It is mainly used for development and testing purposes.
31
Figure 11: 4-Nodes Development Cluster
For a list of nodes in the development cluster, please refer to Appendix A.
3.1.2 Auto-Updating & Reporting Tool

An auto-updating and reporting tool (named as HadoopNodeService.7z) is being installed on every slave nodes in both the production and development clusters. It is developed with C# as the programming language and makes use of an open-source updater, which is known as GUP (Generic Updater for Win32) [29]. GUP has been modified to download and install the updates (named as HadoopUpdater.7z) automatically in the silent mode whenever the updates are available. Figure 12 shows the steps in getting an update to be applied on each slave node. An update, which can be created by using NSIS (Nullsoft Scriptable Install System), must be applied on all slave nodes in both clusters before another update can be applied again. This tool also supplies the job scheduler in Apache Hadoop with information about each slave node in the production and development clusters. This information includes memory usage, processor utilization rate and idleness of the particular slave node.
32
Figure 12: Steps in Getting an Update to Be Applied on Each Slave Node
3.2 Hadoop Cluster @EC2

Amazon Elastic Compute Cloud (Amazon EC2) [30] is a Web Service offered by Amazon to supply its users with resizable resources in the Cloud. It allows its users to rent virtual computers, which represent resources such as computing power, storage space and network bandwidth. Each virtual computer is called an instance. Amazon EC2 offers great ease in deploying multiple instances. An Amazon Machine Image (AMI) can be created from an instance, which has already been installed with the essential software, and many more instances can then be easily spawned using this AMI. Amazon EC2 also offers many types of instances. A 32-bit AMI can be used to create instances of any 32-bit instance type. Table 3 shows the 32-bit instance types available on Amazon EC2. 1 EC2 Compute Unit (ECU) represents the CPU capacity of a 1.0-1.2 GHz (Gigahertz) 2007 Opteron or 2007 Xeon processor.
33
Table 3: 32-Bit Instance Types on Amazon Elastic Compute Cloud (Amazon EC2)
Instance Types Micro Small Medium
Instance Categories Micro Standard High-CPU
Memory Sizes 613 MB 1.7 GB 1.7 GB
EC2 Compute Units (ECUs) Up To 2 ECUs (For Short Periodic Bursts) 1 ECU (1 Virtual Core With 1 ECU) 5 ECUs (2 Virtual Cores With 2.5 ECUs Each)
Housed on Amazon EC2, Hadoop Cluster @EC2 is made up of 4 instances. Each instance can be treated as a node. Figure 13 illustrates the 4-nodes cluster. The master node has a permanent IP (Internet Protocol) address.
Figure 13: Hadoop Cluster @EC2
3.3 Hadoop Service

The Hadoop Service (named as HadoopService.7z) is a Web Service that allows MapReduce jobs to be submitted and run on an Apache Hadoop cluster via the Internet. This Web Service is an improved version of the Web Service that had been implemented during the authors IA (Industrial Attachment) and URECA (Undergraduate Research Experience on Campus) projects. It is implemented in Java and runs on an Oracle GlassFish Server 2.1.1, which uses the Common class loader to load Apache Hadoop JAR (Java Archive) files.
34
The following features have been implemented to enhance the Web Service. Kills a running MapReduce job on an Apache Hadoop cluster. Views all logs that are generated by a MapReduce job when it is running on an Apache Hadoop cluster. Uploads/Downloads a file using MTOM/XOP (Message Transmission Optimization Mechanism/XML-Binary Optimized Packaging).
For a MapReduce job to run on an Apache Hadoop cluster via the Web Service, two sets of files have to be submitted to the Web Service first. These two sets of files are (i) a MapReduce model and (ii) an input data-set. A MapReduce model is an application written according to the MapReduce programming model to process the input data-set on an Apache Hadoop cluster. Figure 14 presents the steps for executing a MapReduce job on the Hadoop Cluster @NTU.
Figure 14: Steps for Executing a MapReduce Job on Hadoop Cluster @NTU
Table 4 describes the Web API (Application Programming Interface) of Hadoop Service.
35
Table 4: Web API (Application Programming Interface) of Hadoop Service
Methods getVersion() isClusterAvailable() listModels() getModel() addModel(), removeModel() listInputs() getInput() addInput(), removeInput() listOutputs() removeOutput() prepareJob(), runPreparedJob() runJob(), killJob() getOutput() getCompressedOutput() getCompressedWCOutput() uploadFile(), downloadFile() putFile(), getFile()
Descriptions Gets the version number of Apache Hadoop JAR files that Hadoop Service is using. Checks whether the Apache Hadoop cluster is available. Lists all MapReduce models stored in Hadoop Service. Retrieves the status information of a MapReduce model. Adds/Removes a MapReduce model to/from Hadoop Service. Lists all input data-sets stored in Hadoop Service. Retrieves the status information of an input data-set. Adds/Removes an input data-set to/from Hadoop Service. Lists all output data-sets stored in Hadoop Service. Removes an output data-set from Hadoop Service. Prepares & runs a MapReduce job. Runs/Kills a MapReduce job. Retrieves the status information of an output data-set. Retrieves the file information of a compressed output data-set. Retrieves the file information of a compressed output data-set. This data-set may be a portion of the original data-set. Uploads/Downloads a file in multiple segments. Uploads/Downloads a file using MTOM/XOP.
Hadoop Service allows multiple MapReduce jobs to be submitted and run concurrently. Running these MapReduce jobs in parallel had caused problems in log recording and service performance. These problems and their solutions are presented below. 1. The logging information produced by a running MapReduce job garbled with those generated by other running MapReduce jobs. Figure 15 presents the strategy formulated to solve this problem.
Figure 15: Strategy Formulated To Solve a Problem in Log Recording
36
2. The status of each running MapReduce job is often being queried. The increasing number of queries, due to multiple running MapReduce jobs, degraded the performance of the Web Service. Figure 16 shows the solution to this problem.
Figure 16: Solution to Solve Performance Degradation Due To Increasing Number of Queries
3. When the status of a running MapReduce job was queried, the logging information produced by the particular MapReduce job was also sent simultaneously. This also degraded the performance of the Web Service. It has been observed that logging information is viewed only when a MapReduce job fails. Thus logging information has been modified to be sent only when a MapReduce job completes. Please take note that a MapReduce job from the perspective of Hadoop Service corresponds to one or more MapReduce jobs, which are spawned by a single instance of a MapReduce model, on an Apache Hadoop cluster. Hadoop Service also allows the status information of the cluster to be viewed visually via the Internet. Figure 17 shows the screen capture of the webpage presenting the status information of the cluster.
37
Figure 17: Screen Capture of Hadoop Service
3.3.1 Web Service Clients

Two Web Service clients have been implemented to allow MapReduce jobs to be submitted and run on an Apache Hadoop cluster via the above Web Service. The first Web Service client (named as HadoopWSClient.7z) is implemented in Java. It serves as a GUI (Graphical User Interface)/CLI (Command-Line Interface) application for the users or an add-on application for any PHP-enabled Web Server. Figure 18 shows the package diagram for Hadoop Service and this first Web Service client.
38
Figure 18: Package Diagram for Hadoop Service & Java-Based Web Service Client
Figure 19 shows the screen capture of the first Web Service client.
Figure 19: Screen Capture of Java-Based Web Service Client
The second Web Service client (named as HadoopWebClient.7z) is a Web-based application implemented in PHP and JavaScript. It provides an interactive and personalized experience for the users by employing AJAX (Asynchronous JavaScript & XML) and Client-Side Local Storage. It runs on any Web Server that supports PHP 5.2.4 and relies on the first Web Service client, which must be running in the background on the Web Server. Figure 20 shows the screen capture of this second Web Service client. It provides an option to choose between the Apache Hadoop clusters for submitting and running a MapReduce job.
39
Figure 20: Screen Capture of Web-Based Web Service Client
Figure 21 shows the screen capture of another Web Service client that uses Hadoop Service. It is implemented by another student specifically to run CASE (Complex Adaptive System Evolver) on an Apache Hadoop cluster. It is being called CASE GUI.
40
Figure 21: Screen Capture of CASE GUI
Figure 22 illustrates the Cloud Computing architecture that has been implemented.
Figure 22: Cloud Computing Architecture
41
Chapter 4: Preliminary Work

This chapter describes the preliminary work of incorporating Cloud Computing into Automated Red Teaming (ART) Framework. Two MapReduce models, which are implemented to enable ART Framework to execute using an Apache Hadoop cluster, are also presented in this chapter. They are MapReduce MANA (MRMANA) and Map-Only MANA (MOMANA).
4.1 Automated Red Teaming Framework

Red Teaming is a technique often utilized to uncover vulnerabilities and breaches in operational concepts with the ultimate goal of improving them. However, it demands close collaboration from a group of subject matter experts, whose knowledge and experiences greatly influence the success of this technique. This is especially so for complicated and multifaceted nature of military operational concepts. Automated Red Teaming (ART) is a concept that enhances the Manual Red Teaming (MRT) effort with the automated discovery of vulnerabilities and breaches in the targeted system. The technique works by accessing the targeted system using a series of rigorous strategies and keeping track of those strategies that have performed exceedingly well against the operational concepts of the Blue team. These well-performed strategies provide the subject matter experts with alternative views regarding the various vulnerabilities and breaches in the operational concepts of the Blue team. The ART Framework realizes the ART concept by leveraging on advanced technologies such as high-performance computing, EAs and agent-based simulations. It is developed by the DSO National Laboratories (DSO) using Visual C++ programming language. The architecture of the ART framework, as shown in Figure 23, is composed by the following components: ART Parameters Setting Interface allows the selection of those parameters that are required to be varied; 42
Simulation Model Dependent Modules add a layer of data flow between the ART Framework and simulation models. Data flowing into the simulation models are the parameters to be executed and those that are flowing out will be the results from the simulation runs. These data are translated with wrappers that follow the simulation format to the ART Framework data structures; EA Module stores the EA library in which the user can choose from. It also prepares the parameters for the individual simulation, analyses the results and distills the desired Red Teaming objectives; Condor Controller submits the run of each individual simulation to the Condor cluster. It also monitors the completion of each individual runs and signals the ART controller for further processing; ART Output Module provides feedback on the whole process, updates the user on the selected parameters and with the run results; and ART Controller coordinates the whole process.
Figure 23: Architecture of ART Framework [31]
Figure 24 shows a screen capture of the ART Framework.
43
Figure 24: Screen Capture of ART Framework
The Java-based Web Service client has been incorporated into the ART Framework. Figure 25 presents the steps that the ART Framework performed for the Java-based Web Service client to submit and run a MapReduce job on an Apache Hadoop cluster.
Figure 25: Steps for ART Framework to Submit & Run a MapReduce Job
44
4.2 MapReduce MANA

MapReduce MANA (MRMANA) is a MapReduce application that had initially been implemented by the author during his Industrial Attachment in the DSO National Laboratories and was incrementally improved upon during this project. In MRMANA, each Map task executes MANA for one replication of an excursion, while each Reduce task gathers the obtained results of all replications that belong to an excursion. After gathering the results, the Reduce task calculates the means and standard deviations, and generates an output file for the excursion. If the execution time for one replication of an excursion is negligible, it will be very expensive to execute MANA for only one replication in the Map task due to the overhead in creating the Map task. Thus Map-Only MANA (MOMANA) is preferred over MRMANA. In MOMANA, each Map task executes MANA for all replications of an excursion, calculates the means and standard deviations, and generates an output file for the excursion. There is no Reduce task in MOMANA. Figure 26 illustrates the application workflows of MRMANA and MOMANA. The number of Reduce tasks spawned in MRMANA and the number of Map tasks spawned in MOMANA are equivalent to the number of excursions in the particular MapReduce job.
Figure 26: Application Workflows of MRMANA & MOMANA
45
4.2.1 Problems & Solutions

Table 5 lists the problems that had arose in the execution of MRMANA and their respective solutions. These problems are commonly found in MapReduce applications and their solutions are also being used in the subsequent MapReduce models.
Table 5: Solutions to Problems in MRMANA
Problems Empty Result File (Due to a race condition) Missing Result File (Due to offline DataNodes)
Solutions Each task attempts to produce a result file that is named uniquely from other attempts. When a task attempt completes successfully, it renames the result file to the intended filename. Increases the replication factor for the result files and/or enables rack-awareness for the clusters to improve fault-tolerance.
For a more detailed description of the above problems and their solutions, please refer to Appendix B.
4.2.2 Demonstrating Apache Hadoop-Compliant ART

This experiment was performed to benchmark the time taken by MRMANA and MOMANA in executing two types of scenarios respectively. The ART Framework was utilized to submit and run MapReduce jobs on Hadoop Cluster @NTU. The experiment was performed using the production cluster and thus might be affected by the performance of those nodes that were being utilized by other users. However, the production cluster exhibits the hazards that are unavoidable in a distributed heterogeneous computing environment and at the same time, these hazards also pose a trial to the robustness of the cluster. Both types of scenarios were executed for 30 replications and on the slowest node, which represented most of the nodes in the cluster. The simple scenario was executed in 0.05 second, while the complex scenario took 1 minute to execute. Other common settings are described below. Evolutionary Algorithm: NSGA-II (Non-Dominated Sorting Genetic Algorithm II) [32]
46
Generations: 100 Population Size: 100 Replications for each Excursion: 30
4.2.2.1 Results & Analysis

All the results shown here are the averages of 2 replications. Table 6 shows the times taken by MRMANA and MOMANA in executing both types of scenarios.
Table 6: Times Taken By MRMANA & MOMANA in Executing Simple & Complex Scenarios
MRMANA MOMANA ART Framework (Using A Single Computer)
Execution Times (minutes) Simple Scenario Complex Scenario 23.11 321.06 5.76 395.90 15.83 10366.67
The result shows that MRMANA executed in a faster time than MOMANA when the complex scenario was used. For the simple scenario, MOMANA executed faster than MRMANA. Compared to running ART Framework on a single computer, MRMANA took a longer time for the simple scenario. This was mainly due to the overhead in executing only one replication of the excursion in each Map task. The execution time for one replication of the excursion was negligible.
47
Chapter 5: Implementing Apache Hadoop-Compliant CASE

This chapter describes Apache Hadoop-compliant CASE and presents the four MapReduce models that are implemented to enable CASE in executing using an Apache Hadoop cluster. These four MapReduce models are Replication Model, Standard Model, Island MapReduce 1 and Island MapReduce 2. Experiments that use these MapReduce models and their obtained results are also presented in this chapter.
5.1 Apache Hadoop-Compliant CASE

Apache Hadoop-compliant CASE has been modified in such a way that each component in the original CASE can be run within a Map/Reduce task. These modifications differ greatly from those modifications that have been done to ART Framework to make it Apache Hadoopcompliant. Figure 27 illustrates the differences between the Apache Hadoop-compliant ART Framework and CASE. In Apache Hadoop-Compliant CASE, each MapReduce job executes EA and passes generated result-sets to the next MapReduce job.
Figure 27: Differences between Apache Hadoop-Compliant ART Framework & CASE
48
Apache Hadoop-compliant CASE has 5 entry points. These entry points are used by MapReduce models to execute the components in CASE. Table 7 describes each of the 5 entry points.
Table 7: 5 Entry Points in Apache Hadoop-Compliant CASE
Entry Points execute evolve replicate evolve_migrate exec_migrate
Descriptions Executes the simulation model. Executes EA. Executes CASE. Executes EA and generates migrating individuals. Executes CASE and generates migrating individuals.
Each entry point often uses methods that are almost the same as the original methods in CASE. These methods are modified to run effectively and efficiently within a Map/Reduce task. They are written in the same file (named as hadoop.rb) to form a module. Figure 28 shows the entry points and the methods that they use. Each entry point writes its obtained results to the Standard Output Stream.
Figure 28: 5 Entry Points in Apache Hadoop-Compliant CASE
49
5.2 Replication Model

In Replication Model, each Map task executes an instance of CASE and produces a result-set, while a single Reduce task gathers all the obtained result-sets and combines them into a single result file. Each instance of CASE is run independently. Figure 29 illustrates the application workflow of the Replication Model. The number of Map tasks spawned is specified by the user. The Replication Model executes multiple instances of CASE on multiple computers simultaneously. All instances normally have the same setting and execute the same EA within the same parameter space. It can also be configured for each instance to have different settings, execute different EAs and/or execute within different regions of the parameter space. As each Map task may execute for a long period of time, the Replication Model is thus not the recommended way to make use of Apache Hadoop.
Figure 29: Application Workflow of Replication Model
50
5.2.1 Testing Cluster Robustness

In this experiment, Replication Model was utilized to test the robustness of the production cluster and its fault tolerance. The experiment was performed using the production cluster and thus might be affected by the performance of those nodes that were being utilized by other users. However, the production cluster exhibits the hazards that are unavoidable in a distributed heterogeneous computing environment and at the same time, these hazards also pose a trial to the robustness of the cluster. A simple scenario was used in this experiment. This simple scenario took 5 seconds to execute for 30 replications on the slowest node, which represented most of the nodes. Other common settings are described below. Map Tasks: 10 Evolutionary Algorithm: NSGA-II (Non-Dominated Sorting Genetic Algorithm II) Generations: 10 Population Size: 100 Replications for each Excursion: 30

All the results shown here are the averages of 10 replications. Table 8 shows the execution times for the Replication Model when a given set of nodes was switched off during the start, the midst or the end of its execution. All nodes within this set were being utilized by the Replication Model at that point in time.
Table 8: Execution Times Demonstrating Robustness
Number Of Offline Nodes During Execution 0 Start of Execution 5 Midst of Execution End of Execution Start of Execution 10 Midst of Execution End of Execution
Execution Times (minutes) 90.53 110.83 135.05 210.33 100.67 130.08 190.17
51
During this experiment, all MapReduce jobs completed successfully. This experiment has shown that the production cluster is robust and fault tolerant.
5.3 Standard Model

To adapt to the MapReduce programming model, the Standard Model splits CASE into two portions. Each portion is able to be executed independently. The first portion, which executes in a Map task, is responsible for its first two components: Excursion Generator and Simulation Engine, while the second portion, which runs in a Reduce task, takes care of its last component: Evolutionary Algorithm. The Standard Model consists of G MapReduce jobs, which are executed sequentially. In each MapReduce job, each spawned Map task executes MANA for R times per excursion, while a single Reduce task gathers all the obtained results and executes EA with them as inputs. The output of the MapReduce job is then supplied as the input to the next MapReduce job. Figure 30 illustrates the application workflow of the Standard Model. The number of generations (G), the number of excursions to be executed by each Map task (E) and the number of replications per excursion (R) can be specified by the user.
Figure 30: Application Workflow of Standard Model
52
Each Map task must be able to finish its execution within two hours. If they are unable to, the MapReduce job will fail. The failure of the MapReduce job will cause a Standard Model execution to be terminated. Thus the Standard Model may require the user to split the workload among more Map tasks by having a smaller E.
5.3.1 Demonstrating Scalability

This set of experiment was divided into two parts and utilized the Standard Model. The first part observed the effect caused by the varying number of excursions in a Map task on the execution time, while the second part investigated the degree of scalability in the Standard Model. The experiment was performed using the production cluster and thus might be affected by the performance of those nodes that were being utilized by other users. However, the production cluster exhibits the hazards that are unavoidable in a distributed heterogeneous computing environment and at the same time, these hazards also pose a trial to the robustness of the cluster. The first portion of the experiment utilized a simple scenario. This simple scenario was executed in 2 and 5 seconds for 30 replications on the fastest and slowest nodes in the cluster respectively. Other common settings are described below. Evolutionary Algorithm: NSGA-II (Non-Dominated Sorting Genetic Algorithm II) Generations: 50 Population Size: 100 Replications for each Excursion: 30
The second portion of the experiment used two scenarios. Both scenarios were executed for 30 replications and on the slowest node, which represented most of the nodes in the cluster. The simple scenario was executed in 5 seconds, while the complex scenario took 60 seconds to execute. Other common settings are described below. Evolutionary Algorithm: NSGA-II (Non-Dominated Sorting Genetic Algorithm II) Generations: 100 Population Size: 100
53
Replications for each Excursion: 30
The experiment also compared the quality of the solutions in terms of hyper-volume obtained by the Standard Model to those that were produced by CASE and MapReduce application [33]. The MapReduce application was proposed by the Illinois Genetic Algorithms Laboratory in 2009 to scale genetic algorithm. The MapReduce application was modified from Standard Model. It takes the parameter values in each excursion as the key for the particular excursion and has more than one Reduce task. A customized partitioning function splits all excursions equally among the Reduce tasks.

All the results shown here are the averages of 10 replications. Table 9 shows the execution times for the Standard Model utilizing a given number of nodes in the production cluster. For each given set of nodes, the Standard Model was designed to execute 5 and K excursions in each of its Map task. K can be calculated using the following formula.
Table 9: Execution Times Using 5 & K Excursions in a Map Task
Number of Nodes Involved 1 2 3 4 5 10 15
Number of Excursions per Map Task 5 100 5 50 5 34 5 25 5 20 5 10 5 7
Execution Times (minutes) 522.95 424.72 273.80 235.82 201.47 166.20 162.00 130.15 122.97 104.55 80.92 70.02 69.77 62.75
54
Number of Nodes Involved 20 25
Number of Excursions per Map Task 5 5 4
Execution Times (minutes) 48.18 48.05 51.07
Figure 31 presents the above table visually. K certainly aided in getting a faster execution time in using the Standard Model with 15 or fewer nodes in the cluster. It was unable to achieve a faster execution time with 20 nodes and beyond, as the execution time for K excursions in a Map task was negligible when compared to the overheads involved in using the Standard Model.
600 Execution Times (minutes) 500 400 300 200 100 0 1 2 3 4 5 10 15 20 25
Number of Nodes Involved 5 Excursions/Map Task K Excursions/Map Task
Figure 31: Execution Times Using 5 & K Excursions in a Map Task
Table 10 shows the execution times for Standard Model when used on the simple and complex scenarios. To investigate the degree of scalability in the Standard Model, the number of nodes was increased incrementally. The number of excursions to execute in each Map task was adjusted accordingly to the above formula used in obtaining K.
55
Table 10: Execution Times Demonstrating Scalability
Number of Nodes Involved 1 2 4 5 10 20 25
Number of Excursions per Map Task 100 50 25 20 10 5 4
Execution Times (minutes) Simple Scenario Complex Scenario 854.13 10517.86 476.08 5875.90 266.52 3163.48 219.80 2353.27 136.73 1309.33 97.42 632.13 92.75 483.73
Figure 32 visually shows the execution times for Standard Model using the simple scenario. It is observed that as the number of node involved increases, the execution time for using Standard Model decreases accordingly. The above observation is applicable to this simple scenario when the number of nodes involved is less than 20.
900 800 700 600 500 400 300 200 100 0 1 2 4 5 10 20 25
Execution Times (minutes)
Number of Nodes Involved Simple Scenario
Figure 32: Execution Times Demonstrating Scalability Using Simple Scenario
Figure 33 shows the execution times for the Standard Model when using the complex scenario. The above observation is applicable to this complex scenario. Therefore it can be concluded that an optimal number of nodes definitely exists and depends on the complexity of the scenario being used. The following issues were identified in this experiment.
56
A MapReduce job may take a longer time to complete. The delay in the MapReduce job is caused by nodes that are either equipped with a slower processor or experiencing a higher computational load due to external factors, such as utilized by other users. Network traffic may cause a delay in executing the Map tasks and thus each Map task occurs with a different starting time. This may in turn aggravate the previous issue.
12000 Execution Times (minutes) 10000 8000 6000 4000 2000 0 1 2 4 5 10 20 25
Number of Nodes Involved Complex Scenario
Figure 33: Execution Times Demonstrating Scalability Using Complex Scenario
Figure 34 shows the quality of the solutions in terms of hyper-volume that were produced by the Standard Model, CASE and MapReduce application. The solutions produced by the Standard Model had a hyper-volume that was similar to that obtained by CASE. The solutions produced by the MapReduce application were less optimal due to each Reduce task executing EA with a set of similar excursions. This set of experiment and its findings are documented in a published paper. Please refer to Appendix C for this paper.
57
-5 -10 Hyper-Volume -15 -20 -25 -30 0 20 40 60 80 100
Generations Standard Model CASE MapReduce Application
Figure 34: Comparison of Solutions between Standard Model & CASE
5.3.2 Evaluating Hadoop Cluster @EC2

An experiment was performed to demonstrate the feasibility of using Amazon EC2 within this project. In this experiment, Standard Model was executed on Hadoop Cluster @EC2 and the development cluster in Hadoop Cluster @NTU. The common settings are described below. Evolutionary Algorithm: NSGA-II (Non-Dominated Sorting Genetic Algorithm II) Generations: 50 Population Size: 40 Excursions per Map Task: 10 Replications for each Excursion: 30
Benchmarks were being performed on small and medium instances in Hadoop Cluster @EC2 and a virtual machine in the development cluster respectively. Table 11 shows the benchmarks that were obtained. A medium instance has 5 EC2 Compute Units (ECUs), while a small instance has only 1 ECU.
58
Table 11: Benchmarks for Virtual Machine, Small & Medium Instances
Benchmarks Execution Times for 30 Replications of the Simple Scenario that is used in this Experiment (seconds) Execution Times for 30 Replicatons of the Complex Scenario (seconds) Whetstone iSSE3 (GFLOPS) CPU Mark
Small Instance
Medium Instance
Virtual Machine
5.0
2.1
5.0
138.0
58.4
69.5
3.65 363.4
14.87 1490.6
21.10 1272.7

The execution times shown in Table 12 are the averages of 10 replications. The calculated costs for the respective execution times take only into account of the hourly charge for each instance involved. The cost for data transfer is indicated by a plus symbol (+) and is also unaccountable in this experiment due to the small scale that it was being performed. However, the cost may escalate if a more complex scenario is used and/or an experiment is performed in a larger scale.
Table 12: Execution Times on Hadoop Cluster @EC2 & Hadoop Cluster @NTU
Clusters
Descriptions 4 Small Instances Public IP Address For Master Node 4 Medium Instances Public IP Address For Master Node 4 Small Instances Private IP Address For Master Node 4 Medium Instances Private IP Address For Master Node 4 Virtual Machines
Costs
Execution Times (minutes) Standard Mean Deviation 89.32 1.80
$0.96+
$1.16+
48.70
1.07
Hadoop Cluster @EC2
$0.96
88.53
1.23
$1.16 -
48.40 90.83
0.95 2.43
Hadoop Cluster @NTU
59
Based on the obtained results, it is feasible to make use of Amazon EC2 within this project. The cost can be further reduced by using medium instances to form the cluster and a private IP address for the master node.
5.4 Island MapReduce

The Island MapReduce is designed to run CASE in a scalable fashion by employing two techniques: MapReduce programming model and Island Model. It exists in two versions: Island MapReduce 1 and Island MapReduce 2. Island MapReduce 1 and Island MapReduce 2 use the same migration topology and migration policy. The islands are arranged in a ring topology and each island is tagged with a different index number. Random excursions in a particular island are sent to the next island in the ring and substitutes random excursions in that island.
5.4.1 Island MapReduce 1

Island MapReduce 1 consists of one or more MapReduce jobs. In a MapReduce job, each pair of Map and Reduce tasks corresponds to an island. Each Map task executes an instance of CASE for T generations and produces two result-sets, which contain excursions that are remaining in the particular island and excursions that are migrating from the particular island to another island respectively. The associated Reduce task gathers all the excursions for the particular island and groups them. The output of the MapReduce job is then supplied as the input to the subsequent MapReduce job. Figure 35 illustrates the application workflow of Island MapReduce 1. The number of islands (N) and the migration interval (T) can be specified by the user.
60
Figure 35: Application Workflow of Island MapReduce 1
5.4.2 Island MapReduce 2

Like the Standard Model, the Island MapReduce 2 also splits CASE into two portions in order to adapt to the MapReduce programming model. Each portion is able to be executed independently. The first portion, which executes in a Map task, is responsible for its first two components: Excursion Generator and Simulation Engine, while the second portion, which runs in a Reduce task, takes care of its last component: Evolutionary Algorithm. The Island MapReduce 2 consists of G MapReduce jobs, which are executed sequentially. In each MapReduce job, each Map task executes MANA for R times per excursion, while each Reduce task corresponds to an island. A Reduce task gathers all the obtained results for the excursions in the particular island, executes EA with them as inputs and generates at least one result-set, which contains excursions remaining in the particular island. Whenever T is reached, it produces an additional result-set, which includes excursions that are migrating from the particular island. The output of the MapReduce job is then supplied as the input to the subsequent MapReduce job. Figure 36 illustrates the application workflow of the Island MapReduce 2. The number of islands (N), the number of generations (G), the number of
61
excursions to be executed by each Map task (E), the number of replications per excursion (R) and the migration interval (T) can be specified by the user. The Island MapReduce 2 allows more scalability and parallelism than Island MapReduce 1. By having a smaller E in Island MapReduce 2, workload can be split among more Map tasks. However, it is more efficient and effective to employ Island MapReduce 1 whenever simple scenarios are used instead.
Figure 36: Application Workflow of Island MapReduce 2
5.4.3 Evaluating Island MapReduce

In this experiment, the timings and the quality of the solutions in terms of hyper-volume were compared between Island MapReduce 1 and Island MapReduce 2. It also investigated the possible effect of two parameters on the timings and the quality of the solutions. These two parameters, which are commonly tuned in the Island Model, are the population size and the migration interval.
62
This experiment was performed using the production cluster and thus might be affected by the performance of those nodes that were being utilized by other users. However, the production cluster exhibits the hazards that are unavoidable in a distributed heterogeneous computing environment and at the same time, these hazards also pose a trial to the robustness of the cluster. A simple scenario was used in this experiment. This simple scenario took 5 seconds to execute for 30 replications on the slowest node, which represented most of the nodes. Other common settings are described below. Islands: 20 Generations: 100 Replications for each Excursion: 30
Table 13 shows the differences between the 4 configurations that were used in this experiment. The EA used in this experiment was either DE (Differential Evolution) [34] or NSGAII (Non-Dominated Sorting Genetic Algorithm II).
Table 13: Differences between Configuration 1, 2, 3 & 4 for Island MapReduce
Characteristics Evolutionary Algorithm Excursions per Generation on an Island Migration Interval Migration Size
Configuration 1 DE 50 5 2
Configuration 2 NSGA-II 50 10 2

All the results shown here are the averages of 10 replications. Table 14 shows the execution times for Island MapReduce with Configuration 1, 2, 3 and 4. Island MapReduce 1 and Island MapReduce 2 were executed using Configuration 1. Just as expected, Island MapReduce 2 achieved a much faster execution time as compared to that obtained by Island MapReduce 1.
63
Table 14: Execution Times for Island MapReduce with Configuration 1, 2, 3 & 4
Configurations Configuration 1 Configuration 2 Configuration 3 Configuration 4
Island MapReduce 1 2 1 1 1
Number of Excursions per Map Task 250 10 250 200 100
Execution Times (minutes) 433.38 390.23 435.02 204.25 198.02
Figure 37 presents the quality of the solutions in terms of hyper-volume that were obtained by both Island MapReduce 1 and Island MapReduce 2 when executed with Configuration 1. Both Island MapReduce 1 and Island MapReduce 2 were able to obtain solutions that generate similar hyper-volumes.
-5 -10 Hyper-Volume -15 -20 -25 -30 0 20 40 60 80 100
Generations Island MapReduce 1 Island MapReduce 2
Figure 37: Comparison of Solutions between Island MapReduce 1 & Island MapReduce 2
Figure 38 shows the quality of the solutions in terms of hyper-volume that were obtained by Island MapReduce 1 when applied with Configuration 2, 3 and 4. Using a smaller number of excursions in Configuration 3 produced solutions that were less optimal. However, this can be improved by having a smaller migration interval, as in Configuration 4.
64
-5 -10 Hyper-Volume -15 -20 -25 -30 0 20 40 60 80 100
Generations Configuration 2 Configuration 3 Configuration 4
Figure 38: Comparison of Solutions between Configuration 2, 3 & 4 for Island MapReduce 1
65
Chapter 6: Conclusion
6.1 Summary
The popularity of Cloud Computing is on the rise. It has also delivered benefits to its adopters. These benefits include high availability, scalability, reduced cost and fault tolerant. This project explores the paradigm of Cloud Computing through Apache Hadoop. Due to the nature of this project which involves the use of military applications, such as MANA, a private Cloud was implemented. This private Cloud is called Hadoop Cluster @NTU. Despite being in a non-dedicated and heterogeneous computing environment, the Cloud cluster implemented has demonstrated its robustness and high fault tolerance. Besides the private Cloud, a public Cloud, which is called Hadoop Cluster @EC2, was also deployed and evaluated. This public Cloud is used mainly for comparison purposes. It is also used to exhibit the feasibility of this project using a public Cloud as an infrastructure. Cloud Computing is not considered complete without Web Service. In this project, Hadoop Service was implemented. It allows the submission and running of MapReduce jobs on an Apache Hadoop cluster via the Internet. This system, which consists of Hadoop Cluster @NTU and Hadoop Service, had been used in the recent IDFW (International Data Farming Workshop) 21 held in Lisbon, Portugal on 19th-24th September 2010. Three Web Service clients were being implemented. The CASE GUI, a Web-based Web Service client was especially implemented to run Apache Hadoop-compliant CASE on an Apache Hadoop cluster. To investigate the integration of Cloud Computing and Data Farming, the Apache Hadoopcompliant ART and CASE were implemented. Six different MapReduce models were implemented and evaluated. They enable Apache Hadoop-compliant ART Framework/CASE to execute using an Apache Hadoop cluster. Although two MapReduce models, such as MRMANA/MOMANA and Island MapReduce 1/Island MapReduce 2, may serve the same purpose, they differ greatly in their implementation complexities and scalabilities. Increased scalability for a MapReduce model escalates its implementation complexity.
66
6.2 Limitations
Due to the nature of MANA being a proprietary simulation agent-based model that runs only on Windows operating system, all nodes within Hadoop Cluster @NTU must be running Windows operating system. This limitation has posed a few issues. The first issue is that Apache Hadoop is unable to monitor the memory usage of each node in the clusters. Thus it is unable to schedule tasks based on their memory requirements. Although Apache Hadoop is implemented in Java, there exists a difference in terminating all the child processes spawned by a Map/Reduce task on a node between Windows and UNIX-like operating systems. When a Map/Reduce task fails or is being killed, all the child processes are automatically terminated on UNIX-like operating systems. But for Windows operating system, these child processes are left running in the background instead. Careful programming and a mechanism, which is built into each MapReduce model, can only minimize the chance of the child processes being left running in the background. The mechanism may fail due to the fact that Windows operating system reuses process identity number when creating a new process. As Apache Hadoop is not supported as a production platform on Windows operating system, any patches available in the Hadoop community require tremendous effort in testing before they can be deployed on nodes within Hadoop Cluster @NTU. The last issue is that Windows operating system is not widely supported by Cloud providers. Currently, Amazon is the only major Cloud provider that allows its customers to run their Cloud applications on Windows operating system. Within Hadoop Cluster @NTU, there are 30 non-dedicated physical computers. A Map/Reduce task fails when the computer, which is executing the particular Map/Reduce task, is shut down by another user. If this situation happens too frequently, the MapReduce job will take a longer time to complete. The presence of non-dedicated physical computers also poses a serious problem to tools that are required in certain MapReduce jobs and have to be installed on each node. These tools can
67
be moved, overwritten or even deleted by another user. Thus these affected MapReduce jobs will fail.
6.3 Future Enhancements

The possible future enhancements are: The current method used to identify bottlenecks within Hadoop Cluster @NTU requires tremendous effort. This can be facilitated greatly by installing Chukwa [35] to monitor the clusters. Chukwa automatically collects all logs generated by Apache Hadoop and uses these logs for debugging, performance measurement and operational monitoring. It will also aid considerably in identifying those problematic nodes within the clusters. Cascading [36] is an application programming interface (API) that greatly simplifies the complexities in the creation of MapReduce applications for Apache Hadoop. However, it may not be effective to be applied on the applications in this project, as they are normally composed by multiple iterations of similar operations. Twister [37], an API supporting iterative MapReduce computations, may be a more worthy alternative to explore. Due to the nature of this project which involves the usage of military applications, such as MANA, security is always a topmost concern for this project when it is being discussed. Yahoo has recently launched a version of Hadoop that integrates with Kerberos, a mature open source authentication standard [38]. This version of Hadoop is worth exploring since it supports security.
68
References
Markus Klem, Merill Lynch Estimates Cloud Computing To Be $100 Billion Market, SYSCON Media [Online], (21st August 2008). Available: http://www.sys-con.com/node/604936 [2] Christy Pettey, Gartner Identifies the Top 10 Strategic Technologies for 2011, Gartner Newsroom [Online], (19th October 2010). Available: http://www.gartner.com/it/page.jsp?id=1454221 [3] Alfred G. Brandstein, and Gary E. Horne, Data Farming: A Meta-Technique for Research in the 21st Century, in Maneuver Warfare Science 1998, Quantico, VA, USA, 1998, Page 93 99. [4] Gary E. Horne, and Ted E. Meyer, Data Farming: Discovering Surprise, in Proceedings of the 36th Conference on Winter Simulation (WSC 2004), Washington, DC, USA, 2004, Page 807 813. [5] Philip S. Barry, Jianping Zhang, and Mary McDonald, Architecting a Knowledge Discovery Engine for Military Commanders Utilizing Massive Runs of Simulations, in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003), Washington, DC, USA, 2003, Page 699 704. [6] Mark Baker, and Rajkumar Buyya, Cluster Computing At a Glance, in High Performance Cluster Computing: Architectures and Systems, Volume 1, 1st Edition. Upper Saddle River, NJ, USA: Prentice Hall PTR, 1999, Chapter 1, Page 3 46. [7] Bart Jacob, Michael Brown, Kentaro Fukui, and Nihar Trivedi, What Grid Computing is, in Introduction to Grid Computing, 1st Edition. Austin, TX, USA: IBMs International Technical Support Organization, 2005, Chapter 1, Page 3 6. [8] L. Youseff, M. Butrico, and D. Da Silva, Towards a Unified Ontology of Cloud Computing, in Proceedings of the 2008 Grid Computing Environments Workshop (GCE 2008), Austin, TX, USA, 2008, Page 1 10. [9] Luis M. Vaquero, Luis Rodero-Merino, Juan Caceres, and Maik Lindner, A Break in the Clouds: Towards a Cloud Definition, in ACM SIGCOMM Computer Communication Review, Volume 39, Issue 1. New York, NY, USA: ACM, 2008, Page 50 55. [10] Mladen A. Vouk, Cloud Computing Issues, Research and Implementation, in Proceedings of the 30th International Conference on Information Technology Interfaces (ITI 2008), Cavtat/Dubrovnik, Croatia, 2008, Page 31 40. [11] Shlomo Swidler, The OGF Open Cloud Computing Interface, presented at IGT 2009 World Summit of Cloud Computing, Shefayim, Israel, 2009. [12] Cari Tuna, Cloudera Raises Hefty Funding Round, The Wall Street Journal [Online], (26th October 2010). Available: http://blogs.wsj.com/digits/2010/10/26/cloudera-raises-heftyfunding-round/?mod=google_news_blog [1]
69
[13] Tom White, Meet Hadoop, in Hadoop: The Definitive Guide, 1st Edition. Sebastopol, CA, USA: OReilly Media, 2009, Chapter 1, Page 1 13. [14] Tom White, The Hadoop Distributed Filesystem, in Hadoop: The Definitive Guide, 1st Edition. Sebastopol, CA, USA: OReilly Media, 2009, Chapter 3, Page 41 74. [15] Konstantin Shvachko, Automatic Namespace Recovery from the Secondary Image, The Apache Software Foundation [Online], (8th July 2009). Available: https://issues.apache.org/jira/browse/HADOOP-2585 [16] Jeffrey Dean, and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in Communications of the ACM, Volume 51, Issue 1. New York, NY, USA: ACM, 2008, Page 107 113. [17] Douglas Thain, Todd Tannenbaum, and Miron Livny, Condor and the Grid, in Grid Computing: Making the Global Infrastructure a Reality, 1st Edition. Chichester, UK: John Wiley & Sons, 2003, Chapter 11, Page 299 335. [18] Douglas Thain, and Christopher Moretti, Abstractions for Cloud Computing with Condor, in Cloud Computing and Software Services: Theory and Techniques, 1st Edition. Boca Raton, FL, USA: CRC Press, 2009, Chapter 7, Page 153 171. [19] Miron Livny, Condor and the Cloud The Challenges and the Roadmap of Condor, presented at Condor & the Cloud with Professor Miron Livny & a FaceBook IT Case Study, Hertzelia, Israel, 2009. [20] Web Services Architecture, W3C Working Group Note, David Booth, Hugo Haas, and Francis McCabe, 11th February 2004. [21] James Decraene, Yong Yong Cheng, Malcolm Low Yoke Hean, Suiping Zhou, Wentong Cai, Choo Chwee Seng, Evolving Agent-Based Simulations in the Clouds, in Proceedings of the 3rd International Workshop on Advanced Computational Intelligence (IWACI 2010), Suzhou, China, 2010, Page 244 249. [22] Gregory C. Mclntosh, David P. Galligan, Mark A. Anderson, and Michael K. Lauren, Recent Developments in the MANA Agent-Based Model, in Scythe Issue 1, Scheveningen, Netherlands, 2006, Page 38 39. [23] Eckart Zitzler, Marco Laumanns, and Stefan Bleuler, A Tutorial on Evolutionary Multiobjective Optimization, in Proceedings of the Workshop on Multiple Objective Metaheuristics (MOMH 2004), Heidelberg, Germany, 2004, page 3 38. [24] Darrell Whitley, Soraya Rana, and Robert B. Heckendorn, The Island Model Genetic Algorithm: On Separability, Population Size and Convergence, in Journal of Computing and Information Technology, Volume 7, Issue 1. Zagreb, Croatia: University Computing Centre, 1999, Page 33 47. [25] Zbigniew Skolicki, and Kenneth De Jong, The Influence of Migration Sizes and Intervals on Island Models, in Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (GECCO 2005), Washington, DC, USA, 2005, Page 1295 1302.
70
[26] Heinz Muhlenbein, Evolution in Time and Space The Parallel Genetic Algorithm, in Proceedings of the 1st Workshop on Foundations of Genetic Algorithms (FOGA 1990), Indiana, IN, USA, 1990, Page 316 337. [27] Darrell Whitley, and Timothy Starkweather, GENITOR II: A Distributed Genetic Algorithm, in Journal of Experimental & Theoretical Artificial Intelligence, Volume 2, Issue 3. 1990, Page 189 214. [28] Nullsoft Scriptable Install System [Online], (2009). Available: http://nsis.sourceforge.net/Main_Page [29] GUP for Win32 [Online], (2010). Available: http://gup-win32.tuxfamily.org/ [30] Shufen Zhang, Shuai Zhang, Xuebin Chen, and Shangzhuo Wu, Analysis and Research of Cloud Computing System Interface, in Proceedings of the 2nd International Conference on Future Networks (ICFN 2010), Sanya, China, 2010, Page 88 92. [31] C.L. Chua, W.C. Sim, C.S. Choo, and Victor Tay, Automated Red Teaming: An ObjectiveBased Data Farming Approach For Red Teaming, in Proceedings of the 40th Conference on Winter Simulation (WSC 2008), Austin, TX, USA, 2008, Page 1456 1462. [32] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan, A Fast Elitist MultiObjective Genetic Algorithm: NSGA-II, in IEEE Transactions on Evolutionary Computation, Volume 6, Issue 2. 2002, Page 182 197. [33] Abhishek Verma, Xavier Llora, Roy H. Campbell, and David E. Goldberg, Scaling Genetic Algorithm Using MapReduce, in Proceedings of the 9th International Conference on Intelligent Systems Design and Applications (ISDA 2009), Pisa, Italy, 2009, Page 13 18. [34] R. Storn, and K. Price, Differential Evolution A Simple and Efficient Heuristic for Global Optimization over Continuous Space, in Journal of Global Optimization, Volume 11, Issue 4. 1997, Page 341 359. [35] Jerome Boulon, Andy Konwinski, Runping Qi, Ariel Rabkin, Eric Yang, and Mac Yang, Chukwa: A Large-Scale Monitoring System, in Proceedings of the Cloud Computing & its Applications 2008 (CCA 2008), Chicago, IL, USA, 2008. [36] Cascading [Online], (2010). Available: http://www.cascading.org/ [37] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox, Twister: A Runtime for Iterative MapReduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, IL, USA, 2010, Page 810 818. [38] Yahoo Distribution of Hadoop [Online], (2010). Available: http://yahoo.github.com/hadoop-common/
71
Appendix
Appendix A: List of Nodes in Hadoop Cluster @NTU
Clusters Both Clusters Host Names Web Server pdcc.ntu.edu.sg IP Addresses 155.69.148.19
Clusters Host Names IP Addresses Dedicated Physical Computer: Core 2 Duo CPU 2.66 GHz + 2 GHz RAM Production Cluster pdccrs-01 155.69.145.248 Non-Dedicated Physical Computers: Pentium D CPU 3.20 GHz + 2 GHz RAM pdc1 155.69.151.85 pdc2 155.69.151.86 pdc3 155.69.151.87 pdc4 155.69.151.88 pdc5 155.69.151.89 pdc6 155.69.151.90 pdc7 155.69.151.91 pdc8 155.69.151.92 pdc9 155.69.151.93 pdc10 155.69.151.94 pdc11 155.69.151.95 pdc12 155.69.151.96 Production Cluster pdc13 155.69.151.97 pdc14 155.69.151.98 pdc15 155.69.151.99 pdc16 155.69.151.100 pdc17 155.69.151.101 pdc18 155.69.151.102 pdc19 155.69.151.103 pdc20 155.69.151.104 pdc21 155.69.151.105 pdc22 155.69.151.106 pdc23 155.69.151.107 pdc24 155.69.151.108
72
Clusters Host Names IP Addresses Non-Dedicated Physical Computers: Core 2 Quad CPU 2.66 GHz + 3 GHz RAM pdc25 155.69.145.211 pdc26 155.69.145.212 pdc27 155.69.145.213 Production Cluster pdc28 155.69.145.214 pdc29 155.69.145.215 pdc33 155.69.145.228 Dedicated Virtual Machines: Xeon X5560 CPU 2.80 GHz + 3 GHz RAM fypyyc1 155.69.102.245 fypyyc2 155.69.102.246 fypyyc3 155.69.102.247 Production Cluster fypyyc4 155.69.102.248 fypyyc5 155.69.102.249 fypyyc6 155.69.102.250 fypyyc7 155.69.102.251 fypyyc8 155.69.102.252 Development Cluster fypyyc9 155.69.102.253 fypyyc10 155.69.102.254
73
Appendix B: Past Problems in MRMANA & Solutions
74
75
76
77
78
79
80
Appendix C: Evolving Agent-Based Simulations in the Clouds

This paper was published in the Proceedings of the 3rd International Workshop on Advanced Computational Intelligence (IWACI 2010).
81
Evolving Agent-based Simulations in the Clouds

James Decraene, Yong Yong Cheng, Malcolm Yoke Hean Low Suiping Zhou, Wentong Cai and Chwee Seng Choo
Abstract Evolving agent-based simulations enables one to automate the difcult iterative process of modeling complex adaptive systems to exhibit pre-specied/desired behaviors. Nevertheless this emerging technology, combining research advances in agent-based modeling/simulation and evolutionary computation, requires signicant computing resources (i.e., high performance computing facilities) to evaluate simulation models across a large search space. Moreover, such experiments are typically conducted in an infrequent fashion and may occur when the computing facilities are not fully available. The user may thus be confronted with a computing budget limiting the use of these evolvable simulation techniques. We propose the use of the cloud computing paradigm to address these budget and exibility issues. To assist this research, we utilize a modular evolutionary framework coined CASE (for complex adaptive system evolver) which is capable of evolving agent-based models using nature-inspired search algorithms. In this paper, we present an adaptation of this framework which supports the cloud computing paradigm. An example evolutionary experiment, which examines a simplied military scenario modeled with the agent-based simulation platform MANA, is presented. This experiment refers to Automated Red Teaming: a vulnerability assessment tool employed by defense analysts to study combat operations (which are regarded here as complex adaptive systems). The experimental results suggest promising research potential in exploiting the cloud computing paradigm to support computing intensive evolvable simulation experiments. Finally, we discuss an additional extension to our cloud computing compliant CASE in which we propose to incorporate a distributed evolutionary approach, e.g., the island-based model to further optimize the evolutionary search.
I. I NTRODUCTION
XAMINING complex adaptive systems (CAS) remains problematic as the traditional analytical and statistical modeling methods appear to limit the study of CAS [1]. To overcome these issues, Holland proposed the use of evolutionary agent-based simulations to examine the emergent and complicated phenomena characterizing CAS. In evolutionary agent-based simulations, multiple and interacting evolvable agents (e.g., neurones, traders, soldiers, etc.) determine, as a whole, the behavior of the system (e.g., brain, nancial market, warfare, etc.). The evolution of agents is conducted through the use of evolutionary computation techniques (e.g., learning classier systems, genetic programming, evolution strategies, etc.). The evolution of CAS can be
James Decraene, Yong Yong Cheng, Malcolm Yoke Hean Low, Suiping Zhou and Wentong Cai are with the Parallel and Distributed Computing Center at the School of Computer Engineering, Nanyang Technological University, Singapore (email: jdecraene@ntu.edu.sg). Chwee Seng Choo is with DSO National Laboratories, 20 Science Park Drive, Singapore. This R&D work was supported by the Defence Research and Technology Ofce, Ministry of Defence, Singapore under the EVOSIM Project (Evolutionary Computing Based Methodologies for Modeling, Simulation and Analysis).
driven to exhibit pre-specied and desired system behaviors (e.g., to identify critical conditions leading to the emergence of specic system-level phenomena such as a nancial crisis or battleeld outcomes). Although this method appears to be satisfactory for studying CAS, it is limited by the requirement of signicant computational resources. Indeed in evolvable simulation experiments, many simulation models are iteratively generated and evaluated. Due to the stochastic nature of both evolutionary algorithms and agent-based simulations, experiment replications are also required to account for statistical uctuations. As a result, the experimental process is computationally highly demanding. Moreover, such experiments are typically conducted occasionally when the computing facilities may not be fully available. To address these computing budget issues, involving both scalability and exibility constraints, we examine the cloud computing paradigm [2]. This distributed computing paradigm has recently been introduced to specically address such computing budget issues where large dataset and considerable computational requirements are dealt with. To assist this research, we propose to modify a modular evolutionary framework, coined CASE for complex adaptive system evolver to support cloud computing facilities. In the remainder of this paper, we rst provide introductions to both evolutionary agent-based simulations and cloud computing. Following this, we present the CASE framework. The latter is then extended to support the cloud computing paradigm. A series of experiments is described to evaluate our cloud computing compliant framework in terms of scalability. The experiments involve a simplied military simulation which is modeled with the agent-based simulation platform MANA [3]. Finally we discuss an additional extension to CASE which would incorporate a distributed evolutionary approach [4] to further optimize the search process. II. E VOLUTIONARY AGENT- BASED S IMULATIONS Agent-based systems (ABSs) are computational methods which can model the intricate and non-linear dynamics of complex adaptive systems. ABSs are commonly implemented with object-oriented programming environments in which agents are instantiations of object classes. ABSs typically involve a large number of autonomous agents which are executed in a concurrent or pseudo-concurrent manner (i.e., using a time-slicing algorithm). Each agent possesses its own distinct state variables, can be dynamically deleted and is capable of interacting with the other agents. The agents
computational methods may include stochastic processes resulting in a stochastic behavior at the system level. To study ABS, the data-farming method was proposed as a means to identify the landscape of possibilities [5], i.e., the spectrum of possible simulation outcomes. In data farming experiments, specic simulation model parameters are selected and varied (according to pre-specied boundary values). This exploratory analysis of parameters enables one to examine the effects of the parameters over the simulation outcomes. Several techniques [6] have been introduced to reduce the search space where each solution/design point is a distinct simulation model. The search space can be reduced even further when one is interested in a single (or target) system behavior. Evolutionary computation (EC) techniques can here be used to drive the generation/evaluation of simulation models. In this paper, we examine such an objectivebased data farming approach using evolutionary agent-based simulations [7]. In evolutionary ABS, EC techniques are utilized to evolve simulation models to exhibit a desirable output/behavior. This method differs from simulation optimization techniques [8] as it relies on the simulation of autonomous and concurrent agents whose (inter)actions may include stochastic elements. Therefore the evaluation of the simulation models is also stochastic by nature. III. C LOUD C OMPUTING Cloud computing [2] is a novel high performance computing (HPC) paradigm which has recently attracted considerable attention. The computing capabilities (i.e., compute and storage clouds) are typically provided as a service via the Internet. This web approach enables users to access HPC services without requiring expertise in the technology that supports them. In other words, the user does not need expertise in mainframe administration and maintenance, distributed systems, networking, etc. The key benets of cloud computing are identied as follows: Reduced Cost: Cloud computing infrastructures are provided by a third-party and do not need to be purchased for potentially infrequent computing tasks. Users pay for the resources on a utility computing basis. This enables users with limited nancial and computing resources to exploit high performance computing facilities (e.g., the Amazon Elastic Compute Cloud, the Sun Grid) without having to invest into personal and expensive computing facilities. Scalability: Multiple computing clouds (which can be distant from each other) can be aggregated to form a single virtual entity enabling users to conduct very large scale experiments. The computing resources are dynamically provided and self-managed by the cloud computing server. Cloud computing is a HPC paradigm, in others words, it aims at enabling users to exploit large amounts of computing power in a short period of time (in minutes or hours). Thus, cloud computing differs from High Throughput Computing approaches, such
as Condor [9]1 , which aim at provisioning large amounts of computing power over longer periods of time (in days or weeks). One of the core technology underlying cloud computing, enabling the above benets, is the MapReduce programming model [11]. This model is composed of two distinct phases: Map: The input data is partitioned into subsets and distributed across multiple compute nodes. The data subsets are processed in parallel by the different nodes. A set of intermediate les results from the Map phase and is processed during the Reduce phase. Reduce: Multiple compute nodes process the intermediate les which are then collated to produce the output les. Similarly to the Map processes, the Reduce operations are distributed (and executed in parallel) over multiple compute nodes. The relative simplicity of the MapReduce programming model facilitates the efcient parallel distribution of computationally expensive jobs. This parallelism also enables the recovery from failure during the operations (this is particularly relevant when considering a distributed environment where some nodes may fail during a run). Map/Reduce operations may be replicated (if a distinct operation fails, its replication is retrieved). Also, failed operations may automatically be rescheduled. These faulttolerant features are inherent properties of cloud computing frameworks such as the Apache Hadoop. Thus the user is not required to handle such issues. We suggest that evolutionary agent-based simulations can be expressed as MapReduce computations, and consequently, may exploit the benets provided by the cloud computing paradigm. In the next section we briey present some related studies which examined the combination of the MapReduce programming model with evolutionary algorithms. IV. R ELATED STUDIES Recent studies have combined evolutionary computation and the MapReduce programming model. In [12], Jin et al. claimed that, as devised, the MapReduce model cannot directly support the implementation of parallel genetic algorithms (i.e., a specic island-based model). As a result, MapReduce was extended and included an additional Reduce process. The iterative cycle is as follows. During the Map phase, multiple instances of the genetic algorithms are executed in parallel. The local optimal solutions of each population are collected during the rst Reduce phase. An additional collection and sorting of the local optimal solutions is conducted during the second Reduce phase. The resulting set of global optimal solutions is then utilized to initiate the next generation. Llora et al. [13] presented a different approach where several evolutionary algorithms were adapted to support the MapReduce model (in contrast with Jin et al. who adapted the MapReduce model and not the evolutionary algorithm
1 Note
that Condor is being adapted to support cloud computing [10].
itself). The parallelization of the evolutionary algorithms was here conducted using a decentralized and distributed selection approach [14]. This method avoided the requirement of a second Reduce process (i.e., a single selection operation is conducted over the aggregation of the different pools of solutions). The above studies provide guidance for translating evolutionary algorithms for MapReduce operations. The approach proposed by Llora et al. is further examined in Section VI. Note that in contrast with Jin et al. and Llora et al.s approaches, the objective function is here the simulation of stochastic agent-based models. The resolution (i.e., level of abstraction) of the simulations is the key factor (i.e., the bulk of the work) determining the computational requirements of the evolutionary experiments. In the next section, a description of the CASE framework is provided. V. T HE CASE FRAMEWORK CASE is a recently developed framework which enables one to evolve simulation models using nature-inspired search algorithms. This system was constructed in a modular manner (using the Ruby programming language to accommodate the users specic requirements (e.g., use of different simulation engines or evolutionary algorithms, etc.). This framework can be regarded as a simplication of the Automated Red Teaming framework [15] which was developed by the DSO National Laboratories of Singapore. CASE is composed of three main components which are distinguished as follows: 1) The model generator: This component takes as inputs a base simulation model specied in the eXtended Markup Language and a set of model specication text les. According to these inputs, novel XML simulation models are generated and sent to the simulation engine for evaluation. Thus, as currently devised, CASE only supports simulation models specied in XML. Moreover, the model generator may consider constraints over the evolvable parameters (this feature is optional). These constraints are specied in a text le by the user. These constraints (due for instance to interactions between evolvable simulation parameters) aim at increasing the plausibility of generated simulation models (e.g., through introducing cost trade-off for specic parameter values). 2) The simulation engine: The set of XML simulation models is received and executed by the stochastic simulation engine. Each simulation model is replicated a number of times to account for statistical uctuations. A set of result les detailing the outcomes of the simulations (in the form of numerical values for instance) are generated. These measurements are used to evaluate the generated models, i.e., these gures are the tness (or cost) values utilized by the evolutionary algorithm (EA) to direct the search. 3) The evolutionary algorithm: The set of simulation results and associated model specication les are
received by the evolutionary algorithm, which in turns, processes the results and produce a new generation of model specication les. The generation of these new model specications is driven by the userspecied (multi)objectives (e.g., maximize/minimize some quantitative values capturing the target system behavior). The algorithm iteratively generates models which would incrementally, through the evolutionary search, best exhibit the desired outcome behavior. The model specication les are sent back to the model generator; this completes the search iteration. This component is the key module responsible for the automated analysis and modeling of simulations. Communications between the three components are conducted via text les for simplicity and exibility. Note that the exible nature of CASE allows one to develop and integrate different simulation platforms (using models specied in XML), and search algorithms. In the next section, we propose a cloud computing compliant version of CASE. VI. M AP R EDUCE CASE We present our adaptation of the CASE framework to support the MapReduce programming model. This adaptation is conducted using the Apache Hadoop framework which relies on the Map and Reduce functions devised in functional programming languages such as Lisp. During initialization, the CASE modules (simple Ruby scripts and the simulation engine executable) are sent to the compute nodes. Then, at each search iteration, only the model specication les are transmitted to the compute nodes, where, locally the generation and evaluation of simulation models are conducted. The motivation of this approach is to decrease the network trafc and distribute the computational effort (moving computation is cheaper than moving data). Also, note that only a single Reduce process is conducted to retrieve the intermediate result les. Future work will consider exploiting the Reduce phase through analyzing intermediate result les (to assist the evolutionary algorithm) using multiple compute nodes. This relatively straightforward implementation illustrates the simplicity of the MapReduce programming model. VII. E XPERIMENT We present an example experiment in which the CASE framework is utilized for Automated Red Teaming (ART), a simulation-based military methodology utilized to uncover weaknesses of operation plans. Here, combat is conceptually regarded as a complex adaptive system which outcomes result from complex non-linear dynamics [16]. The agent-based simulation platform MANA [3], developed by the New Zealand Defense and Technology Agency, is employed to model and perform the simulations. A. Automated Red Teaming Automated Red Teaming (ART) was originally proposed by the defense research community as a vulnerability assessment tool to automatically uncover critical weaknesses
of operational plans [7]. Using this computer/simulationbased approach, defense analysts may subsequently resolve the identied tactical plan loopholes. A stochastic agent-based simulation is typically used to model and simulate the behavioral and dynamical features of the environment/agents. The agents are specied with a set of properties which denes their intrinsic capabilities and personality such as sensor range, re range, movement range, communications range, aggressiveness, response to injured teammates and cohesion. A review of ABS systems applied to various military applications is provided by Cioppa et al. [17]. In ART experiments, a defensive Blue team (a set of agents) is subjected to repeated attacks, where multiple scenarios may be examined, from a belligerent Red team. Thus, ART aims at anticipating the adversary behaviour through the simulation of various potential scenarios. B. Setting A maritime anchorage protection scenario is examined. In this scenario, a Blue Team (composed of 7 vessels) conducts patrols to protect an anchorage (in which 10 Green commercial vessels are anchored) against threats. Red forces (5 vessels) attempt to break Blues defense strategy and inict damages to anchored vessels. The aim of the study is to discover Reds strategies that are able to breach through Blues defensive tactic. We detail the model, evolutionary algorithm and cloud computing facilities utilized in the experiments: The model: Figure 1 depicts the scenario which was modeled using the ABS platform MANA.
Red behavioral parameters (Table I). As the number of decision variables increases, the search space becomes signicantly larger. According to the number of evolvable properties and associated ranges given for this experiment, the search space contains 1.007 distinct candidate solutions (i.e., variants of the original simulation model).
TABLE I E VOLVABLE R ED PARAMETERS Red property Team 1 initial position (x,y) Team 2 initial position (x,y) Intermediate waypoints (x,y) Team 1 nal position (x,y) Team 2 nal position (x,y) Aggressiveness Cohesiveness Determination Min (0,0) (0,160) (0,40) (0,160) (0,0) -100 -100 20 Max (399,39) (399,199) (399,159) (399,199) (399,39) 100 100 100
The home and nal positions together with the intermediate waypoint dene the trajectory of each distinct Red vessel. Three of the Red crafts (Team 1) were set up to initiate their attack from the north while the remaining two attack (Team 2) from the south. This allows Red to perform multi-directional attack at the anchorage. In addition, the nal positions of the Red crafts are constrained to the opposite region (with respect to initial area) to simulate escapes from the anchorage following successful attacks. Psychological elements are included in the decision variables to address the potential effects on the Red force. The aggressiveness determines the reaction of individual Red crafts upon detecting a Blue patrol. Cohesiveness inuences the propensity of Red to maneuver as a group or not, whereas determination stands for the Reds willingness to follow the dened trajectories. The Red crafts aggressiveness against the Blue force are varied from unaggressive (-100) to very aggressive (100). Likewise, the cohesiveness of the Red crafts are varied from independent (-100) to very cohesive (100). Finally, a minimum value of 20 is set for determination to prevent inaction from occurring.
Fig. 1. MANA model of the maritime anchorage protection scenario adapted from [18]. The map covers an area of 100 by 50 nautical miles (1 nm = 1.852km). The dashed lines depict the patrolling paths of the different Blue vessels.
The evolutionary algorithm: The Non-dominated Sorting Algorithm II (NSGA-II) [19] is employed to conduct the evolutionary search using the parameter values listed in Table II:
TABLE II E VOLUTIONARY A LGORITHM S ETTING Parameter Population size Number of search iteration Mutation probability Mutation index Crossover rate Crossover index Value 100 50 0.1 20 0.9 20
The Blue patrolling strategy is composed of two layers: an outer (with respect to the anchorage area, 30 by 10 nm) and inner patrol. The outer patrol consists of four smaller but faster boats. They provide the rst layer of defence whereas the larger and heavily armored ships inside the anchorage are the second defensive layer. In CASE, each candidate solution is represented by a vector of real values dening the different evolvable
The NSGA-II population size and number of search iteration indicate that 5000 distinct MANA simulation
models are generated and evaluated for each experimental run. Each individual simulation model is executed/replicated 30 times to account for statistical uctuations. The efciency of the algorithm is measured by the number of Green casualties with respect to the number of Red casualties. In other words, the objectives are: To minimize the number of Green (commercial) vessels alive. To minimize the number of Red casualties.
14 12 10 Time (hours) 8 6 4 2 0 1 220 200 180 160 Time (hours) 140 120 100 80 60 40 20 0 1 2 4 5 10 20 Number of distributed compute nodes 25 2 4 5 10 20 Number of distributed compute nodes 25
The cloud computing facilities: The cloud computing cluster is composed of 30 laboratory workstations located at the Parallel and Distributed Computing Center, Nanyang Technological University. Note that the hardware of the workstations may vary from each others, thus a heterogeneous environment is considered. Moreover, as these workstations may also occasionally be utilized by students, the performance of workstations may also be affected during experiments. This exemplies the hazards (e.g., a student may reboot a compute node) that may occur in a distributed environment. We purposely utilize such a computing environment to test the fault tolerant features of Hadoop.
C. Results Figure 2 presents the running times of two experiments where we incrementally increased the number of available compute nodes. In the rst experimental run, a relatively fast version of the simulation model is employed (requires 5 seconds to execute 30 replications on a compute node). In the second case, the model execution time is increased from 5 to 90 seconds to reect real life military simulation models which typically require such an amount of time. It can be observed that as the number of available compute nodes increases, the time required to perform the experiment decreases accordingly. Nevertheless, we note that this relationship (i.e., number of nodes/time) is not exactly scalable (most remarkable when the number of compute nodes is higher than 10) in the rst model. Whereas in the second experimental run, the running time scales with the number of utilized compute nodes. The results suggest that, according to the execution time of the simulation model, an optimal (from a computing cost point of view) number of compute node exists. A number of issues causing overheads were identied: 1) The iterative nature of the evolutionary algorithm requires the synchronization of the search iterations. As a result, compute nodes equipped with a relatively slower CPU (or having a higher computational load due to external factors such as students using the computer) may cause a delay. 2) Delays may also occur due to network trafc. The latter may lead the model evaluations to occur with differing start times (this issue may thus aggravate the previous one).
Fig. 2. Running times of MapReduce CASE experiments with increasing number of computer nodes using a fast (top) and slow (bottom) variants of the base simulation model.
Future work will consider the utilization of an asynchronous model considering a heterogeneous computing environment to resolve the above issues. Also, note that some experiments were conducted while laboratory demonstrations were occurring. Nevertheless no signicant deteriorations upon the experiments were observed (apart from the occasional slow down of some model evaluations). All experiments were thus successfully achieved using this heterogeneous and relatively hazardous computing environment. This support the robustness qualities of the cloud computing paradigm. In the next section we discuss the integration of distributed evolutionary computation techniques within our CASE MapReduce model. VIII. F UTURE
WORK
Our simplistic adaption of CASE did not exploit some features (e.g., shufing process, multiple Reduce processes) of the MapReduce model. We discuss future directions, examining distributed evolutionary computation, which may potentially address this decit: Island-based model: The island-based model [4] is a popular and efcient way to implement evolutionary algorithms on distributed systems. In this model, each compute node executes an independent evolutionary algorithm over its own sub-population. The nodes work in consort by periodically exchanging solutions in a
process called migration. It has been reported that such models often exhibit better search performance in terms of both accuracy and speed. This approach may thus further optimize the evolutionary search given a limited computing budget. We may, for instance, devise Reduce processes that would carry out the computations required during the migrations (e.g., selection of most promising solutions to be transferred). Self-adaptive mechanisms: Similarly to the parameter setting of evolutionary algorithms, the performance of distributed evolutionary approaches may vary according to the specic migration scheme employed. Numerous parameters (as mentioned above) are to be pre-specied by the user and ultimately determine the efciency of the distributed evolutionary search. This parameter tuning process is thus a critical step which typically requires series of preliminary experiments to identify a satisfactory set of parameter values. Consequently, running such preliminary experiments conicts with our intention to resolve computing budget issues. Recent studies [20], [21] have addressed this issue where selfadaptive methods are used to automate this parameter tuning process. We suggest that these computations may be expressed as Reduce processes. The above directions are currently being investigated using our seminal work on combining CASE and the MapReduce model. IX. C ONCLUSION We rst briey presented the elds of evolutionary agentbased simulations and cloud computing. To date, the work reported here is among the very rst attempts to combine evolutionary agent-based simulations with the MapReduce programming model. To assist this research, we utilized the modular evolutionary framework CASE. The latter was adapted to support the MapReduce model. To test our novel framework, we presented an evolutionary experiment which involved Automated Red Teaming, a method originating from the defense research community where warfare is conceptually regarded as a complex adaptive system. The experimental results demonstrated the benets of the MapReduce approach in terms of both scalability and robustness. Finally we discussed a future research direction in which selfadaptive distributed evolutionary algorithms are considered to further optimize the evolutionary search. ACKNOWLEDGMENTS We would like to thank the following organizations that helped make this R&D work possible: Defence Research and Technology Ofce, Ministry of Defence, Singapore, for sponsoring the Evolutionary Computing Based Methodologies for Modeling, Simulation and Analysis project which is part of the Defence Innovative Research Programme FY08. Defence Technology Agency, New Zealand Defence Force, for sharing the Agent Based Model, MANA.
Parallel and Distributed Computing Center, School of Computer Engineering, Nanyang Technological University, Singapore. DSO National Laboratories, Singapore. R EFERENCES
[1] J. Holland, Studying complex adaptive systems, Journal of Systems Science and Complexity, vol. 19, no. 1, pp. 18, 2006. [2] A. Weiss, Computing in the Clouds, netWorker, vol. 11, no. 4, pp. 1625, 2007. [3] M. Lauren and R. Stephen, Map-aware Non-uniform Automata (MANA)-A New Zealand Approach to Scenario Modelling, Journal of Battleeld Technology, vol. 5, pp. 2731, 2002. [4] E. Cantu-Paz, Efcient and Accurate Parallel Genetic Algorithms. Kluwer Academic Pub, 2000. [5] P. Barry and M. Koehler, Simulation in Context; Using Data Farming for Decision Support, in Proceedings of the 36th Winter Simulation Conference, 2004, pp. 814819. [6] T. Cioppa and T. Lucas, Efcient Nearly Orthogonal and Space-lling Latin Hypercubes, Technometrics, vol. 49, no. 1, pp. 4555, 2007. [7] C. Chua, C. Sim, C. Choo, and V. Tay, Automated Red Teaming: an Objective-based Data Farming Approach for Red Teaming, in Proceedings of the 40th Winter Simulation Conference, 2008, pp. 14561462. [8] S. Olafsson and J. Kim, Simulation Optimization, in Proceedings of the 34th Winter Simulation Conference, vol. 1, 2002, pp. 7984. [9] M. Litzkow, M. Livny, and M. Mutka, Condor-a Hunter of Idle Workstations, in Proceedings of the 8th International Conference of Distributed Computing Systems, vol. 43, 1988, pp. 104111. [10] Thain, D. and Moretti, C., Abstractions for Cloud Computing with Condor, in Cloud Computing and Software Services, S. Ahson and M. Ilyas, Eds. CRC Press, 2010, To appear. [11] J. Dean and S. Ghemawat, MapReduce: Simplied Data Processing on Large Clusters, Commun. ACM, vol. 51, no. 1, pp. 107113, 2008. [12] C. Jin, C. Vecchiola, and R. Buyya, MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms, in ESCIENCE 08: Proceedings of the 2008 Fourth IEEE International Conference on eScience. Washington, DC, USA: IEEE Computer Society, 2008, pp. 214221. [13] X. Llora, A. Verma, R. Campbell, and D. Goldberg, When Huge Is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing, Parallel and Distributed Computational Intelligence, pp. 1141, 2010. [14] K. De Jong and J. Sarma, On Decentralizing Selection Algorithms, in Proceedings of the Sixth International Conference on Genetic Algorithms, 1995, pp. 1723. [15] C. S. Choo, C. L. Chua, and S.-H. V. Tay, Automated Red Teaming: a Proposed Framework for Military Application, in Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation. New York, NY, USA: ACM, 2007, pp. 19361942. [16] A. Ilachinski, Articial war: Multiagent-based Simulation of Combat. World Scientic Pub Co Inc, 2004. [17] T. Cioppa, T. Lucas, and S. Sanchez, Military Applications of Agentbased Simulations, in Proceedings of the 36th Winter Simulation Conference, 2004, pp. 171180. [18] M. Low, M. Chandramohan, and C. Choo, Multi-Objective Bee Colony Optimization Algorithm to Automated Red Teaming, in Proceedings of the 41th Winter Simulation Conference, 2009, pp. 17981808. [19] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan, A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimization: NSGA-II, Lecture Notes in Computer Science, pp. 849858, 2000. [20] K. Srinivasa, K. Venugopal, and L. Patnaik, A Self-adaptive Migration Model Genetic Algorithm for Data Mining Applications, Information Sciences, vol. 177, no. 20, pp. 42954313, 2007. [21] C. Leon, G. Miranda, and C. Segura, A Memetic Algorithm and a Parallel Hyperheuristic Island-based Model for a 2D Packing Problem, in GECCO 09: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation. New York, NY, USA: ACM, 2009, pp. 13711378.

Cloud Computing: Application On Data Farming Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloud Computing: Application On Data Farming Report

Uploaded by

Copyright:

Available Formats

NANYANG TECHNOLOGICAL UNIVERSITY

CLOUD COMPUTING: APPLICATION ON DATA FARMING

Yong Yong Cheng

School of Computer Engineering 2010

NANYANG TECHNOLOGICAL UNIVERSITY

SCE09-0445 CLOUD COMPUTING: APPLICATION ON DATA FARMING

Yong Yong Cheng

School of Computer Engineering 2010

1.3 Scope .................................................................................................................................. 17 1.4 Report Organization ........................................................................................................... 17

Chapter 2: Concepts & Frameworks ................................................................................ 19

2.2 Web Service........................................................................................................................ 24 2.3 Complex Adaptive System Evolver ..................................................................................... 24

2.3.3 Island Model .............................................................................................................................. 27

Chapter 3: Implementing Cloud Infrastructure & Applications....................................... 29

3.2 Hadoop Cluster @EC2 ........................................................................................................ 33 3.3 Hadoop Service .................................................................................................................. 34

Chapter 4: Preliminary Work ........................................................................................... 42

Chapter 5: Implementing Apache Hadoop-Compliant CASE ........................................... 48

5.3 Standard Model.................................................................................................................. 52

5.4 Island MapReduce .............................................................................................................. 60

Chapter 6: Conclusion ...................................................................................................... 66

References ....................................................................................................................... 69 Appendix .......................................................................................................................... 72

Figure 1: Search Popularity on Google Since 2007

Figure 2: Data Farming Iterative Process

1.2.2 Computing Environments

Table 1: Pros & Cons of Computing Environments for Data Farming

Computing Environments A Single Computer

Pros - Simple to deploy simulation models.

A Cluster of Computers (Cluster Computing)

A Grid of Computers (Grid Computing)

A typical Cloud Computing architecture is composed of six layers, as shown in Figure 3.

Figure 3: Cloud Computing Stack [11]

The benefits of Cloud Computing are:

1.2.3 Objective-Based Data Farming

1.4 Report Organization

Chapter 2: Concepts & Frameworks

2.1 Apache Hadoop

Figure 4: Component Stacks of Apache Hadoop & Google MapReduce Framework

2.1.1 Hadoop Distributed File System

Figure 5: MapReduce Execution

Characteristics Task Type Application Structure Checkpoint Mechanism

Apache Hadoop Data-Intensive. MapReduce.

2.2 Web Service

2.3 Complex Adaptive System Evolver

Figure 6: Complex Adaptive System Evolver (CASE)

2.3.1 Map Aware Non-Uniform Automata

2.3.2 Evolutionary Algorithm

Figure 7: General Structure of an Evolutionary Algorithm (EA)

2.3.3 Island Model

Figure 8: Differences between Single Population Model & Island Model

Chapter 3: Implementing Cloud Infrastructure & Applications

3.1 Hadoop Cluster @NTU

Figure 9: 37-Nodes Production Cluster

Figure 10: Screen Capture of the Installer

3.1.1 Development Cluster

Figure 11: 4-Nodes Development Cluster

For a list of nodes in the development cluster, please refer to Appendix A.

3.1.2 Auto-Updating & Reporting Tool

Figure 12: Steps in Getting an Update to Be Applied on Each Slave Node

3.2 Hadoop Cluster @EC2

Instance Types Micro Small Medium

Instance Categories Micro Standard High-CPU

Memory Sizes 613 MB 1.7 GB 1.7 GB

Figure 13: Hadoop Cluster @EC2

3.3 Hadoop Service

Table 4: Web API (Application Programming Interface) of Hadoop Service