You are on page 1of 12

International

Journal of Computer
Engineering
Technology (IJCET),
ISSN 0976-6367(Print),
INTERNATIONAL
JOURNAL
OFand
COMPUTER
ENGINEERING
&
ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

TECHNOLOGY (IJCET)

ISSN 0976 6367(Print)


ISSN 0976 6375(Online)
Volume 6, Issue 3, March (2015), pp. 12-23
IAEME: www.iaeme.com/IJCET.asp
Journal Impact Factor (2015): 8.9958 (Calculated by GISI)
www.jifactor.com

IJCET
IAEME

WORK LOAD ANALYSIS SECURITY ASPECTS AND


OPTIMIZATION OF WORKLOAD IN HADOOP
CLUSTERS
Atul U. Patil1,

T.I. Bagban2,

B.S.Patil3,

R.U.Patil4,

S.A.Gondil5

M.E CSE, ADCET/ Shivaji University, Kolahpur, India


Asso.Prof. DKTE,Ichalkarnji)/ Shivaji University, Kolahpur, India
3
Asso.Prof. PVPIT, Budhgaon)/ Shivaji University, Kolahpur, India
4
Asst.Prof. BVCOE, Kolhapur)/ Shivaji University, Kolahpur, India
5
Asst.Prof, Bharthi vidhyapit palus, Pune,India.
2

ABSTRACT
This paper discusses a propose cloud system that mixes On-Demand allocation of resources
with improved utilization, opportunistic provisioning of cycles from idle cloud nodes to alternative
processes .Because for cloud computing to avail all the demanded services to the cloud customers is
extremely troublesome. It's a significant issue to fulfil cloud consumers needs. Hence On-Demand
cloud infrastructure exploitation Hadoop configuration with improved C.P.U. utilization and storage
hierarchy improved utilization is projected using Fair4s Job scheduling algorithm. therefore all cloud
nodes that remains idle are all in use and additionally improvement in security challenges and
achieves load balancing and quick process of huge information in less quantity of your time and
method all kind of jobs whether or not it\'s massive or little. Here we have a tendency to compare the
GFS read write algorithm and Fair4s job scheduling algorithm for file uploading and file
downloading; and enhance the C.P.U. utilization and storage utilization. Cloud computing moves the
appliance software system and databases to the massive data centres, wherever the management of
the information and services might not be totally trustworthy. thus this security drawback is finding
by encrypting the information using encryption/decryption algorithm and Fair4s Job scheduling
algorithm that solve the problem of utilization of all idle cloud nodes for larger data.
Keywords: C.P.U Utilization, Encryption/decryption algorithm, Fair4s Job scheduling algorithm,
GFS, Storage utilization.
12

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

I.

INTRODUCTION

Cloud computing considered as a quickly rising new technology for delivering computing as
a utility. In cloud computing varied cloud customers demand type of services as per their
dynamically ever-changing needs. Thus it's the work of cloud computing to avail all the demanded
services to the cloud customers. But as a result of the supply of limited number of resources it's very
troublesome for cloud suppliers to produce all the demanded services. From the cloud providers'
perspective cloud resources should be allotted in a very honest manner. So, it is a very important
issue to fulfil cloud consumers' Quality of service needs and satisfaction. So as to make sure ondemand accessibility a supplier has to overprovision: keep an outsized proportion of nodes idle so
they will be wont to satisfy an on-demand request that might come back at any time. The necessity to
stay of these nodes idle results in low utilization. The only way to improve it's to keep fewer nodes
idle. But this implies probably rejecting a higher proportion of requests to some extent at that a
provider now not provides on-demand computing [2]. Many trends are gap up the era of Cloud
Computing that is a web primarily based development and use of engineering. The most cost
effective and a lot of more powerful processors, beside the software as a service (SaaS) computing
design, area unit reworking knowledge canters into pools of computing service on a large scale.
Meanwhile, the network
Band width increase and reliable nevertheless versatile network connections build it even
doable that clients will currently Subscribe top quality services from knowledge and software
package that reside solely on remote data centres. In the recent years, Infrastructure Service (IaaS)
cloud computing has emerged as an attractive different to the acquisition and management of
physical resources. An important factor of Infrastructure-as-a-Service (IaaS) clouds is providing
users on-demand access to resources. However, to supply on-demand access, cloud suppliers should
either considerably over provision their infrastructure (or pay a high value for operative resources
with low utilization) or reject an oversized proportion of user requests (in that case the access isn't
any longer on-demand). At the same time, not all users need really on-demand access to resources
[3]. Several applications and workflows are designed for recoverable systems wherever interruptions
in service are expected. Here a technique is propose, a cloud infrastructure with Hadoop
configuration that mixes on-demand allocation of resources with expedient provisioning of cycles
from idle cloud nodes to different processes. The target is to handles larger data in less amount of
your time and keeps utilization of all idle cloud nodes through rending of larger files into smaller one
exploitation Fair4s Job scheduling algorithm, additionally increase the utilization of central
processing unit and storage hierarchy for uploading files and downloading files. To stay data and
services trustworthy, security is additionally maintain using RSA algorithm that is wide used for
secure knowledge transmission. Also we have compare the GFS read write algorithm with the Fair4s
Job scheduling algorithm thus we are going to get the improved utilization results because of varied
options obtainable in Fair4s job scheduling algorithm just like the Setting Slots Quota for Pools,
Setting Slot Quota for Individual Users, assignment Slots based on Pool weight, Extending Job
Priorities these options permits provides practicality so job allocation and load equalisation takes
place in efficient manner.
II.

LITERATURE SURVEY

There is abundant analysis work in the sphere of cloud computing over the past decades. a
number of the work done has been mentioned, this paper researched cloud computing design and its
safety, planned a replacement cloud computing design, SaaS model was used to deploy the
connected software system on the cloud platform, so the resource utilization and computing of
scientific tasks quality are going to be improved [17]. Workload characterization studies square
13

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

measure helpful for serving to Hadoop operators determine system bottleneck and figure out
solutions for optimizing performance. several previous efforts are accomplished in numerous areas,
together with network systems [06], a cloud infrastructure that mixes on-demand allocation of
resources with expedient provisioning of cycles from idle cloud nodes to different processes by
deploying backfill virtual machines (VMs) [21].A model for securing Map/Reduce computation
within the cloud. The model uses a language primarily based security approach to enforce data flow
policies that vary dynamically because of a restricted revocable delegation of access rights between
principals. The decentralized label model (DLM) is employed to specific these policies[18].A new
security design, Split Clouds, that protects the data hold on in a cloud, whereas the architecture lets
every organization hold direct security controls to their data, rather than exploit them to cloud
providers. The main of the model includes of time period data summaries, in line security gateway
and third party auditor. By the mix of the 3 solutions, the design can prevent malicious activities
performed even by the safety administrators within the cloud providers [20].Several studies [19],
[20], [21] have been conducted for workload analysis in grid environments and parallel computer
systems.
They proposed various methods for analysing and modelling workload traces. However, the
job characteristics and scheduling policies in grid are much different from the ones in a Hadoop
system.
III.

THE PROPOSED SYSTEM

Fig.1 System Architecture

14

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

Cloud computing has become a viable, thought resolution for processing, storage and
distribution, however moving massive amounts of knowledge in associated out of the cloud
presented an insurmountable challenge[4].Cloud computing is a very undefeated paradigm of service
destined computing and has revolutionized the means computing infrastructure is abstracted and
used. Three most well-liked cloud paradigms include:
1. Infrastructure as a Service (IaaS)
2. Platform as a Service (PaaS)
3. Software as a Service (SaaS)
The thought can even be extended to info as a Service or Storage as a Service. Scalable
database management system (DBMS) each for update intensive application workloads, in addition
as decision support systems square measure important a part of the cloud infrastructure. Initial styles
embody distributed databases for update intensive workloads and parallel database systems for
analytical workloads. Changes in information access patterns of application and therefore the have to
be compelled to scale intent on thousands of commodity machines led to birth of a replacement
category of systems referred to as Key-Value stores[11].In the domain of data analysis, we propose
the Map Reduce paradigm and its open-source implementation Hadoop, in terms of usability and
performance.
The System has six modules:
1.
Hadoop Configuration( Cloud Server Setup)
2.
Login & Registration
3.
Cloud Service Provider(CSP)
4.
Fair4s Job Scheduling Algorithm
5.
Encryption/decryption module
6.
Administration files(Third Party Auditor)
3.1 Hadoop Configuration (Cloud Server Setup)
The Apache Hadoop is a framework that permits for the decentralized process of huge data
sets across clusters of computers using straightforward programming models. it's designed to
proportion from single servers to several thousand nodes, providing massive computation and
storage capacity, instead of think about underlying hardware to give large availability, the
infrastructure itself is intended to handle failures at the application layer, thus delivering a most
available service on prime of a cluster of nodes, every of which can be vulnerable to failures [6].
Hadoop implements Map reduce, using the HDFS. The Hadoop Distributed File System allows users
to possess one available namespace, unfold across several lots of or thousands of servers, making
one massive file system. Hadoop has been incontestable on clusters with more than two thousand
nodes. The present style target is ten thousand node clusters.
Hadoop was inspired by MapReduce, framework during which associate application is deescalated into varied tiny parts. Any of those parts (also referred to as fragments or blocks) may be
run on any node within the cluster. The present Hadoop system consists of the Hadoop architecture,
Map-Reduce, the Hadoop distributed file system (HDFS).

15

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

Fig.2 Architecture of hadoop


JobTracker is that the daemon service for submitting and following MapReduce jobs in
Hadoop. Theres just one Job tracker method run on any hadoop cluster. Job tracker runs on its own
JVM process. In an exceedingly typical production cluster its run on a separate machine. Every slave
node is designed with job tracker node location. The JobTracker is single purpose of failure for the
Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop
performs; scheduling applications submit jobs to the task trackers. [9].
A TaskTracker is a slave node daemon within the cluster that accepts tasks (Map, reduce and
Shuffle operations) from a JobTracker. Theres just one Task tracker method run on any hadoop
slave node. Task tracker runs on its own JVM method. Each TaskTracker is designed with a group of
slots, these indicate the amount of tasks that it will settle for. The TaskTracker starts a separate JVM
methods to try and do the particular work (called as Task Instance) this is often to confirm that
process failure doesn't take down the task tracker [10].
Namenode stores the entire system namespace. Information like last modified time, created
time, file size, owner, permissions etc. are stored in Namenode [10].The current Apache Hadoop
ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS).
The Hadoop Distributed File System (HDFS)
HDFS is a fault tolerant and self-healing distributed filing system designed to point out a
cluster of business normal servers into a massively scalable pool of storage. Developed specifically
for large-scale process workloads where quality, flexibility and turnout square measure necessary,
HDFS accepts data in any format despite schema, optimizes for prime system of measurement
streaming, and scales to tried deployments of 100PB and on the way side [8].
3.2 Login and Registration
It offer Interface to Login. Client will upload the file and download file from cloud and
obtain the detailed summery of his account. During this means security is provided to the consumer
by providing consumer user name and password and stores it in info at the most server that ensures
the safety. Any information uploaded and downloaded, log record has every activity which may be
used for more audit trails. With this facility, it ensures enough security to consumer and information
hold on at the cloud servers solely may be changed by the consumer.
16

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

3.3 Cloud Service Provider (Administrator)


It is administration of user and information. Cloud service supplier has an authority to feature
and take away clients. It ensures enough security on clients information hold on at the cloud servers.
Conjointly the log records of every registered and authorize consumer on cloud solely will access the
services. This specific consumer log record is helps in improve security.
3.4 Job Scheduling Algorithm
Map-Reduce is a distributed processing model and an implementation for process and
generating giant datasets that's amenable to a broad style of real-time tasks. Clients specify the
workload computation in terms of a map and a reduce operate additionally Users specify a map
operate that processes a key/value combine to come up with a collection of intermediate key/value
pairs, and a reduce operate that merges all intermediate values related to an equivalent intermediate
key. Programs written during this purposeful style area unit
Automatically parallelized and executed on an oversized cluster of commodity machines. The
run-time system takes care of the main points of partitioning the computer file, scheduling the
program's execution across a collection of machines, handling machine failures, and managing the
desired inter-machine communication. This enables programmers with none expertise with parallel
and distributed systems to simply utilize the resources of an oversized distributed system [7].
Our implementation of Fair4s Job scheduling algorithm runs on an oversized cluster of
commodity machines and is very scalable. Map-Reduce is Popularized by open-source Hadoop
project. Our Fair4s Job scheduling algorithm works on process of enormous files by dividing them
on variety of chunks and assignment the tasks to the cluster nodes in hadoop multimode
configuration. In these ways in which our planned Fair4s Job programming algorithm improves the
utilization of the Cluster nodes with parameters like time, CPU, and storage.
3.4.1 Features of Fair4s
Extended functionalities available in Fair4s scheduling algorithm create it workload efficient
than GFS read write algorithm square measure listed out below these functionalities permits
algorithm to provides out efficient performance in process huge work load from totally different
clients.
1. Setting Slots Quota for Pools- All jobs are divided into many pools. Every job belongs to at least
one of those pools. Whereas in Fair4S, every pool is designed with a maximum slot occupancy. All
jobs belonging to a uniform pool share the slots quota, and also the range of slots employed by these
jobs at a time is restricted to the utmost slots occupancy of their pool. The slot occupancy higher
limit of user teams makes the slots assignment a lot of versatile and adjustable, and ensures the slots
occupancy isolation across totally different user teams. Though some slots are occupied by some
giant jobs, the influence is barely restricted to the native pool within.
2. Setting Slot Quota for Individual Users-In Fair4S, every user is designed with a most slots
occupance. Given a user, regardless of what number jobs he/she submits, the entire range of
occupied slots won't exceed the quota. This constraint on individual user avoids that a user submit
too many roles and these jobs occupy too several slots.
3. Assigning Slots based on Pool Weight- Fair4S, every pool is designed with a weight. All pools
that look ahead to a lot of slots type a queue of pools. Given a pool, the prevalence times within the
queue is linear to the burden of the pool. Therefore, a pool with a high weight are allotted with a lot
of slots. Because the pool weight is configurable, the pool weight-based slot assignment policy
decreases small jobs waiting time (for slots) effectively.
4. Extending Job Priorities- Fair4S introduces an in depth and quantified priority for every job. The
task priority is described by associate degree integral range ranged from zero to a thousand.
Generally, at intervals a pool, a job with a better priority will preempt the slots used by another job
17

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

with a lower priority. A quantified job priority contributes to differentiate the priorities of small jobs
in numerous user-groups. Programming Model
3.4.2 Fair4s Job Scheduling Algorithm
A job scheduling algorithm, Fair4S, which is modeled to be biased for small jobs. In variety
of workloads Small jobs account for the majority of the workload, and lots of them require instant
responses, which is an important factor at production Hadoop systems. The inefficiency of Hadoop
fair scheduler and GFS read write algorithm for handling small jobs motivates us to use and analyze
Fair4S, which introduces pool weights and extends job priorities to guarantee the rapid responses for
small jobs [1] In this scenario clients is going to upload or download file from the main server where
the Fair4s Job Scheduling Algorithm going to execute. On main server the mapper function will
provide the list of available cluster I/P addresses to which tasks are get assigned so that the task of
files splitting get assigned to each live clusters. Fair4s Job Scheduling Algorithm splits file according
to size and the available cluster nodes.
3.4.3 Procedure of Slots Allocation
1. The primary step is to allot slots to job pools. Every job pool is organized with two parameters of
maximum slots quota and pool weight. In any case, the count of slots allotted to a job pool wouldn't
exceed its most slots quota. If slots demand for one job pool varies, the utmost slots quota is
manually adjusted by Hadoop operators. If a job pool requests additional slots, the scheduler first
judges whether or not the slots occupance of the pool can exceed the quota. If not, the pool are
appended with the queue and wait for slot allocation. The scheduler allocates the slots by roundrobin algorithm. Probabilistically, a pool with high allocation weight are additional likely to be
allotted with slots.
2. The second step is to allot slots to individual jobs. Every job is organized with a parameter of job
priority that may be a worth between zero and a thousand. The duty priority and deficit are removed
and mixed into a weight of the duty. Inside employment pool, idle slots are allotted to the roles with
the highest weight.
3.5 Encryption/decryption
In this, file get encrypted/decrypted by exploitation the RSA encryption/decryption algorithm
encryption/decryption algorithm uses public key & private key for the encryption and
decipherment of data. Consumer transfer the file in conjunction with some secrete/public key so
private key's generated & file get encrypted. At the reverse method by using the public
key/private key pair file get decrypted and downloaded. Like client upload the file with the public
key and also the file name that is used to come up with the distinctive private key's used for
encrypting the file. During this approach uploaded file get encrypted and store at main servers and so
this file get splitted by using the Fair4s Scheduling algorithm that provides distinctive security
feature for cloud data. In an exceedingly reverse method of downloading the data from cloud servers,
file name and public key wont to generate secrete and combines The all parts of file so data get
decrypted and downloaded that ensures the tremendous quantity of security to cloud information.

18

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

Fig.3 RSA encryption/decryption


3.6

Administration of client files(Third Party Auditor)


This module provides facility for auditing all client files, as numerous activities are done by
client. Files Log records and got created and hold on Main Server. for every registered client Log
record is get created that records the varied activities like that operations (upload/download)
performed by client. Additionally Log records keep track of your time and date at that varied
activities carried out by client. For the security and security of the client data and conjointly for the
auditing functions the Log records helps. Additionally for the Administrator Log record facility is
provided that records the Log info of all the registered clients. In order that Administrator will
control over the all the info hold on Cloud servers. Administrator will see client wise Log records
that helps us to notice the fraud information access if any fake user attempt to access the info hold on
Cloud servers.Registered Client Log records:

Fig.4 List of Log records of clients.

19

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

IV.

RESULTS

Our results of the project will be explained well with the help of project work done on
number of clients and one main server and then three to five secondary servers so then we have get
these results bases on three parameters taken into consideration like
1) Time
2) CPU Utilization
3) Storage Utilization.
Our evaluation examines the improved utilization of Cluster nodes i.e. Secondary servers by
uploading and downloading files by using Fair4s scheduling algorithm versus GFS read write
algorithm from three perspectives. First is improved time utilization and second is improved CPU
utilization also the storage utilization also get improved tremendously.
4.1 Results for time utilization

Fig.5 Time Utilization Graph For Uploading Files


Fig. 5 shows time utilization for GFS and Fair4s algorithm for uploading files.
These are:
Uploading File Size(in Kb)
Time (in milisec) for GFS
Time (in milisec) for Fair4s
1742936
4734113
6938669
11527296
3057917
17385800

1720
928
1473
1857
253
1859

20

107
170
117
704
38
839

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

Fig.06 Time Utilization Graph for Download Files


Fig. 06 shows time utilization for GFS and Fair4s for downloading files.
These are:
Number of Files
Time (in milisec) for GFS
Time (in milisec) for Fair4s
5
840
795
7
1937
1852
9
4814
3698
11
5143
4111
4.2 Results for CPU utilization

Fig.07 CPU Utilizationon Graph for GFS Files


Fig.08 describes the CPU utilization for GFS files on number of cluster nodes.
21

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

Fig.08 Describes CPU utilization graph on Fair4s Algorithm on number of Cluster nodes in Hadoop.
V.

CONCLUSION

We have proposed improved cloud architecture that mixes On-Demand schedulingof


infrastructure resources with optimized utilization, opportunistic provisioning of cycles from idle
nodes to different processes. A cloud infrastructure using Hadoop configuration with improved
processor utilization and storage space utilization is proposed using Fair4s Job scheduling algorithm.
Hence all unutilized nodes that remains idle are all get utilised and mostly improvement in security
problems and achieves load balancing and quick process of huge data in less amount of your time.
We tend to compare the GFS read write algorithm and fair4s map reduce algorithm for file uploading
and file downloading; and optimizes the processor utilization and storage space use. During this
paper, we tend to additionally plan a number of the techniques that area unit implemented to guard
data and propose design to protect data in cloud. This model was proposed to store data in cloud in
encrypted information using RSA technique that relies on encryption and decryption of data. Till
currently in several planned works, there's Hadoop configuration for cloud infrastructure. However
still the cloud nodes remains idle. Hence no such work on C.P.U. utilization for GFS read write
algorithm versus fair4s scheduling algorithm and storage utilization for GFS read write algorithm
versus fair4s algorithm, done.
We give the backfill problem solution using an on-demand user workload on cloud structure
using hadoop. We tend to contribute to an increase of the processor utilization and time utilization
between GFS and Fair4s. In our work additionally all cloud nodes area unit get fully utilised , no any
cloud stay idle, additionally processing of file get at faster rate so tasks get processed at less quantity
of your time that is additionally a big advantage hence improve utilization. We tend to additionally
implement RSA algorithm to secure the data, hence improve security.
VI. REFERENCES
1.

2.

ZujieRen, Jian WanWorkload Analysis, Implications, and Optimization on a Production


Hadoop Cluster:A Case Study on Taobao,CO IEEE TRANSACTIONS ON SERVICES
COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2014.
M. Zaharia, D. Borthakur, J.S. Sarma, S. Shenker, and I. Stoica, Job Scheduling for MultiUser Mapreduce Clusters, (Univ.California, Berkeley, CA, USA, Tech. Rep. No.
UCB/EECS-2009-55, Apr. 2009).

22

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),


ISSN 0976 - 6375(Online), Volume 6, Issue 3, March (2015), pp. 12-23 IAEME

3.

4.
5.
6.
7.

8.

9.
10.
11.
12.

13.
14.
15.
16.
17.
18.

19.

20.
21.
22.

23.

Y. Chen, S. Alspaugh, and R.H. Katz, Interactive Analytical Processing in Big Data
Systems: A Cross-Industry Study of Mapreduce Workloads, Proc. VLDB Endowment, vol.
5, no. 12, Aug. 2012
Divyakant Agrawal et al., Big Data and Cloud Computing: Current State and Future
Opportunities, EDBT, pp 22-24, March 2011.
Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, Workload Characterization on a Production
Hadoop Cluster: A Case Study on Taobao, in Proc. IEEE IISWC, 2012, pp. 3-13.
Jeffrey Dean et al., MapReduce: simplified data processing on large clusters,
communications of the acm, Vol S1, No. 1, pp.107-113, 2008 January.
Y. Chen, S. Alspaugh, D. Borthakur, and R.H. Katz, Energy Efficiency for Large-Scale
Mapreduce Workloads with Significant Interactive Analysis, in Proc. EuroSys, 2012, pp. 43
56.
Stackoverflow(2014,07,14).HadoopArchitecture Internals: use of job and task
trackers[English].Available:http://stackoverflow.com/questions/11263187/hadoop
architecture-internals-use-of-job-and-task-trackers
S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, An Analysis of Traces from a
Production Mapreduce Cluster, in Proc. CCGRID, 2010, pp. 94-103.
J. Dean et al.,MapReduce: a flexible data processing tool,In CACM, Jan 2010.
M. Stonebraker et al., MapReduce and parallel DBMSs: friends or foes? In CACM. Jan
2010.
X. Liu, J. Han, Y. Zhong, C. Han, and X. He, Implementing WebGIS on Hadoop: A Case
Study of Improving Small File I/O Performance on HDFS, in Proc. CLUSTER, 2009, pp. 18.
A. Abouzeid et al., HadoopDB: An Architectural Hybrid of MapReduce and DBMS
Technologies for Analytical Workloads, In VLDB 2009.
S. Das et al., Ricardo: Integrating R and Hadoop, In SIGMOD 2010.
J. Cohen et al.,MAD Skills: New Analysis Practices for Big Data, In VLDB, 2009.
Gaizhen Yang et al., The Application of SaaS-Based Cloud Computing in the University
Research and Teaching Platform, ISIE, pp. 210-213, 2011.
Paul Marshall et al., Improving Utilization of Infrastructure Clouds,IEEE/ACM
International Symposium, pp. 205-2014, 2011.
F. Wang, Q. Xin, B. Hong, S.A. Brandt, E.L. Miller, D.D.E. Long, and T.T. Mclarty, File
System Workload Analysis for Large Scale Scientific Computing Applications, in Proc.
MSST, 2004,
]pp. 139-152.[23] M. Zaharia, D. Borthakur, J.S. Sarma, K. Elmeleegy, S. Shenker, andI.
Stoica, Delay Scheduling: A Simple Technique for AchievingLocality and Fairness in
Cluster Scheduling, in Proc. EuroSys, 2010, pp. 265-278.
E. Medernach, Workload Analysis of a Cluster in a Grid Environment, in Proc. Job
Scheduling Strategies Parallel Process. 2005, pp. 36-61
K. Christodoulopoulos, V. Gkamas, and E.A. Varvarigos, Statistical Analysis and Modeling
of Jobs in a Grid Environment, J. Grid Computing, vol. 6, no. 1, 2008.
Gandhali Upadhye and Astt. Prof. Trupti Dange, Nephele: Efficient Data Processing Using
Hadoop International journal of Computer Engineering & Technology (IJCET), Volume 5,
Issue 7, 2014, pp. 11 - 16, ISSN Print: 0976 6367, ISSN Online: 0976 6375.
Suhas V. Ambade and Prof. Priya Deshpande, Hadoop Block Placement Policy For
Different File Formats International journal of Computer Engineering & Technology
(IJCET), Volume 5, Issue 12, 2014, pp. 249 - 256, ISSN Print: 0976 6367, ISSN Online:
0976 6375.

23