Professional Documents
Culture Documents
RESEARCH ARTICLE
OPEN ACCESS
1,2
----------------------------------------************************ ----------------------------------
Abstract:
Big data is a technology that refers to massive, intensive data that is changing rapidly. Such diverse
and huge volume data is very difficult to handle by present technologies and infrastructures. Hadoop
overcomes this drawback with powerful programming model called MapReduce. The implementation of
Map Reduce needs storage and parallel processing. HDFS is commonly used for storing and processing of
the data. In this paper we suggest various scheduling algorithms for MapReduce that help increase the
resource utility and speedup the response time of the system. Each of the algorithms are compared, nature
and drawbacks of algorithms are mentioned.
Keywords MapReduce, Hadoop, BigData.
----------------------------------------************************ -------------------------------I.
INTRODUCTION
In this age of digital data, speed with which data
is generated is commendable. In 2005 digital data
on global scale was 150 ExaBytes, which has
grown to be 1200 ExaBytes in 2010. This growth is
expected to increase by 44% in 2020 [1]. This data
is not necessarily in structured format. It contains
mixed data example tweets, status updates, blogs,
videos etc; theyre unstructured. Latest and
advanced technologies like Hadoop are necessary to
process such massive data. Hadoop is not just
massive data storage engine, it is more than that.
Data processing and Data storage can be combined.
It is robust , scalable, reliable and inexpensive [2].
Hadoop uses HDFS for storing the data and Map
Reduce to allocate the tasks.
HDFS: Hadoop Distributed File System is
fashioned to store a large amount of data upto tera
and peta bytes. HDFS also provides very easy
access to all the data stored in it. HDFS has
interface similar to unix file system [3]. the meta
data and application data are stored separately to
increase the performance. HDFS is a master/slave
architecture where namespace contains files and
directories, these are presented by NameNode
where permissions, access time records, space etc
ISSN: 2394-2231
http://www.ijctjournal.org
Page 1
II.
SCHEDULING ALGORITHMS
MapReduce programming model processes the
tasks in parallel fashion therefore scheduling is
highly necessary. If scheduling is inappropriate the
whole system could break down and performance
of entire system is degraded. Depending upon the
type of the task, nature of the environment etc
factors the scheduling algorithm is decided. The
main goal of any algorithm chosen is to minimize
the time taken to complete the job. These
scheduling algorithm chosen must cater such that
performance is increased. MapReduce tasks are
categorized as NP-hard.
If the algorithm focuses on priority it may have to
compromise data locality and so on. This will have
affect on jobs performance. Different jobs may
have different priorities and before selecting an
algorithm all these factors must be considered and
analysed.
MapReduce has two types of run time behavior
adaptive and non adaptive. Adaptive algorithms are
those which use parameter values and make
decisions . Non-Adaptive algorithms are those
which do not take any values into consideration
before making scheduling decisions and jobs are
processed in the order predefined.
ISSN :2394-2231
http://www.ijctjournal.org
Page 2
A. FIFO
http://www.ijctjournal.org
Page 3
Combination
Re-Execution
Scheduling
Technology was proposed by Lei et al [16] helps in
increased running time of map tasks. It is estimated
that CREST reduces the running time of map tasks
ISSN :2394-2231
http://www.ijctjournal.org
Page 4
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
ISSN :2394-2231
2231
http://www.ijctjournal.org
Page 5