Professional Documents
Culture Documents
After the Map phase and before the beginning condition and forms fixed shape while DBSCAN has
of the Reduce phase is a handoff process, known as an issue of processing time and it is more complex
shuffle and sort. Here, data from the mapped tasks is than K-Means. Comprehensive analysis of these
prepared and moved to the nodes where the reducer existing techniques has been carried out and
tasks will be run. When the mapped task is complete, appropriate clustering algorithm is provided. A
the results are sorted by key, partitioned if there are hybrid approach based on parallel K-Means and
multiple reducers, and then written to disk. The parallel DBSCAN is proposed to overcome the
MapReduce programming model simplifies large- drawbacks of both these algorithms. This approach
scale data processing on commodity cluster by allows combining the benefits of both the clustering
exploiting parallel map tasks and reduces tasks. techniques. Further, the proposed technique is
Although many efforts have been made to improve evaluated on the MapReduce framework of Hadoop
the performance of MapReduce jobs, they ignore the Platform. The results show that the proposed
network traffic generated in the shuffle phase, which approach is an improved version of parallel K-Means
plays a critical role in performance enhancement. clustering algorithm. This algorithm also performs
5. Aggregation Methodology on Map better than parallel DBSCAN while handling clusters
of circularly distributed data points and slightly
Reduce for Big Data Applications by using
APPLICATIONS TO ENHANCE THE cloud for efficient processing. Big Data introduces to
NETWORK TRAFFIC PERFORMANCE. datasets whose sizes are beyond the capability of
typical database software tools to capture,
International Journal For Technological
accumulate, maintain and examined. The application
Research In Engineering Volume 4, Issue 4,
of Big Data differs across verticals since of the
December-2016
several challenges that bring about the various use
MapReduce is a scheme for processing and
cases. With the increasing amount of data and the
managing large scale data sets during a distributed
availability of high performance and relatively low-
cluster, which has been used for applications such as
cost hardware, database systems have been extended
document clustering, generating search indexes,
and parallelized to run on multiple hardware
access log analysis, and numerous other forms of data
platforms to manage scalability. Recently, a new
analytic. In existing system, a hash function is used to
distributed data processing framework called Map
partition intermediate data among reduce tasks.
Reduce was proposed whose fundamental idea is to
During this project the system proposed a
simplify the parallel processing using a distributed
decomposition-based distributed algorithm to deal
computing platform that offers only two interfaces.
with the large-scale optimization problem for large
To further reduce network traffic within a Map
data application and an online algorithm is
Reduce job, we consider to aggregate data with the
additionally designed to adjust data partition and
same keys before sending them to remote reduce
aggregation in a dynamic manner. Network traffic
tasks.
price under both offline and on-line cases is
OBECTIVES:
significantly reduced as demonstrated by the
Map-Reduce algorithm: MapReduce is
extensive stimulation results by the various proposals
a programming model and an associated
considered and used.
implementation for processing and generating large
7. OPTIMIZATION OF MAP REDUCE
data sets with a parallel, distributed algorithm on
APPLICATIONS USING PARTITION AND
a cluster. A MapReduce program is composed of
AGGREGATION IN BIG DATA a Map() procedure (method) that performs filtering
APPLICATION. wjert, 2016, Vol. 2, Issue 3, and sorting (such as sorting students by first name
179 -189 Research Article into queues, one queue for each name) and