You are on page 1of 4

SURVEY ON VARIOUS SMALL FILE

HANDLING STRATEGIES ON HADOOP


B.santhosh kumar P.kanaga ranjitham K.r.karthekk J.gokila
Ap(sg)/cse Ug scholar Ug scholar Ug scholar
Sns college of technology Sns college of technology Sns college of technology Sns college of technology
b.santhoshkumar@gmail.com ranjup1995@gmail.com krk1232@gmail.com vijayalakshmigokila@gm
ail.com

Abstract-The Hadoop is inefficient in processing the small files. files whose memory will be in GB and TB. The problem in
So we are going to improve the performance of the Hadoop the small file storage is that HDFS uses only single
using various strategies proposed on various survey papers. NameNode to accommodate all files.
There are many disadvantages in strategies proposed so far. So
we are going to use the appropriate strategy for improving In HDFS, the input data is divided into data blocks
performance of the Hadoop. Today the merge model and SIFM of 64MB[7] and NameNode store the metadata of data
strategies serves the better way of processing the small files in blocks and DataNode store the data blocks. If the file size is
the Hadoop and the New HAR is also considerable strategy for less than 64MB it is difficult to place in HDFS architecture.
optimizing the Hadoop. The proposed Strategies are still not There are many applications which can produce only small
efficient in improving the processing of Hadoop in small files.
In the future the new Strategy may or may not be introduced
files like word, PowerPoint, flash files, images, mp3, video
for handling the small files in the Hadoop. files,etc., The file size in KB are very difficult to handle. So
some changes in HDFS architecture can make the small file
Keywords: Namespace, NameNode, SIFM, DataNode, HDFS storage and access effectively.
INTRODUCTION:
Todays world is becoming fully automatic through
Internet. Internet provides the most facilities needed for
today. The access to Internet creates largeamount of data
day by day.These data collection is said to be big data. This HADOOP DISTRIBUTED FILE SYSTEM:
storage is initially provided by cloud termed as cloud
computing. This technology stores data in cloud and solves The Hadoop Distributed File System (HDFS) is a
the problem of storage. distributed file system designed to run on commodity
hardware. HDFS is highly Fault tolerance, and it is deployed
Due to a sudden increase in internet user within a in low cost hardware. HDFS is providing high throughput
particular period the problem arises for storage again. access on data and it is very efficient in handling very large
There are various distributed file systems like Google File files. Hadoop provides the partitioning of data and
System, Hadoop Distributed file System, Lustre, PVFS computation across thousands of host and executing
etc.,[7] available for storing the large amount of data. applications in parallel[7]. It stores file system metadata and
These large scale distributed file systems are used application data separately.HDFS doesnt use data
in handling data emerging from many servers like protection mechanism such as RAID. It uses write once read
Facebook, Google, Yahoo, whatsapp, etc., These servers many (WORM) concept[6]. The storage of metadata on a
produce very large as well as small files but in huge amount. server is termed as NameNode.HDFS keeps all the
These servers are mean to increase in usage of Internet namespace in RAM. HDFS works as master/slave concept.
among people by sharing of information, mostly on Application data are stored on another server named as
communication. DataNode. TCP protocols provide connection and
communication with each servers. It stores 25 petabytes of
In recent years, Hadoop is giving the highest data. HDFS has three main important components namely
performance and emerging trend to solve the storage of NameNode, DataNode[7]. The Hadoop architecture has
large files. The large file systems are being handled by single NameNode and More numbers of DataNode.
Hadoop Distributed File System which helps in accessing
files very efficiently. The most common problem arises in NameNode:
the storage and access of large amount of small files An HDFS cluster consists of a single NameNode
inHadoop distributed file system. HDFS can handle more act as a master server that sets the file system namespace
and promotes access to files by clients. NameNode runs file considering file correlations that can reduce seek time and
system namespace operations such as closing, opening and delay in reading files. The second idea is metadata files are
renaming files, directories. It also determines mapping of stored in structured distributed Architecture that is applied
blocks to a DataNode[6]. NameNode will contain the track to reduce the seek time operation of requested files[7]. The
of how the files are broken into blocks and these blocks are third idea is prefetching and caching strategy on DataNode
stored in the namespace. These data are being maintained in is used to reduce the access time on reading large number of
the NameNodes main memory so as to provide quick small files.In merging files, the small file is uploaded to
access on client request. HDFS and it is merged with big file. The metadata file for
the merged file is created and structured merging file is
DataNode:
constructed. The metadata is loaded to NameNode and the
The cluster also has a number of DataNode, usually merged file is loaded to DataNode. An architecture chord is
one node in the cluster. DataNode sets the storage that is opted to store and manage metadata in NameNode[7]. The
attached to nodes on which they run[6]. DataNode also prefetching and caching strategy is used to cache the
perform block creation, replication and deletion in response metadata and index file and to obtain the small files. This
to direction from NameNode. SIFM technique is very helpful in reducing the
communication cost and improves the I/O performance in
reading files[7].
In File merging strategy, there are three parts
namely File Filtering Criteria, Structure index file creation,
File Merging operation[2].In File Filtering Criteria, the
major concern is to store and access small files effectively.
Many applications consists of a large number of small files.
Here the main aim is to use cut off point[1]. In a survey,
some application storage has been investigated as follows
Climatology has 450000 files with an average of 61MB and
in Biology, human generome generates upto 30 million files
with an average of 190KB etc.,[7]. If the cut-off point is
fixed as 1MB then the small files less than 1MB is stored
using native HDFS method and above 1MB is stored by file
Merging operation. In Structured Index File Creation,
NameNode maintains only the metadata. In Structured index
file there will be separate index for the small file index and
merge file index. These index files also maintained by
NameNode. These indexes have very much less memory
than the metadata memory. So when merging small files
with big files, the indexes alone be stored,which reduces
SMALL FILES: metadata size[7]. However the memory gets reduced in
Small file is the file that is less than the size of storing small files in NameNode. In File Merging Operation,
HDFS block. The HDFS block is usually set in the format of the File filtering Criteria and Structured Index File Creation
64, 128 or 256 MB. The small file is the file that is less than performs some storage. Those files are merged According to
the 75% of the determined HDFS block. Eventhough the file their size like big file with a small file. This is said to be File
is larger than the Determined HDFS block size, it produces Merging Operation.
small size i.e. (Larger file - HDFS block size = Small file).
In metadata file storage, the small files storage
This file also said to be small file[12].Small files occur due
contains the mapping information of small file with large
to the thirst of companies to produce the real time data. The
file. Metadata are stored in NameNode with key value pairs.
source are being discovered day by day using various
With the help of the key value the filecan be accessed from
technologies to produce the data in large amount[7]. Mostly
memory[7]. SF name indicates the original small file,
those data are small in size but large small files in amount.
MF_name indicates the merged large file name, offset is the
Some can produce the image as a single file that will be
offset of small file,MF_Flag is to validate the merged file,
normally available in KB. Likewise large small files are
MF_length indicates the length of merged file. The metadata
produced.
file storage uses these parameters to store and access the
STRATEGIES: data files. So the performance gets increased[7].
SIFM: The prefetching and caching files is to increase the
Structured Index File Merging is an optimization Scheme to efficiency of accessing the files. This perfecting also
manage large amount of small files in HDFS. This technique reduces the I/O cost and response time of the data from
focuses on three main ideas. The first idea is Merging Files, memory. The three main fetching used to improve this
efficiency are metadata caching, index file caching and the file is accessed based on index location mapped with that
merged data file fetching[7]. The first uses the metafile file name.
mapping file concept to process the client request. Second is
New HAR:
to identify the connected block to access the requested file
by using the metadata obtained.Third is to returned the New HAR is normally said as NHAR, which is
accessed file to client, the related files are cached on the advance of HAR. NHAR reduces number of files during the
merged file[7]. merging of small file with the large file[3]. Usually the
index is accessed twice and maintained by HAR. NHAR
HAR files:
reduces the difficulty by accessing only one index. To add
HAR stands for Hadoop Archives. HAR is to reduce the an additional file to the existing HAR file, new HAR file is
Number of files. A HAR file is created using to be created[5]. In NHAR file, the additional file can be
Hadooparchives command that plays mapreduce work to added without any creation.
pack files that are archived into small number of HDFS
files[1].So that original files can be accessed in parallel. By
accessing the files in parallel, the optimization and PERFORMANCE MEASURES:
efficiency increases.By using HAR file there will be no
change, the original files are visible and accessible[4]. HAR The comparison has been made on original data
file is no more efficient than HDFS as it has two index files and optimized data based on three parameters. Weather
to access[4].The major disadvantage in using HAR files is condition report has been taken to analyse the factors.To
the reading of files is slow than HDFS due to the using of evaluate the performance there are two parameters. One is
the index file twice[4]. Time taken to move files from local file system to HDFS.
Other is Memory usage of the NameNode to store metadata.
Sequence Files: These strategies are defined using a case study of Weather
Report[8].
The idea in sequence file system is to use filename as a
key and file content as a value[1]. The single files are Time Taken to Move Files into HDFS:
collected and organized as a group and it is difficult to work
on this sequence files as accessing or storing one by one[1]. Files are moving on both original and optimized
So the parallel accessing mechanism is applied to improve Hadoop. The optimized Hadoop is handled by the Merge
the efficiency[11].It can work slowly in converting the Model Strategy. The Merge Model Strategy is that it merges
existing data into sequence files but it is possible to gather all the small files into a single large file and moves it to the
sequence files in parallel else data pipeline can be applied to HDFS for processing. The file size and time taken by files to
solve this problem[1]. move into HDFS has been observed and recorded. The input
file size to store using original Hadoop of 2GB takes about
Consolidator: 162 seconds and this can be reduced by using optimized
Hadoop to 71 seconds. This proves that optimized Hadoop
The small files are divided into respected category.
is efficient in moving files to HDFS[8].
Then each category files are organized and can be merged
into single large file[1]. But this process can produce large Memory Usage of the NameNode to Store Metadata:
amount of data even in terabytes. This is not a successful
method to access fast[1]. There is a parameter termed The metadata content only been stored in
Desired File Size by which the file size can be fixed and NameNode. For the input data the data block has been
the size of the file to be merged can be limited. Eventhough divided into 64MB of each block and the data is stored in
the use of Desired File Size parameter, the file become very the data block and metadata are stored in the NameNode. In
large when merged. So the access of file is going to be very the NameNode, metadata of each block consumes about 150
slow[1]. bytes of memory[8].
The Weather report contains 1018 small files each
HBase:
file ranges from 250KB to 5000KB. Total size of these files
HBase contain different storage for small files based on is 2 GB. In the Original Hadoop, the HDFS has created
their need of access. It stores these data in MapFiles so by 1018 data blocks as the input file has 1018 small files of size
applying Mapreduce technology the files can be accessed less than 64 MB[8]. The memory that can be used by 1018
and it is faster too[1]. metadata of data block to store was 152700 bytes. In
optimized Hadoop the small files get merged with single
File Mapping:
large file of size 2GB file into 32blocks. The memory that
The small files are merged with large files to store can be used to store 32 metadata of data blocks was 4800
the small data efficiently. To make improvement in access, bytes. This improves the performance efficiency upto
File mapping takes place[2]. File Mapping involves the 80%[8].
mapping on the basis of index. So that the access can be
made easier. When searing by the name of the small file, the
CONCLUSION: [4] Vaibhav Gopal Korat, Kumar Swamy Pamu, Reduction of
Data at Namenode in HDFS usingharballing Technique ISSN:
Hadoop is implemented to process large files but Hadoop is 2278-323
inefficient in handling large number of small files. Several [5] Sachin Bende, Rajashree Shedge, Dealing with Small Files
Problem in Hadoop Distributed File System
strategies are being surveyed to store the small files
efficiently. In file merging strategy, the NameNode memory [6] K. Kiruthika, E. Gothai, An Efficient Approach for Storing
and Accessing Small Files in HDFS
is decreased by maintaining the Index Files. The use of
Index file reduces the memory of NameNode. Metadata file [7] Yingchi Mao, Bicong Jia1, Wei Min1 and Jiulong Wang,
storage strategy uses the mapping function efficiently with Optimization Scheme for Small Files Storage Based on
Hadoop Distributed File System
the parameters. In prefetching and caching strategy, while
fetching one file the related files are cached in memory for [8] Guru Prasad M , Nagesh H R , Deepthi M, Improving the
quick access of files. HAR files archives the files to Performance of Processing forSmall Files in Hadoop: A Case
Study of Weather Data Analytics ISSN: 0975-9646
decrease the file size. But the disadvantage is using of Index
twice. So NHAR arises as the advantage of HAR by using [9] Shrikrishna Utpat, K. A. Dehamane, Srinivasa Kini, An
Optimized Storing and Accessing Mechanism for Small Files on
Index only once. In Sequence file strategy, the collection HDFS ISSN: 2277 128X
and merge of files are parallelly handled so that the access
can be made efficient. Consolidator strategy, collect the files [10] file:///F:/PROJECT/The%20Small%20Files%20Problem%20-
%20Cloudera%20Engineering%20Blog.htm
and organize it to merge with single large file. This
increases the size of the memory so, the file access is not [11] file:///F:/PROJECT/Hadoop%20%E2%80%93%20How%20to%
20manage%20huge%20numbers%20of%20small%20files%20i
efficient. HBase uses MapFiles strategy to make small files
n%20HDFS%20_%20Bodhtree%20Blog_%20Business%20Inte
efficient. In File Mapping, the small files are merged with lligence%20Analytics%20_%20Cloud%20Services%20_%20Sa
large files based on the Index. The index provides the small lesforce%20Cloud%20CRM.html
file name so that the access of small file and merge file is [12] file:///F:/pro/Working%20with%20Small%20Files%20in%20Ha
made easier.The time requires to move small files will be doop%20-%20Part%201%20_%20Inquidia.html
more and as there are many small files memory for creating
[13] X. Liu, J. Han, Y. Zhong, C. Han and X. He, Implementing
the metadata in the name node will require more memory. WebGIS on Hadoop: A Case Study of Improving Small File I/O
And the retrieving of the files from the HDFS will be Performance on HDFS, Proceedings of IEEE International
difficult regarding small files. To improve the performance Conference on Cluster Computing, New Orleans, USA, August
of the Hadoop in the area of storing and retrieving the data 31 - September 4, (2009).
from the HDFS, several strategies are used to improve the [14] C. Shen, W. Lu, J. Wu and B. Wei, A digital library
performance of the Hadoop. The process is called as architecture supporting massive small files and efficient replica
maintenance, Proceedings of the 10th annual joint conference
optimizing the Hadoop so it can be called as Optimized on digital libraries, ACM Press, QLD, Australia, June 21-25,
Hadoop. The processing of the small file by using the merge (2010) .
model show the performances of processing the small files
[15] Apache Software Foundation. Official apache Hadoop
up to 90.83%.These strategies are meant for improving the website,http://hadoop.apache.org/, Aug, 2012.
performance of the Hadoop. As the Surveys shows that
[16] The Hadoop Architecture and Design, Available:
there are some disadvantages in the proposed strategies and
http://hadoop.apache.org/common/docs/r0.16.4/hdfs_design.
we are going to pick the right strategy and improve the html ,Aug 2012.
performances of the Hadoop as efficient as possible.
[17] Tom White, The Small Files
Problem.http://www.cloudera.com/blog/2009/02/the small files
problem,2009 .
REFERENCES: [18] B. Dong, An Optimized Approach for Storing and Accessing
Small Files on Cloud Storage, Journal of Network and
Computer Applications, vol. 35, (2012).

[1] Parth Gohil and Bakul Panchal, Efficient Ways to Improve the [19] K. Shvachko, H. Kuang, S. Radia and R. Chansler, The
Performance of HDFS for Small Files ISSN: 2222-2863 Hadoop Distributed File System, Proceedings of the 26th IEEE
Symposium on Massive Storage Systems and Technologies,
[2] K.P. Jayakar and Y.B. Gurav,Managing Small Size Files Incline Village, NV, USA, May 3-7, (2010).
through Indexing in Extended Hadoop File System proceedings
of International Journal of Advance Reasearch in Computer [20] Yang Zhang, Dan Liu, Improving the Efficiency of Storing for
Science and Management Studies ISSN 2321-7782 Small Files in HDFS, 2012 International Conference on
Computer Science and Service System.
[3] Chethan.R , ChandanKumar, Jayanth Kumar.S Girish.H.J , Prof.
Mangala.C.N, A Selective Approach for Storing Small Files in
Respective Blocks of Hadoop ISSN: 0975-0282

You might also like