Professional Documents
Culture Documents
1. What is Hadoop?
Hadoop was created by Douglas Reed Cutting, who named haddop after his childs stu
ffed elephant to support Lucene and Nutch search engine projects. Open-source pr
oject administered by Apache Software Foundation. Hadoop consists of two key ser
vices: a. Reliable data storage using the Hadoop Distributed File System (HDFS).
b. High-performance parallel data processing using a technique called MapReduce
. Hadoop is large-scale, high-performance processing jobs in spite of system cha
nges or failures.
2. Hadoop, Why?
Need to process 100TB datasets On 1 node: scanning @ 50MB/s = 23 days On 1000 no
de cluster: scanning @ 50MB/s = 33 min Need Efficient, Reliable and Usable frame
work
3. Where and When Hadoop
Batch data processing, not real-time / user facing (e.g. Document Analysis and I
ndexing, Web Graphs and Crawling) Highly parallel data intensive distributed app
lications Very large production deployments (GRID) Process lots of unstructured
data When your processing can easily be made parallel Running batch jobs is acce
ptable When you have access to lots of cheap hardware
4. Benefits of Hadoop
Hadoop is designed to run on cheap commodity hardware It automatically handles d
ata replication and node failure It does the hard work you can focus on processi
ng data Cost Saving and efficient and reliable data processing
5. How Hadoop Works
Hadoop implements a computational paradigm named Map/Reduce, where the applicati
on is divided into many small fragments of work, each of which may be executed o
r re-executed on any node in the cluster. In addition, it provides a distributed
file system (HDFS) that stores data on the compute nodes, providing very high a
ggregate bandwidth across the cluster. Both Map/Reduce and the distributed file
system are designed so that node failures are automatically handled by the frame
work.
6. Hdoop Architecture
The Apache Hadoop project develops open-source software for reliable, scalable,
distributed computing Hadoop Consists:: Hadoop Common*: The common utilities tha
t support the other Hadoop subprojects. HDFS*: A distributed file system that pr
ovides high throughput access to application data. MapReduce*: A software framew
ork for distributed processing of large data sets on compute clusters. Hadoop is
made up of a number of elements. Hadoop consists of the Hadoop Common, At the b
ottom is the Hadoop Distributed File System (HDFS), which stores files across st
orage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, which c
onsists of JobTrackers and TaskTrackers. * This presentation is primarily focus
on Hadoop architecture and related sub project
7. Data Flow
This is the architecture of our backend data warehouing system. This system prov
ides important information on the usage of our website, including but not limite
d to the number page views of each page, the number of active users in each coun
try, etc. We generate 3TB of compressed log data every day. All these data are s
tored and processed by the hadoop cluster which consists of over 600 machines. T
he summary of the log data is then copied to Oracle and MySQL databases, to make
sure it is easy for people to access.
8. Hadoop Common
Hadoop Common is a set of utilities that support the other Hadoop subprojects. H
adoop Common includes FileSystem, RPC, and serialization libraries.
9. HDFS
Hadoop Distributed File System (HDFS) is the primary storage system used by Hado
op applications. HDFS creates multiple replicas of data blocks and distributes t
hem on compute nodes throughout a cluster to enable reliable, extremely rapid co
mputations. Replication and locality
10 MapReduce Tips
1. Use an appropriate MapReduce language
There are many languages and frameworks that sit on top of MapReduce, so its wort
h thinking up-front which one to use for a particular problem. There is no one-s
ize-fits-all language; each has different strengths and weaknesses.
Java: Good for: speed; control; binary data; working with existing Java or M
apReduce libraries.
Pipes: Good for: working with existing C++ libraries.
Streaming: Good for: writing MapReduce programs in scripting languages.
Dumbo (Python), Happy (Jython), Wukong (Ruby), mrtoolkit (Ruby): Good for: P
ython/Ruby programmers who want quick results, and are comfortable with the MapR
educe abstraction.
Pig, Hive, Cascading: Good for: higher-level abstractions; joins; nested dat
a.
While there are no hard and fast rules, in general, we recommend using pure Java
for large, recurring jobs, Hive for SQL style analysis and data warehousing, an
d Pig or Streaming for ad-hoc analysis.
2. Consider your input data chunk size
Are you generating large, unbounded files, like log files? Or lots of small file
s, like image files? How frequently do you need to run jobs?
Answers to these questions determine how your store and process data using HDFS.
For large unbounded files, one approach (until HDFS appends are working) is to
write files in batches and merge them periodically. For lots of small files, see
The Small Files Problem. HBase is a good abstraction for some of these problems
too, so may be worth considering.
3. Use SequenceFile and MapFile containers
SequenceFiles are a very useful tool. They are:
Splittable. So they work well with MapReduce: each map gets an independent s
plit to work on.
Compressible. By using block compression you get the benefits of compression
(use less disk space, faster to read and write), while keeping the file splitta
ble still.
Compact. SequenceFiles are usually used with Hadoop Writable objects, which
have a pretty compact format.
A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by k
ey.
However, both are Java-centric, so you cant read them with non-Java tools. The Th
rift and Avro projects are the places to look for language-neutral container fil
e formats. (For example, see Avros DataFileWriter although there is no MapReduce
integration yet.)
4. Implement the Tool interface
If you are writing a Java driver, then consider implementing the Tool interface
to get the following options for free:
-D to pass in arbitrary properties (e.g. -D mapred.reduce.tasks=7 sets the n
umber of reducers to 7)
lot of CPU-intensive processing in each map and typically only write the result
at the end of the computation. They should be written in such a way as to repor
t progress on a regular basis (more frequently than every 10 minutes). This may
be achieved in a number of ways:
Call setStatus() on Reporter to set a human-readable description of
the tasks progress
Call incrCounter() on Reporter to increment a user counter
Call progress() on Reporter to tell Hadoop that your task is still there (an
d making progress)
8. Debug with status and counters
Using the Reporters setStatus() and incrCounter() methods is a simple but effecti
ve way to debug your jobs. Counters are often better than printing to standard e
rror since they are aggregated centrally, and allow you to see how many times a
condition has occurred.
Status descriptions are shown on the web UI so you can monitor a job and keep an
d eye on the statuses (as long as all the tasks fit on a single page). You can s
end extra debugging information to standard error which you can then retrieve th
rough the web UI (click through to the task attempt, and find the stderr file).
You can do more advanced debugging with debug scripts.
9. Tune at the job level before the task level
Before you start profiling tasks there are a number of job-level checks to run t
hrough:
Have you set the optimal number of mappers and reducers?
The number of mappers is by default set to one per HDFS block. This is u
sually a good default, but see tip 2.
The number of reducers is best set to be the number of reduce slots in t
he cluster (minus a few to allow for failures). This allows the reducers to comp
lete in a single wave.
Have you set a combiner (if your algorithm allows it)?
Have you enabled intermediate compression? (See JobConf.setCompressMapOutput
(), or equivalentlymapred.compress.map.output).
If using custom Writables, have you provided a RawComparator?
Finally, there are a number of low-level MapReduce shuffle parameters that y
ou can tune to get improved performance.
10. Let someone else do the cluster administration
Getting a cluster up and running can be decidely non-trivial, so use some of the
free tools to get started. For example, Cloudera provides an online configurati
on tool, RPMs, and Debian packages to set up Hadoop on your own hardware, as wel
l as scripts to run on Amazon EC2.
Hadoop interview questions
Q1. Name the most common InputFormats defined in Hadoop? Which one is default ?
Following 3 are most common InputFormats defined in Hadoop
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat
TextInputFormat is the hadoop default
KeyValueInputFormat: Reads text file and parses lines into key, val pairs. Every
thing up to the first tab character is sent as key to the Mapper and the remaind
er of the line is sent as value to the mapper.
Q9. If no custom partitioner is defined in the hadoop then how is data partition
ed before its sent to the reducer
The default partitioner computes a hash value for the key and assigns the partit
ion based on this result
Q16. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What
will hadoop do ?
It will restart the task again on some other task tracker and only if the task f
ails more than 4 (default setting and can be changed) times will it kill the job
Q17. Hadoop achieves parallelism by dividing the tasks across many nodes, it is
possible for a few slow nodes to rate-limit the rest of the program and slow dow
n the program. What mechanism Hadoop provides to combat this
Speculative Execution
Q21. What is the characteristic of streaming API that makes it flexible run map
reduce jobs in languages like perl, ruby, awk etc.
Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer pha
ses of a Map Reduce job by having both Mappers and Reducers receive their input
on stdin and emit output (key, value) pairs on stdout.
Q23. What is the benifit of Distributed cache, why can we just have the file in
HDFS and have the application read it
This is because distributed cache is much faster. It copies the file to all trac
kers at the start of the job. Now if the task tracker runs 10 or 100 mappers or
reducer, it will use the same copy of distributed cache. On the other hand, if y
ou put code in file to read it from HDFS in the MR job then every mapper will tr
y to access it from HDFS hence if a task tracker run 100 map jobs then it will t
ry to read this file 100 times from HDFS. Also HDFS is not very efficient when u
sed like this.
Q.24 What mechanism does Hadoop framework provides to synchronize changes made i
n Distribution Cache during runtime of the application
This is a trick questions. There is no such mechanism. Distributed Cache by desi
gn is read only during the time of Job execution
Q25. Have you ever used Counters in Hadoop. Give us an example scenario
Anybody who claims to have worked on a Hadoop project is expected to use counter
s
Q26. Is it possible to provide multiple input to Hadoop? If yes then how can you
give multiple directories as input to the Hadoop job
Yes, The input format class provides methods to add multiple directories as inpu
t to a Hadoop job
Q28. What will a hadoop job do if you try to run it with an output directory tha
t is already present? Will it
- overwrite it
- warn you and continue
- throw an exception and exit
The hadoop job will throw an exception and exit.
Q29. How can you set an arbitary number of mappers to be created for a job in Ha
doop
This is a trick question. You cannot set it
Q30. How can you set an arbitary number of reducers to be created for a job in H
adoop
Q31. How will you write a custom partitioner for a Hadoop job
To have hadoop use a custom partitioner you will have to do minimum the followin
g three
- Create a new class that extends Partitioner class
- Override method getPartition
- In the wrapper that runs the Map Reducer, either
- add the custom partitioner to the job programtically using method setPartiti
onerClass or
- add the custom partitioner to the job as a config file (if your wrapper read
s from config file or oozie)
Q33. Did you ever built a production process in Hadoop ? If yes then what was th
e process when your hadoop job fails due to any reason
Its an open ended question but most candidates, if they have written a productio
n job, should talk about some type of alert mechanisn like email is sent or ther
e monitoring system sends an alert. Since Hadoop works on unstructured data, its
very important to have a good alerting system for errors since unexpected data
can very easily break the job.
Q34. Did you ever ran into a lop sided job that resulted in out of memory error,
if yes then how did you handled it
This is an open ended question but a candidate who claims to be an intermediate
developer and has worked on large data set (10-20GB min) should have run into th
is problem. There can be many ways to handle this problem but most common way is
to alter your algorithm and break down the job into more map reduce phase or us
e a combiner if possible.
NO.2 MapReduce v2 (MRv2 /YARN) splits which major functions of the JobTracker in
to separate
daemons? Select two.
A. Heath states checks (heartbeats)
B. Resource management
C. Job scheduling/monitoring
D. Job coordination between the ResourceManager and NodeManager
E. Launching tasks
F. Managing file system metadata
G. MapReduce metric reporting
H. Managing tasks
Answer: B,D
NO.3 For each intermediate key, each reducer task can emit:
A. As many final key-value pairs as desired. There are no restrictions on the ty
pes of those keyvalue pairs (i.e., they can be heterogeneous).
B. As many final key-value pairs as desired, but they must have the same type as
the intermediate key-value pairs.
C. As many final key-value pairs as desired, as long as all the keys have the sa
me type and all the values have the same type.
D. One final key-value pair per value associated with the key; no restrictions o
n the type.
E. One final key-value pair per key; no restrictions on the type.
Answer: E
NO.4 In a MapReduce job with 500 map tasks, how many map task attempts will ther
e be?
A. It depends on the number of reduces in the job.
NO.5 In a large MapReduce job with m mappers and n reducers, how many distinct c
opy operations will there be in the sort/shuffle phase?
A. mXn (i.e., m multiplied by n)
B. n
C. m
D. m+n (i.e., m plus n)
E. E.mn(i.e., m to the power of n)
Answer: A
NO.6 Which process describes the lifecycle of a Mapper?
A. The JobTracker calls the TaskTrackers configure () method, then its map () met
hod and finally its close () method.
B. The TaskTracker spawns a new Mapper to process all records in a single input
split.
C. The TaskTracker spawns a new Mapper to process each key-value pair.
D. The JobTracker spawns a new Mapper to process all records in a single file.
Answer: C
NO.7 Can you use MapReduce to perform a relational join on two large tables shar
ing a key?
Assume that the two tables are formatted as comma-separated files in HDFS.
A. Yes.
B. Yes, but only if one of the tables fits into memory
C. Yes, so long as both tables fit into memory.
D. No, MapReduce cannot perform relational operations.
E. No, but it can be done with either Pig or Hive.
Answer: A
NO.8 When is the earliest point at which the reduce method of a given Reducer ca
n be called?
A. As soon as at least one mapper has finished processing its input split.
B. As soon as a mapper has emitted at least one record.
C. Not until all mappers have finished processing all records.
D. It depends on the InputFormat used for the job.
Answer: C
NO.9 Your clusters HDFS block size in 64MB. You have directory containing 100 pla
in text files, each
of which is 100MB in size. The InputFormat for your job is TextInputFormat. Dete
rmine how many
Mappers will run?
A. 64
B. 100
C. 200
D. 640
Answer: C
10. Indentify which best defines a SequenceFile?
1.When is the earliest point at which the reduce method of a given Reducer can b
e called?
A.As soon as at least one mapper has finished processing its input split.
B.As soon as a mapper has emitted at least one record.
C.Not until all mappers have finished processing all records.
D.It depends on the InputFormat used for the job.
Answer: C
2.Which describes how a client reads a file from HDFS?
A.The client queries the NameNode for the block location(s).The NameNode returns
the block location(s) to the client.The client reads the data directory off the
DataNode(s).
B.The client queries all DataNodes in parallel.The DataNode that contains the re
quested data responds directly to the client.The client reads the data directly
off the DataNode.
C.The client contacts the NameNode for the block location(s).The NameNode then q
ueries the DataNodes for block locations.The DataNodes respond to the NameNode,
and the NameNode redirects
the client to the DataNode that holds the requested data block(s).The client the
n reads the data directly off the DataNode.
D.The client contacts the NameNode for the block location(s).The NameNode contac
ts the DataNode that
holds the requested data block.Data is transferred from the DataNode to the Name
Node, and then from
the NameNode to the client.
Answer: C
3.You are developing a combiner that takes as input Text keys, IntWritable value
s, and emits Text keys, IntWritable values.Which interface should your class imp
lement?
A.Combiner <Text, IntWritable, Text, IntWritable>
B.Mapper <Text, IntWritable, Text, IntWritable>
C.Reducer <Text, Text, IntWritable, IntWritable>
D.Reducer <Text, IntWritable, Text, IntWritable>
E.Combiner <Text, Text, IntWritable, IntWritable>
Answer: D
4.Indentify the utility that allows you to create and run MapReduce jobs with an
y executable or script as
of intermediate data that it needs to transfer between mappers and reduces which
is a potential bottleneck.A custom implementation of which interface is most li
kely to reduce the amount of intermediate
data transferred across the network?
A.Partitioner
B.OutputFormat
C.WritableComparable
D.Writable
E.InputFormat
F.Combiner
Answer: F
10.Can you use MapReduce to perform a relational join on two large tables sharin
g a key? Assume that
the two tables are formatted as comma-separated files in HDFS.
A.Yes.
B.Yes, but only if one of the tables fits into memory
C.Yes, so long as both tables fit into memory.
D.No, MapReduce cannot perform relational operations.
E.No, but it can be done with either Pig or Hive.
Answer: A