You are on page 1of 14

Hadoop basic information with Q&A

1. What is Hadoop?
Hadoop was created by Douglas Reed Cutting, who named haddop after his childs stu
ffed elephant to support Lucene and Nutch search engine projects. Open-source pr
oject administered by Apache Software Foundation. Hadoop consists of two key ser
vices: a. Reliable data storage using the Hadoop Distributed File System (HDFS).
b. High-performance parallel data processing using a technique called MapReduce
. Hadoop is large-scale, high-performance processing jobs in spite of system cha
nges or failures.
2. Hadoop, Why?
Need to process 100TB datasets On 1 node: scanning @ 50MB/s = 23 days On 1000 no
de cluster: scanning @ 50MB/s = 33 min Need Efficient, Reliable and Usable frame
work
3. Where and When Hadoop
Batch data processing, not real-time / user facing (e.g. Document Analysis and I
ndexing, Web Graphs and Crawling) Highly parallel data intensive distributed app
lications Very large production deployments (GRID) Process lots of unstructured
data When your processing can easily be made parallel Running batch jobs is acce
ptable When you have access to lots of cheap hardware
4. Benefits of Hadoop
Hadoop is designed to run on cheap commodity hardware It automatically handles d
ata replication and node failure It does the hard work you can focus on processi
ng data Cost Saving and efficient and reliable data processing
5. How Hadoop Works
Hadoop implements a computational paradigm named Map/Reduce, where the applicati
on is divided into many small fragments of work, each of which may be executed o
r re-executed on any node in the cluster. In addition, it provides a distributed
file system (HDFS) that stores data on the compute nodes, providing very high a
ggregate bandwidth across the cluster. Both Map/Reduce and the distributed file
system are designed so that node failures are automatically handled by the frame
work.
6. Hdoop Architecture
The Apache Hadoop project develops open-source software for reliable, scalable,
distributed computing Hadoop Consists:: Hadoop Common*: The common utilities tha
t support the other Hadoop subprojects. HDFS*: A distributed file system that pr
ovides high throughput access to application data. MapReduce*: A software framew
ork for distributed processing of large data sets on compute clusters. Hadoop is
made up of a number of elements. Hadoop consists of the Hadoop Common, At the b
ottom is the Hadoop Distributed File System (HDFS), which stores files across st
orage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, which c
onsists of JobTrackers and TaskTrackers. * This presentation is primarily focus
on Hadoop architecture and related sub project
7. Data Flow
This is the architecture of our backend data warehouing system. This system prov
ides important information on the usage of our website, including but not limite
d to the number page views of each page, the number of active users in each coun
try, etc. We generate 3TB of compressed log data every day. All these data are s
tored and processed by the hadoop cluster which consists of over 600 machines. T
he summary of the log data is then copied to Oracle and MySQL databases, to make
sure it is easy for people to access.
8. Hadoop Common

Hadoop Common is a set of utilities that support the other Hadoop subprojects. H
adoop Common includes FileSystem, RPC, and serialization libraries.
9. HDFS
Hadoop Distributed File System (HDFS) is the primary storage system used by Hado
op applications. HDFS creates multiple replicas of data blocks and distributes t
hem on compute nodes throughout a cluster to enable reliable, extremely rapid co
mputations. Replication and locality
10 MapReduce Tips
1. Use an appropriate MapReduce language
There are many languages and frameworks that sit on top of MapReduce, so its wort
h thinking up-front which one to use for a particular problem. There is no one-s
ize-fits-all language; each has different strengths and weaknesses.
Java: Good for: speed; control; binary data; working with existing Java or M
apReduce libraries.
Pipes: Good for: working with existing C++ libraries.
Streaming: Good for: writing MapReduce programs in scripting languages.
Dumbo (Python), Happy (Jython), Wukong (Ruby), mrtoolkit (Ruby): Good for: P
ython/Ruby programmers who want quick results, and are comfortable with the MapR
educe abstraction.
Pig, Hive, Cascading: Good for: higher-level abstractions; joins; nested dat
a.
While there are no hard and fast rules, in general, we recommend using pure Java
for large, recurring jobs, Hive for SQL style analysis and data warehousing, an
d Pig or Streaming for ad-hoc analysis.
2. Consider your input data chunk size
Are you generating large, unbounded files, like log files? Or lots of small file
s, like image files? How frequently do you need to run jobs?
Answers to these questions determine how your store and process data using HDFS.
For large unbounded files, one approach (until HDFS appends are working) is to
write files in batches and merge them periodically. For lots of small files, see
The Small Files Problem. HBase is a good abstraction for some of these problems
too, so may be worth considering.
3. Use SequenceFile and MapFile containers
SequenceFiles are a very useful tool. They are:
Splittable. So they work well with MapReduce: each map gets an independent s
plit to work on.
Compressible. By using block compression you get the benefits of compression
(use less disk space, faster to read and write), while keeping the file splitta
ble still.
Compact. SequenceFiles are usually used with Hadoop Writable objects, which
have a pretty compact format.
A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by k
ey.
However, both are Java-centric, so you cant read them with non-Java tools. The Th
rift and Avro projects are the places to look for language-neutral container fil
e formats. (For example, see Avros DataFileWriter although there is no MapReduce
integration yet.)
4. Implement the Tool interface
If you are writing a Java driver, then consider implementing the Tool interface
to get the following options for free:
-D to pass in arbitrary properties (e.g. -D mapred.reduce.tasks=7 sets the n
umber of reducers to 7)

-files to put files into the distributed cache


-archives to put archives (tar, tar.gz, zip, jar) into the distributed cache
-libjars to put JAR files on the task classpath
public class MyJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
JobConf job = new JobConf(getConf(), MyJob.class);
// run job ...
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(),
new MyJob(), args);
System.exit(res);
}
}
By taking this step you also make your driver more testable, since you can injec
t arbitrary configurations usingConfigureds setConf() method.
5. Chain your jobs
Its often natural to split a problem into multiple MapReduce jobs. The benefits a
re a better decomposition of the problem into smaller, more-easily understood (a
nd more easily tested) steps. It can also boost re-usability. Also, by using the
Fair Scheduler, you can run a small job promptly, and not worry that it will be
stuck in a long queue of (other peoples) jobs.
ChainMapper and ChainReducer (in 0.20.0) are worth checking out too, as they all
ow you to use smaller units within one job, effectively allowing multiple mapper
s before and after the (single) reducer: M+RM*.
Pig and Hive do this kind of thing all the time, and it can be instructive to un
derstand what they are doing behind the scenes by using EXPLAIN, or even by read
ing their source code, to make you a better MapReduce programmer. Of course, you
could always use Pig or Hive in the first place
6. Favor multiple partitions
Were used to thinking that the output data is contained in one file. This is OK f
or small datasets, but if the output is large (more than a few tens of gigabytes
, say) then its normally better to have a partitioned file, so you take advantage
of the cluster parallelism for the reducer tasks. Conceptually, you should thin
k of your output/part-* files as a single file: the fact it is broken up is an imp
lementation detail. Often, the output forms the input to another MapReduce job,
so it is naturally processed as a partitioned output by specifying output as the
input path to the second job.
In some cases the partitioning can be exploited. CompositeInputFormat, for examp
le, uses the partitioning to do joins efficiently on the map-side. Another examp
le: if your output is a MapFile, you can useMapFileOutputFormats getReaders() met
hod to do lookups on the partitioned output.
For small outputs you can merge the partitions into a single file, either by set
ting the number of reducers to 1 (the default), or by using the handy -getmerge
option on the filesystem shell:
% hadoop fs -getmerge hdfs-output-dir local-file
This concatenates the HDFS files hdfs-output-dir/part-* into a single local file
.
7. Report progress
If your task reports no progress for 10 minutes (see the mapred.task.timeout pro
perty) then it will be killed by Hadoop. Most tasks dont encounter this situation
since they report progress implicitly by reading input and writing output. Howe
ver, some jobs which dont process records in this way may fall foul of this behav
ior and have their tasks killed. Simulations are a good example, since they do a

lot of CPU-intensive processing in each map and typically only write the result
at the end of the computation. They should be written in such a way as to repor
t progress on a regular basis (more frequently than every 10 minutes). This may
be achieved in a number of ways:
Call setStatus() on Reporter to set a human-readable description of
the tasks progress
Call incrCounter() on Reporter to increment a user counter
Call progress() on Reporter to tell Hadoop that your task is still there (an
d making progress)
8. Debug with status and counters
Using the Reporters setStatus() and incrCounter() methods is a simple but effecti
ve way to debug your jobs. Counters are often better than printing to standard e
rror since they are aggregated centrally, and allow you to see how many times a
condition has occurred.
Status descriptions are shown on the web UI so you can monitor a job and keep an
d eye on the statuses (as long as all the tasks fit on a single page). You can s
end extra debugging information to standard error which you can then retrieve th
rough the web UI (click through to the task attempt, and find the stderr file).
You can do more advanced debugging with debug scripts.
9. Tune at the job level before the task level
Before you start profiling tasks there are a number of job-level checks to run t
hrough:
Have you set the optimal number of mappers and reducers?
The number of mappers is by default set to one per HDFS block. This is u
sually a good default, but see tip 2.
The number of reducers is best set to be the number of reduce slots in t
he cluster (minus a few to allow for failures). This allows the reducers to comp
lete in a single wave.
Have you set a combiner (if your algorithm allows it)?
Have you enabled intermediate compression? (See JobConf.setCompressMapOutput
(), or equivalentlymapred.compress.map.output).
If using custom Writables, have you provided a RawComparator?
Finally, there are a number of low-level MapReduce shuffle parameters that y
ou can tune to get improved performance.
10. Let someone else do the cluster administration
Getting a cluster up and running can be decidely non-trivial, so use some of the
free tools to get started. For example, Cloudera provides an online configurati
on tool, RPMs, and Debian packages to set up Hadoop on your own hardware, as wel
l as scripts to run on Amazon EC2.
Hadoop interview questions
Q1. Name the most common InputFormats defined in Hadoop? Which one is default ?
Following 3 are most common InputFormats defined in Hadoop
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat
TextInputFormat is the hadoop default

Q2. What is the difference between TextInputFormatand KeyValueInputFormat class


TextInputFormat: It reads lines of text files and provides the offset of the lin
e as key to the Mapper and actual line as Value to the mapper

KeyValueInputFormat: Reads text file and parses lines into key, val pairs. Every
thing up to the first tab character is sent as key to the Mapper and the remaind
er of the line is sent as value to the mapper.

Q3. What is InputSplit in Hadoop


When a hadoop job is run, it splits input files into chunks and assign each spli
t to a mapper to process. This is called Input Split

Q4. How is the splitting of file invoked in Hadoop Framework


It is invoked by the Hadoop framework by running getInputSplit()method of the In
put format class (like FileInputFormat) defined by the user

Q5. Consider case scenario: In M/R system,


- HDFS block size is 64 MB
- Input format is FileInputFormat
- We have 3 files of size 64K, 65Mb and 127Mb
then how many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows
- 1 split for 64K files
- 2 splits for 65Mb files
- 2 splits for 127Mb file
Q6. What is the purpose of RecordReader in Hadoop
The InputSplit has defined a slice of work, but does not describe how to access
it. The RecordReaderclass actually loads the data from its source and converts i
t into (key, value) pairs suitable for reading by the Mapper. The RecordReader i
nstance is defined by the InputFormat
Q7. After the Map phase finishes, the hadoop framework does "Partitioning, Shuff
le and sort". Explain what happens in this phase?
- Partitioning
Partitioning is the process of determining which reducer instance will receive w
hich intermediate keys and values. Each mapper must determine for all of its out
put (key, value) pairs which reducer will receive them. It is necessary that for
any key, regardless of which mapper instance generated it, the destination part
ition is the same
- Shuffle
After the first map tasks have completed, the nodes may still be performing seve
ral more map tasks each. But they also begin exchanging the intermediate outputs
from the map tasks to where they are required by the reducers. This process of
moving map outputs to the reducers is known as shuffling.
- Sort
Each reduce task is responsible for reducing the values associated with several
intermediate keys. The set of intermediate keys on a single node is automaticall
y sorted by Hadoop before they are presented to the Reducer

Q9. If no custom partitioner is defined in the hadoop then how is data partition
ed before its sent to the reducer
The default partitioner computes a hash value for the key and assigns the partit
ion based on this result

Q10. What is a Combiner


The Combiner is a "mini-reduce" process which operates only on data generated by
a mapper. The Combiner will receive as input all data emitted by the Mapper ins
tances on a given node. The output from the Combiner is then sent to the Reducer
s, instead of the output from the Mappers.
Q12. What is job tracker
Job Tracker is the service within Hadoop that runs Map Reduce jobs on the cluste
r
Q13. What are some typical functions of Job Tracker
The following are some typical tasks of Job Tracker
- Accepts jobs from clients
- It talks to the NameNode to determine the location of the data
- It locates TaskTracker nodes with available slots at or near the data
- It submits the work to the chosen Task Tracker nodes and monitors progress of
each task by receiving heartbeat signals from Task tracker

Q14. What is task tracker


Task Tracker is a node in the cluster that accepts tasks like Map, Reduce and Sh
uffle operations - from a JobTracker

Q15. Whats the relationship between Jobs and Tasks in Hadoop


One job is broken down into one or many tasks in Hadoop.

Q16. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What
will hadoop do ?
It will restart the task again on some other task tracker and only if the task f
ails more than 4 (default setting and can be changed) times will it kill the job

Q17. Hadoop achieves parallelism by dividing the tasks across many nodes, it is
possible for a few slow nodes to rate-limit the rest of the program and slow dow
n the program. What mechanism Hadoop provides to combat this
Speculative Execution

Q18. How does speculative execution works in Hadoop


Job tracker makes different task trackers process same input. When tasks complet
e, they announce this fact to the Job Tracker. Whichever copy of a task finishes
first becomes the definitive copy. If other copies were executing speculatively
, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs.
The Reducers then receive their inputs from whichever Mapper completed successf
ully, first.

Q19. Using command line in Linux, how will you

- see all jobs running in the hadoop cluster


- kill a job
- hadoop job -list
- hadoop job -kill jobid

Q20. What is Hadoop Streaming


Streaming is a generic API that allows programs written in virtually any languag
e to be used as Hadoop Mapper and Reducer implementations

Q21. What is the characteristic of streaming API that makes it flexible run map
reduce jobs in languages like perl, ruby, awk etc.
Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer pha
ses of a Map Reduce job by having both Mappers and Reducers receive their input
on stdin and emit output (key, value) pairs on stdout.

Q22. Whats is Distributed Cache in Hadoop


Distributed Cache is a facility provided by the Map/Reduce framework to cache fi
les (text, archives, jars and so on) needed by applications during execution of
the job. The framework will copy the necessary files to the slave node before an
y tasks for the job are executed on that node.

Q23. What is the benifit of Distributed cache, why can we just have the file in
HDFS and have the application read it
This is because distributed cache is much faster. It copies the file to all trac
kers at the start of the job. Now if the task tracker runs 10 or 100 mappers or
reducer, it will use the same copy of distributed cache. On the other hand, if y
ou put code in file to read it from HDFS in the MR job then every mapper will tr
y to access it from HDFS hence if a task tracker run 100 map jobs then it will t
ry to read this file 100 times from HDFS. Also HDFS is not very efficient when u
sed like this.

Q.24 What mechanism does Hadoop framework provides to synchronize changes made i
n Distribution Cache during runtime of the application
This is a trick questions. There is no such mechanism. Distributed Cache by desi
gn is read only during the time of Job execution

Q25. Have you ever used Counters in Hadoop. Give us an example scenario
Anybody who claims to have worked on a Hadoop project is expected to use counter
s

Q26. Is it possible to provide multiple input to Hadoop? If yes then how can you
give multiple directories as input to the Hadoop job
Yes, The input format class provides methods to add multiple directories as inpu
t to a Hadoop job

Q27. Is it possible to have Hadoop job output in multiple directories. If yes th


en how
Yes, by using Multiple Outputs class

Q28. What will a hadoop job do if you try to run it with an output directory tha
t is already present? Will it
- overwrite it
- warn you and continue
- throw an exception and exit
The hadoop job will throw an exception and exit.

Q29. How can you set an arbitary number of mappers to be created for a job in Ha
doop
This is a trick question. You cannot set it

Q30. How can you set an arbitary number of reducers to be created for a job in H
adoop

You can either do it progamatically by using method setNumReduceTasksin the JobC


onfclass or set it up as a configuration setting

Q31. How will you write a custom partitioner for a Hadoop job
To have hadoop use a custom partitioner you will have to do minimum the followin
g three
- Create a new class that extends Partitioner class
- Override method getPartition
- In the wrapper that runs the Map Reducer, either
- add the custom partitioner to the job programtically using method setPartiti
onerClass or
- add the custom partitioner to the job as a config file (if your wrapper read
s from config file or oozie)

Q32. How did you debug your Hadoop code


There can be several ways of doing this but most common ways are
- By using counters
- The web interface provided by Hadoop framework

Q33. Did you ever built a production process in Hadoop ? If yes then what was th
e process when your hadoop job fails due to any reason
Its an open ended question but most candidates, if they have written a productio
n job, should talk about some type of alert mechanisn like email is sent or ther
e monitoring system sends an alert. Since Hadoop works on unstructured data, its
very important to have a good alerting system for errors since unexpected data
can very easily break the job.

Q34. Did you ever ran into a lop sided job that resulted in out of memory error,
if yes then how did you handled it
This is an open ended question but a candidate who claims to be an intermediate
developer and has worked on large data set (10-20GB min) should have run into th
is problem. There can be many ways to handle this problem but most common way is
to alter your algorithm and break down the job into more map reduce phase or us
e a combiner if possible.

CCD-410 Free Demo Download: http://www.itcertmaster.com/CCD-410.html


NO.1 You need to move a file titled weblogs into HDFS. When you try to copy the fi
le, you cant.
You know you have ample space on your DataNodes. Which action should you take to
relieve this
situation and store more files in HDFS?
A. Increase the block size on all current files in HDFS.
B. Increase the block size on your remaining files.
C. Decrease the block size on your remaining files.
D. Increase the amount of memory for the NameNode.
E. Increase the number of disks (or size) for the NameNode.
F. Decrease the block size on all current files in HDFS.
Answer: C

NO.2 MapReduce v2 (MRv2 /YARN) splits which major functions of the JobTracker in
to separate
daemons? Select two.
A. Heath states checks (heartbeats)
B. Resource management
C. Job scheduling/monitoring
D. Job coordination between the ResourceManager and NodeManager
E. Launching tasks
F. Managing file system metadata
G. MapReduce metric reporting
H. Managing tasks
Answer: B,D
NO.3 For each intermediate key, each reducer task can emit:
A. As many final key-value pairs as desired. There are no restrictions on the ty
pes of those keyvalue pairs (i.e., they can be heterogeneous).
B. As many final key-value pairs as desired, but they must have the same type as
the intermediate key-value pairs.
C. As many final key-value pairs as desired, as long as all the keys have the sa
me type and all the values have the same type.
D. One final key-value pair per value associated with the key; no restrictions o
n the type.
E. One final key-value pair per key; no restrictions on the type.
Answer: E
NO.4 In a MapReduce job with 500 map tasks, how many map task attempts will ther
e be?
A. It depends on the number of reduces in the job.

B. Between 500 and 1000.


C. At most 500.
D. At least 500.
E. Exactly 500.
Answer: D

NO.5 In a large MapReduce job with m mappers and n reducers, how many distinct c
opy operations will there be in the sort/shuffle phase?
A. mXn (i.e., m multiplied by n)
B. n
C. m
D. m+n (i.e., m plus n)
E. E.mn(i.e., m to the power of n)
Answer: A
NO.6 Which process describes the lifecycle of a Mapper?
A. The JobTracker calls the TaskTrackers configure () method, then its map () met
hod and finally its close () method.
B. The TaskTracker spawns a new Mapper to process all records in a single input
split.
C. The TaskTracker spawns a new Mapper to process each key-value pair.
D. The JobTracker spawns a new Mapper to process all records in a single file.
Answer: C
NO.7 Can you use MapReduce to perform a relational join on two large tables shar
ing a key?
Assume that the two tables are formatted as comma-separated files in HDFS.
A. Yes.
B. Yes, but only if one of the tables fits into memory
C. Yes, so long as both tables fit into memory.
D. No, MapReduce cannot perform relational operations.
E. No, but it can be done with either Pig or Hive.
Answer: A
NO.8 When is the earliest point at which the reduce method of a given Reducer ca
n be called?
A. As soon as at least one mapper has finished processing its input split.
B. As soon as a mapper has emitted at least one record.
C. Not until all mappers have finished processing all records.
D. It depends on the InputFormat used for the job.
Answer: C
NO.9 Your clusters HDFS block size in 64MB. You have directory containing 100 pla
in text files, each
of which is 100MB in size. The InputFormat for your job is TextInputFormat. Dete
rmine how many
Mappers will run?
A. 64
B. 100
C. 200
D. 640
Answer: C
10. Indentify which best defines a SequenceFile?

A. A SequenceFile contains a binary encoding of


us Writable objects
B. A SequenceFile contains a binary encoding of
eous Writable objects
C. A SequenceFile contains a binary encoding of
omparable objects, in sorted order.
D. A SequenceFile contains a binary encoding of
irs. Each key must be the same type. Each value
Answer: D

an arbitrary number of homogeneo


an arbitrary number of heterogen
an arbitrary number of WritableC
an arbitrary number key-value pa
must be the same type.

1.When is the earliest point at which the reduce method of a given Reducer can b
e called?
A.As soon as at least one mapper has finished processing its input split.
B.As soon as a mapper has emitted at least one record.
C.Not until all mappers have finished processing all records.
D.It depends on the InputFormat used for the job.
Answer: C
2.Which describes how a client reads a file from HDFS?
A.The client queries the NameNode for the block location(s).The NameNode returns
the block location(s) to the client.The client reads the data directory off the
DataNode(s).
B.The client queries all DataNodes in parallel.The DataNode that contains the re
quested data responds directly to the client.The client reads the data directly
off the DataNode.
C.The client contacts the NameNode for the block location(s).The NameNode then q
ueries the DataNodes for block locations.The DataNodes respond to the NameNode,
and the NameNode redirects
the client to the DataNode that holds the requested data block(s).The client the
n reads the data directly off the DataNode.
D.The client contacts the NameNode for the block location(s).The NameNode contac
ts the DataNode that
holds the requested data block.Data is transferred from the DataNode to the Name
Node, and then from
the NameNode to the client.
Answer: C
3.You are developing a combiner that takes as input Text keys, IntWritable value
s, and emits Text keys, IntWritable values.Which interface should your class imp
lement?
A.Combiner <Text, IntWritable, Text, IntWritable>
B.Mapper <Text, IntWritable, Text, IntWritable>
C.Reducer <Text, Text, IntWritable, IntWritable>
D.Reducer <Text, IntWritable, Text, IntWritable>
E.Combiner <Text, Text, IntWritable, IntWritable>
Answer: D
4.Indentify the utility that allows you to create and run MapReduce jobs with an
y executable or script as

the mapper and/or the reducer?


A.Oozie
B.Sqoop
C.Flume
D.Hadoop Streaming
E.mapred
Answer: D
5.How are keys and values presented and passed to the reducers during a standard
sort and shuffle phase of MapReduce?
A.Keys are presented to reducer in sorted order; values for a given key are not
sorted.
B.Keys are presented to reducer in sorted order; values for a given key are sort
ed in ascending order.
C.Keys are presented to a reducer in random order; values for a given key are no
t sorted. The safer , easier way to help you pass any IT exams. 3 / 4
D.Keys are presented to a reducer in random order; values for a given key are so
rted in ascending order.
Answer: A
6.Assuming default settings, which best describes the order of data provided to
a reducer s reduce method:
A.The keys given to a reducer aren t in a predictable order, but the values asso
ciated with those keys always are.
B.Both the keys and values passed to a reducer always appear in sorted order.
C.Neither keys nor values are in any predictable order.
D.The keys given to a reducer are in sorted order but the values associated with
each key are in no predictable order
Answer: D
7.You wrote a map function that throws a runtime exception when it encounters a
control character in
input data.The input supplied to your mapper contains twelve such characters tot
als, spread across five
file splits.The first four file splits each have two control characters and the
last split has four control characters.
Indentify the number of failed task attempts you can expect when you run the job
with mapred.max.map.attempts set to 4:
A.You will have forty-eight failed task attempts
B.You will have seventeen failed task attempts
C.You will have five failed task attempts
D.You will have twelve failed task attempts
E.You will have twenty failed task attempts
Answer: E
8.You want to populate an associative array in order to perform a map-side join.
You ve decided to put this
information in a text file, place that file into the DistributedCache and read i
t in your Mapper before any records are processed.
Indentify which method in the Mapper you should use to implement code for readin
g the file and populating the associative array?
A.combine
B.map
C.init
D.configure
Answer: D
9.You ve written a MapReduce job that will process 500 million input records and
generated 500 million key-value pairs.The data is not uniformly distributed.You
r MapReduce job will create a significant amount

of intermediate data that it needs to transfer between mappers and reduces which
is a potential bottleneck.A custom implementation of which interface is most li
kely to reduce the amount of intermediate
data transferred across the network?
A.Partitioner
B.OutputFormat
C.WritableComparable
D.Writable
E.InputFormat
F.Combiner
Answer: F
10.Can you use MapReduce to perform a relational join on two large tables sharin
g a key? Assume that
the two tables are formatted as comma-separated files in HDFS.
A.Yes.
B.Yes, but only if one of the tables fits into memory
C.Yes, so long as both tables fit into memory.
D.No, MapReduce cannot perform relational operations.
E.No, but it can be done with either Pig or Hive.
Answer: A

You might also like