Professional Documents
Culture Documents
Hortonworks
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 2
Question 1
Which Hadoop component is responsible for managing the distributed fle system metadata?
A. NameNode
B. Metanode
C. DataNode
D. NameSpaceManager
Aoswern A
Question 2
Aoswern B
Question 3
Which Two of the following statements are true about hdfs? Choose 2 answers
Aoswern A, B
Question 4
In Hadoop 2.2, which one of the following statements is true about a standby NameNode?
The Standby NameNode:
A. Communicates directly with the actie NameNode to maintain the state of the actie NameNode.
B. Receiies the same block reports as the actie NameNode.
C. Runs on the same machine and shares the memory of the actie NameNode.
D. Processes all client requests and block reports from the appropriate DataNodes.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 3
Aoswern B
Question 5
Which HDFS command uploads a local fle X into an existng HDFS directory Y?
A. hadoop scp X Y
B. hadoop fs -localPut X Y
C. hadoop fs-put X Y
D. hadoop fs -get X Y
Aoswern C
Question 6
Aoswern B
Question 7
In Hadoop 2.2, which TW= of the following processes work together to proiide automatc failoier of the NameNode?
Choose 2 answers
A. ZKFailoierController
B. ZooKeeper
C. QuorumManager
D. JournalNode
Aoswern A, D
Question 8
Which one of the following statements is FALSP regarding the communicaton between DataNodes and a federaton of
NameNodes in Hadoop 2.2?
Aoswern A
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 4
Question 9
What is the term for the process of moiing map outputs to the reducers?
A. Reducing
B. Combining
C. Parttoning
D. Shufing and sortng
Aoswern D
Question 10
A. The job's Parttoner shufes and sorts all (key.ialue) pairs and sends the output to all reducers
B. The default Hash Parttoner sends key ialue pairs with the same key to the same Reducer
C. The reduce method is inioked once for each unique ialue
D. The Mapper must sort its output of (key.ialue) pairs in descending order based on ialue
Aoswern A
Question 11
What are the TW= main components of the YARN ResourceManager process? Choose 2 answers
A. Job Tracker
B. Task Tracker
C. Scheduler
D. Applicatons Manager
Aoswern C, D
Question 12
Which one of the following statements describes the relatonship between the NodeManager and the
ApplicatonMaster?
Aoswern D
Question 13
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 5
Which one of the following statements describes the relatonship between the ResourceManager and the
ApplicatonMaster?
Aoswern A
Question 14
Which one of the following statements regarding the components of YARN is FALSP?
Aoswern D
Question 15
Which YARN component is responsible for monitoring the success or failure of a Container?
A. ResourceManager
B. ApplicatonMaster
C. NodeManager
D. JobTracker
Aoswern A
Question 16
Aoswern B
Question 17
Which one of the following statements describes a Pig bag. tuple, and map, respectiely?
A. Unordered collecton of maps, ordered collecton of tuples, ordered set of key:ialue pairs
B. Unordered collecton of tuples, ordered set of felds, set of key ialue pairs
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 6
Aoswern B
Question 18
Aoswern C
Question 19
Which Pig statement combines A by its frst feld and B by its second feld?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 7
Aoswern B
Question 20
A Pig J=IN statement that combined relatons A by its frst feld and B by its second feld would produce what output?
A. 2 Jim Chris 2
3 Terry 3
4 Brian 4
B. 2 cherry
2 cherry
3 orange
4 peach
C. 2 cherry Jim, Chris
3 orange Terry
4 peach Brian
D. 2 cherry Jim 2
2 cherry Chris 2
3 orange Terry 3
4 peach Brian 4
Aoswern D
Question 21
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 8
A. =pton A
B. =pton B
C. =pton C
D. =pton D
Aoswern D
Question 22
Aoswern D
Question 23
Which two of the following statements are true about Pig's approach toward data? Choose 2 answers
Aoswern B, E
Question 24
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 9
Aoswern D
Question 25
What command to defne B would produce the output (M,62,95l02) when inioking the DUMP operator on B?
Aoswern A
Question 26
Which two of the following are true about this triiial Pig program' (choose Two)
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 10
Aoswern A, D
Question 27
A. The logeients relaton represents the data from the my.log fle, using a comma as the parsing delimiter
B. The logeients relaton represents the data from the my.log fle, using a tab as the parsing delimiter
C. The frst feld of logeients must be a properly-formated date string or table return an error
D. The statement is not a ialid Pig command
Aoswern B
Question 28
Aoswern A
Question 29
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 11
Aoswern B
Question 30
To use a laia user-defned functon (UDF) with Pig what must you do?
Aoswern C
Question 31
Which TW= of the following statements are true regarding Hiie? Choose 2 answers
A. Useful for data analysts familiar with SQL who need to do ad-hoc queries
B. =fers real-tme queries and row leiel updates
C. Allows you to defne a structure for your unstructured Big Data
D. Is a relatonal database
Aoswern A, C
Question 32
A. Records can only be added to the table using the Hiie INSPRT command.
B. When the table is dropped, the underlying folder in HDFS is deleted.
C. Hiie dynamically defnes the schema of the table based on the FR=M clause of a SPLPCT query.
D. Hiie dynamically defnes the schema of the table based on the format of the underlying data.
Aoswern B
Question 33
Assuming the statements aboie execute successfully, which one of the following statements is true?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 12
Aoswern A
Question 34
Aoswern C
Question 35
Which one of the following Hiie commands uses an HCatalog table named x?
A. SPLPCT * FR=M x;
B. SPLPCT x.-FR=M org.apache.hcatalog.hiie.HCatLoader('x');
C. SPLPCT * FR=M org.apache.hcatalog.hiie.HCatLoader('x');
D. Hiie commands cannot reference an HCatalog table
Aoswern C
Question 36
Which one of the following classes would a Pig command use to store data in a table defned in HCatalog?
A. org.apache.hcatalog.pig.HCat=utputFormat
B. org.apache.hcatalog.pig.HCatStorer
C. No special class is needed for a Pig script to store data in an HCatalog table
D. Pig scripts cannot use an HCatalog table
Aoswern B
Question 37
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 13
Assuming
the
statements
aboie execute successfully, which one of the following statements is true?
A. Hiie reformats File1 into a structure that Hiie can access and moies into to:user:joe:x:
B. The fle named File1 is moied to to:user:joe:x:
C. The contents of File1 are parsed as comma-delimited rows and loaded into :user:joe:x:
D. The contents of File1 are parsed as comma-delimited rows and stored in a database
Aoswern B
Question 38
Which one of the following statements describes a Hiie user-defned aggregate functon?
Aoswern A
Question 39
Aoswern A
Question 40
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 14
Aoswern D
Question 41
Aoswern B
Question 42
A. A bigram of the top 80 sentences that contain the substring "you are" in the lines column of the input data A1 table.
B. An 80-ialue ngram of sentences that contain the words "you" or "are" in the lines column of the inputdata table.
C. A trigram of the top 80 sentences that contain "you are" followed by a null space in the lines column of the
inputdata table.
D. A frequency distributon of the top 80 words that follow the subsequence "you are" in the lines column of the
inputdata table.
Aoswern D
Question 43
Which one of the following fles is required in eiery =ozie Workfow applicaton?
A. job.propertes
B. Confg-default.xml
C. Workfow.xml
D. =ozie.xml
Aoswern C
Question 44
A. mapreduce
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 15
B. pig
C. hiie
D. mrunit
Aoswern D
Question 45
When is the earliest point at which the reduce method of a giien Reducer can be called?
A. As soon as at least one mapper has fnished processing its input split.
B. As soon as a mapper has emited at least one record.
C. Not untl all mappers haie fnished processing all records.
D. It depends on the InputFormat used for the job.
Aoswern C
In a MapReduce job reducers do not start executng the reduce method untl the all Map jobs haie completed.
Reducers start copying intermediate key-ialue pairs from the mappers as soon as they are aiailable. The programmer
defned reduce method is called only afer all the mappers haie fnished.
Note: The reduce phase has 3 steps: shufe, sort, reduce. Shufe is where the data is collected by the reducer from
each mapper. This can happen while mappers are generatng data since it is only a data transfer. =n the other hand,
sort and reduce can only start once all the mappers are done.
Why is startng the reducers early a good thing? Because it spreads out the data transfer from the mappers to the
reducers oier tme, which is a good thing if your network is the botleneck.
Why is startng the reducers early a bad thing? Because they "hog up" reduce slots while only copying data. Another
job that starts later that will actually use the reduce slots now can't use them.
You can customize when the reducers startup by changing the default ialue of
mapred.reduce.slowstart.completed.maps in mapred-site.xml. A ialue of 1.00 will wait for all the mappers to fnish
before startng the reducers. A ialue of 0.0 will start the reducers right away. A ialue of 0.5 will start the reducers
when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-
job basis.
Typically, keep mapred.reduce.slowstart.completed.maps aboie 0.9 if the system eier has multple jobs running at
once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If you only eier
haie one job running at a tme, doing 0.1 would probably be appropriate.
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers, When is the reducers are started in
a MapReduce job?
Question 46
A. The client queries the NameNode for the block locaton(s). The NameNode returns the block locaton(s) to the
client. The client reads the data directory of the DataNode(s).
B. The client queries all DataNodes in parallel. The DataNode that contains the requested data responds directly to the
client. The client reads the data directly of the DataNode.
C. The client contacts the NameNode for the block locaton(s). The NameNode then queries the DataNodes for block
locatons. The DataNodes respond to the NameNode, and the NameNode redirects the client to the DataNode that
holds the requested data block(s). The client then reads the data directly of the DataNode.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 16
D. The client contacts the NameNode for the block locaton(s). The NameNode contacts the DataNode that holds the
requested data block. Data is transferred from the DataNode to the NameNode, and then from the NameNode to the
client.
Aoswern A
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers, How the Client communicates with
HDFS?
Question 47
You are deieloping a combiner that takes as input Text keys, IntWritable ialues, and emits Text keys, IntWritable
ialues. Which interface should your class implement?
Aoswern D
Question 48
Indentfy the utlity that allows you to create and run MapReduce jobs with any executable or script as the mapper
and:or the reducer?
A. =ozie
B. Sqoop
C. Flume
D. Hadoop Streaming
P. mapred
Aoswern D
Hadoop streaming is a utlity that comes with the Hadoop distributon. The utlity allows you to create and run
Map:Reduce jobs with any executable or script as the mapper and:or the reducer.
Reference: htp:::hadoop.apache.org:common:docs:r0.20.1:streaming.html (Hadoop Streaming, second sentence)
Question 49
How are keys and ialues presented and passed to the reducers during a standard sort and shufe phase of
MapReduce?
A. Keys are presented to reducer in sorted order; ialues for a giien key are not sorted.
B. Keys are presented to reducer in sorted order; ialues for a giien key are sorted in ascending order.
C. Keys are presented to a reducer in random order; ialues for a giien key are not sorted.
D. Keys are presented to a reducer in random order; ialues for a giien key are sorted in ascending order.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 17
Aoswern A
Question 50
Assuming default setngs, which best describes the order of data proiided to a reducer’s reduce method:
A. The keys giien to a reducer aren’t in a predictable order, but the ialues associated with those keys always are.
B. Both the keys and ialues passed to a reducer always appear in sorted order.
C. Neither keys nor ialues are in any predictable order.
D. The keys giien to a reducer are in sorted order but the ialues associated with each key are in no predictable order
Aoswern D
Question 51
You wrote a map functon that throws a runtme excepton when it encounters a control character in input dat
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 18
a. The input supplied to your mapper contains twelie such characters totals, spread across fie fle splits. The frst four
fle splits each haie two control characters and the last split has four control characters.
Indentfy the number of failed task atempts you can expect when you run the job with mapred.max.map.atempts
set to 4:
A. You will haie forty-eight failed task atempts
B. You will haie seienteen failed task atempts
C. You will haie fie failed task atempts
D. You will haie twelie failed task atempts
P. You will haie twenty failed task atempts
Aoswern E
There will be four failed task atempts for each of the fie fle splits.
Note:
Question 52
You want to populate an associatie array in order to perform a map-side join. You’ie decided to put this informaton
in a text fle, place that fle into the DistributedCache and read it in your Mapper before any records are processed.
Indentfy which method in the Mapper you should use to implement code for reading the fle and populatng the
associatie array?
A. combine
B. map
C. init
D. confgure
Aoswern D
Question 53
You’ie writen a MapReduce job that will process 500 million input records and generated 500 million key-ialue pairs.
The data is not uniformly distributed. Your MapReduce job will create a signifcant amount of intermediate data that it
needs to transfer between mappers and reduces which is a potental botleneck. A custom implementaton of which
interface is most likely to reduce the amount of intermediate data transferred across the network?
A. Parttoner
B. =utputFormat
C. WritableComparable
D. Writable
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 19
P. InputFormat
F. Combiner
Aoswern F
Combiners are used to increase the efciency of a MapReduce program. They are used to aggregate intermediate
map output locally on indiiidual mapper outputs. Combiners can help you reduce the amount of data that needs to
be transferred across to the reducers. You can use your reducer code as a combiner if the operaton performed is
commutatie and associatie.
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers, What are combiners? When should
I use a combiner in my MapReduce Job?
Question 54
Can you use MapReduce to perform a relatonal join on two large tables sharing a key? Assume that the two tables
are formated as comma-separated fles in HDFS.
A. Yes.
B. Yes, but only if one of the tables fts into memory
C. Yes, so long as both tables ft into memory.
D. No, MapReduce cannot perform relatonal operatons.
P. No, but it can be done with either Pig or Hiie.
Aoswern A
Note:
* Join Algorithms in MapReduce
A) Reduce-side join
B) Map-side join
C) In-memory join
: Striped Striped iariant iariant
: Memcached iariant
Question 55
You haie just executed a MapReduce job. Where is intermediate data writen to afer being emited from the
Mapper’s map method?
A. Intermediate data in streamed across the network from Mapper to the Reduce and is neier writen to disk.
B. Into in-memory bufers on the TaskTracker node running the Mapper that spill oier and are writen into HDFS.
C. Into in-memory bufers that spill oier to the local fle system of the TaskTracker node running the Mapper.
D. Into in-memory bufers that spill oier to the local fle system (outside HDFS) of the TaskTracker node running the
Reducer
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 20
P. Into in-memory bufers on the TaskTracker node running the Reducer that spill oier and are writen into HDFS.
Aoswern C
The mapper output (intermediate data) is stored on the Local fle system (N=T HDFS) of each indiiidual mapper
nodes. This is typically a temporary directory locaton which can be setup in confg by the hadoop administrator. The
intermediate data is cleaned up afer the Hadoop Job completes.
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers, Where is the Mapper =utput
(intermediate kay-ialue data) stored ?
Question 56
You want to understand more about how users browse your public website, such as which pages they iisit prior to
placing an order. You haie a farm of 200 web seriers hostng your website. How will you gather this data for your
analysis?
Aoswern A
Question 57
Aoswern AB
Question 58
You need to run the same job many tmes with minor iariatons. Rather than hardcoding all job confguraton optons
in your driie code, you’ie decided to haie your Driier subclass org.apache.hadoop.conf.Confgured and implement
the org.apache.hadoop.utl.Tool interface.
Indentfy which iniocaton correctly passes.mapred.job.name with a ialue of Pxample to Hadoop?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 21
Aoswern C
Question 59
You are deieloping a MapReduce job for sales reportng. The mapper will process input keys representng the year
(IntWritable) and input ialues representng product indentfes (Text).
Indentfy what determines the data types used by the Mapper for a giien job.
A. The key and ialue types specifed in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass
methods
B. The data types specifed in HAD==P_MAP_DATATYPPS eniironment iariable
C. The mapper-specifcaton.xml fle submited with the job determine the mapper’s input key and ialue types.
D. The InputFormat used by the job determines the mapper’s input key and ialue types.
Aoswern D
The input types fed to the mapper are controlled by the InputFormat used. The default input format,
"TextInputFormat," will load data in as (LongWritable, Text) pairs. The long ialue is the byte ofset of the line in the
fle. The Text object holds the string contents of the line of the fle.
Note: The data types emited by the reducer are identfed by set=utputKeyClass() andset=utputValueClass(). The
data types emited by the reducer are identfed by set=utputKeyClass() and set=utputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods
setMap=utputKeyClass() and setMap=utputValueClass() methods of the JobConf class will oierride these.
Reference: Yahoo! Hadoop Tutorial, THP DRIVPR MPTH=D
Question 60
Identfy the MapReduce i2 (MRi2 : YARN) daemon responsible for launching applicaton containers and monitoring
applicaton resource usage?
A. ResourceManager
B. NodeManager
C. ApplicatonMaster
D. ApplicatonMasterSeriice
P. TaskTracker
F. JobTracker
Aoswern B
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 22
Question 61
Which best describes how TextInputFormat processes input fles and line breaks?
A. Input fle splits may cross line breaks. A line that crosses fle splits is read by the RecordReader of the split that
contains the beginning of the broken line.
B. Input fle splits may cross line breaks. A line that crosses fle splits is read by the RecordReaders of both splits
containing the broken line.
C. The input fle is split exactly at the line breaks, so each RecordReader will read a series of complete lines.
D. Input fle splits may cross line breaks. A line that crosses fle splits is ignored.
P. Input fle splits may cross line breaks. A line that crosses fle splits is read by the RecordReader of the split that
contains the end of the broken line.
Aoswern A
Reference: How Map and Reduce operatons are actually carried out
Question 62
A. As many intermediate key-ialue pairs as designed. There are no restrictons on the types of those key-ialue pairs
(i.e., they can be heterogeneous).
B. As many intermediate key-ialue pairs as designed, but they cannot be of the same type as the input key-ialue pair.
C. =ne intermediate key-ialue pair, of a diferent type.
D. =ne intermediate key-ialue pair, but of the same type.
P. As many intermediate key-ialue pairs as designed, as long as all the keys haie the same types and all the ialues
haie the same type.
Aoswern E
Question 63
You haie the following key-ialue pairs as output from your Map task:
(the, 1)
(fox, 1)
(faster, 1)
(than, 1)
(the, 1)
(dog, 1)
How many keys will be passed to the Reducer’s reduce method?
A. Six
B. Fiie
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 23
C. Four
D. Two
P. =ne
F. Three
Aoswern B
=nly one key ialue pair will be passed from the two (the, 1) key ialue pairs.
Question 64
You haie user profle records in your =LPT database, that you want to join with web logs you haie already ingested
into the Hadoop fle system. How will you obtain these user records?
A. HDFS command
B. Pig L=AD command
C. Sqoop import
D. Hiie L=AD DATA command
P. Ingest with Flume agents
F. Ingest with Hadoop Streaming
Aoswern C
Question 65
What is the disadiantage of using multple reducers with the default HashParttoner and distributng your workload
across you cluster?
Aoswern C
Question 66
Giien a directory of fles with the following structure: line number, tab character, string:
Pxample:
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 24
1 abialkjfkaoasdfksdlkjhqweroij
2 kadfhuwqounahagtnbiaswslmnbfgy
3 kjfeiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line:
conf.setInputFormat (____.class) ; ?
A. SequenceFileAsTextInputFormat
B. SequenceFileInputFormat
C. KeyValueFileInputFormat
D. BDBInputFormat
Aoswern C
htp:::stackoierfow.com:questons:9s21s54:how-to-parse-customwritable-from-text-in-hadoop
Question 67
You need to perform statstcal analysis in your MapReduce job and would like to call methods in the Apache
Commons Math library, which is distributed as a 1.3 megabyte Jaia archiie (JAR) fle. Which is the best way to make
this library aiailable to your MapReducer job at runtme?
A. Haie your system administrator copy the JAR to all nodes in the cluster and set its locaton in the
HAD==P_CLASSPATH eniironment iariable before you submit your job.
B. Haie your system administrator place the JAR fle on a Web serier accessible to all cluster nodes and then set the
HTTP_JAR_URL eniironment iariable to its locaton.
C. When submitng the job on the command line, specify the –libjars opton followed by the JAR fle path.
D. Package your code and the Apache Commands Math library into a zip fle named JobJar.zip
Aoswern C
Question 68
The Hadoop framework proiides a mechanism for coping with machine issues such as faulty confguraton or
impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts
more copies of a map or reduce task. All the tasks run simultaneously and the task fnish frst are used. This is called:
A. Combine
B. IdenttyMapper
C. IdenttyReducer
D. Default Parttoner
P. Speculatie Pxecuton
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 25
Aoswern E
Speculatie executon: =ne problem with the Hadoop system is that by diiiding the tasks across many nodes, it is
possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller,
then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already
complete, the system is stll waitng for the fnal map task to check in, which takes much longer than all the other
nodes.
By forcing tasks to run in isolaton from one another, indiiidual tasks do not know where their inputs come from.
Tasks trust the Hadoop platorm to just deliier the appropriate input. Therefore, the same input can be processed
multple tmes in parallel, to exploit diferences in machine capabilites. As most of the tasks in a job are coming to a
close, the Hadoop platorm will schedule redundant copies of the remaining tasks across seieral nodes which do not
haie other work to perform. This process is known as speculatie executon. When tasks complete, they announce
this fact to the JobTracker. Whicheier copy of a task fnishes frst becomes the defnitie copy. If other copies were
executng speculatiely, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers
then receiie their inputs from whicheier Mapper completed successfully, frst.
Reference: Apache Hadoop, Module 4: MapReduce
Note:
* Hadoop uses "speculatie executon." The same task may be started on multple boxes. The frst one to fnish wins,
and the other copies are killed.
Failed tasks are tasks that error out.
* There are a few reasons Hadoop can kill tasks by his own decisions:
a) Task does not report progress during tmeout (default is 10 minutes)
b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue
(CapacityScheduler).
c) Speculatie executon causes results of task not to be needed since it has completed on other place.
Reference: Diference failed tasks is killed tasks
Question 69
A. As many fnal key-ialue pairs as desired. There are no restrictons on the types of those key-ialue pairs (i.e., they
can be heterogeneous).
B. As many fnal key-ialue pairs as desired, but they must haie the same type as the intermediate key-ialue pairs.
C. As many fnal key-ialue pairs as desired, as long as all the keys haie the same type and all the ialues haie the same
type.
D. =ne fnal key-ialue pair per ialue associated with the key; no restrictons on the type.
P. =ne fnal key-ialue pair per key; no restrictons on the type.
Aoswern C
Question 70
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 26
D. All data for a giien ialue, regardless of which mapper(s) produced it.
Aoswern C
Reducing lets you aggregate ialues together. A reducer functon receiies an iterator of input ialues from an input list.
It then combines these ialues together, returning a single output ialue.
All ialues with the same key are presented to a single reduce task.
Reference: Yahoo! Hadoop Tutorial, Module 4: MapReduce
Question 71
Aoswern C
The MapReduce framework operates exclusiiely on <key, ialue> pairs, that is, the framework iiews the input to the
job as a set of <key, ialue> pairs and produces a set of <key, ialue> pairs as the output of the job, conceiiably of
diferent types.
The key and ialue classes haie to be serializable by the framework and hence need to implement the Writable
interface. Additonally, the key classes haie to implement the WritableComparable interface to facilitate sortng by
the framework.
Reference: MapReduce Tutorial
Question 72
=n a cluster running MapReduce i1 (MRi1), a TaskTracker heartbeats into the JobTracker on your cluster, and alerts
the JobTracker it has an open map task slot.
What determines how the JobTracker assigns each map task to a TaskTracker?
Aoswern E
The TaskTrackers send out heartbeat messages to the JobTracker, usually eiery few minutes, to reassure the
JobTracker that it is stll aliie. These message also inform the JobTracker of the number of aiailable slots, so the
JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to fnd
somewhere to schedule a task within the MapReduce operatons, it frst looks for an empty slot on the same serier
that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers, How JobTracker schedules a task?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 27
Question 73
Aoswern D
Question 74
A client applicaton creates an HDFS fle named foo.txt with a replicaton factor of 3. Identfy which best describes the
fle access rules in HDFS if the fle has a single block that is stored on data nodes A, B and C?
A. The fle will be marked as corrupted if data node B fails during the creaton of the fle.
B. Pach data node locks the local fle to prohibit concurrent readers and writers of the fle.
C. Pach data node stores a copy of the fle in the local fle system with the same name as the HDFS fle.
D. The fle can be accessed if at least one of the data nodes storing the fle is aiailable.
Aoswern D
HDFS keeps three copies of a block on three diferent datanodes to protect against true data corrupton. HDFS also
tries to distribute these three replicas on more than one rack to protect against data aiailability issues. The fact that
HDFS actiely monitors any failed datanode(s) and upon failure detecton immediately schedules re-replicaton of
blocks (if needed) implies that three copies of data on three diferent nodes is sufcient to aioid corrupted fles.
Note:
HDFS is designed to reliably store iery large fles across machines in a large cluster. It stores each fle as a sequence of
blocks; all blocks in a fle except the last block are the same size. The blocks of a fle are replicated for fault tolerance.
The block size and replicaton factor are confgurable per fle. An applicaton can specify the number of replicas of a
fle. The replicaton factor can be specifed at fle creaton tme and can be changed later. Files in HDFS are write-once
and haie strictly one writer at any tme. The NameNode makes all decisions regarding replicaton of blocks. HDFS uses
rack-aware replica placement policy. In default confguraton there are total 3 copies of a datablock on HDFS, 2 copies
are stored on datanodes on same rack and 3rd copy on a diferent rack.
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers , How the HDFS Blocks are
replicated?
Question 75
In a MapReduce job, you want each of your input fles processed by a single map task. How do you confgure a
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 28
MapReduce job so that a single map task processes each input fle regardless of how many blocks the input fle
occupies?
A. Increase the parameter that controls minimum split size in the job confguraton.
B. Write a custom MapRunner that iterates oier all key-ialue pairs in the entre fle.
C. Set the number of mappers equal to the number of input fles you want to process.
D. Write a custom FileInputFormat and oierride the method isSplitable to always return false.
Aoswern D
FileInputFormat is the base class for all fle-based InputFormats. This proiides a generic implementaton of
getSplits(JobContext). Subclasses of FileInputFormat can also oierride the isSplitable(JobContext, Path) method to
ensure input-fles are not split-up and are processed as a whole by Mappers.
Reference: org.apache.hadoop.mapreduce.lib.input, Class FileInputFormat<K,V>
Question 76
A. The JobTracker calls the TaskTracker’s confgure () method, then its map () method and fnally its close () method.
B. The TaskTracker spawns a new Mapper to process all records in a single input split.
C. The TaskTracker spawns a new Mapper to process each key-ialue pair.
D. The JobTracker spawns a new Mapper to process all records in a single fle.
Aoswern B
For each map instance that runs, the TaskTracker creates a new instance of your mapper.
Note:
* The Mapper is responsible for processing Key:Value pairs obtained from the InputFormat. The mapper may perform
a number of Pxtracton and Transformaton functons on the Key:Value pair before ultmately outputng none, one or
many Key:Value pairs of the same, or diferent Key:Value type.
* With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class. This class defnes an
'Identty' map functon by default - eiery input Key:Value pair obtained from the InputFormat is writen out.
Pxamining the run() method, we can see the lifecycle of the mapper:
:**
* Pxpert users can oierride this method for more complete control oier the
* executon of the Mapper.
* @param context
* @throws I=Pxcepton
*:
public ioid run(Context context) throws I=Pxcepton, InterruptedPxcepton {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
setup(Context) - Perform any setup for the mapper. The default implementaton is a no-op method.
map(Key, Value, Context) - Perform a map operaton in the giien Key : Value pair. The default implementaton calls
Context.write(Key, Value)
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 29
cleanup(Context) - Perform any cleanup for the mapper. The default implementaton is a no-op method.
Reference: Hadoop:MapReduce:Mapper
Question 77
Determine which best describes when the reduce method is frst called in a MapReduce job?
A. Reducers start copying intermediate key-ialue pairs from each Mapper as soon as it has completed. The
programmer can confgure in the job what percentage of the intermediate data should arriie before the reduce
method begins.
B. Reducers start copying intermediate key-ialue pairs from each Mapper as soon as it has completed. The reduce
method is called only afer all intermediate data has been copied and sorted.
C. Reduce methods and map methods all start at the beginning of a job, in order to proiide optmal performance for
map-only or reduce-only jobs.
D. Reducers start copying intermediate key-ialue pairs from each Mapper as soon as it has completed. The reduce
method is called as soon as the intermediate key-ialue pairs start to arriie.
Aoswern B
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers , When is the reducers are started
in a MapReduce job?
Question 78
You haie writen a Mapper which iniokes the following fie calls to the =utputColletor.collect method:
output.collect (new Text (“AppleN), new Text (“RedN) ) ;
output.collect (new Text (“BananaN), new Text (“YellowN) ) ;
output.collect (new Text (“AppleN), new Text (“YellowN) ) ;
output.collect (new Text (“CherryN), new Text (“RedN) ) ;
output.collect (new Text (“AppleN), new Text (“GreenN) ) ;
How many tmes will the Reducer’s reduce method be inioked?
A. 6
B. 3
C. 1
D. 0
P. 5
Aoswern B
reduce() gets called once for each [key, (list of ialues)] pair. To explain, let's say you called:
out.collect(new Text("Car"),new Text("Subaru");
out.collect(new Text("Car"),new Text("Honda");
out.collect(new Text("Car"),new Text("Ford");
out.collect(new Text("Truck"),new Text("Dodge");
out.collect(new Text("Truck"),new Text("Cheiy");
Then reduce() would be called twice with the pairs
reduce(Car, <Subaru, Honda, Ford>)
reduce(Truck, <Dodge, Cheiy>)
Reference: Mapper output.collect()?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 30
Question 79
To process input key-ialue pairs, your mapper needs to lead a 512 MB data fle in memory. What is the best way to
accomplish this?
A. Serialize the data fle, insert in it the JobConf object, and read the data into memory in the confgure method of the
mapper.
B. Place the data fle in the DistributedCache and read the data into memory in the map method of the mapper.
C. Place the data fle in the DataCache and read the data into memory in the confgure method of the mapper.
D. Place the data fle in the DistributedCache and read the data into memory in the confgure method of the mapper.
Aoswern C
Question 80
In a MapReduce job, the reducer receiies all ialues associated with same key. Which statement best describes the
ordering of these ialues?
Aoswern B
Note:
* Input to the Reducer is the sorted output of the mappers.
* The framework calls the applicaton's Reduce functon once for each unique key in the sorted order.
* Pxample:
For the giien sample input the frst map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
Question 81
You need to create a job that does frequency analysis on input dat
a. You will do this by writng a Mapper that uses TextInputFormat and splits each ialue (a line of text from an input
fle) into indiiidual characters. For each one of these characters, you will emit the character as a key and an
InputWritable as the ialue. As this will produce proportonally more intermediate data than input data, which two
resources should you expect to be botlenecks?
A. Processor and network I:=
B. Disk I:= and network I:=
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 31
Aoswern B
Question 82
You want to count the number of occurrences for each unique word in the supplied input dat
a. You’ie decided to implement this by haiing your mapper tokenize each word and emit a literal ialue 1, and then
haie your reducer increment a counter for each literal 1 it receiies. Afer successful implementng this, it occurs to
you that you could optmize this by specifying a combiner. Will you be able to reuse your existng Reduces as your
combiner in this case and why or why not?
A. Yes, because the sum operaton is both associatie and commutatie and the input and output types to the reduce
method match.
B. No, because the sum operaton in the reducer is incompatble with the operaton of a Combiner.
C. No, because the Reducer and Combiner are separate interfaces.
D. No, because the Combiner is incompatble with a mapper which doesn’t use the same data type for both the key
and ialue.
P. Yes, because Jaia is a polymorphic object-oriented language and thus reducer code can be reused as a combiner.
Aoswern A
Combiners are used to increase the efciency of a MapReduce program. They are used to aggregate intermediate
map output locally on indiiidual mapper outputs. Combiners can help you reduce the amount of data that needs to
be transferred across to the reducers. You can use your reducer code as a combiner if the operaton performed is
commutatie and associatie. The executon of combiner is not guaranteed, Hadoop may or may not execute a
combiner. Also, if required it may execute it more then 1 tmes. Therefore your MapReduce jobs should not depend on
the combiners executon.
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers, What are combiners? When should
I use a combiner in my MapReduce Job?
Question 83
Your client applicaton submits a MapReduce job to your Hadoop cluster. Identfy the Hadoop daemon on which the
Hadoop framework will look for an aiailable slot schedule a MapReduce operaton.
A. TaskTracker
B. NameNode
C. DataNode
D. JobTracker
P. Secondary NameNode
Aoswern D
JobTracker is the daemon seriice for submitng and tracking MapReduce jobs in Hadoop. There is only =ne Job
Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical producton cluster its
run on a separate machine. Pach slaie node is confgured with job tracker node locaton. The JobTracker is single point
of failure for the Hadoop MapReduce seriice. If it goes down, all running jobs are halted. JobTracker in Hadoop
performs following actons(from Hadoop Wiki:)
Client applicatons submit jobs to the Job tracker.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 32
The JobTracker talks to the NameNode to determine the locaton of the data
The JobTracker locates TaskTracker nodes with aiailable slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals ofen enough, they are deemed to haie
failed and the work is scheduled on a diferent TaskTracker.
A TaskTracker will notfy the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the
job elsewhere, it may mark that specifc record as something to aioid, and it may may eien blacklist the TaskTracker
as unreliable.
When the work is completed, the JobTracker updates its status.
Client applicatons can poll the JobTracker for informaton.
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers, What is a JobTracker in Hadoop?
How many instances of JobTracker run on a Hadoop Cluster?
Question 84
Which project giies you a distributed, Scalable, data store that allows you random, realtme read:write access to
hundreds of terabytes of data?
A. HBase
B. Hue
C. Pig
D. Hiie
P. =ozie
F. Flume
G. Sqoop
Aoswern A
Use Apache HBase when you need random, realtme read:write access to your Big Data.
Note: This project's goal is the hostng of iery large tables -- billions of rows X millions of columns -- atop clusters of
commodity hardware. Apache HBase is an open-source, distributed, iersioned, column-oriented store modeled afer
Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leierages the
distributed data storage proiided by the Google File System, Apache HBase proiides Bigtable-like capabilites on top
of Hadoop and HDFS.
Features
Linear and modular scalability.
Strictly consistent reads and writes.
Automatc and confgurable sharding of tables
Automatc failoier support between RegionSeriers.
Conienient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Pasy to use Jaia API for client access.
Block cache and Bloom Filters for real-tme queries.
Query predicate push down iia serier side Filters
Thrif gateway and a RPST-ful Web seriice that supports XML, Protobuf, and binary data encoding optons
Pxtensible jruby-based (JIRB) shell
Support for exportng metrics iia the Hadoop metrics subsystem to fles or Ganglia; or iia JMX
Reference: htp:::hbase.apache.org: (when would I use HBase? First sentence)
Question 85
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 33
You use the hadoop fs –put command to write a 300 MB fle using and HDFS block size of 64 MB. Just afer this
command has fnished writng 200 MB of this fle, what would another user see when trying to access this life?
A. They would see Hadoop throw an ConcurrentFileAccessPxcepton when they try to access this fle.
B. They would see the current state of the fle, up to the last bit writen by the command.
C. They would see the current of the fle through the last completed block.
D. They would see no content untl the whole fle writen and closed.
Aoswern C
Question 86
Identfy the tool best suited to import a porton of a relatonal database eiery day as fles into HDFS, and generate
Jaia classes to interact with that imported data?
A. =ozie
B. Flume
C. Pig
D. Hue
P. Hiie
F. Sqoop
G. fuse-dfs
Aoswern F
Question 87
You haie a directory named jobdata in HDFS that contains four fles: _frst.txt, second.txt, .third.txt and #data.txt. How
many fles will be processed by the FileInputFormat.setInputPaths () command when it's giien a path object
representng this directory?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 34
Aoswern C
Files startng with '_' are considered 'hidden' like unix fles startng with '.'.
# characters are allowed in HDFS fle names.
Question 88
You write MapReduce job to process 100 fles in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper
applies a regular expression oier input ialues and emits key-ialues pairs with the key consistng of the matching text,
and the ialue containing the flename and byte ofset. Determine the diference between setng the number of
reduces to one and setngs the number of reducers to zero.
Aoswern D
* It is legal to set the number of reduce-tasks to zero if no reducton is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by
set=utputPath(Path). The framework does not sort the map-outputs before writng them out to the FileSystem.
* =fen, you may want to process input data using a map functon only. To do this, simply set mapreduce.job.reduces
to zero. The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be
the fnal output of the job.
Note:
Reduce
In this phase the reduce(WritableComparable, Iterator, =utputCollector, Reporter) method is called for each <key, (list
of ialues)> pair in the grouped inputs.
The output of the reduce task is typically writen to the FileSystem iia =utputCollector.collect(WritableComparable,
Writable).
Applicatons can use the Reporter to report progress, set applicaton-leiel status messages and update Counters, or
just indicate that they are aliie.
The output of the Reducer is not sorted.
Question 89
A combiner reduces:
A. The number of ialues across diferent keys in the iterator supplied to a single reduce method call.
B. The amount of intermediate data that must be transferred between the mapper and reducer.
C. The number of input fles a mapper must process.
D. The number of output fles a reducer must produce.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 35
Aoswern B
Combiners are used to increase the efciency of a MapReduce program. They are used to aggregate intermediate
map output locally on indiiidual mapper outputs. Combiners can help you reduce the amount of data that needs to
be transferred across to the reducers. You can use your reducer code as a combiner if the operaton performed is
commutatie and associatie. The executon of combiner is not guaranteed, Hadoop may or may not execute a
combiner. Also, if required it may execute it more then 1 tmes. Therefore your MapReduce jobs should not depend on
the combiners executon.
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers, What are combiners? When should
I use a combiner in my MapReduce Job?
Question 90
In a MapReduce job with 500 map tasks, how many map task atempts will there be?
Aoswern D
Pxplanaton:
From Cloudera Training Course:
Task atempt is a partcular instance of an atempt to execute a task
– There will be at least as many task atempts as there are tasks
– If a task atempt fails, another will be started by the JobTracker
– Speculatie executon can also result in more task atempts than completed tasks
Question 91
MapReduce i2 (MRi2:YARN) splits which major functons of the JobTracker into separate daemons? Select two.
Aoswern B, C
The fundamental idea of MRi2 is to split up the two major functonalites of the JobTracker, resource management
and job scheduling:monitoring, into separate daemons. The idea is to haie a global ResourceManager (RM) and per-
applicaton ApplicatonMaster (AM). An applicaton is either a single job in the classical sense of Map-Reduce jobs or a
DAG of jobs.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 36
Note:
The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop,
specifcally in (mainly) JobTracker:
: Monitoring the status of the cluster with respect to which nodes haie which resources aiailable. Under YARN, this
will be global.
: Managing the parallelizaton executon of any specifc job. Under YARN, this will be done separately for each job.
Reference: Apache Hadoop YARN – Concepts & Applicatons
Question 92
A. Algorithms that require applying the same mathematcal functon to large numbers of indiiidual binary records.
B. Relatonal operatons on large amounts of structured and semi-structured data.
C. Algorithms that require global, sharing states.
D. Large-scale graph algorithms that require one-step link traiersal.
P. Text analysis algorithms on large collectons of unstructured text (e.g, Web crawls).
Aoswern C
See 3) below.
Limitatons of Mapreduce – where not to use Mapreduce
While iery powerful and applicable to a wide iariety of problems, MapReduce is not the answer to eiery problem.
Here are some problems I found where MapReudce is not suited and some papers that address the limitatons of
MapReuce.
1. Computaton depends on preiiously computed ialues
If the computaton of a ialue depends on preiiously computed ialues, then MapReduce cannot be used. =ne good
example is the Fibonacci series where each ialue is summaton of the preiious two ialues. i.e., f(k+2) p f(k+1) + f(k).
Also, if the data set is small enough to be computed on a single machine, then it is beter to do it as a single
reduce(map(data)) operaton rather than going through the entre map reduce process.
2. Full-text indexing or ad hoc searching
The index generated in the Map step is one dimensional, and the Reduce step must not generate a large amount of
data or there will be a serious performance degradaton. For example, CouchDB’s MapReduce may not be a good ft
for full-text indexing or ad hoc searching. This is a problem beter suited for a tool such as Lucene.
3. Algorithms depend on shared global state
Solutons to many interestng problems in text processing do not require global synchronizaton. As a result, they can
be expressed naturally in MapReduce, since map and reduce tasks run independently and in isolaton. Howeier, there
are many examples of algorithms that depend crucially on the existence of shared global state during processing,
making them difcult to implement in MapReduce (since the single opportunity for global synchronizaton in
MapReduce is the barrier between the map and reduce phases of processing)
Reference: Limitatons of Mapreduce – where not to use Mapreduce
Question 93
In the reducer, the MapReduce API proiides you with an iterator oier Writable ialues. What does calling the next ()
method return?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 37
D. It returns a reference to a Writable object. The API leaies unspecifed whether this is a reused object or a new
object.
P. It returns a reference to the same Writable object if the next ialue is the same as the preiious ialue, or a new
Writable object otherwise.
Aoswern C
Calling Iterator.next() will always return the SAMP PXACT instance of IntWritable, with the contents of that instance
replaced with the next ialue.
Reference: manupulatng iterator in mapreduce
Question 94
Aoswern C
By default, hiie use an embedded Derby database to store metadata informaton. The metastore is the "glue"
between Hiie and HDFS. It tells Hiie where your data fles liie in HDFS, what type of data they contain, what tables
they belong to, etc.
The Metastore is an applicaton that runs on an RDBMS and uses an open source =RM layer called DataNucleus, to
coniert object representatons into a relatonal schema and iice iersa. They chose this approach as opposed to
storing this informaton in hdfs as they need the Metastore to be iery low latency. The DataNucleus layer allows them
to plugin many diferent RDBMS technologies.
Note:
* By default, Hiie stores metadata in an embedded Apache Derby database, and other client:serier databases like
MySQL can optonally be used.
* features of Hiie include:
Metadata storage in an RDBMS, signifcantly reducing the tme to perform semantc checks during query executon.
Reference: Store Hiie Metadata into RDBMS
Question 95
Analyze each scenario below and indentfy which best describes the behaiior of the default parttoner?
A. The default parttoner assigns key-ialues pairs to reduces based on an internal random number generator.
B. The default parttoner implements a round-robin strategy, shufing the key-ialue pairs to each reducer in turn.
This ensures an eient partton of the key space.
C. The default parttoner computes the hash of the key. Hash ialues between specifc ranges are associated with
diferent buckets, and each bucket is assigned to a specifc reducer.
D. The default parttoner computes the hash of the key and diiides that ialule modulo the number of reducers. The
result determines the reducer assigned to process the key-ialue pair.
P. The default parttoner computes the hash of the ialue and takes the mod of that ialue with the number of
reducers. The result determines the reducer assigned to process the key-ialue pair.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 38
Aoswern D
The default parttoner computes a hash ialue for the key and assigns the partton based on this result.
The default Parttoner implementaton is called HashParttoner. It uses the hashCode() method of the key objects
modulo the number of parttons total to determine which partton to send a giien (key, ialue) pair to.
In Hadoop, the default parttoner is HashParttoner, which hashes a record’s key to determine which partton (and
thus which reducer) the record belongs in.The number of partton is then equal to the number of reduce tasks for the
job.
Reference: Getng Started With (Customized) Parttoning
Question 96
You need to moie a fle ttled “weblogsN into HDFS. When you try to copy the fle, you can’t. You know you haie
ample space on your DataNodes. Which acton should you take to relieie this situaton and store more fles in HDFS?
Aoswern C
Question 97
In a large MapReduce job with m mappers and n reducers, how many distnct copy operatons will there be in the
sort:shufe phase?
Aoswern A
A MapReduce job with m mappers and r reducers iniolies up to m * r distnct copy operatons, since each mapper
may haie intermediate output going to eiery reducer.
Question 98
A. Sequences of MapReduce and Pig. These sequences can be combined with other actons including forks, decision
points, and path joins.
B. Sequences of MapReduce job only; on Pig on Hiie tasks or jobs. These MapReduce sequences can be combined
with forks and path joins.
C. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actons with excepton handlers
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 39
but no forks.
D. Iterntie repetton of MapReduce jobs untl a desired answer or state is reached.
Aoswern A
=ozie workfow is a collecton of actons (i.e. Hadoop Map:Reduce jobs, Pig jobs) arranged in a control dependency
DAG (Direct Acyclic Graph), specifying a sequence of actons executon. This graph is specifed in hPDL (a XML Process
Defniton Language).
hPDL is a fairly compact language, using a limited amount of fow control and acton nodes. Control nodes defne the
fow of executon and include beginning and end of a workfow (start, end and fail nodes) and mechanisms to control
the workfow executon path ( decision, fork and join nodes).
Workfow defnitons
Currently running workfow instances, including instance states and iariables
Reference: Introducton to =ozie
Note: =ozie is a Jaia Web-Applicaton that runs in a Jaia serilet-container - Tomcat and uses a database to store:
Question 99
Which best describes what the map method accepts and emits?
A. It accepts a single key-ialue pair as input and emits a single key and list of corresponding ialues as output.
B. It accepts a single key-ialue pairs as input and can emit only one key-ialue pair as output.
C. It accepts a list key-ialue pairs as input and can emit only one key-ialue pair as output.
D. It accepts a single key-ialue pairs as input and can emit any number of key-ialue pair as output, including zero.
Aoswern D
Question 100
When can a reduce class also serie as a combiner without afectng the output of a MapReduce program?
A. When the types of the reduce operaton’s input key and input ialue match the types of the reducer’s output key
and output ialue and when the reduce operaton is both communicatie and associatie.
B. When the signature of the reduce method matches the signature of the combine method.
C. Always. Code can be reused in Jaia since it is a polymorphic object-oriented programming language.
D. Always. The point of a combiner is to serie as a mini-reducer directly afer the map phase to increase performance.
P. Neier. Combiners and reducers must be implemented separately because they serie diferent purposes.
Aoswern A
You can use your reducer code as a combiner if the operaton performed is commutatie and associatie.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 40
Reference: 24 Interiiew Questons & Answers for Hadoop MapReduce deielopers, What are combiners? When should
I use a combiner in my MapReduce Job?
Question 101
You want to perform analysis on a large collecton of images. You want to store this data in HDFS and process it with
MapReduce but you also want to giie your data analysts and data scientsts the ability to process the data directly
from HDFS with an interpreted high-leiel programming language like Python. Which format should you use to store
this data in HDFS?
A. SequenceFiles
B. Airo
C. JS=N
D. HTML
P. XML
F. CSV
Aoswern B
Question 102
You want to run Hadoop jobs on your deielopment workstaton for testng before you submit them to your producton
cluster. Which mode of operaton in Hadoop allows you to most closely simulate a producton cluster while using a
single machine?
A. Run all the nodes in your producton cluster as iirtual machines on your deielopment workstaton.
B. Run the hadoop command with the –jt local and the –fs fle::::optons.
C. Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single machine.
D. Run simldooop, the Apache open-source sofware for simulatng Hadoop clusters.
Aoswern C
Question 103
Your cluster’s HDFS block size in 64MB. You haie directory containing 100 plain text fles, each of which is 100MB in
size. The InputFormat for your job is TextInputFormat. Determine how many Mappers will run?
A. 64
B. 100
C. 200
D. 640
Aoswern C
Pach fle would be split into two as the block size (64 MB) is less than the fle size (100 MB), so 200 mappers would be
running.
Note:
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 41
If you're not compressing the fles then hadoop will process your large fles (say 10G), with a number of mappers
related to the block size of the fle.
Say your block size is 64M, then you will haie ~160 mappers processing this 10G fle (160*64 ~p 10G). Depending on
how CPU intensiie your mapper logic is, this might be an
acceptable blocks size, but if you fnd that your mappers are executng in sub minute tmes, then you might want to
increase the work done by each mapper (by increasing the block size to 128, 256, 512m - the actual size depends on
how you intend to process the data).
Reference: htp:::stackoierfow.com:questons:11014493:hadoop-mapreduce-appropriate-input-fles-size (frst
answer, second paragraph)
Question 104
What is a SequenceFile?
Aoswern D
Question 105
Which HDFS command displays the contents of the fle x in the user's HDFS home directory?
A. hadoop fs -Is x
B. hdfs fs -get x
C. hadoop fs -cat x
D. hadoop fs -cp x
Aoswern C
Question 106
Which HDFS command copies an HDFS fle named foo to the local flesystem as localFoo?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 42
Aoswern A
Question 107
Which of the following tool was designed to import data from a relatonal database into HDFS?
A. HCatalog
B. Sqoop
C. Flume
D. Ambari
Aoswern B
Question 108
You want to Ingest log fles Into HDFS, which tool would you use?
A. HCatalog
B. Flume
C. Sqoop
D. Ambari
Aoswern B
________________________________________________________________________________________________
https://www. pass4sures.com/