You are on page 1of 2

BIG DATA Fall 2016

ASSIGNMENT 2
Total Marks 40 (Weitage 5 )
For all your questions
Run your program over the given input file in Standalone mode as well as in the
pseudo-distributed mode. Explore
the log files and output statements. Upload the Source code and the output file
on Slate.
Input File
You are given an input text file named citation.txt. It contains information reg
arding the research papers published in
various journals. The complete file Citation-network V1 can be found at https://
cn.aminer.org/citation. The format of
the file is as follows:
#* --- paperTitle
#@ --- Authors
#t ---- Year
#c --- publication venue
#index 00---- index id of this paper
Question 1: Map Only MapReduce Job and COUNTERs (5 marks)
Your task is to develop a MapReduce program that process the citation.txt input
file and count the number of papers
published during four time-periods: 1980-1990, 1991-2000, 2001-2010 and 2011-tod
ay.
The stub is provided for this question to help you with counters.
Hint
1. You should use a Map-only MapReduce job, by setting the number of Reducers to
0.
2. Upload the given input file to HDFS.
3. Use a counter group to count number of paper published in different decades.
4. In your main function, retrieve the values of the counters after the job has
completed and report them
using System.out.println.
Question 2: List of Co-Authors (10 marks)
In this question, your task is to make a MapReduce algorithm that produces a lis
t of co-authors of each author in the
given input file. You are free to use pairs or stripe approach. To make your alg
orithm efficient you should use
combiners or in-mapper aggregation technique
Sample Output (Author -> List of Co -authors )
David Jones -> Sam Nick, Ali Javed , Daniel Brown
Sam Nick -> David Jones, Zan Jao, Ali Javed
Ali Javed -> David Jones ,Sam Nick
Zan Jao -> Sam Nick
Daniel Brown -> David Jones
Question 3: Common Co-authors (10 marks)Develop an efficient MapReduce program t
o find common co-author between any two researchers.
Sample Output (author1 - author2 -> author3, author4 )
David Jones - Sam Nick -> Ali Javed
David Jones - Ali Javed -> Sam NicK
Sam Nick - Ali Javed -> David Jones
Question 4: (15 marks)
Built a MapReduce program that outputs total percentage of papers published by e
ach author in each year. Compute
using Pair method
Sample Output (author, year -> percentage of the papers published )
David Jones , 2006 -> 30%
David Jones , 2008 -> 50%
David Jones, 2010 -> 20%
Hint:

Percentage of the papers published in the year 2006 by David = No. of papers pub
lished in 2006 by David
Total no. of papers published by David
This problem is somewhat similar to relative frequency word co-occurrence proble
m discussed in the class. You will
have to override practitioner and comparator.

You might also like