Project Thesis

1
Table of Contents
Chapter-Title Page
No.
1. Introduction
1.1 Introduction to Project
1.2 Purpose of System
1.3 Problems in Existing System
2. System Analysis
2.1 Study of the System
2.2 System Requirement Specifications
2.3 Proposed System
2.4 Process Modules Used
3. Feasibility Study
3.1 Technical Feasibility
3.2 Operational Feasibility
3.3 Economical Feasibility
4. Requirement Specification
4.1 Software Requirement Specifications
4.2 Hardware Requirement Specifications
4.3 Functional Requirements
4.4 Performance Requirements
5. System Design
5.1 Introduction
5.2 Normalization
5.3 E-R Diagram
5.4 Data Flow Diagram/UML diagram
6. Coding and Implementation
6.1 Method or Algorithm Used.
2
6.2 Results / Output
7. System Testing
7.1 Unit Testing
7.2 Integration Testing
7.3 Acceptance Testing
8. Issued Faced
9. Conclusion
10. Future scop
11. References
3
CHAPTER 1
INTRODUCTION
4
1.1 INTRODUCTION
Recommendation System belongs to the class of Information Retrieval, Data Mining and Machine
Learning. A recommendation system is a type of information filtering system which attempts to
predict the preferences of a user, and make suggests based on these preferences. There are a wide
variety of applications for recommendation systems. These have become increasingly popular over
the last few years and are now utilized in most online platforms that we use. The content of such
platforms varies from movies, music, books and videos, to friends and stories on social media
platforms, to products on e-commerce websites, to people on professional and dating websites, to
search results returned on Google. Often, these systems are able to collect information about a users
choices, and can use this information to improve their suggestions in the future. For example,
Facebook can monitor your interaction with various stories on your feed in order to learn what types
of stories appeal to you. Sometimes, the recommender systems can make improvements based on
the activities of a large number of people. For example, if Amazon observes that a large number of
customers who buy the latest Apple Macbook also buy a USB-C-toUSB Adapter, they can recommend
the Adapter to a new user who has just added a Macbook to his cart. Due to the advances in
recommender systems, users constantly expect good recommendations. They have a low threshold
for services that are not able to make appropriate suggestions. If a music streaming app is not able to
predict and play music that the user likes, then the user will simply stop using it. This has led to a high
emphasis by tech companies on improving their recommendation systems. However, the problem is
more complex than it seems. Every user has different preferences and likes. In addition, even the
taste of a single user can vary depending on a large number of factors, such as mood, season, or type
of activity the user is doing. For example, the type of music one would like to hear while exercising
differs greatly from the type of music he’d listen to when cooking dinner. Another issue that
recommendation systems have Search Engine Architecture, Spring 2017, NYU Courant to solve is the
exploration vs exploitation problem. They must explore new domains to discover more about the
user, while still making the most of what is already known about of the user. Two main approaches
are widely used for recommender systems. One is content-based filtering, where we try to profile the
users interests using information collected, and recommend items based on that profile. The other is
collaborative filtering, where we try to group similar users together and use information about the
group to make recommendations to the user. Both approaches are discussed in greater detail in
Section 3.
Benchmarking
Benchmarking involves looking outward (outside a particular business, organisation, industry, region
or country) to examine how others achieve their performance levels, and to understand the processes
they use.
5
In this way, benchmarking helps explain the processes behind excellent performance. When
lessons learned from a benchmarking exercise are applied appropriately, they facilitate
improved performance in critical functions within an organisation or in key areas of the
business.
If a company is to be successful, it needs to evaluate its performance in a consistent manner.
In order to do so, businesses need to set standards for themselves and measure their
processes and performance against recognized industry leaders or against best practices
from other industries, which operate in a similar environment.
This is commonly referred to as benchmarking in management parlance.
The benchmarking process is relatively uncomplicated. Some knowledge and a practical dent
is all that is needed to make such a process a success.
A Step-by-Step Approach to Benchmarking

Following are the steps involved in benchmarking process:
(1) Planning
Prior to engaging in benchmarking, it is imperative that corporate stakeholders identify the
activities that need to be benchmarked.
For instance, the processes that merit such consideration would generally be core activities
that have the potential to give the business in question a competitive edge.
6
Such processes would generally command a high cost, volume or value. For the optimal
results of benchmarking to be reaped, the inputs and outputs need to be redefined; the
activities chosen should be measurable and thereby easily comparable, and thus the
benchmarking metrics needs to be arrived at.
Prior to engaging in the benchmarking process, the total process flow needs to be given due
consideration. For instance, improving one core competency at the detriment to another
proves to be of little use.
Therefore, many choose to document such processes in detail (a process flow chart is
deemed to be ideal for this purpose), so that omissions and errors are minimized; thus
enabling the company to obtain a clearer idea of its strategic goals, its primary business
processes, customer expectations and critical success factors.
The next step in the planning process would be for the company to choose an appropriate
benchmark against which their performance can be measured.
The benchmark can be a single entity or a collective group of companies, which operate at
optimal efficiency.
As stated before, if such a company operates in a similar environment or if it adopts a

comparable strategic approach to reach their goals, its relevance would, indeed, be greater.
Measures and practices used in such companies should be identified, so that business
process alternatives can be examined.
Also, it is always prudent for a company to ascertain its objectives, prior to commencement
of the benchmarking process.
The methodology adopted and the way in which output is documented should be given due
consideration too. On such instances, a capable team should be found in order to carry out
the benchmarking process, with a leader or leaders being duly appointed, so as to ensure
the smooth, timely implementation of the project.
(2) Collection of Information :

Information can be broadly classified under the sub texts of primary data and secondary
data.
To clarify further, here, primary data refers to collection of data directly from the
benchmarked company/companies itself, while secondary data refers to information
garnered from the press, publications or websites.
7
Exploratory research, market research, quantitative research, informal conversations,
interviews and questionnaires, are still, some of the most popular methods of collecting
information.
When engaging in primary research, the company that is due to undertake the
benchmarking process needs to redefine its data collection methodology.
Drafting a questionnaire or a standardized interview format, carrying out primary research

via the telephone, e-mail or in face-to-face interviews, making on-site observations, and
documenting such data in a systematic manner is vital, if the benchmarking process is to be
a success.
(3) Analysis of Data :

Once sufficient data is collected, the proper analysis of such information is of foremost
importance.
Data analysis, data presentation (preferably in graphical format, for easy reference), results
projection, classifying the performance gaps in processes, and identifying the root cause that
leads to the creation of such gaps (commonly referred to as enablers), need to be then
carried out.
(4) Implementation :
This is the stage in the benchmarking process where it becomes mandatory to walk the talk.
This generally means that far-reaching changes need to be made, so that the performance
gap between the ideal and the actual is narrowed and eliminated wherever possible.
A formal action plan that promotes change should ideally be formulated keeping the
organization's culture in mind, so that the resistance that usually accompanies change is
minimized.
Ensuring that the management and staff are fully committed to the process and that
sufficient resources are in place to meet facilitate the necessary improvements would be
critical in making the benchmarking process, a success.
(5) Monitoring : As with most projects, in order to reap the maximum benefits of the
benchmarking process, a systematic evaluation should be carried out on a regular basis.
Assimilating the required information, evaluating the progress made, re-iterating the impact
of the changes and making any necessary adjustments, are all part of the monitoring
process.
8
1.2 PURPOSE OF THE SYSTEM
Our Project is “MOVIES RECOMMENDATION SYSTEM AND BENCH
MARKING using BIGDATA HADOOP”. Recommender systems play a major role in
today's ecommerce industry. Recommender systems recommend items to users such as
books, movies, videos, electronic products and many other products in general.
Recommender systems help the users to get personalized recommendations, helps users to
take correct decisions in their online transactions, increase sales and redefine the users web
browsing experience, retain the customers, enhance their shopping experience. Information
overload problem is solved by search engines, but they do not provide personalization of
data. Recommendation engines provide personalization. There are different type of
recommender systems such as content-based, collaborative filtering, hybrid recommender
system, demographic and keyword based recommender system. Variety of algorithms are
used by various researchers in each type of recommendation system. Lot of work has been
done on this topic, still it is a very favourite topic among data scientists. It also comes under
the domain of data Science.
1.3 PROBLEM WITH EXISTING SYSTEM
Recommendation system are purely based on analysis of data. Data analysis can be done
by using python, java, HADOOP, etc. But the problem with old technique of analysis data
is that, since data is increasing day by day that using old methods for storing and analysis
them is a headache. Using Relational Database Management system, desktop statistics and
software packages used to visualize data often have difficulty handling big data. The work
may require "massively parallel software running on tens, hundreds, or even thousands of
servers".
HADOOP is a database framework, which allows users to save, process Big Data in a fault
tolerant, low latency ecosystem using programming models. However HADOOP has
recently developed into an ecosystem of technologies and tool to complement Big Data
processing.
9
CHAPTER 2
SYSTEM ANALYSIS
10
2.1 STUDY OF THE SYSTEM
This is a Simple Recommender offers generalized recommendations to every user based on
movie popularity and (sometimes) genre. The basic idea behind this recommender is that
movies that are more popular and more critically acclaimed will have a higher probability of
being liked by the average audience. This model does not give personalized recommendations
based on the user.
The implementation of this model is extremely trivial. All we have to do is sort our movies
based on ratings and popularity and display the top movies of our list. As an added step, we can
pass in a genre argument to get the top movies of a particular genre.
The system is built using HADOOP including HADOOP components HDFS, SQOOP,HIVE,
MySql and Ubuntu terminal.
We are building a Collaborative Filtering Recommendation system. Collaborative filtering is

a method of making automatic predictions (filtering) about the interests of a user by collecting
preferences or taste information from many users (collaborating). The underlying assumption
of the collaborative filtering approach is that if a person A has the same opinion as a person B on
an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen
person.
For example, a collaborative filtering recommendation system for Movies tastes could make
predictions about which movie a user should like given a partial list of that user's tastes (likes
or dislikes).
Note that these predictions are specific to the user, but use information gleaned from many
users.
We are using MoviesLens dataset which contain 105339 ratings and 6138 tag applications
across 10329 movies.
2.1.1 BIG DATA

Big data means really a big data, it is a collection of large datasets that cannot be processed
using traditional computing techniques. Big data is not merely a data, rather it has become a
complete subject, which involves various tools, techniques and frameworks.
What Comes Under Big Data?
Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.
• Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
11
•
• Social Media Data: Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
• Stock Exchange Data: The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
• Power Grid Data: The power grid data holds information consumed by a particular
node with respect to a base station.
• Transport Data: Transport data includes model, capacity, distance and availability of
a vehicle.
• Search Engine Data: Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in
it will be of three types.
• Structured data: Relational data.
• Semi Structured data: XML data.
• Unstructured data: Word, PDF, Text, Media Logs.
Traditional Approach
In this approach, an enterprise will have a computer to store and process big data. Here data
will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated
softwares can be written to interact with the database, process the required data and present it
to the users for analysis purpose.
Fig 2.1.1.a Traditional approach of storing data
Limitation
This approach works well where we have less volume of data that can be accommodated by
standard database servers, or up to the limit of the processor which is processing the data. But
12
when it comes to dealing with huge amounts of data, it is really a tedious task to process such
data through a traditional database server.
2.1.2 HADOOP
Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under the
Apache v2 license. Hadoop was developed, based on the paper written by Google on
MapReduce system and it applies concepts of functional programming. Hadoop is written in
the Java programming language and ranks among the highest-level Apache projects. Hadoop
was developed by Doug Cutting and Michael J. Cafarella.
PROBLEM:
• The first problem is storing huge amount of data.
• Next problem was storing the variety of data.
• The third challenge was about processing the data faster.
HADOOP came up as solution :
• HDFS provides a distributed way to store Big Data.
• In HDFS you can store all kinds of data whether it is structured, semi-structured or
unstructured. In HDFS, there is no pre-dumping schema validation.
• The processing logic is sent to the nodes where data is stored so as that each node can
process a part of data in parallel. Finally, all of the intermediary output produced by
each node is merged together and the final response is sent back to the client.
HADOOP FEATURES:
Fig 2.1.2.a HADOOP features
13
Reliability:
When machines are working in tandem, if one of the machines fails, another machine will take
over the responsibility and work in a reliable and fault tolerant fashion. Hadoop infrastructure
has inbuilt fault tolerance features and hence, Hadoop is highly reliable.
Economical:
Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop
cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB
hard disk and Xeon processors, but if I would have used hardware-based RAID with Oracle for
the same purpose, I would end up spending 5x times more at least. So, the cost of ownership of
a Hadoop-based project is pretty minimized. It is easier to maintain the Hadoop environment
and is economical as well. Also, Hadoop is an open source software and hence there is no
licensing cost.
Scalability:
Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if
you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor
because you can go ahead and procure more hardware and expand your setup within minutes
whenever required.
Flexibility:
Hadoop is very flexible in terms of ability to deal with all kinds of data. We discussed
“Variety” in our previous blog on Big Data Tutorial, where data can be of any kind and Hadoop
can store and process them all, whether it is structured, semi-structured or unstructured data.
2.2.1 Hadoop Core Components

While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of
your Hadoop platform, but there are two services which are always mandatory for setting
up Hadoop. One is HDFS (storage) and the other is MAP REDUCE (processing).
1) HDFS
14
Fig 2.2.1.a HDFS COMPONENTS
The main components of HDFS are: NameNode and DataNode. Let us talk about the
roles of these two components in detail.
a) NAMENODE
i) It is the master daemon that maintains and manages the DataNodes (slave nodes).
ii) Records the metadata of all the blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
iii) If a file is deleted in HDFS, the NameNode will immediately record this in the
EditLog.
iv) If regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live
b) DATANODE
i) It is the slave daemon which run on each slave machine
ii) Actual data is stored on DataNodes.
iii) Responsible for serving read and write requests from the clients.
iv) Responsible for creating blocks, deleting blocks and replicating the same based on
the decisions taken by the NameNode
15
v) Sends heartbeats to the NameNode periodically to report the overall health of HDFS,
by default, this frequency is set to 3 seconds.
2.2.2 HADOOP Ecosystem

Hadoop Ecosystem is neither a programming language nor a service, it is a platform or
framework which solves big data problems.
Below are the HADOOP components, that together form a HADOOP ecosystem :
• HDFS -> HADOOP Distributed File System
• YARN -> Yet Another Resource Negotiator
• MapReduce -> Data processing using programming
• Spark -> In-memory Data Processing
• PIG, HIVE-> Data Processing Services using Query (SQL-like)
• HBase -> NoSQL Database
• Apache Drill -> SQL on HADOOP
• Zookeeper -> Managing Cluster
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
16
Fig 2.2.2.a HADOOP Ecosystem
1. YARN:
Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing
activities by allocating resources and scheduling tasks.
It has two major components, i.e. ResourceManager and NodeManager.
a. ResourceManager is again a main node in the processing department.
b. It receives the processing requests, and then passes the parts of requests to
corresponding NodeManagers accordingly, where the actual processing
takes place.
c. NodeManagers are installed on every DataNode. It is responsible for

execution of task on every single DataNode.
2. MAPREDUCE
17
It is the core component of processing in a Hadoop Ecosystem as it provides the logic of
processing. In other words, MapReduce is a software framework which helps in writing
applications that processes large data sets using distributed and parallel algorithms inside
Hadoop environment.
• In a MapReduce program, Map() and Reduce() are two functions.
1. The Map function performs actions like filtering, grouping and

sorting.
2. While Reduce function aggregates and summarizes the result

produced by map function.
3. The result generated by the Map function is a key value pair (K, V)
which acts as the input for Reduce function.
3. APACHE PIG
• PIG has two parts: Pig Latin, the language and the pig runtime, for the
execution environment. You can better understand it as Java and JVM.
• It supports pig latin language, which has SQL like command structure.
As everyone does not belong from a programming background. So, Apache PIG relieves
them. You might be curious to know how?
Well, I will tell you an interesting fact:

10 line of pig latin = approx. 200 lines of Map-Reduce Java code
But don’t be shocked when I say that at the back end of Pig job, a map-reduce job executes.
• The compiler internally converts pig latin to MapReduce. It produces a

sequential set of MapReduce jobs, and that’s an abstraction (which works
like black box).
• PIG was initially developed by Yahoo.
• It gives you a platform for building data flow for ETL (Extract, Transform
and Load), processing and analyzing huge data sets.
4. APACHE HIVE
18
• Facebook created HIVE for people who are fluent with SQL. Thus, HIVE
makes them feel at home while working in a Hadoop Ecosystem.
• Basically, HIVE is a data warehousing component which performs reading,

writing and managing large data sets in a distributed environment using
SQL-like interface.
HQL = SQL + HIVE
• The query language of Hive is called Hive Query Language(HQL), which is

very similar like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
• The Hive Command line interface is used to execute HQL commands.
• While, Java Database Connectivity (JDBC) and Object Database

Connectivity (ODBC) is used to establish connection from data storage.
• It supports all primitive data types of SQL.
• You can use predefined functions, or write tailored user defined functions
(UDF) also to accomplish your specific needs.
5. APACHE OOZIE
Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For Apache
jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them together as
one logical work.
There are two kinds of Oozie jobs:
• Oozie workflow: These are sequential set of actions to be executed. You can
assume it as a relay race. Where each athlete waits for the last one to
complete his part.
• Oozie Coordinator: These are the Oozie jobs which are triggered when the
data is made available to it. Think of this as the response-stimuli system in
our body. In the same manner as we respond to an external stimulus, an
Oozie coordinator responds to the availability of data and it rests otherwise.
6. APACHE FLUME
19
Ingesting data is an important part of our Hadoop Ecosystem.
• The Flume is a service which helps in ingesting unstructured and semi-structured

data into HDFS.
• It gives us a solution which is reliable and distributed and helps us in collecting,

aggregating and moving large amount of data sets.
• It helps us to ingest online streaming data from various sources like network
traffic, social media, email messages, log files etc. in HDFS.
Fig 2.2.2.b Flume agents
The flume agent has 3 components: source, sink and channel.
1. Source: it accepts the data from the incoming streamline and stores the data in the
channel.
2. Channel: it acts as the local storage or the primary storage. A Channel is a

temporary storage between the source of data and persistent data in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and
commits or writes the data in the HDFS permanently.
7. APACHE SQOOP
The major difference between Flume and Sqoop is that:
• Flume only ingests unstructured data or semi-structured data into HDFS.
20
• While Sqoop can import as well as export structured data from RDBMS or
Enterprise data warehouses to HDFS or vice versa.
Fig 2.2.2.c SQOOP
When we submit Sqoop command, our main task gets divided into sub tasks which is handled
by individual Map Task internally. Map Task is the sub task, which imports part of data to the
Hadoop Ecosystem. Collectively, all Map tasks imports the whole data
Fig 2.2.2.d SQOOP Working
21
2.2 SYSTEM REQUIREMENT SPECIFICATION
1. System Configuration - 64bit system with minimum 8GB RAM.
2. Operating system - Ubuntu
3. Software used - HADOOP version 1.1
4. Database - MYSQL
5. SQOOP
6. HIVE
2.3 PROPOSED SYSTEM

We are proposing a Movies recommendation system using BIG DATA HADOOP which is
based on the concept of collaborative filtering algorithm. The basic idea behind this
recommender is that movies that are more popular and more critically acclaimed will have a
higher probability of being liked by the average audience. This model does not give
personalized recommendations based on the user.
The implementation of this model is extremely trivial. All we have to do is sort our movies
based on ratings and popularity and display the top movies of our list. As an added step, we can
pass in a genre argument to get the top movies of a particular genre.
We build a model from a user's past behaviour (items previously purchased or selected and/or
numerical ratings given to those items) as well as similar decisions made by other users. This
model is then used to predict items (or ratings for items) that the user may have an interest in.
This approach utilize a series of discrete characteristics of an item in order to recommend
additional items with similar properties.
Fig 2.3.1 collaborative filtering recommendation concept
22
2.4 PROCESS MODULES USED
The process modules which are used in this project are:
➢ HDFS
➢ SQOOP
➢ MYSQL
➢ HIVE
2.4.1 HDFS (HADOOP Distributed File System)
• Primary data storage system used by HADOOP applications.
• It employs a NameNode and DataNode architecture to implement a distributed file system
• Specially designed to be highly fault-tolerant
2.4.2 SQOOP
• Apache Sqoop (SQL-to-HADOOP)
• Designed to support bulk import of data into HDFS from structured data stores such as relational
databases etc.
2.4.3 MYSQL
➢ We had used MYSQL for managing our database and processing it .
2.4.4 HIVE
• HIVE is a WareHousing tool on top of HADOOP.
• Facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
• Applied only for structured data.
• Uses HQL (Hive Query Lang).
23
CHAPTER 4
REQUIREMENT SPECIFICATION
24
4.1 SOFTWARE REQUIREMENT SPECIFICATION
We are working with Dataset MoviesLens dataset which contain 105339 ratings and
6138 tag applications across 10329 movies.
The software used in building our project are:
➢ HADOOP version 1.1

➢ HDFS
➢ SQOOP
➢ HIVE
➢ MYSQL
4.2 HARDWARE REQUIREMENT SPECIFICATION

Hardware used for this projects are:
➢ Computer system with minimum configuration of 64bit and 8GB RAM.

➢ Operating system : Ubuntu version 10
4.3 FUNCTIONAL REQUIREMENT
Functional requirements may involve calculations, technical details, data manipulation and
processing, and other specific functionality that define what a system is supposed to
accomplish. ... This should be contrasted with non-functional requirements, which specify
overall characteristics such as cost and reliability.
Our project is with minimum functional requirements , it only requires a system with maximum
storage capacity (min 8GB RAM).
25
CHAPTER 5
SYSTEM DESIGN
26
5.1 INTRODUCTION
The project is designed with the concept of collaborative filtering system and its approach is
that if a person A has the same opinion as a person B on an issue, A is more likely to have B's
opinion on a different issue than that of a randomly chosen person.
In other words we can explain as , If two users share the same interests in the past, e.g. they
liked the same book or the same movie, they will also have similar tastes in the future. If, for
example, user A and user B have a similar purchase history and user A recently bought a book
that user B has not yet seen, the basic idea is to propose this book to user B.
We are using dataset MovieLens which contain 105339 ratings and 6138 tag applications across
10329 movies.
The basic designing concept of our project is that:
➢ First users, ratings, movies are ingested on HADOOP cluster using sqoop
➢ Then the data from these tables are preprocessed using hive and then good ratings are
processed by removing all ratings less than 3
➢ Then the good ratings are processed to give movies recommendation as output.
5.1.1 DATABASE
Tables in MovieLens database:
1. genres
2. genres_movies
3. movies
4. occupations
5. users
6. ratings
7. users_demo
27
Fig 5.1.1 database of our project
Fig 5.1.2 fields in table users, ratings and movies
28
5.2 RELATIONAL MODEL
Fig 5.2.1 Relational-mode020
…..l-of-database
29
CHAPTER 6
CODING & IMPLEMENTATION
30
6.1 METHOD:
There are 3 main steps:
Step1: Ingestion
Step 2: Association Rule Mining
Step 3: Recommendations
1) Ingestion:
Ingestion is about moving data from where it is originated, into a system where it can be
stored and analyzed such as HADOOP.
a) SQOOP import tables users, ratings, and movies in to HADOOP.

b) Create External Hive tables on top of SQOOP Result.
2) Association Rule Mining :
Association Rule Mining is about preprocessing the external hive tables on some set of rules
in order to get only those movies with good movies which can be recommended to the user.
Pre-processing: (i/p: Ratings, o/p: Good Ratings)
a. Remove all movies less than 3 ratings.

b. Remove all movies whose votes < Threshold (5).
3) Recommendation : (i/p: Good ratings, o/p: movies recommendation)
a) Find all the similar users
b) Find out the movies which were watched by similar user which was not watched by
base user.
Fig 6.1.1 Method of processing
6.1.2 PROCESS SCREENSHOTS FROM OUR PROJECT
31
Fig 6.1.2.a Running script file
Fig 6.1.2.b Fetching data from RDBMS for Recommendation using SQOOP
32
Fig 6.1.2.c Running MAP-REDUCE job for loading the data into cluster
Fig 6.1.2.d Data on the HADOOP cluster
33
Fig 6.1.2.e Data of ratings table on HADOOP cluster
Fig 6.1.2.f Creating External Hive tables on top of SQOOP Result
34
Fig 6.1.2.g Pre-processing the Hive tables
Fig 6.1.2.h Pre-processed data on HIVE tables
35
6.2 ALGORITHM
For our project, we focused on one main algorithm for recommendation: Collaborative
filtering.
6.2.1 Collaborative Filtering
Collaborative Filtering techniques make recommendations for a user based on ratings and
preferences data of many users. The main underlying idea is that if two users have both
liked certain common items, then the items that one user has liked that the other user has
not yet tried can be recommended to him. We see collaborative filtering techniques in action
on various Internet platforms such as Amazon.com, Netflix, and Facebook. We are
recommended items based on the ratings and purchase data that these platforms collect from
their user base. We explore two algorithms for Collaborative filtering, the Nearest
Neighbour’s Algorithm and the Latent Factors Algorithm. 3.1.1 Nearest Neighbour’s
Collaborative Filtering: This approach relies on the idea that users who have similar rating
behaviours so far, share the same tastes and will likely exhibit similar rating behaviours
going forward. The algorithm first computes the similarity between users by using the row
vector in the ratings matrix corresponding to a user as a representation for that user. The
similarity is computed by using either cosine similarity or Pearson Correlation. In order to
predict the rating for a particular user for a given movie j, we find the top k similar users to
this particular user and then take a weighted average of the ratings of the k similar users
with the weights being the similarity values. 3.1.2 Latent Factor Methods: The latent factor
algorithm looks to decompose the ratings matrix R into two tall and thin matrices Q and P,
with matrix Q having dimensions num_users × k and P having the dimensions unites × k
where k is the number of latent factors. The decomposition of R into Q and P is such that
R = Q.PT
Any rating rij in the ratings matrix can be computed by taking the dot product of row qi of
matrix Q and jpg of matrix P. The matrices Q and P are initialized randomly or by
performing SVD on the ratings matrix. Then, the algorithm solves the problem of
minimizing the error between the actual rating value rij and the value given by taking the
dot product of rows qi and pj. The algorithm performs stochastic gradient descent to find
the matrices Q and P with minimum error starting from the initial matrices.
36
Fig 6.2.1 Calculation of user similarity
Fig 6.2.2 Movie Recommendation System
37
Benchmarking is a comparison method with those of related and comparable
organizations.
It is the measurement of an organization’s internal processes performance data
and a comparison with those of related and comparable organizations. Preferably,
these comparisons are made with businesses from the same sector, but it is
possible to use benchmarking between businesses from other sector as well. In
these comparisons it mainly concerns the dimensions quality, time and costs of
organizations that are about the same size and that more or less have the same
outlet.tn addition, it is about how certain features can be realized better, faster
and cheaper.
6.3 RESULTS / OUTPUT
Table 1 represents the performance of various methods on Movie Lens 100k (small)
data.
38
mov
rating rating_ yea genresMa
ie Title Genres
count avg r trix
ID
[0, 1, 1, 1,
Toy
Adventure|Animation|Children|Com 2569.000 3.9593 199 1, 0, 0, 0,
1 Story
edy|Fantasy 000 23 5 1, 0, 0, 0,
(1995)
0, 0, 0, ...
[0, 1, 0, 1,
Jumanj
1155.000 3.2683 199 0, 0, 0, 0,
2 i Adventure|Children|Fantasy
000 98 5 1, 0, 0, 0,
(1995)
0, 0, 0, ...
Grump [0, 0, 0, 0,
ier Old 685.0000 3.1868 199 1, 0, 0, 0,
3 Comedy|Romance
Men 00 61 5 0, 0, 0, 0,
(1995) 0, 1, 0, ...
Waitin [0, 0, 0, 0,
g to 138.0000 3.0000 199 1, 0, 0, 1,
4 Comedy|Drama|Romance
Exhale 00 00 5 0, 0, 0, 0,
(1995) 0, 1, 0, ...
Father
[0, 0, 0, 0,
of the
657.0000 3.1438 199 1, 0, 0, 0,
5 Bride Comedy
00 36 5 0, 0, 0, 0,
Part II
0, 0, 0, ...
(1995)
Table 6.3.1 output for calculating similarity
39
Fig 6.3.1 good_movies
Fig 6.3.2 good_ratings
40
Fig 6.3.3 final result after execution
Result: After the successful execution of the scripts, we got an output as :
If the user watches movie Birdcage, he is recommended with the movie Starwars, Fargo, etc.
Fig 6.3.4 recommendation result 1
41
Fig 6.3.5 recommendation output b
Fig 6.3.6 recommendation output c
42
APPLICATION :
Benchmarking is used and applied within the management of organizations. Several aspects
of processesare evaluated against the best performance of other companies. It is however
necessary that this comparisons is made between companies with common features. through
this approach, organizations will acquire a better understanding of how they can tackle
developments and improvement’s in the best possible way. this approach can be a non-
recurring event, but it is increasingly used as a continuous process to improve the
performance of the organization.
43
CHAPTER 7
SYSTEM TESTING
44
7.1 Unit Testing
Unit Testing of software applications is done during the development (coding) of an
application. The objective of Unit Testing is to isolate a section of code and verify its
correctness. In procedural programming a unit may be an individual function or procedure.
The goal of Unit Testing is to isolate each part of the program and show that the individual parts
are correct. Unit Testing is usually performed by the developer.
7.2 Integration Testing

INTEGRATION TESTING is a level of software testing where individual units are combined
and tested as a group. The purpose of this level of testing is to expose faults in the interaction
between integrated units. Test drivers and test stubs are used to assist in Integration Testing.
7.3 Acceptance testing

ACCEPTANCE TESTING is a level of software testing where a system is tested for
acceptability. The purpose of this test is to evaluate the system's compliance with the business
requirements and assess whether it is acceptable for delivery. Definition by ISTQB.
7.5 System Testing

System Testing (ST) is a black box testing technique performed to evaluate the complete system
the system's compliance against specified requirements. In System testing, the functionalities
of the system are tested from an end-to-end perspective.
System Testing is usually carried out by a team that is independent of the development team in
order to measure the quality of the system unbiased. It includes both functional and Non-
Functional testing.
45
CHAPTER 8
ISSUED FACED
46
8.1 ISSUED FACED
8.1.1 Scalability Issues :
One of the major challenges in working with the 20 million dataset is memory constraints. The
data cannot be stored as a dense matrix due to its huge size. We have to make use of sparse
matrix representations in order for the program to work without memory issues. Further,
intermediate results such as the user-user similarity matrix cannot be computed and stored due
to the huge memory footprint. We had to think of ways to compute the similarity values as and
when needed. Further, the 20 million dataset also needed a lot of time run. We were able to
overcome the time requirements by writing parallelized implementations of the algorithms using
Apache Spark.
8.1.2 Broken links
As mentioned earlier, meta-data about the movies was collected by scrapping details from the
IMDB pages site. The smaller dataset provided auto-generated links to the movies url based on
the movies title and release year. This caused a large portion of the links to broken. Some titles
were ambiguity leading to a search page with recommendations rather the the movie page. For
others, there was some sort of error in the reference to the link. Some example of these errors
was usage of a secondary foreign title instead of the English one, usage of a former title, and
incorrect year of the movie. As a result, almost a third of the links were broken, and had to be
corrected before the data could be used. To fix this, we decided to use a different dataset where
the IMDB movie id was provided instead, which was easier to use.
8.1.3 Data sparsity
In practice, many commercial recommender systems are based on large datasets. As a result,
the user-item matrix used for collaborative filtering could be extremely large and sparse, which
brings about the challenges in the performances of the recommendation.
One typical problem caused by the data sparsity is the cold start problem. As collaborative
filtering methods recommend items based on users' past preferences, new users will need to rate
sufficient number of items to enable the system to capture their preferences accurately and thus
provides reliable recommendations.
Similarly, new items also have the same problem. When new items are added to the system,
they need to be rated by a substantial number of users before they could be recommended to
users who have similar tastes to the ones who rated them. The new item problem does not
affect content-based recommendations, because the recommendation of an item is based on its
discrete set of descriptive qualities rather than its ratings.
47
CHAPTER 9
CONCLUSION
48
9.1 CONCLUSION
The Simple Recommender offers generalized recommendations to every user based on
movie popularity and (sometimes) genre. The basic idea behind this recommender is
that movies that are more popular and more critically acclaimed will have a higher
probability of being liked by the average audience. This model does not give
personalized recommendations based on the user.
The implementation of this model is extremely trivial. All we have to do is sort our
movies based on ratings and popularity and display the top movies of our list. As an
added step, we can pass in a genre argument to get the top movies of a particular genre.
49
CHAPTER 10
FUTURE SCOPE
50
10.1 FUTURE SCOOP
There are plenty of way to expand on the work done in this project. Firstly, the content
based method can be expanded to include more criteria to help categorize the movies.
The most obvious ideas is to add features to suggest movies with common actors,
directors or writers. In addition, movies released within the same time period could also
receive a boost in likelihood for recommendation. Similarly, the movies total gross
could be used to identify a users taste in terms of whether he/she prefers large release
blockbusters, or smaller indie films. However, the above ideas may lead to overfitting,
given that a users taste can be highly varied, and we only have a guarantee that 20
movies (less than 0.2%) have been reviewed by the user. In addition, we could try to
develop hybrid methods that try to combine the advantages of both content-based
methods and collaborative filtering into one recommendation system.
51
CHAPTER 11
REFERENCES
52
11.1 REFERENCE
1. https://www.kaggle.com/rounakbanik/movie-recommender-systems
2. https://www.kaggle.com/bakostamas/movie-recommendation-algorithm
3. https://www.toptal.com/algorithms/predicting-likes-inside-a-simple-recommendation-
engine
4.https://pdfs.semanticscholar.org/767e/ed55d61e3aba4e1d0e175d61f65ec0dd6c08.pdf
5. https://en.wikipedia.org/wiki/Recommender_system
6. https://en.wikipedia.org/wiki/Collaborative_filtering
7. Edureka tutorial
8. Cloudera tutorial
53

Project Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Thesis

Uploaded by

Copyright:

Available Formats

1

If a company is to be successful, it needs to evaluate its performance in a consistent manner.

This is commonly referred to as benchmarking in management parlance.

A Step-by-Step Approach to Benchmarking

As stated before, if such a company operates in a similar environment or if it adopts a

(2) Collection of Information :

Drafting a questionnaire or a standardized interview format, carrying out primary research

(3) Analysis of Data :

1.3 PROBLEM WITH EXISTING SYSTEM

We are building a Collaborative Filtering Recommendation system. Collaborative filtering is

2.1.1 BIG DATA

What Comes Under Big Data?

• Structured data: Relational data.

• Semi Structured data: XML data.

• Unstructured data: Word, PDF, Text, Media Logs.

Fig 2.1.1.a Traditional approach of storing data

• The first problem is storing huge amount of data.

• Next problem was storing the variety of data.

• The third challenge was about processing the data faster.

HADOOP came up as solution :

• HDFS provides a distributed way to store Big Data.

Fig 2.1.2.a HADOOP features

2.2.1 Hadoop Core Components

i) It is the slave daemon which run on each slave machine

ii) Actual data is stored on DataNodes.

2.2.2 HADOOP Ecosystem

• HDFS -> HADOOP Distributed File System

• YARN -> Yet Another Resource Negotiator

• MapReduce -> Data processing using programming

• Spark -> In-memory Data Processing

• PIG, HIVE-> Data Processing Services using Query (SQL-like)

• HBase -> NoSQL Database

• Apache Drill -> SQL on HADOOP

• Zookeeper -> Managing Cluster

• Oozie -> Job Scheduling

• Flume, Sqoop -> Data Ingesting Services

It has two major components, i.e. ResourceManager and NodeManager.

a. ResourceManager is again a main node in the processing department.

c. NodeManagers are installed on every DataNode. It is responsible for

• In a MapReduce program, Map() and Reduce() are two functions.

1. The Map function performs actions like filtering, grouping and

2. While Reduce function aggregates and summarizes the result

Well, I will tell you an interesting fact:

• The compiler internally converts pig latin to MapReduce. It produces a

• PIG was initially developed by Yahoo.

• Basically, HIVE is a data warehousing component which performs reading,

HQL = SQL + HIVE

• The query language of Hive is called Hive Query Language(HQL), which is

• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.

• The Hive Command line interface is used to execute HQL commands.

• While, Java Database Connectivity (JDBC) and Object Database

• It supports all primitive data types of SQL.

There are two kinds of Oozie jobs:

• The Flume is a service which helps in ingesting unstructured and semi-structured

• It gives us a solution which is reliable and distributed and helps us in collecting,

Fig 2.2.2.b Flume agents

The flume agent has 3 components: source, sink and channel.

2. Channel: it acts as the local storage or the primary storage. A Channel is a

The major difference between Flume and Sqoop is that:

• Flume only ingests unstructured data or semi-structured data into HDFS.

Fig 2.2.2.c SQOOP

Fig 2.2.2.d SQOOP Working

2.3 PROPOSED SYSTEM