You are on page 1of 5

Post Graduate Diploma in Business Analytics, 2015-2017

22 July 2016
Data Source:
, Twitter, Wikipedia

PROJECT SCOPE:
The project aims at identifying the success of a Bollywood movie. The success is estimated
on three levels Hit, Average and Flop. The study can be used for three broad perspectives:
a) Before making a movie, estimate the success of a movie basis given parameters
b) For star casts, identify which movie should they work in so as to improve their hit
ratio
c) For producers, identify the controllable variables and adapt their movie launch so as
to maximize chances of movie being hit
DATA PREPARATION:
The data set consists of 842 movies from the year 2012 to 2016. The movies released in the
years from 2012 to 2015, 726 in count, form the training data set; while the movies released
in the year 2016, 116 in count, form the test data set. The response variable is Total Movie
Grossing divided by Total Movie Budget (i.e. Revenue/Budget). Movies having the ratio
value less than 1 are considered to be flop i.e. for flop movies, revenue grossing is less than
their budget. Movies having this ratio in the range 1 to 2 are considered hit; while for movies
with ratio more than 2, they are considered super hit.
There are six explanatory variables, namely month of movie launch, length of movie, number
of screens on which movie was launched, genre of the movie, actor popularity and director
popularity (Refer Exhibit 1 for data types). Month refers to month of movie release; while
length of movie captures the duration of movie in minutes. There are, in total, 13 different
genres captured for movies. A particular movie could fall into multiple genres (Refer Exhibit
2 for different genres).

Number of screens refers to number of screens on which movie was released. A multiplex
could have more than one screen. Actors and Directors popularity index are calculated from
the number of Twitter followers. For actors popularity index, actors having followers falling
in the lowest 50th percentile have low popularity, while the top 25 percentile have high
popularity, remaining fall in the medium category. For the directors popularity index, 80
percentile is the threshold value. For directors having Twitter followers in the top 20
percentile, index shows high popularity; while for others it shows low popularity. (Refer
Exhibit 3 for distribution of Twitter followers of actor and director)
DATA EXTRACTION:
The data was crawled using Python script from the website www.boxofficeindia.com. Also,
Twitter Python APIs (Module Tweepy) were used to find out the Twitter followers for Actors
and Directors. The number of followers was used as a proxy to classify the popularity of a

actor on three scales high, medium and low, and classify the popularity of a director on
binary level high and low.
TOOLS USED:
AAA
METHODOLOGY:
Logistic
Odds Ratio
Decision Tree
Made a decision tree using rpart function of R. The model suggested that the most
important variable was Screens with value ___, then length ____ and so on (See from the
graph below). Then prediction was carried out and the results are shown.

RESULTS:
RECOMMENDATIONS:
For Movie Success:
For Actor:
For Producer:

APPENDIX:
Exhibit 1:

Exhibit 2:

Exhibit 3:

You might also like