You are on page 1of 10

Cover Page for Academic Tasks

Course Code: MGN619 Course Title: Business Analytics

Course Instructor: Kriti Bedi

Academic TaskNo.:02 Academic Task Title: Predictive Ays.

Date ofAllotment:5/10/2018 Date of Submission:23/10/2018

Student’s Name: Chitransh singh Student’s Reg. No :11706236

Evaluation Parameters: (Parameters on which student is to be evaluated- To be mentioned by

students as specified at the time of assigning the task by the instructor)

Learning Outcomes: Learned about machine learning and predictive analysis using various models.

Declaration:

I declare that this Assignment is my individual work. I have not copied it from any other student’s work or
from any other source except where due acknowledgement is made explicitly in the text, nor has any part
been written for me by any other person.

StudentsSignature:

Evaluator’scomments (For Instructor’s use only)

General Observations Suggestions for Improvement Best part of assignment

Evaluator‟s Signature and Date:

Marks Obtained:

Max. Marks
Introduction
In this study tree Classifier machine learning algorithm is applied to predict. The data is acquired
from Banking marketing and includes 45211 number instances(Row) data on 17
attributes(column) based on Data Mining for Bank Direct Marketing .

Dataset(s)
The data for the project was accessed from the
(https://archive.ics.uci.edu/ml/datasets/bankmarketing). The data is extracted by Paulo Cortez,
Sérgio Moro using Data Mining for Bank Direct Marketing October 2011. The data set includes
figures on 45211 observations and 17 attributes of dataset. The data is related with direct
marketing campaigns of a Portuguese banking institution. The marketing campaigns were based
on phone calls. Often, more than one contact to the same client was required, in order to access if
the product (bank term deposit) would be (or not) subscribed. The target variable in the data set
likes Ages, employee service position ,such as admin ,services, blue-collar etc. either they are
single ,married ,divorce and not based on a set of different variables. There are 17 features
containing information on class target such as nominal, numeric, binary number of variables
properties.

Data Preparation and Cleaning


The following data preparation tasks are conducted to make the data suitable for running
the machine learning model (decision tree classifier)

Non removing all the rows with unknown data


Changing the file type from xls to csv else change file name . Making
either subset to the columns on which we taking for analysis or
Manually putting the attributes on the columns for the ease of
selection of data in model and preventing errors.

Methods
A supervised machine learning approach of tree Classifier is used for the study tree classifier is
chosen due to two reasons. First since the outcome (target) variable is binary variable , using
classification algorithms is better than regression algorithms. This is because the target having
only values of 0 and 1, regression algorithms will perform less due to less variation in the
target variable.
#classification
Library define as collection of stored packages of R function in which data would be complied with format
#classification
> library(tree)
> library(ISLR)
> attach(bank)
> View(bank)
> range(V1)
[1] 18 95

#converting numeric into variables data


High=ifelse(V1>=40,"High","Low")
View(High)
Bank=data.frame(bank,High)
View(Bank)
#partitioning of data(Bank) into train and test

ind=sample(2,nrow(Bank),replace=TRUE,prob=c(0.7,0.3))

train=Bank[ind==1,]

test=Bank[ind==2,]

library(party) (A class for representing decision trees and corresponding accessor functions)

model<-ctree(V2~V3+V4+V7,data=train,controls=ctree_control(mincriterion = 0.9,minsplit = 5000))


{since the data is large so i used to control commands to minimize the number of nodes}
model

plot(model)
As we can see for analyzing the dataset bank marketing R have picked varibles (V4,V3,V7,) as most
important factor.

From above tree model we can interpreted that the areas which was mostly affected in banking
market decision as in primary,territory,secondary and unknown which is further has to be
implemented .

In V4 variables (area) like primary,secondary,unknown have different class of people like single,
married and divorce which was represented V3.

Under V3 we have two type of dataset –


1)Single
2)Divorce and married

Single- p<0.001

If person was single it shows “yes” attributes in V7 Node6(n=291)


Whereas “No” in V7 then Node7(n=211)

If person was divorced and married it created four types of variables under V3 variables.
Out of four variables most of the person are married which have Node13(n=591) as comparisation to
other attributes
If “yes” then people can easily meet up to other attribute which was mentioned V4 variables of
secondary and Unknown attributes.

At last the most important attributes which was from tertiary under V4 variables that define high
affected by making unexpected decision that is comes under (V7) variables generally people are from
divorced and married and Nodes 21(n=3067) is high covered
#error in train data
tab<-table(predict(model),train$V2)
print(tab)
1-sum(diag(tab))/sum(tab)

#error in test
data
testpred<-predict(model,newdata=test)
tab2<-table(testpred,test$V2)
print(tab2)
1-sum(diag(tab2))/sum(tab2
# Regression

bank<-read.csv("C:/Users/ABC/Desktop/banking marketing.csv")
ind=sample(2,nrow(bank),replace=TRUE,prob=c(0.8,0.2))
tdata=bank[ind==1,]
vdata=bank[ind==2,]
head(tdata)
head(vdata)
results=lm(V1~V6,tdata)
results
prediction=predict(results,vdata)
head(prediction)
head(vdata)
summary(results)
results
> results

Call:
lm(formula = V1 ~ V6, data = tdata)

Coefficients:
(Intercept) V6
4.047e+01 3.343e-04
#Clustering
Clustering is an unsupervised learning technique. It is the task of grouping together a set of
objects in a way that objects in the same cluster are more similar to each other than to objects in
other clusters.

According to Bank marketing dataset has disnabled to made a cluster Plot Inspite of this dataset I
had taken Inbuilt dataset that is “US Arrests” . It is already mentioned in a inbuilt data of
Rstudio .

As below it was “USArresrts” data The commands are for plotting cluster which was shown highly affected area.
In terms if Murder, Assault ,Rape etc.
#Full script
attach(USArrests)
> View(USArrests)
> df=USArrests
> df=na.omit(df)
> df=scale(df)
> k2=kmeans(df,centers = 2,nstart = 20)
> fviz_cluster(k2,data=df)

You might also like