You are on page 1of 26

9/1/2018 Prospective Insurance Customers

Prospective Insurance Customers:


In [25]: import warnings
warnings.filterwarnings('ignore')

In [2]: import numpy as np


import pandas as pd

import matplotlib
matplotlib.use('nbagg')
import matplotlib.pyplot as plt

import seaborn as sns


import math
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

1.Collecting the data:


In [3]: # Loading the Spread sheet
sheet = pd.ExcelFile('Prospective Insurance Customers.xlsx')

In [4]: #Loading the Spread Sheet into DataFrame


df = sheet.parse()

In [5]: #shape of data


df.shape

Out[5]: (5000, 11)

observation : This dataset is consisting of 5000 rows and 11 columns/attributes.

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 1/26
9/1/2018 Prospective Insurance Customers

In [6]: #seeing how the data looks like


df.head()

Out[6]:

Number Nu
Customer Avg Size Relationship Household Education House
of Age o
ID Household Status Type Level Type
Houses Pol

Without Home
0 529229 1 2 50 Married Low 1
Children Owner

With Rented
1 926757 1 3 46 Married Low 0
Children House

With Rented
2 475463 1 3 45 Married Low 0
Children House

Without Rented
3 900971 1 2 40 Married Low 1
Children House

Other Without Rented


4 628437 1 1 50 Medium 0
Relationship Children House

2.Analyzing th data:
In [7]: #Giving the information about the attributes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
Customer ID 5000 non-null int64
Number of Houses 5000 non-null int64
Avg Size Household 5000 non-null int64
Age 5000 non-null int64
Relationship Status 5000 non-null object
Household Type 5000 non-null object
Education Level 5000 non-null object
House Type 5000 non-null object
Number of Car Policies 5000 non-null int64
Number of Life Insurance Policies 5000 non-null int64
Customer Type 5000 non-null object
dtypes: int64(6), object(5)
memory usage: 429.8+ KB

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 2/26
9/1/2018 Prospective Insurance Customers

In [8]: df.describe()

Out[8]:
Number of
Number of Avg Size Number of Life
Customer ID Age
Houses Household Car Policies Insurance
Policies

count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000

mean 500123.188600 1.111600 2.672000 44.892600 0.560400 0.074800

std 289073.995447 0.415668 0.788759 8.714531 0.606651 0.373675

min 175.000000 1.000000 1.000000 20.000000 0.000000 0.000000

25% 248324.750000 1.000000 2.000000 40.000000 0.000000 0.000000

50% 501487.500000 1.000000 3.000000 44.000000 1.000000 0.000000

75% 747474.750000 1.000000 3.000000 50.000000 1.000000 0.000000

max 999399.000000 10.000000 5.000000 80.000000 7.000000 8.000000

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 3/26
9/1/2018 Prospective Insurance Customers

In [9]: sns.countplot('Customer Type', data=df)


plt.show()
df['Customer Type'].value_counts()

Out[9]: Prospective 4689


Current 311
Name: Customer Type, dtype: int64

observation: Given dataset has imbalanced data, which means majority class is Prospective and minority class
is Current.

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 4/26
9/1/2018 Prospective Insurance Customers

In [10]: sns.countplot(x='Customer Type',hue='Relationship Status', data=df)


plt.show()
print(df[df['Customer Type']=='Current']['Relationship Status'].value_counts
())

df[df['Customer Type']=='Prospective']['Relationship Status'].value_counts()

Married 294
Other Relationship 13
Singles 3
Living Together 1
Name: Relationship Status, dtype: int64

Out[10]: Married 4061


Other Relationship 487
Singles 90
Living Together 51
Name: Relationship Status, dtype: int64

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 5/26
9/1/2018 Prospective Insurance Customers

In [11]: sns.countplot(x='Customer Type',hue='Household Type', data=df)


plt.show()

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 6/26
9/1/2018 Prospective Insurance Customers

In [12]: sns.countplot(x='Customer Type',hue='Education Level', data=df)


plt.show()

In [13]: #Finding the numerical features

cols = df.columns

num_cols = df._get_numeric_data().columns
print("Numerical features are:")
print(list(num_cols))

Numerical features are:


['Customer ID', 'Number of Houses', 'Avg Size Household', 'Age', 'Number of C
ar Policies', 'Number of Life Insurance Policies']

In [14]: #categorical_features = cols - num_cols


ct = list(set(cols) - set(num_cols))
print("Categocal Features are:")
print(ct)

Categocal Features are:


['Relationship Status', 'House Type', 'Education Level', 'Household Type', 'C
ustomer Type']

3.Data Cleaning :

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 7/26
9/1/2018 Prospective Insurance Customers

In [15]: #Find if there are any Nan values


df.isnull().sum()

Out[15]: Customer ID 0
Number of Houses 0
Avg Size Household 0
Age 0
Relationship Status 0
Household Type 0
Education Level 0
House Type 0
Number of Car Policies 0
Number of Life Insurance Policies 0
Customer Type 0
dtype: int64

In [17]: df['Customer Type'].value_counts()

Out[17]: Prospective 4689


Current 311
Name: Customer Type, dtype: int64

In [16]: #Customer Type has two categories, so making Current category to 1 and Prospec
tive category to 0

def customer_type(x):
if(x == 'Current'):
return 1
return 0

df['Customer Type'] = df['Customer Type'].map(customer_type)

In [17]: #Converting categorial features into numerical features

df_final = pd.get_dummies(df, columns=['Household Type', 'House Type',


'Education Level', 'Relationship Status'])

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 8/26
9/1/2018 Prospective Insurance Customers

In [18]: print(df_final.shape)
df_final.head()

(5000, 18)
Out[18]:
Number
Number Number Household Ho
Customer Avg Size of Life Customer
of Age of Car Type_With Type_
ID Household Insurance Type
Houses Policies Children
Policies

0 529229 1 2 50 1 0 1 0 1

1 926757 1 3 46 0 0 0 1 0

2 475463 1 3 45 0 0 0 1 0

3 900971 1 2 40 1 0 0 0 1

4 628437 1 1 50 0 0 0 0 1

4.Splitting the data:


In [280]: #Storing independent variables into X and dependent variable into y
X = df_final.drop(['Customer Type', 'Customer ID'] , axis=1)
y = df_final['Customer Type']

In [20]: X_cols = X.columns

X_cols.tolist()

Out[20]: ['Number of Houses',


'Avg Size Household',
'Age',
'Number of Car Policies',
'Number of Life Insurance Policies',
'Household Type_With Children',
'Household Type_Without Children',
'House Type_Home Owner',
'House Type_Rented House',
'Education Level_High',
'Education Level_Low',
'Education Level_Medium',
'Relationship Status_Living Together',
'Relationship Status_Married',
'Relationship Status_Other Relationship',
'Relationship Status_Singles']

In [21]: #normalizing the data


from sklearn import preprocessing
X1 = preprocessing.normalize(X)

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 9/26
9/1/2018 Prospective Insurance Customers

In [22]: print("shape of X:", X1.shape)


print("shape of Y:", y.shape)

shape of X: (5000, 16)


shape of Y: (5000,)

In [26]: from sklearn.cross_validation import train_test_split

#splitting the data into train as 70% and test as 30%


X_tr, X_test, Y_tr, Y_test = train_test_split(X1,y,test_size=0.3,random_state=
1)

In [27]: Y_test = Y_test.values

In [28]: print("shape of train data:", X_tr.shape)


print("shape of test data:", X_test.shape)

shape of train data: (3500, 16)


shape of test data: (1500, 16)

5.Building the model:

Utility functions for model evaluations

In [53]: def print_results(y_true, y_pred):

cm = confusion_matrix(Y_test, Y_pred)

sns.heatmap(cm, annot=True,annot_kws={"size": 20}, fmt='g', vmin=0, vmax=1


500)

print("Accuracy on test data:",accuracy_score(Y_test, Y_pred) * 100)


print("Precision on test data:",precision_score(Y_test, Y_pred) * 100)
print("Recall on test data:",recall_score(Y_test, Y_pred) * 100)
print("F1_score on test data:",f1_score(Y_test, Y_pred) * 100)
plt.show()

In [54]: from sklearn.linear_model import LogisticRegression

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 10/26
9/1/2018 Prospective Insurance Customers

In [55]: #fitting the model with logistic regression


log_model = LogisticRegression()
log_model.fit(X_tr, Y_tr)

Out[55]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

In [56]: Y_pred = log_model.predict(X_test)

print_results(Y_test, Y_pred)

Accuracy on test data: 93.0


Precision on test data: 0.0
Recall on test data: 0.0
F1_score on test data: 0.0

By comparing with the above metrics, this model is predicting only the dominating class which means
majority class because imbalanced data.

To overcome that point, we can use parameter called class weight can be made as balanced.

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 11/26
9/1/2018 Prospective Insurance Customers

Use class weights to make it right


We will do cross validation to check over fitting / under fitting

In [57]: from sklearn.cross_validation import cross_val_score, cross_val_predict

#fitting the model using the class_weight parameter with balanced


log_model = LogisticRegression(class_weight='balanced')

predicted = cross_val_predict(log_model, X_tr, Y_tr, cv=5)

#finding the train accuracy


print("Train Accuracy on data:",accuracy_score(Y_tr, predicted) * 100)

Train Accuracy on data: 64.25714285714285

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 12/26
9/1/2018 Prospective Insurance Customers

In [58]: #testing the model


log_model.fit(X_tr, Y_tr)
Y_pred = log_model.predict(X_test) # Predicted class labels from test features

print_results(Y_test, Y_pred)

Accuracy on test data: 64.13333333333333


Precision on test data: 11.408199643493761
Recall on test data: 60.952380952380956
F1_score on test data: 19.219219219219223

observation:

According to problem, we have to predict current customer as current and enen if prospective customer
is predicted as current customer its fine. Because that is what our final objective is.

So, we have to increase true positives(predicting Current customer as current customer) which
means decease the false negative(ie., we don't want our model to predict current customers as
prospective customers), for that reason have to choose our metric as recall.

In [59]: scores = cross_val_score(log_model, X_tr, Y_tr, cv=5,scoring='accuracy')


print(scores)

[0.67902996 0.62714286 0.61142857 0.65142857 0.64377682]

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 13/26
9/1/2018 Prospective Insurance Customers

From the above list, the cross validation scores are not changing much across different cv sets. So we
can be sure that our model is clearly not overfitting

=======================================

Picking right threshold as it is imbalanced data

1. Getting the probability values instead of label, so that we can


choose threshold
In [60]: probs = log_model.predict_proba(X_test) #Predicted probabilities from test fea
tures,array of shape = [n_samples, n_classes]

In [61]: probs[0]

Out[61]: array([0.47050864, 0.52949136])

In [73]: # we wil take probability values of a customer being current customer from the
given probailities
probs = probs[:,1]
probs[0]

Out[73]: 0.5294913551020631

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 14/26
9/1/2018 Prospective Insurance Customers

In [64]: def threshold_picker(start, end, step, y_true, pred_probs, print_cm = False):

# create thresholds in specified range


ths = np.arange(start , end, step)

# list that stores threshold values, precisions and recall values


precision = []
recall = []
ths_new = []

for th in ths:
th = np.round(th,3)
act_y = [] # predicted values for current threshhold
pre_y = [] # true values for current threshold
for i, pred_prob in enumerate(pred_probs):
act_y.append(y_true[i]);
pre_y.append(1 if pred_prob > th else 0);

# finding precision ad recall values with current threshold


p = precision_score(act_y, pre_y) * 100
r = recall_score(act_y, pre_y) * 100
if(p != 0 and r!= 0):
precision.append(p)
recall.append(r)
ths_new.append(th)
# print confusion matrix if needed
if print_cm:
cm = pd.DataFrame(confusion_matrix(act_y, pre_y))
print(cm)
print(f'(threshold, Precision, recall) : ({th}, {p}, {r})')
print('*'*50)

#Precision and Recall Scores of the decision threshold


plt.plot(ths_new, precision, 'b--', marker='o', label='precision')
plt.plot(ths_new, recall, 'r--', marker='o', label='recall')
plt.legend()
plt.show()

2. Checking for right threshold ( 0.3 <= threshold <= 0.8 )

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 15/26
9/1/2018 Prospective Insurance Customers

In [74]: threshold_picker(0.3, 0.8, 0.01, Y_test, probs)

As we already know recall should be high and precsion should not be low, ie., if the precision is too low,
the model is trying to predict every customer as prospective, which we should avoid.

so we will pick the range where precision is starts increasig and recall is also reasonably high.(0.45 to
0.5)

3. Checking for right threshold ( 0.45 <= threshold <= 0.5 )

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 16/26
9/1/2018 Prospective Insurance Customers

In [75]: threshold_picker(0.45, 0.5, 0.002, Y_test, probs)

Around 0.48, precision remains same and recall decreases suddenly, we will take that as our optimal
threshold.

Final optimal threshold : 0.48

4. Comparing results of optimal threshold(0.48) with default


threshold(0.5) with balanced class_weights

4.1 Default Threshold

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 17/26
9/1/2018 Prospective Insurance Customers

In [165]: # using defalut threshold


th = 0.5
# converting probabilities into 1/0 based on threshold
Y_pred_default = [1 if prob>th else 0 for prob in probs]

# find confusion matrix


cm_default = confusion_matrix(Y_test, Y_pred_default)

4.2 Optimal Threshold

In [167]: th = 0.48
# conerting probabilities into 1/0 based on threshold
Y_pred_optimal = [1 if prob>th else 0 for prob in probs]
# find confusion matrix
cm_optimal = confusion_matrix(Y_test, Y_pred_optimal)

4.3 plotting both confusion matrices

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 18/26
9/1/2018 Prospective Insurance Customers

In [181]: fig = plt.figure(figsize=plt.figaspect(.5))

ax1 = plt.subplot(121)
ax1.set_adjustable('box')
sns.heatmap(cm_default, annot=True,annot_kws={"size": 20}, fmt='g', cbar=False
, vmin=0, vmax=1500, ax=ax1)
plt.title("threshold=0.5")

ax2 = plt.subplot(122)

sns.heatmap(cm_optimal, annot=True,annot_kws={"size": 20}, fmt='g', vmin=0, vm


ax=1500, ax=ax2)
plt.title("threshold=0.48")
plt.show()

# *******************************************
print('\n\nDefault threshold')
print('*'*25)
print("Precision:",precision_score(Y_test, Y_pred_default) * 100)
print("Recall :",recall_score(Y_test, Y_pred_default) * 100)

print('\n\nOptimal threshold')
print('*'*25)
print("Precision :",precision_score(Y_test, Y_pred_optimal) * 100)
print("Recall :",recall_score(Y_test, Y_pred_optimal) * 100)

Default threshold
*************************
Precision: 11.408199643493761
Recall : 60.952380952380956

Optimal threshold
*************************
Precision : 9.83425414364641
Recall : 84.76190476190476

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 19/26
9/1/2018 Prospective Insurance Customers

Questions

1. Which attributes are most predictive of a customer’s likelihood


to purchase and why?

In [234]: #Finding the weights of all features

weights = log_model.coef_
weights

Out[234]: array([[-1.22123796, 0.41864355, -0.04224982, 6.00272688, -0.31417063,


-0.33972745, -0.17507081, 1.62442449, -2.13922276, 2.034943 ,
-3.61282045, 1.06307918, -0.13138883, 0.90629245, -1.30562076,
0.01591887]])

In [330]: #Finding the column names


X_cols = X.columns

a = X_cols.tolist()

In [273]: #Making the dictionary of key should be feature names and value should be corr
esponding feature weight

d = {}
n = len(a) #finding the length of features

for i in range(n):
x = a[i]
d[a[i]] = weights[0][i]

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 20/26
9/1/2018 Prospective Insurance Customers

In [279]: import operator


result = sorted(d.items(), key=operator.itemgetter(1),reverse=True)
result

Out[279]: [('Number of Car Policies', 6.002726879338931),


('Education Level_High', 2.0349430012371137),
('House Type_Home Owner', 1.624424493865725),
('Education Level_Medium', 1.063079184955882),
('Relationship Status_Married', 0.906292450000115),
('Avg Size Household', 0.4186435502184599),
('Relationship Status_Singles', 0.01591887361090185),
('Age', -0.04224982386024744),
('Relationship Status_Living Together', -0.13138882682397124),
('Household Type_Without Children', -0.17507081222574708),
('Number of Life Insurance Policies', -0.31417062839689325),
('Household Type_With Children', -0.339727452196972),
('Number of Houses', -1.2212379620488776),
('Relationship Status_Other Relationship', -1.3056207612097648),
('House Type_Rented House', -2.13922275828844),
('Education Level_Low', -3.612820450615726)]

In [369]: result[2][0]

Out[369]: 'House Type_Home Owner'

In [394]: n = len(result)
keys = []
values = []
for i in range(n):
keys.append(result[i][0])
values.append(result[i][1])

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 21/26
9/1/2018 Prospective Insurance Customers

In [397]: sns.barplot(values, keys)


plt.show()

print(keys)

['Number of Car Policies', 'Education Level_High', 'House Type_Home Owner',


'Education Level_Medium', 'Relationship Status_Married', 'Avg Size Househol
d', 'Relationship Status_Singles', 'Age', 'Relationship Status_Living Togethe
r', 'Household Type_Without Children', 'Number of Life Insurance Policies',
'Household Type_With Children', 'Number of Houses', 'Relationship Status_Othe
r Relationship', 'House Type_Rented House', 'Education Level_Low']

2. Among the customers who have not purchased a health


insurance plan(ie., prospective customers), who are the 50 most
likely to purchase a plan?

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 22/26
9/1/2018 Prospective Insurance Customers

In [284]: df_final.head()

Out[284]:
Number
Number Number Household Ho
Customer Avg Size of Life Customer
of Age of Car Type_With Type_
ID Household Insurance Type
Houses Policies Children
Policies

0 529229 1 2 50 1 0 1 0 1

1 926757 1 3 46 0 0 0 1 0

2 475463 1 3 45 0 0 0 1 0

3 900971 1 2 40 1 0 0 0 1

4 628437 1 1 50 0 0 0 0 1

In [289]: #Taking only the perspective customers from the whole data
df_final1 = df_final[df_final['Customer Type'] == 0]

df_final1['Customer Type'].value_counts()

Out[289]: 0 4689
Name: Customer Type, dtype: int64

In [318]: #storing the Customer Id of prospective customer into an one variable


cus_ids = df_final1['Customer ID']
#Converting pandas to list
a = cus_ids.tolist()

In [294]: #Splitting the data into independent variables and dependent variables
X = df_final1.drop(['Customer ID', 'Customer Type'], axis=1)
y = df_final1['Customer Type']

In [296]: #normalizing the data


from sklearn import preprocessing
X_pre = preprocessing.normalize(X)

In [297]: Y_pred1 = log_model.predict(X_pre)

In [298]: probs1 = log_model.predict_proba(X)

In [304]: probs1[][1]

Out[304]: 0.028084328873035802

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 23/26
9/1/2018 Prospective Insurance Customers

In [320]: th = 0.48
n = len(Y_pred1)
dis = {}
for i in range(n):

if(probs1[i][1] > th):


dis[a[i]] = probs1[i][1]

In [321]: result2 = sorted(dis.items(), key=operator.itemgetter(1),reverse=True)

In [329]: cus_50 = []
for i in range(50):
cus_50.append(result2[i][0])
print(cus_50)

[565836, 349052, 983279, 849524, 67544, 942610, 100040, 139962, 274774, 33491
8, 956126, 786112, 594755, 280767, 958667, 558277, 16719, 385947, 428279, 216
740, 966356, 67752, 285831, 99666, 921421, 628626, 667416, 268739, 304136, 67
8156, 188763, 328206, 167788, 856187, 260768, 512893, 205384, 343236, 962491,
91323, 109910, 498509, 862950, 87493, 596017, 922856, 468284, 465313, 458265,
589181]

Getting top 50 customers details who will most likely to purchase a plan

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 24/26
9/1/2018 Prospective Insurance Customers

In [360]: # get all the customer ids from the dataframe


total_cus_ids = df['Customer ID'].tolist()

# get the index of top_50 cusstomers.


top_cus_indices = [total_cus_ids.index(cus_id) for cus_id in cus_50]

# get top 50 customer details from the


top_50_cus_details = df.iloc[top_cus_indices]

top_50_cus_details.head()

Out[360]:

Number
Customer Avg Size Relationship Household Education House
of Age
ID Household Status Type Level Type
Houses

Without Home
4616 565836 1 3 49 Married Medium
Children Owner

Other With Rented


2338 349052 1 3 40 Medium
Relationship Children House

With Home
4852 983279 1 3 56 Married Medium
Children Owner

Without Home
3659 849524 1 2 52 Married Low
Children Owner

Without Home
2653 67544 1 2 47 Married High
Children Owner

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 25/26
9/1/2018 Prospective Insurance Customers

http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 26/26

You might also like