Prospective Insurance Customers PDF

9/1/2018 Prospective Insurance Customers
Prospective Insurance Customers:

In [25]: import warnings
warnings.filterwarnings('ignore')
In [2]: import numpy as np

import pandas as pd
import matplotlib
matplotlib.use('nbagg')
import matplotlib.pyplot as plt
import seaborn as sns

import math
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
1.Collecting the data:

In [3]: # Loading the Spread sheet
sheet = pd.ExcelFile('Prospective Insurance Customers.xlsx')
In [4]: #Loading the Spread Sheet into DataFrame

df = sheet.parse()
In [5]: #shape of data

df.shape
Out[5]: (5000, 11)
observation : This dataset is consisting of 5000 rows and 11 columns/attributes.
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 1/26
In [6]: #seeing how the data looks like

df.head()
Out[6]:
Number Nu
Customer Avg Size Relationship Household Education House
of Age o
ID Household Status Type Level Type
Houses Pol
Without Home
0 529229 1 2 50 Married Low 1
Children Owner
With Rented
1 926757 1 3 46 Married Low 0
Children House
With Rented
2 475463 1 3 45 Married Low 0
Children House
Without Rented
3 900971 1 2 40 Married Low 1
Children House
Other Without Rented

4 628437 1 1 50 Medium 0
Relationship Children House
2.Analyzing th data:
In [7]: #Giving the information about the attributes
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
Customer ID 5000 non-null int64
Number of Houses 5000 non-null int64
Avg Size Household 5000 non-null int64
Age 5000 non-null int64
Relationship Status 5000 non-null object
Household Type 5000 non-null object
Education Level 5000 non-null object
House Type 5000 non-null object
Number of Car Policies 5000 non-null int64
Number of Life Insurance Policies 5000 non-null int64
Customer Type 5000 non-null object
dtypes: int64(6), object(5)
memory usage: 429.8+ KB
In [8]: df.describe()
Out[8]:
Number of
Number of Avg Size Number of Life
Customer ID Age
Houses Household Car Policies Insurance
Policies
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000
mean 500123.188600 1.111600 2.672000 44.892600 0.560400 0.074800
std 289073.995447 0.415668 0.788759 8.714531 0.606651 0.373675
min 175.000000 1.000000 1.000000 20.000000 0.000000 0.000000
25% 248324.750000 1.000000 2.000000 40.000000 0.000000 0.000000
50% 501487.500000 1.000000 3.000000 44.000000 1.000000 0.000000
75% 747474.750000 1.000000 3.000000 50.000000 1.000000 0.000000
max 999399.000000 10.000000 5.000000 80.000000 7.000000 8.000000
In [9]: sns.countplot('Customer Type', data=df)

plt.show()
df['Customer Type'].value_counts()
Out[9]: Prospective 4689

Current 311
Name: Customer Type, dtype: int64
observation: Given dataset has imbalanced data, which means majority class is Prospective and minority class
is Current.
In [10]: sns.countplot(x='Customer Type',hue='Relationship Status', data=df)

plt.show()
print(df[df['Customer Type']=='Current']['Relationship Status'].value_counts
())
df[df['Customer Type']=='Prospective']['Relationship Status'].value_counts()
Married 294
Other Relationship 13
Singles 3
Living Together 1
Name: Relationship Status, dtype: int64
Out[10]: Married 4061

Other Relationship 487
Singles 90
Living Together 51
Name: Relationship Status, dtype: int64
In [11]: sns.countplot(x='Customer Type',hue='Household Type', data=df)

plt.show()
In [12]: sns.countplot(x='Customer Type',hue='Education Level', data=df)

plt.show()
In [13]: #Finding the numerical features
cols = df.columns
num_cols = df._get_numeric_data().columns
print("Numerical features are:")
print(list(num_cols))
Numerical features are:

['Customer ID', 'Number of Houses', 'Avg Size Household', 'Age', 'Number of C
ar Policies', 'Number of Life Insurance Policies']
In [14]: #categorical_features = cols - num_cols

ct = list(set(cols) - set(num_cols))
print("Categocal Features are:")
print(ct)
Categocal Features are:

['Relationship Status', 'House Type', 'Education Level', 'Household Type', 'C
ustomer Type']
3.Data Cleaning :
In [15]: #Find if there are any Nan values

df.isnull().sum()
Out[15]: Customer ID 0
Number of Houses 0
Avg Size Household 0
Age 0
Relationship Status 0
Household Type 0
Education Level 0
House Type 0
Number of Car Policies 0
Number of Life Insurance Policies 0
Customer Type 0
dtype: int64
In [17]: df['Customer Type'].value_counts()
Out[17]: Prospective 4689

Current 311
In [16]: #Customer Type has two categories, so making Current category to 1 and Prospec
tive category to 0
def customer_type(x):
if(x == 'Current'):
return 1
return 0
df['Customer Type'] = df['Customer Type'].map(customer_type)
In [17]: #Converting categorial features into numerical features
df_final = pd.get_dummies(df, columns=['Household Type', 'House Type',

'Education Level', 'Relationship Status'])
In [18]: print(df_final.shape)
df_final.head()
(5000, 18)
Out[18]:
Number
Number Number Household Ho
Customer Avg Size of Life Customer
of Age of Car Type_With Type_
ID Household Insurance Type
Houses Policies Children
Policies
0 529229 1 2 50 1 0 1 0 1
1 926757 1 3 46 0 0 0 1 0
2 475463 1 3 45 0 0 0 1 0
3 900971 1 2 40 1 0 0 0 1
4 628437 1 1 50 0 0 0 0 1
4.Splitting the data:

In [280]: #Storing independent variables into X and dependent variable into y
X = df_final.drop(['Customer Type', 'Customer ID'] , axis=1)
y = df_final['Customer Type']
In [20]: X_cols = X.columns
X_cols.tolist()
Out[20]: ['Number of Houses',

'Avg Size Household',
'Age',
'Number of Car Policies',
'Number of Life Insurance Policies',
'Household Type_With Children',
'Household Type_Without Children',
'House Type_Home Owner',
'House Type_Rented House',
'Education Level_High',
'Education Level_Low',
'Education Level_Medium',
'Relationship Status_Living Together',
'Relationship Status_Married',
'Relationship Status_Other Relationship',
'Relationship Status_Singles']
In [21]: #normalizing the data

from sklearn import preprocessing
X1 = preprocessing.normalize(X)
In [22]: print("shape of X:", X1.shape)

print("shape of Y:", y.shape)
shape of X: (5000, 16)

shape of Y: (5000,)
In [26]: from sklearn.cross_validation import train_test_split
#splitting the data into train as 70% and test as 30%

X_tr, X_test, Y_tr, Y_test = train_test_split(X1,y,test_size=0.3,random_state=
1)
In [27]: Y_test = Y_test.values
In [28]: print("shape of train data:", X_tr.shape)

print("shape of test data:", X_test.shape)
shape of train data: (3500, 16)

shape of test data: (1500, 16)
5.Building the model:
Utility functions for model evaluations
In [53]: def print_results(y_true, y_pred):
cm = confusion_matrix(Y_test, Y_pred)
sns.heatmap(cm, annot=True,annot_kws={"size": 20}, fmt='g', vmin=0, vmax=1

500)
print("Accuracy on test data:",accuracy_score(Y_test, Y_pred) * 100)

print("Precision on test data:",precision_score(Y_test, Y_pred) * 100)
print("Recall on test data:",recall_score(Y_test, Y_pred) * 100)
print("F1_score on test data:",f1_score(Y_test, Y_pred) * 100)
plt.show()
In [54]: from sklearn.linear_model import LogisticRegression
In [55]: #fitting the model with logistic regression

log_model = LogisticRegression()
log_model.fit(X_tr, Y_tr)
Out[55]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
In [56]: Y_pred = log_model.predict(X_test)
print_results(Y_test, Y_pred)
Accuracy on test data: 93.0

Precision on test data: 0.0
Recall on test data: 0.0
F1_score on test data: 0.0
By comparing with the above metrics, this model is predicting only the dominating class which means
majority class because imbalanced data.
To overcome that point, we can use parameter called class weight can be made as balanced.
Use class weights to make it right

We will do cross validation to check over fitting / under fitting
In [57]: from sklearn.cross_validation import cross_val_score, cross_val_predict
#fitting the model using the class_weight parameter with balanced

log_model = LogisticRegression(class_weight='balanced')
predicted = cross_val_predict(log_model, X_tr, Y_tr, cv=5)
#finding the train accuracy

print("Train Accuracy on data:",accuracy_score(Y_tr, predicted) * 100)
Train Accuracy on data: 64.25714285714285
In [58]: #testing the model

log_model.fit(X_tr, Y_tr)
Y_pred = log_model.predict(X_test) # Predicted class labels from test features
print_results(Y_test, Y_pred)
Accuracy on test data: 64.13333333333333

Precision on test data: 11.408199643493761
Recall on test data: 60.952380952380956
F1_score on test data: 19.219219219219223
observation:
According to problem, we have to predict current customer as current and enen if prospective customer
is predicted as current customer its fine. Because that is what our final objective is.
So, we have to increase true positives(predicting Current customer as current customer) which
means decease the false negative(ie., we don't want our model to predict current customers as
prospective customers), for that reason have to choose our metric as recall.
In [59]: scores = cross_val_score(log_model, X_tr, Y_tr, cv=5,scoring='accuracy')

print(scores)
[0.67902996 0.62714286 0.61142857 0.65142857 0.64377682]
From the above list, the cross validation scores are not changing much across different cv sets. So we
can be sure that our model is clearly not overfitting
=======================================
Picking right threshold as it is imbalanced data
1. Getting the probability values instead of label, so that we can

choose threshold
In [60]: probs = log_model.predict_proba(X_test) #Predicted probabilities from test fea
tures,array of shape = [n_samples, n_classes]
In [61]: probs[0]
Out[61]: array([0.47050864, 0.52949136])
In [73]: # we wil take probability values of a customer being current customer from the
given probailities
probs = probs[:,1]
probs[0]
Out[73]: 0.5294913551020631
In [64]: def threshold_picker(start, end, step, y_true, pred_probs, print_cm = False):
# create thresholds in specified range

ths = np.arange(start , end, step)
# list that stores threshold values, precisions and recall values

precision = []
recall = []
ths_new = []
for th in ths:
th = np.round(th,3)
act_y = [] # predicted values for current threshhold
pre_y = [] # true values for current threshold
for i, pred_prob in enumerate(pred_probs):
act_y.append(y_true[i]);
pre_y.append(1 if pred_prob > th else 0);
# finding precision ad recall values with current threshold

p = precision_score(act_y, pre_y) * 100
r = recall_score(act_y, pre_y) * 100
if(p != 0 and r!= 0):
precision.append(p)
recall.append(r)
ths_new.append(th)
# print confusion matrix if needed
if print_cm:
cm = pd.DataFrame(confusion_matrix(act_y, pre_y))
print(cm)
print(f'(threshold, Precision, recall) : ({th}, {p}, {r})')
print('*'*50)
#Precision and Recall Scores of the decision threshold

plt.plot(ths_new, precision, 'b--', marker='o', label='precision')
plt.plot(ths_new, recall, 'r--', marker='o', label='recall')
plt.legend()
plt.show()
2. Checking for right threshold ( 0.3 <= threshold <= 0.8 )
In [74]: threshold_picker(0.3, 0.8, 0.01, Y_test, probs)
As we already know recall should be high and precsion should not be low, ie., if the precision is too low,
the model is trying to predict every customer as prospective, which we should avoid.
so we will pick the range where precision is starts increasig and recall is also reasonably high.(0.45 to
0.5)
3. Checking for right threshold ( 0.45 <= threshold <= 0.5 )
In [75]: threshold_picker(0.45, 0.5, 0.002, Y_test, probs)
Around 0.48, precision remains same and recall decreases suddenly, we will take that as our optimal
threshold.
Final optimal threshold : 0.48
4. Comparing results of optimal threshold(0.48) with default

threshold(0.5) with balanced class_weights
4.1 Default Threshold
In [165]: # using defalut threshold

th = 0.5
# converting probabilities into 1/0 based on threshold
Y_pred_default = [1 if prob>th else 0 for prob in probs]
# find confusion matrix

cm_default = confusion_matrix(Y_test, Y_pred_default)
4.2 Optimal Threshold
In [167]: th = 0.48
# conerting probabilities into 1/0 based on threshold
Y_pred_optimal = [1 if prob>th else 0 for prob in probs]
# find confusion matrix
cm_optimal = confusion_matrix(Y_test, Y_pred_optimal)
4.3 plotting both confusion matrices
In [181]: fig = plt.figure(figsize=plt.figaspect(.5))
ax1 = plt.subplot(121)
ax1.set_adjustable('box')
sns.heatmap(cm_default, annot=True,annot_kws={"size": 20}, fmt='g', cbar=False
, vmin=0, vmax=1500, ax=ax1)
plt.title("threshold=0.5")
ax2 = plt.subplot(122)
sns.heatmap(cm_optimal, annot=True,annot_kws={"size": 20}, fmt='g', vmin=0, vm

ax=1500, ax=ax2)
plt.title("threshold=0.48")
plt.show()
# *******************************************
print('\n\nDefault threshold')
print('*'*25)
print("Precision:",precision_score(Y_test, Y_pred_default) * 100)
print("Recall :",recall_score(Y_test, Y_pred_default) * 100)
print('\n\nOptimal threshold')
print('*'*25)
print("Precision :",precision_score(Y_test, Y_pred_optimal) * 100)
print("Recall :",recall_score(Y_test, Y_pred_optimal) * 100)
Default threshold
*************************
Precision: 11.408199643493761
Recall : 60.952380952380956
Optimal threshold
*************************
Precision : 9.83425414364641
Recall : 84.76190476190476
Questions
1. Which attributes are most predictive of a customer’s likelihood

to purchase and why?
In [234]: #Finding the weights of all features
weights = log_model.coef_
weights
Out[234]: array([[-1.22123796, 0.41864355, -0.04224982, 6.00272688, -0.31417063,

-0.33972745, -0.17507081, 1.62442449, -2.13922276, 2.034943 ,
-3.61282045, 1.06307918, -0.13138883, 0.90629245, -1.30562076,
0.01591887]])
In [330]: #Finding the column names

X_cols = X.columns
a = X_cols.tolist()
In [273]: #Making the dictionary of key should be feature names and value should be corr
esponding feature weight
d = {}
n = len(a) #finding the length of features
for i in range(n):
x = a[i]
d[a[i]] = weights[0][i]
In [279]: import operator

result = sorted(d.items(), key=operator.itemgetter(1),reverse=True)
result
Out[279]: [('Number of Car Policies', 6.002726879338931),

('Education Level_High', 2.0349430012371137),
('House Type_Home Owner', 1.624424493865725),
('Education Level_Medium', 1.063079184955882),
('Relationship Status_Married', 0.906292450000115),
('Avg Size Household', 0.4186435502184599),
('Relationship Status_Singles', 0.01591887361090185),
('Age', -0.04224982386024744),
('Relationship Status_Living Together', -0.13138882682397124),
('Household Type_Without Children', -0.17507081222574708),
('Number of Life Insurance Policies', -0.31417062839689325),
('Household Type_With Children', -0.339727452196972),
('Number of Houses', -1.2212379620488776),
('Relationship Status_Other Relationship', -1.3056207612097648),
('House Type_Rented House', -2.13922275828844),
('Education Level_Low', -3.612820450615726)]
In [369]: result[2][0]
Out[369]: 'House Type_Home Owner'
In [394]: n = len(result)
keys = []
values = []
for i in range(n):
keys.append(result[i][0])
values.append(result[i][1])
In [397]: sns.barplot(values, keys)

plt.show()
print(keys)
['Number of Car Policies', 'Education Level_High', 'House Type_Home Owner',

'Education Level_Medium', 'Relationship Status_Married', 'Avg Size Househol
d', 'Relationship Status_Singles', 'Age', 'Relationship Status_Living Togethe
r', 'Household Type_Without Children', 'Number of Life Insurance Policies',
'Household Type_With Children', 'Number of Houses', 'Relationship Status_Othe
r Relationship', 'House Type_Rented House', 'Education Level_Low']
2. Among the customers who have not purchased a health

insurance plan(ie., prospective customers), who are the 50 most
likely to purchase a plan?
In [284]: df_final.head()
Out[284]:
Number
Number Number Household Ho
Customer Avg Size of Life Customer
of Age of Car Type_With Type_
ID Household Insurance Type
Houses Policies Children
Policies
0 529229 1 2 50 1 0 1 0 1
1 926757 1 3 46 0 0 0 1 0
2 475463 1 3 45 0 0 0 1 0
3 900971 1 2 40 1 0 0 0 1
4 628437 1 1 50 0 0 0 0 1
In [289]: #Taking only the perspective customers from the whole data
df_final1 = df_final[df_final['Customer Type'] == 0]
df_final1['Customer Type'].value_counts()
Out[289]: 0 4689
In [318]: #storing the Customer Id of prospective customer into an one variable

cus_ids = df_final1['Customer ID']
#Converting pandas to list
a = cus_ids.tolist()
In [294]: #Splitting the data into independent variables and dependent variables
X = df_final1.drop(['Customer ID', 'Customer Type'], axis=1)
y = df_final1['Customer Type']
In [296]: #normalizing the data

from sklearn import preprocessing
X_pre = preprocessing.normalize(X)
In [297]: Y_pred1 = log_model.predict(X_pre)
In [298]: probs1 = log_model.predict_proba(X)
In [304]: probs1[][1]
Out[304]: 0.028084328873035802
In [320]: th = 0.48
n = len(Y_pred1)
dis = {}
for i in range(n):
if(probs1[i][1] > th):

dis[a[i]] = probs1[i][1]
In [321]: result2 = sorted(dis.items(), key=operator.itemgetter(1),reverse=True)
In [329]: cus_50 = []
for i in range(50):
cus_50.append(result2[i][0])
print(cus_50)
[565836, 349052, 983279, 849524, 67544, 942610, 100040, 139962, 274774, 33491
8, 956126, 786112, 594755, 280767, 958667, 558277, 16719, 385947, 428279, 216
740, 966356, 67752, 285831, 99666, 921421, 628626, 667416, 268739, 304136, 67
8156, 188763, 328206, 167788, 856187, 260768, 512893, 205384, 343236, 962491,
91323, 109910, 498509, 862950, 87493, 596017, 922856, 468284, 465313, 458265,
589181]
Getting top 50 customers details who will most likely to purchase a plan
In [360]: # get all the customer ids from the dataframe

total_cus_ids = df['Customer ID'].tolist()
# get the index of top_50 cusstomers.

top_cus_indices = [total_cus_ids.index(cus_id) for cus_id in cus_50]
# get top 50 customer details from the

top_50_cus_details = df.iloc[top_cus_indices]
top_50_cus_details.head()
Out[360]:
Number
Customer Avg Size Relationship Household Education House
of Age
ID Household Status Type Level Type
Houses
Without Home
4616 565836 1 3 49 Married Medium
Children Owner
Other With Rented

2338 349052 1 3 40 Medium
Relationship Children House
With Home
4852 983279 1 3 56 Married Medium
Children Owner
Without Home
3659 849524 1 2 52 Married Low
Children Owner
Without Home
2653 67544 1 2 47 Married High
Children Owner

Prospective Insurance Customers PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prospective Insurance Customers PDF

Uploaded by

Copyright:

Available Formats

9/1/2018 Prospective Insurance Customers

Prospective Insurance Customers:

In [2]: import numpy as np

import seaborn as sns

1.Collecting the data:

In [4]: #Loading the Spread Sheet into DataFrame

In [5]: #shape of data

Out[5]: (5000, 11)

observation : This dataset is consisting of 5000 rows and 11 columns/attributes.

In [6]: #seeing how the data looks like

Other Without Rented

count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000

mean 500123.188600 1.111600 2.672000 44.892600 0.560400 0.074800

std 289073.995447 0.415668 0.788759 8.714531 0.606651 0.373675

min 175.000000 1.000000 1.000000 20.000000 0.000000 0.000000

25% 248324.750000 1.000000 2.000000 40.000000 0.000000 0.000000

50% 501487.500000 1.000000 3.000000 44.000000 1.000000 0.000000

75% 747474.750000 1.000000 3.000000 50.000000 1.000000 0.000000

max 999399.000000 10.000000 5.000000 80.000000 7.000000 8.000000

In [9]: sns.countplot('Customer Type', data=df)

Out[9]: Prospective 4689

In [10]: sns.countplot(x='Customer Type',hue='Relationship Status', data=df)

df[df['Customer Type']=='Prospective']['Relationship Status'].value_counts()

Out[10]: Married 4061

In [11]: sns.countplot(x='Customer Type',hue='Household Type', data=df)

In [12]: sns.countplot(x='Customer Type',hue='Education Level', data=df)

In [13]: #Finding the numerical features

Numerical features are:

In [14]: #categorical_features = cols - num_cols

Categocal Features are:

In [15]: #Find if there are any Nan values

In [17]: df['Customer Type'].value_counts()

Out[17]: Prospective 4689

df['Customer Type'] = df['Customer Type'].map(customer_type)

In [17]: #Converting categorial features into numerical features

df_final = pd.get_dummies(df, columns=['Household Type', 'House Type',

4.Splitting the data:

In [20]: X_cols = X.columns

Out[20]: ['Number of Houses',

In [21]: #normalizing the data

In [22]: print("shape of X:", X1.shape)

shape of X: (5000, 16)

In [26]: from sklearn.cross_validation import train_test_split

#splitting the data into train as 70% and test as 30%

In [27]: Y_test = Y_test.values

In [28]: print("shape of train data:", X_tr.shape)

shape of train data: (3500, 16)

5.Building the model:

Utility functions for model evaluations

In [53]: def print_results(y_true, y_pred):

sns.heatmap(cm, annot=True,annot_kws={"size": 20}, fmt='g', vmin=0, vmax=1

print("Accuracy on test data:",accuracy_score(Y_test, Y_pred) * 100)

In [54]: from sklearn.linear_model import LogisticRegression

In [55]: #fitting the model with logistic regression

Out[55]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

In [56]: Y_pred = log_model.predict(X_test)

Accuracy on test data: 93.0

Use class weights to make it right

In [57]: from sklearn.cross_validation import cross_val_score, cross_val_predict

#fitting the model using the class_weight parameter with balanced

predicted = cross_val_predict(log_model, X_tr, Y_tr, cv=5)

#finding the train accuracy

Train Accuracy on data: 64.25714285714285