Professional Documents
Culture Documents
import matplotlib
matplotlib.use('nbagg')
import matplotlib.pyplot as plt
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 1/26
9/1/2018 Prospective Insurance Customers
Out[6]:
Number Nu
Customer Avg Size Relationship Household Education House
of Age o
ID Household Status Type Level Type
Houses Pol
Without Home
0 529229 1 2 50 Married Low 1
Children Owner
With Rented
1 926757 1 3 46 Married Low 0
Children House
With Rented
2 475463 1 3 45 Married Low 0
Children House
Without Rented
3 900971 1 2 40 Married Low 1
Children House
2.Analyzing th data:
In [7]: #Giving the information about the attributes
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
Customer ID 5000 non-null int64
Number of Houses 5000 non-null int64
Avg Size Household 5000 non-null int64
Age 5000 non-null int64
Relationship Status 5000 non-null object
Household Type 5000 non-null object
Education Level 5000 non-null object
House Type 5000 non-null object
Number of Car Policies 5000 non-null int64
Number of Life Insurance Policies 5000 non-null int64
Customer Type 5000 non-null object
dtypes: int64(6), object(5)
memory usage: 429.8+ KB
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 2/26
9/1/2018 Prospective Insurance Customers
In [8]: df.describe()
Out[8]:
Number of
Number of Avg Size Number of Life
Customer ID Age
Houses Household Car Policies Insurance
Policies
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 3/26
9/1/2018 Prospective Insurance Customers
observation: Given dataset has imbalanced data, which means majority class is Prospective and minority class
is Current.
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 4/26
9/1/2018 Prospective Insurance Customers
Married 294
Other Relationship 13
Singles 3
Living Together 1
Name: Relationship Status, dtype: int64
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 5/26
9/1/2018 Prospective Insurance Customers
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 6/26
9/1/2018 Prospective Insurance Customers
cols = df.columns
num_cols = df._get_numeric_data().columns
print("Numerical features are:")
print(list(num_cols))
3.Data Cleaning :
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 7/26
9/1/2018 Prospective Insurance Customers
Out[15]: Customer ID 0
Number of Houses 0
Avg Size Household 0
Age 0
Relationship Status 0
Household Type 0
Education Level 0
House Type 0
Number of Car Policies 0
Number of Life Insurance Policies 0
Customer Type 0
dtype: int64
In [16]: #Customer Type has two categories, so making Current category to 1 and Prospec
tive category to 0
def customer_type(x):
if(x == 'Current'):
return 1
return 0
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 8/26
9/1/2018 Prospective Insurance Customers
In [18]: print(df_final.shape)
df_final.head()
(5000, 18)
Out[18]:
Number
Number Number Household Ho
Customer Avg Size of Life Customer
of Age of Car Type_With Type_
ID Household Insurance Type
Houses Policies Children
Policies
0 529229 1 2 50 1 0 1 0 1
1 926757 1 3 46 0 0 0 1 0
2 475463 1 3 45 0 0 0 1 0
3 900971 1 2 40 1 0 0 0 1
4 628437 1 1 50 0 0 0 0 1
X_cols.tolist()
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 9/26
9/1/2018 Prospective Insurance Customers
cm = confusion_matrix(Y_test, Y_pred)
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 10/26
9/1/2018 Prospective Insurance Customers
print_results(Y_test, Y_pred)
By comparing with the above metrics, this model is predicting only the dominating class which means
majority class because imbalanced data.
To overcome that point, we can use parameter called class weight can be made as balanced.
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 11/26
9/1/2018 Prospective Insurance Customers
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 12/26
9/1/2018 Prospective Insurance Customers
print_results(Y_test, Y_pred)
observation:
According to problem, we have to predict current customer as current and enen if prospective customer
is predicted as current customer its fine. Because that is what our final objective is.
So, we have to increase true positives(predicting Current customer as current customer) which
means decease the false negative(ie., we don't want our model to predict current customers as
prospective customers), for that reason have to choose our metric as recall.
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 13/26
9/1/2018 Prospective Insurance Customers
From the above list, the cross validation scores are not changing much across different cv sets. So we
can be sure that our model is clearly not overfitting
=======================================
In [61]: probs[0]
In [73]: # we wil take probability values of a customer being current customer from the
given probailities
probs = probs[:,1]
probs[0]
Out[73]: 0.5294913551020631
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 14/26
9/1/2018 Prospective Insurance Customers
for th in ths:
th = np.round(th,3)
act_y = [] # predicted values for current threshhold
pre_y = [] # true values for current threshold
for i, pred_prob in enumerate(pred_probs):
act_y.append(y_true[i]);
pre_y.append(1 if pred_prob > th else 0);
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 15/26
9/1/2018 Prospective Insurance Customers
As we already know recall should be high and precsion should not be low, ie., if the precision is too low,
the model is trying to predict every customer as prospective, which we should avoid.
so we will pick the range where precision is starts increasig and recall is also reasonably high.(0.45 to
0.5)
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 16/26
9/1/2018 Prospective Insurance Customers
Around 0.48, precision remains same and recall decreases suddenly, we will take that as our optimal
threshold.
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 17/26
9/1/2018 Prospective Insurance Customers
In [167]: th = 0.48
# conerting probabilities into 1/0 based on threshold
Y_pred_optimal = [1 if prob>th else 0 for prob in probs]
# find confusion matrix
cm_optimal = confusion_matrix(Y_test, Y_pred_optimal)
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 18/26
9/1/2018 Prospective Insurance Customers
ax1 = plt.subplot(121)
ax1.set_adjustable('box')
sns.heatmap(cm_default, annot=True,annot_kws={"size": 20}, fmt='g', cbar=False
, vmin=0, vmax=1500, ax=ax1)
plt.title("threshold=0.5")
ax2 = plt.subplot(122)
# *******************************************
print('\n\nDefault threshold')
print('*'*25)
print("Precision:",precision_score(Y_test, Y_pred_default) * 100)
print("Recall :",recall_score(Y_test, Y_pred_default) * 100)
print('\n\nOptimal threshold')
print('*'*25)
print("Precision :",precision_score(Y_test, Y_pred_optimal) * 100)
print("Recall :",recall_score(Y_test, Y_pred_optimal) * 100)
Default threshold
*************************
Precision: 11.408199643493761
Recall : 60.952380952380956
Optimal threshold
*************************
Precision : 9.83425414364641
Recall : 84.76190476190476
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 19/26
9/1/2018 Prospective Insurance Customers
Questions
weights = log_model.coef_
weights
a = X_cols.tolist()
In [273]: #Making the dictionary of key should be feature names and value should be corr
esponding feature weight
d = {}
n = len(a) #finding the length of features
for i in range(n):
x = a[i]
d[a[i]] = weights[0][i]
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 20/26
9/1/2018 Prospective Insurance Customers
In [369]: result[2][0]
In [394]: n = len(result)
keys = []
values = []
for i in range(n):
keys.append(result[i][0])
values.append(result[i][1])
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 21/26
9/1/2018 Prospective Insurance Customers
print(keys)
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 22/26
9/1/2018 Prospective Insurance Customers
In [284]: df_final.head()
Out[284]:
Number
Number Number Household Ho
Customer Avg Size of Life Customer
of Age of Car Type_With Type_
ID Household Insurance Type
Houses Policies Children
Policies
0 529229 1 2 50 1 0 1 0 1
1 926757 1 3 46 0 0 0 1 0
2 475463 1 3 45 0 0 0 1 0
3 900971 1 2 40 1 0 0 0 1
4 628437 1 1 50 0 0 0 0 1
In [289]: #Taking only the perspective customers from the whole data
df_final1 = df_final[df_final['Customer Type'] == 0]
df_final1['Customer Type'].value_counts()
Out[289]: 0 4689
Name: Customer Type, dtype: int64
In [294]: #Splitting the data into independent variables and dependent variables
X = df_final1.drop(['Customer ID', 'Customer Type'], axis=1)
y = df_final1['Customer Type']
In [304]: probs1[][1]
Out[304]: 0.028084328873035802
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 23/26
9/1/2018 Prospective Insurance Customers
In [320]: th = 0.48
n = len(Y_pred1)
dis = {}
for i in range(n):
In [329]: cus_50 = []
for i in range(50):
cus_50.append(result2[i][0])
print(cus_50)
[565836, 349052, 983279, 849524, 67544, 942610, 100040, 139962, 274774, 33491
8, 956126, 786112, 594755, 280767, 958667, 558277, 16719, 385947, 428279, 216
740, 966356, 67752, 285831, 99666, 921421, 628626, 667416, 268739, 304136, 67
8156, 188763, 328206, 167788, 856187, 260768, 512893, 205384, 343236, 962491,
91323, 109910, 498509, 862950, 87493, 596017, 922856, 468284, 465313, 458265,
589181]
Getting top 50 customers details who will most likely to purchase a plan
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 24/26
9/1/2018 Prospective Insurance Customers
top_50_cus_details.head()
Out[360]:
Number
Customer Avg Size Relationship Household Education House
of Age
ID Household Status Type Level Type
Houses
Without Home
4616 565836 1 3 49 Married Medium
Children Owner
With Home
4852 983279 1 3 56 Married Medium
Children Owner
Without Home
3659 849524 1 2 52 Married Low
Children Owner
Without Home
2653 67544 1 2 47 Married High
Children Owner
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 25/26
9/1/2018 Prospective Insurance Customers
http://localhost:8888/nbconvert/html/Documents/MedavieCase/Prospective%20Insurance%20Customers.ipynb?download=false 26/26