You are on page 1of 18

Proyecto 3

November 28, 2017

0.1 Proyecto de Aplicación de clustering y clasificación


0.2 Ciencia de datos
0.3 Autores:
0.3.1 José Ignacio González Cárdenas
0.3.2 Marysol Cantarero
0.3.3 Alejandro Preciado
In [36]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import (confusion_matrix, precision_score, recall_score,
f1_score, accuracy_score)
import scipy.optimize as opt
from sklearn import svm
from sklearn.model_selection import train_test_split

In [3]: data = pd.read_csv('HR_comma_sep.csv')

In [4]: data.shape

Out[4]: (14999, 10)

In [69]: data.head(10)

Out[69]: satisfaction_level last_evaluation number_project average_montly_hours \


0 0.38 0.53 2 157
1 0.80 0.86 5 262
2 0.11 0.88 7 272
3 0.72 0.87 5 223
4 0.37 0.52 2 159
5 0.41 0.50 2 153
6 0.10 0.77 6 247

1
7 0.92 0.85 5 259
8 0.89 1.00 5 224
9 0.42 0.53 2 142

time_spend_company Work_accident left promotion_last_5years sales \


0 3 0 1 0 sales
1 6 0 1 0 sales
2 4 0 1 0 sales
3 5 0 1 0 sales
4 3 0 1 0 sales
5 3 0 1 0 sales
6 4 0 1 0 sales
7 5 0 1 0 sales
8 5 0 1 0 sales
9 3 0 1 0 sales

salary
0 low
1 medium
2 medium
3 low
4 low
5 low
6 low
7 low
8 low
9 low

In [71]: data.tail()

Out[71]: satisfaction_level last_evaluation number_project \


14994 0.40 0.57 2
14995 0.37 0.48 2
14996 0.37 0.53 2
14997 0.11 0.96 6
14998 0.37 0.52 2

average_montly_hours time_spend_company Work_accident left \


14994 151 3 0 1
14995 160 3 0 1
14996 143 3 0 1
14997 280 4 0 1
14998 158 3 0 1

promotion_last_5years sales salary


14994 0 support low
14995 0 support low
14996 0 support low

2
14997 0 support low
14998 0 support low

0.4 Estudio de calidad


In [5]: from datacleaner import datacleaner

In [6]: qual = datacleaner.quality_report(data)


qual

Out[6]: MissingValues UniqueValues MinimumValues MaximumValues \


satisfaction_level 0 92 0.09 1
last_evaluation 0 65 0.36 1
number_project 0 6 2 7
average_montly_hours 0 215 96 310
time_spend_company 0 8 2 10
Work_accident 0 2 0 1
left 0 2 0 1
promotion_last_5years 0 2 0 1
sales 0 10 IT technical
salary 0 3 high medium

DataTypes
satisfaction_level float64
last_evaluation float64
number_project int64
average_montly_hours int64
time_spend_company int64
Work_accident int64
left int64
promotion_last_5years int64
sales object
salary object

La columna a predecir es la columna left, si un empleado ha abandonado la compañía. Las


variables categóricas son sales y salary. Las columnas left, work_accident y promotion_last5years
tienen valores booleanos. Las demás variables son numéricas.

0.4.1 Comportamiento de variable a predecir


In [7]: data.left.describe()

Out[7]: count 14999.000000


mean 0.238083
std 0.425924
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000

3
max 1.000000
Name: left, dtype: float64

In [8]: data.left.hist()
plt.title('Empleados que han dejado la compañía')
plt.xlabel('0 = No , 1 = Sí')
plt.ylabel('Número de empleados')
plt.show()

In [9]: data.left.value_counts()

Out[9]: 0 11428
1 3571
Name: left, dtype: int64

In [10]: data.left.value_counts() / data.left.count()

Out[10]: 0 0.761917
1 0.238083
Name: left, dtype: float64

Solamente aprox. un 23% de los empleados han dejado la compañía.

4
0.4.2 Separación de variable a predecir
In [11]: empl = data.drop('left', axis=1)

In [12]: empl.columns

Out[12]: Index(['satisfaction_level', 'last_evaluation', 'number_project',


'average_montly_hours', 'time_spend_company', 'Work_accident',
'promotion_last_5years', 'sales', 'salary'],
dtype='object')

0.4.3 Transformar variables categóricas


Sales

In [13]: le = preprocessing.LabelEncoder()
empl.sales = le.fit_transform(empl.sales)

In [14]: sales_dummies = pd.get_dummies(empl.sales, prefix='dummy_sales')

In [15]: empl = empl.join(sales_dummies).drop('sales', axis=1)

Salary

In [16]: empl.replace('low',1, inplace=True)


empl.replace('medium',2, inplace=True)
empl.replace('high',3, inplace=True)

In [17]: empl

Out[17]: satisfaction_level last_evaluation number_project \


0 0.38 0.53 2
1 0.80 0.86 5
2 0.11 0.88 7
3 0.72 0.87 5
4 0.37 0.52 2
5 0.41 0.50 2
6 0.10 0.77 6
7 0.92 0.85 5
8 0.89 1.00 5
9 0.42 0.53 2
10 0.45 0.54 2
11 0.11 0.81 6
12 0.84 0.92 4
13 0.41 0.55 2
14 0.36 0.56 2
15 0.38 0.54 2
16 0.45 0.47 2
17 0.78 0.99 4
18 0.45 0.51 2

5
19 0.76 0.89 5
20 0.11 0.83 6
21 0.38 0.55 2
22 0.09 0.95 6
23 0.46 0.57 2
24 0.40 0.53 2
25 0.89 0.92 5
26 0.82 0.87 4
27 0.40 0.49 2
28 0.41 0.46 2
29 0.38 0.50 2
... ... ... ...
14969 0.43 0.46 2
14970 0.78 0.93 4
14971 0.39 0.45 2
14972 0.11 0.97 6
14973 0.36 0.52 2
14974 0.36 0.54 2
14975 0.10 0.79 7
14976 0.40 0.47 2
14977 0.81 0.85 4
14978 0.40 0.47 2
14979 0.09 0.93 6
14980 0.76 0.89 5
14981 0.73 0.93 5
14982 0.38 0.49 2
14983 0.72 0.84 5
14984 0.40 0.56 2
14985 0.91 0.99 5
14986 0.85 0.85 4
14987 0.90 0.70 5
14988 0.46 0.55 2
14989 0.43 0.57 2
14990 0.89 0.88 5
14991 0.09 0.81 6
14992 0.40 0.48 2
14993 0.76 0.83 6
14994 0.40 0.57 2
14995 0.37 0.48 2
14996 0.37 0.53 2
14997 0.11 0.96 6
14998 0.37 0.52 2

average_montly_hours time_spend_company Work_accident \


0 157 3 0
1 262 6 0
2 272 4 0
3 223 5 0

6
4 159 3 0
5 153 3 0
6 247 4 0
7 259 5 0
8 224 5 0
9 142 3 0
10 135 3 0
11 305 4 0
12 234 5 0
13 148 3 0
14 137 3 0
15 143 3 0
16 160 3 0
17 255 6 0
18 160 3 1
19 262 5 0
20 282 4 0
21 147 3 0
22 304 4 0
23 139 3 0
24 158 3 0
25 242 5 0
26 239 5 0
27 135 3 0
28 128 3 0
29 132 3 0
... ... ... ...
14969 157 3 0
14970 225 5 0
14971 140 3 0
14972 310 4 0
14973 143 3 0
14974 153 3 0
14975 310 4 0
14976 136 3 0
14977 251 6 0
14978 144 3 0
14979 296 4 0
14980 238 5 0
14981 162 4 0
14982 137 3 0
14983 257 5 0
14984 148 3 0
14985 254 5 0
14986 247 6 0
14987 206 4 0
14988 145 3 0
14989 159 3 1

7
14990 228 5 1
14991 257 4 0
14992 155 3 0
14993 293 6 0
14994 151 3 0
14995 160 3 0
14996 143 3 0
14997 280 4 0
14998 158 3 0

promotion_last_5years salary dummy_sales_0 dummy_sales_1 \


0 0 1 0 0
1 0 2 0 0
2 0 2 0 0
3 0 1 0 0
4 0 1 0 0
5 0 1 0 0
6 0 1 0 0
7 0 1 0 0
8 0 1 0 0
9 0 1 0 0
10 0 1 0 0
11 0 1 0 0
12 0 1 0 0
13 0 1 0 0
14 0 1 0 0
15 0 1 0 0
16 0 1 0 0
17 0 1 0 0
18 1 1 0 0
19 0 1 0 0
20 0 1 0 0
21 0 1 0 0
22 0 1 0 0
23 0 1 0 0
24 0 1 0 0
25 0 1 0 0
26 0 1 0 0
27 0 1 0 0
28 0 1 0 0
29 0 1 0 0
... ... ... ... ...
14969 0 2 0 0
14970 0 2 0 0
14971 0 2 0 0
14972 0 2 0 0
14973 0 2 0 0
14974 0 2 0 0

8
14975 0 2 0 0
14976 0 2 0 0
14977 0 2 0 0
14978 0 2 0 0
14979 0 2 0 0
14980 0 3 0 0
14981 0 1 0 0
14982 0 2 0 0
14983 0 2 0 0
14984 0 2 0 0
14985 0 2 0 0
14986 0 1 0 0
14987 0 1 0 0
14988 0 1 0 0
14989 0 1 0 0
14990 0 1 0 0
14991 0 1 0 0
14992 0 1 0 0
14993 0 1 0 0
14994 0 1 0 0
14995 0 1 0 0
14996 0 1 0 0
14997 0 1 0 0
14998 0 1 0 0

dummy_sales_2 dummy_sales_3 dummy_sales_4 dummy_sales_5 \


0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
12 0 0 0 0
13 0 0 0 0
14 0 0 0 0
15 0 0 0 0
16 0 0 0 0
17 0 0 0 0
18 0 0 0 0
19 0 0 0 0
20 0 0 0 0
21 0 0 0 0

9
22 0 0 0 0
23 0 0 0 0
24 0 0 0 0
25 0 0 0 0
26 0 0 0 0
27 0 0 0 0
28 1 0 0 0
29 1 0 0 0
... ... ... ... ...
14969 0 0 0 0
14970 0 0 0 0
14971 0 0 0 0
14972 1 0 0 0
14973 1 0 0 0
14974 1 0 0 0
14975 0 1 0 0
14976 0 1 0 0
14977 0 1 0 0
14978 0 1 0 0
14979 0 0 0 0
14980 0 0 0 0
14981 0 0 0 0
14982 0 0 0 0
14983 0 0 0 0
14984 0 0 0 0
14985 0 0 0 0
14986 0 0 0 0
14987 0 0 0 0
14988 0 0 0 0
14989 0 0 0 0
14990 0 0 0 0
14991 0 0 0 0
14992 0 0 0 0
14993 0 0 0 0
14994 0 0 0 0
14995 0 0 0 0
14996 0 0 0 0
14997 0 0 0 0
14998 0 0 0 0

dummy_sales_6 dummy_sales_7 dummy_sales_8 dummy_sales_9


0 0 1 0 0
1 0 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 0 1 0 0
6 0 1 0 0

10
7 0 1 0 0
8 0 1 0 0
9 0 1 0 0
10 0 1 0 0
11 0 1 0 0
12 0 1 0 0
13 0 1 0 0
14 0 1 0 0
15 0 1 0 0
16 0 1 0 0
17 0 1 0 0
18 0 1 0 0
19 0 1 0 0
20 0 1 0 0
21 0 1 0 0
22 0 1 0 0
23 0 1 0 0
24 0 1 0 0
25 0 1 0 0
26 0 1 0 0
27 0 1 0 0
28 0 0 0 0
29 0 0 0 0
... ... ... ... ...
14969 0 1 0 0
14970 0 1 0 0
14971 0 1 0 0
14972 0 0 0 0
14973 0 0 0 0
14974 0 0 0 0
14975 0 0 0 0
14976 0 0 0 0
14977 0 0 0 0
14978 0 0 0 0
14979 0 0 0 1
14980 0 0 0 1
14981 0 0 0 1
14982 0 0 0 1
14983 0 0 0 1
14984 0 0 0 1
14985 0 0 0 1
14986 0 0 0 1
14987 0 0 0 1
14988 0 0 0 1
14989 0 0 0 1
14990 0 0 1 0
14991 0 0 1 0
14992 0 0 1 0

11
14993 0 0 1 0
14994 0 0 1 0
14995 0 0 1 0
14996 0 0 1 0
14997 0 0 1 0
14998 0 0 1 0

[14999 rows x 18 columns]

1 Selección de variables
1.1 Criterio de la varianza
In [18]: varianza = np.var(empl, axis =0)
plt.bar(np.arange(len(varianza)),varianza)
plt.title('Varianza de las variables')
plt.ylabel('varianza')
plt.show()

La varianza que sobresale es la de average_montly_hours

In [19]: empl['average_montly_hours'].hist()
plt.show()

12
Como se puede observar, el numero de horas promedio trabajadas mensualmente tiene mucha
varianza a comparación de las otras variables , por lo cual , no influye en la clasificación de los
datos y la excluiremos de la separación de variables.

In [20]: varianza = np.var(empl.drop('average_montly_hours', axis =1) , axis =0)


plt.bar(np.arange(len(varianza)),varianza)
plt.title('Varianza de las variables')
plt.ylabel('varianza')
plt.show()

13
In [21]: varianza

Out[21]: satisfaction_level 0.061813


last_evaluation 0.029297
number_project 1.519183
time_spend_company 2.131856
Work_accident 0.123698
promotion_last_5years 0.020816
salary 0.405975
dummy_sales_0 0.075113
dummy_sales_1 0.049717
dummy_sales_2 0.048522
dummy_sales_3 0.046842
dummy_sales_4 0.040239
dummy_sales_5 0.053932
dummy_sales_6 0.056521
dummy_sales_7 0.199832
dummy_sales_8 0.126525
dummy_sales_9 0.148459
dtype: float64

La única variable a eliminar será average_montly_hours, debido a que las demás variables
estan en un rango aceptable , de 0 a 2.

14
1.2 Criterio de correlación
In [22]: plt.matshow(empl.corr())
plt.show()

Hay una correlación entre las primeras 5 variables existe una correlación mayor.

2 Nuevo modelo, después de reducción de variables


In [23]: Y = data.left
X = empl

In [72]: grados = np.arange(1,5)

Accu = np.zeros(grados.shape)
Prec = np.zeros(grados.shape)
Reca = np.zeros(grados.shape)
F1 = np.zeros(grados.shape)
Nvar = np.zeros(grados.shape)

for ngrado in grados:

poly = PolynomialFeatures(ngrado)
Xa = poly.fit_transform(X)
logreg = LogisticRegression(C=1)
logreg.fit(Xa,Y)

15
Yg = logreg.predict(Xa)
Accu[ngrado-1] = accuracy_score(Y,Yg)
Prec[ngrado-1] = precision_score(Y,Yg)
Reca[ngrado-1] = recall_score(Y,Yg)
F1[ngrado-1] = f1_score(Y,Yg)
Nvar[ngrado-1] = len(logreg.coef_[0])

plt.plot(grados,Accu)
plt.plot(grados,Prec)
plt.plot(grados,Reca)
plt.plot(grados,F1)
plt.xlabel('Grado del Polinomio')
plt.legend(('Accuracy','Precision','Recall','F1'),loc='best')
plt.ylabel('%')
plt.grid()
plt.show()

Nos damos cuenta que mediante la grafica inferimos que el mejor grado del polinomio es el 2
ya que despues de este el accuracy, recall y F1 se mantienen similares, lo que significa que le dan
mayor peso computacional sin mejorar las proyeccciones del modelo.

In [25]: poly = PolynomialFeatures(2)


Xa = poly.fit_transform(X)
logreg = LogisticRegression(C=1)
logreg.fit(Xa,Y)
Yg = logreg.predict(Xa)

16
accu = accuracy_score(Y,Yg)
prec = precision_score(Y,Yg)
reca = recall_score(Y,Yg)
f1 = f1_score(Y,Yg)
Nvar = len(logreg.coef_[0])
In [26]: accu
Out[26]: 0.83125541702780181
In [27]: prec
Out[27]: 0.72413793103448276
In [28]: reca
Out[28]: 0.47045645477457293
In [29]: f1
Out[29]: 0.57036156849431341
Interpretación : El modelo de regresión logística tiene un poder de proyección bajo, por lo cual,
ahora se aplicará el modelo SVM.

2.0.1 SVM
In [33]: clf = svm.SVC(kernel = 'rbf') # Gausseano
clf.fit(X,Y)
Out[33]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
In [34]: Yg = clf.predict(X)
Accuracy = accuracy_score(Y,Yg)
Precision = precision_score(Y,Yg)
Recall = recall_score(Y,Yg)
F1 = f1_score(Y,Yg)
In [35]: print(Accuracy)
print(Precision)
print(Recall)
print(F1)
0.956997133142
0.899945325314
0.921870624475
0.910776040946

Es necesario arreglar este modelo ya que hay un overfitting, es decir, que el modelo está aju-
stando a los datos de entrenamiento. Las métricas se están obteniendo con los mismos datos de
entrenamiento.

17
2.1 Arreglar overfitting
A continuación se separaran los datos, una parte de ellos sera utilizada para entrenar el modelo,
y con la parte restante, se aplicará la prueba del modelo para ver si las predicciones con datos que
el modelo nunca ha visto, son buenos.

In [73]: X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.50, random_state=

In [74]: X_train.shape

Out[74]: (7499, 18)

In [62]: X_test.shape

Out[62]: (7500, 18)

In [63]: clf = svm.SVC(kernel = 'rbf') # Gausseano


clf.fit(X_train,y_train)

Out[63]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,


decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

In [64]: Yg = clf.predict(X_test)
Accuracy = accuracy_score(y_test,Yg)
Precision = precision_score(y_test,Yg)
Recall = recall_score(y_test,Yg)
F1 = f1_score(y_test,Yg)

In [65]: print(Accuracy)
print(Precision)
print(Recall)
print(F1)

0.947333333333
0.874045801527
0.907187323147
0.890308247709

Al aplicar el nuevo modelo de SVM, que incluso con datos que nunca ha visto las metricas
bajan muy poco. Con esto concluimos que nuestro modelo es lo suficiente bueno.

18

You might also like