You are on page 1of 4

The Boston Housing Dataset

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib
archive (http://lib.stat.cmu.edu/datasets/boston (http://lib.stat.cmu.edu/datasets/boston)). The analytics task of this dataset is to predict the median
value of the home. This dataset is available with sklearn library and we will utilize the same for our analytics problem.

In [1]: import numpy as np


import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
from sklearn.datasets import load_boston

%matplotlib inline

We will load the dataset and read the available description of the boston dataset

In [2]: boston = load_boston()


print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset


---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usual


ly the target.

:Attribute Information (in order):


- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.


https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic


prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of C
ollinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the T
enth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst.
Morgan Kaufmann.

In [3]: df = pd.DataFrame(columns=boston.feature_names,data=boston.data)
df['PRICE'] = boston.target
In [4]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
CHAS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
RAD 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
PRICE 506 non-null float64
dtypes: float64(14)
memory usage: 55.4 KB

We can see from the above information that the data contains no missing values and all features are numeric data

In [51]: plt.style.use('seaborn-whitegrid')
plot=pd.plotting.scatter_matrix(df,figsize=(21,20),alpha=0.8,grid=True)

We can see some relationships from the above pairplot

Price ~ 1/LSTAT
Price ~ RM
NOX ~ 1/DIS
RM ~ LSTAT
AGE ~ DIS

We can also see outliers in INDUS~RAD, INDUS~TAX, NOX~TAX, NOX~RAD plots. Further we will visualize data to check correlation between
variables and presence of outliers.
In [32]: f, axs = plt.subplots(2,4,figsize=(18,6))
plt.subplot(121)
sns.heatmap(abs(df.corr()),cmap='Blues')
plt.subplot(243)
df[['NOX']].boxplot()
plt.subplot(244)
df[['TAX']].boxplot()
plt.subplot(247)
df[['RAD']].boxplot()
plt.subplot(248)
df[['INDUS']].boxplot()

Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x7fa1c2cb1908>

The above heatmap shows strong correlation between RAD & TAX but couldn't conclude presence of outliers from the boxplots since there are no dots
outside the IQR. We created two new variables in the dataset since we saw inverse relationship in the scatterplots, these variables should be useful
when we fit Linear Regression line.

In [33]: df1 = df.copy()


df1['I_LSTAT'] = 1/df1['LSTAT']
df1['I_DIS'] = 1/df1['DIS']
x = df1.drop('PRICE',axis=1)
y = df1['PRICE']

We will now try to fit Linear Regression model in the dataset and see how it performs on predicting House prices.

In [113]: from sklearn.linear_model import LinearRegression,Ridge,Lasso


from sklearn.model_selection import cross_val_score

lm = LinearRegression()
scores = cross_val_score(lm,x,y, cv=7)
scores.mean()

Out[113]: 0.5117435579867984

In [114]: from sklearn.ensemble import RandomForestRegressor


rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=8)
scores = cross_val_score(rf,x,y, cv=7)
scores.mean()

Out[114]: 0.6081295296332616

We see that Random Forrest Regressor which is an ensemble of multiple regressor significantly improves the score. In below plot we see importance of
individual features.

In [115]: rf.fit(x,y)
feat_imp = pd.Series(rf.feature_importances_, index=x.columns)
feat_imp.nlargest(18).plot(kind='barh')

Out[115]: <matplotlib.axes._subplots.AxesSubplot at 0x7fa1bf10b908>


We will remove less significant features from the model and see whether it improves the score

In [117]: xx= x.drop(['CHAS','ZN','RAD','INDUS','B'],axis=1)


scores = cross_val_score(rf,xx,y, cv=7)
scores.mean()

Out[117]: 0.6074480065154999

There isn't much improvement in the score by dropping less significant features, hence we keep the model as it is. So we have built a model which
predict the House prices with 60% accuracy.

You might also like