The Boston Housing Dataset

The Boston Housing Dataset
This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib
archive (http://lib.stat.cmu.edu/datasets/boston (http://lib.stat.cmu.edu/datasets/boston)). The analytics task of this dataset is to predict the median
value of the home. This dataset is available with sklearn library and we will utilize the same for our analytics problem.
In [1]: import numpy as np

import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
from sklearn.datasets import load_boston
%matplotlib inline
We will load the dataset and read the available description of the boston dataset
In [2]: boston = load_boston()

print(boston.DESCR)
.. _boston_dataset:
Boston house prices dataset

---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usual

ly the target.
:Attribute Information (in order):

- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of C
ollinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the T
enth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst.
Morgan Kaufmann.
In [3]: df = pd.DataFrame(columns=boston.feature_names,data=boston.data)
df['PRICE'] = boston.target
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
CHAS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
RAD 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
PRICE 506 non-null float64
dtypes: float64(14)
memory usage: 55.4 KB
We can see from the above information that the data contains no missing values and all features are numeric data
In [51]: plt.style.use('seaborn-whitegrid')
plot=pd.plotting.scatter_matrix(df,figsize=(21,20),alpha=0.8,grid=True)
We can see some relationships from the above pairplot
Price ~ 1/LSTAT
Price ~ RM
NOX ~ 1/DIS
RM ~ LSTAT
AGE ~ DIS
We can also see outliers in INDUS~RAD, INDUS~TAX, NOX~TAX, NOX~RAD plots. Further we will visualize data to check correlation between
variables and presence of outliers.
In [32]: f, axs = plt.subplots(2,4,figsize=(18,6))
plt.subplot(121)
sns.heatmap(abs(df.corr()),cmap='Blues')
plt.subplot(243)
df[['NOX']].boxplot()
plt.subplot(244)
df[['TAX']].boxplot()
plt.subplot(247)
df[['RAD']].boxplot()
plt.subplot(248)
df[['INDUS']].boxplot()
Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x7fa1c2cb1908>
The above heatmap shows strong correlation between RAD & TAX but couldn't conclude presence of outliers from the boxplots since there are no dots
outside the IQR. We created two new variables in the dataset since we saw inverse relationship in the scatterplots, these variables should be useful
when we fit Linear Regression line.
In [33]: df1 = df.copy()

df1['I_LSTAT'] = 1/df1['LSTAT']
df1['I_DIS'] = 1/df1['DIS']
x = df1.drop('PRICE',axis=1)
y = df1['PRICE']
We will now try to fit Linear Regression model in the dataset and see how it performs on predicting House prices.
In [113]: from sklearn.linear_model import LinearRegression,Ridge,Lasso

from sklearn.model_selection import cross_val_score
lm = LinearRegression()
scores = cross_val_score(lm,x,y, cv=7)
scores.mean()
Out[113]: 0.5117435579867984
In [114]: from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=8)
scores = cross_val_score(rf,x,y, cv=7)
scores.mean()
Out[114]: 0.6081295296332616
We see that Random Forrest Regressor which is an ensemble of multiple regressor significantly improves the score. In below plot we see importance of
individual features.
In [115]: rf.fit(x,y)
feat_imp = pd.Series(rf.feature_importances_, index=x.columns)
feat_imp.nlargest(18).plot(kind='barh')
Out[115]: <matplotlib.axes._subplots.AxesSubplot at 0x7fa1bf10b908>

We will remove less significant features from the model and see whether it improves the score
In [117]: xx= x.drop(['CHAS','ZN','RAD','INDUS','B'],axis=1)

scores = cross_val_score(rf,xx,y, cv=7)
scores.mean()
Out[117]: 0.6074480065154999
There isn't much improvement in the score by dropping less significant features, hence we keep the model as it is. So we have built a model which
predict the House prices with 60% accuracy.

The Boston Housing Dataset

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Boston Housing Dataset

Uploaded by

Copyright:

Available Formats

The Boston Housing Dataset

In [1]: import numpy as np

In [2]: boston = load_boston()

Boston house prices dataset

Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usual

:Attribute Information (in order):

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

We can see some relationships from the above pairplot

Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x7fa1c2cb1908>

In [33]: df1 = df.copy()

In [113]: from sklearn.linear_model import LinearRegression,Ridge,Lasso

In [114]: from sklearn.ensemble import RandomForestRegressor

Out[115]: <matplotlib.axes._subplots.AxesSubplot at 0x7fa1bf10b908>

In [117]: xx= x.drop(['CHAS','ZN','RAD','INDUS','B'],axis=1)

You might also like