Professional Documents
Culture Documents
This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib
archive (http://lib.stat.cmu.edu/datasets/boston (http://lib.stat.cmu.edu/datasets/boston)). The analytics task of this dataset is to predict the median
value of the home. This dataset is available with sklearn library and we will utilize the same for our analytics problem.
%matplotlib inline
We will load the dataset and read the available description of the boston dataset
.. _boston_dataset:
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of C
ollinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the T
enth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst.
Morgan Kaufmann.
In [3]: df = pd.DataFrame(columns=boston.feature_names,data=boston.data)
df['PRICE'] = boston.target
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
CHAS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
RAD 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
PRICE 506 non-null float64
dtypes: float64(14)
memory usage: 55.4 KB
We can see from the above information that the data contains no missing values and all features are numeric data
In [51]: plt.style.use('seaborn-whitegrid')
plot=pd.plotting.scatter_matrix(df,figsize=(21,20),alpha=0.8,grid=True)
Price ~ 1/LSTAT
Price ~ RM
NOX ~ 1/DIS
RM ~ LSTAT
AGE ~ DIS
We can also see outliers in INDUS~RAD, INDUS~TAX, NOX~TAX, NOX~RAD plots. Further we will visualize data to check correlation between
variables and presence of outliers.
In [32]: f, axs = plt.subplots(2,4,figsize=(18,6))
plt.subplot(121)
sns.heatmap(abs(df.corr()),cmap='Blues')
plt.subplot(243)
df[['NOX']].boxplot()
plt.subplot(244)
df[['TAX']].boxplot()
plt.subplot(247)
df[['RAD']].boxplot()
plt.subplot(248)
df[['INDUS']].boxplot()
The above heatmap shows strong correlation between RAD & TAX but couldn't conclude presence of outliers from the boxplots since there are no dots
outside the IQR. We created two new variables in the dataset since we saw inverse relationship in the scatterplots, these variables should be useful
when we fit Linear Regression line.
We will now try to fit Linear Regression model in the dataset and see how it performs on predicting House prices.
lm = LinearRegression()
scores = cross_val_score(lm,x,y, cv=7)
scores.mean()
Out[113]: 0.5117435579867984
Out[114]: 0.6081295296332616
We see that Random Forrest Regressor which is an ensemble of multiple regressor significantly improves the score. In below plot we see importance of
individual features.
In [115]: rf.fit(x,y)
feat_imp = pd.Series(rf.feature_importances_, index=x.columns)
feat_imp.nlargest(18).plot(kind='barh')
Out[117]: 0.6074480065154999
There isn't much improvement in the score by dropping less significant features, hence we keep the model as it is. So we have built a model which
predict the House prices with 60% accuracy.