Regression Predictive Modelling Problem In R | Investigate The Boston House Price Dataset

Problem Definition

For this project we will investigate the Boston House Price dataset. It is already included in the library mlbench, so you just need to install the package, as follows:


install.library("mlbench") 
# load the package 
library(mlbench) 
# list the contents of the package 
library(help = "mlbench") 
# Boston Housing Data 
data(BostonHousing) 

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the UCI Machine Learning Repository)


1. CRIM: per capita crime rate by town

2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

3. INDUS: proportion of non-retail business acres per town2

4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

5. NOX: nitric oxides concentration (parts per 10 million)

6. RM: average number of rooms per dwelling

7. AGE: proportion of owner-occupied units built prior to 1940

8. DIS: weighted distances to five Boston employment centers

9. RAD: index of accessibility to radial highways

10. TAX: full-value property-tax rate per $10,000

11. PTRATIO: pupil-teacher ratio by town

12. B: 1000(𝐵𝑘 − 0.63) 2 , where 𝐵𝑘 is the proportion of blacks by town

13. LSTAT: % lower status of the population

14. MEDV: Median value of owner-occupied homes in $1000s We can see that the input attributes have a mixture of units


Load the Dataset

The dataset is available in the mlbench package. Let's start by offloading the required packages and loading the dataset.


# load packages 
library(mlbench) 
library(caret) 
library(corrplot) 
# attach the BostonHousing dataset 
data(BostonHousing) 

Validation Dataset

It is a good idea to use a validation hold out set. This is a sample of the data that we hold back from our analysis and modelling. We use it right at the end of our project to confirm the accuracy of our final model. It is a smoke test that we can use to see if we messed up and to give us confidence on our estimates of accuracy on unseen data.

# Split out validation dataset 
# create a list of 80% of the rows in the original dataset we can use for  training 
set.seed(7)
validationIndex <- createDataPartition(BostonHousing$medv, p=0.80,  list=FALSE) 
# select 20% of the data for validation 
validation <- BostonHousing[-validationIndex,] 
# use the remaining 80% of data to training and testing the models 
dataset <- BostonHousing[validationIndex,] 

Analyse Data

The objective of this step in the process is to better understand the problem.


Descriptive Statistics

Let's start off by confirming the dimensions of the dataset, e.g., the number of rows and columns.

# dimensions of dataset 
dim(dataset) 

We have 407 instances to work with and can confirm the data has 14 attributes including the class attribute medv.

407 14


Let's also look at the data types of each attribute.

# list types for each attribute 
sapply(dataset, class) 

We can see that one of the attributes (chas) is a factor while all of the others are numeric.


crim zn indus chas nox rm age dis rad tax

ptratio b

"numeric" "numeric" "numeric" "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"

lstat medv

"numeric" "numeric"


Let's now take a peak at the first 20 rows of the data.

# take a peek at the first 5 rows of the data 
head(dataset, n=20)

We can confirm that the scales for the attributes are all over the place because of the differing units. We may benefit from some transforms later on.


crim zn indus chas nox rm age dis rad tax ptratio b lstat medv

2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2 6 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7 7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9 8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1 9 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5 13 0.09378 12.5 7.87 0 0.524 5.889 39.0 5.4509 5 311 15.2 390.50 15.71 21.7 14 0.62976 0.0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21.0 396.90 8.26 20.4 15 0.63796 0.0 8.14 0 0.538 6.096 84.5 4.4619 4 307 21.0 380.02 10.26 18.2 16 0.62739 0.0 8.14 0 0.538 5.834 56.5 4.4986 4 307 21.0 395.62 8.47 19.9 17 1.05393 0.0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21.0 386.85 6.58 23.1 18 0.78420 0.0 8.14 0 0.538 5.990 81.7 4.2579 4 307 21.0 386.75 14.67 17.5 19 0.80271 0.0 8.14 0 0.538 5.456 36.6 3.7965 4 307 21.0 288.99 11.69 20.2 20 0.72580 0.0 8.14 0 0.538 5.727 69.5 3.7965 4 307 21.0 390.95 11.28 18.2 23 1.23247 0.0 8.14 0 0.538 6.142 91.7 3.9769 4 307 21.0 396.90 18.72 15.2 25 0.75026 0.0 8.14 0 0.538 5.924 94.1 4.3996 4 307 21.0 394.33 16.30 15.6 26 0.84054 0.0 8.14 0 0.538 5.599 85.7 4.4546 4 307 21.0 303.42 16.51 13.9 27 0.67191 0.0 8.14 0 0.538 5.813 90.3 4.6820 4 307 21.0 376.88 14.81 16.6


Let's summarize the distribution of each attribute.

# summarize attribute distributions 
summary(dataset) 

We can note that chas is a pretty unbalanced factor. We could transform this attribute to numeric to make calculating descriptive statistics and plots easier.


crim zn indus chas nox rm age

Min. : 0.00906 Min. : 0.00 Min. : 0.46 0:376 Min. :0.3850 Min. :3.863 Min. : 2.90


1st Qu.: 0.08556 1st Qu.: 0.00 1st Qu.: 5.19 1: 31 1st Qu.:0.4530 1st Qu.:5.873 1st Qu.: 45.05


Median : 0.28955 Median : 0.00 Median : 9.90 Median :0.5380 Median :6.185 Median : 77.70


Mean : 3.58281 Mean :10.57 Mean :11.36 Mean :0.5577 Mean :6.279 Mean : 68.83


3rd Qu.: 3.50464 3rd Qu.: 0.00 3rd Qu.:18.10 3rd Qu.:0.6310 3rd Qu.:6.611 3rd Qu.: 94.55


Max. :88.97620 Max. :95.00 Max. :27.74 Max. :0.8710 Max. :8.780 Max. :100.00


dis rad tax ptratio b lstat medv

Min. : 1.130 Min. : 1.000 Min. :188.0 Min. :12.60 Min. : 0.32 Min. : 1.730 Min. : 5.00


1st Qu.: 2.031 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:374.50 1st Qu.: 6.895 1st Qu.:17.05


Median : 3.216 Median : 5.000 Median :330.0 Median :19.00 Median :391.13 Median :11.500 Median :21.20


Mean : 3.731 Mean : 9.464 Mean :405.6 Mean :18.49 Mean :357.88 Mean :12.827 Mean :22.61


3rd Qu.: 5.100 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.27 3rd Qu.:17.175 3rd Qu.:25.00


Max. :10.710 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.970 Max. :50.00


Let's go ahead and convert chas to a numeric attribute.


dataset[,4] <- as.numeric(as.character(dataset[,4]))


Now, let's now take a look at the correlation between all of the numeric attributes. cor(dataset[,1:13])


This is interesting. We can see that many of the attributes have a strong correlation (e.g.> 0:70 or < 0:70). For example:

  • nox and indus with 0.77

  • dist and indus with 0.71

  • tax and indus with 0.72

  • age and nox with 0.72

  • dist and nox with 0.76

crim zn indus chas nox rm age dis rad tax ptratio b lstat

crim 1.00000000 -0.19790631 0.40597009 -0.05713065 0.4232413 -0.21513269 0.3543819 -0.3905097 0.64240501 0.60622608 0.2892983 -0.3021185 0.47537617


zn -0.19790631 1.00000000 -0.51895069 -0.04843477 -0.5058512 0.28942883 - 0.5707027 0.6561874 -0.29952976 -0.28791668 -0.3534121 0.1692749 -0.39712686


indus 0.40597009 -0.51895069 1.00000000 0.08003629 0.7665481 -0.37673408 0.6585831 -0.7230588 0.56774365 0.68070916 0.3292061 -0.3359795 0.59212718


chas -0.05713065 -0.04843477 0.08003629 1.00000000 0.1027366 0.08252441 0.1093812 -0.1114242 -0.00901245 -0.02779018 -0.1355438 0.0472442 -0.04569239


nox 0.42324132 -0.50585121 0.76654811 0.10273656 1.0000000 -0.29885055 0.7238371 -0.7708680 0.58516760 0.65217875 0.1416616 -0.3620791 0.58196447


rm -0.21513269 0.28942883 -0.37673408 0.08252441 -0.2988506 1.00000000 - 0.2325359 0.1952159 -0.19149122 -0.26794733 -0.3200037 0.1553992 -0.62038075


age 0.35438190 -0.57070265 0.65858310 0.10938121 0.7238371 -0.23253586 1.0000000 -0.7503321 0.45235421 0.50164657 0.2564318 -0.2512574 0.59321281


This is collinearity and we may see better results with regression algorithms if the correlated attributes are removed.


Unimodal Data Visualizations

Let's look at visualizations of individual attributes. It is often useful to look at your data using multiple different visualizations in order to spark ideas. Let's look at histograms of each attribute to get a sense of the data distributions.


# histograms each attribute 
par(mfrow=c(2,7)) 
for(i in 1:13) { 
hist(dataset[,i], main=names(dataset)[i]) 
}



We can see that some attributes may have an exponential distribution, such as crim, zn, age and b. We can see that others may have a bimodal distribution such as rad and tax


Result:












Let's look at the same distributions using density plots that smooth them out a bit.


# density plot for each attribute 
par(mfrow=c(2,7)) 
for(i in 1:13) { 
plot(density(dataset[,i]), main=names(dataset)[i]) 
} 

Output Result:











This perhaps adds more evidence to our suspicion about possible exponential and bimodal distributions. It also looks like nox, rm and lsat may be skewed Gaussian distributions, which might be helpful later with transforms.


Let's look at the data with box and whisker plots of each attribute.


# boxplots for each attribute 
par(mfrow=c(2,7)) 
for(i in 1:13) { 
boxplot(dataset[,i], main=names(dataset)[i]) 
}

This helps point out the skew in many distributions so much so that data looks like outliers (e.g., beyond the whisker of the plots).


Output Result:











Multi-modal Data Visualisations

Let's look at some visualisations of the interactions between variables. The best place to start is a scatterplot matrix.



# scatterplot matrix 
pairs(dataset[,1:13]) 

We can see that some of the higher correlated attributes do show good structure in their relationship. Not linear, but nice predictable curved relationships


Output Result:

Scatterplot Matrix of Boston House Dataset Input Attributes.













# correlation plot 
correlations <- cor(dataset[,1:13]) 
corrplot(correlations, method="circle") 

The larger darker blue dots confirm the positively correlated attributes we listed early (not the diagonal). We can also see some larger darker red dots that suggest some negatively correlated attributes. For example, tax and rad. These too may be candidates for removal to better improve accuracy of models later on


Output Result:














Summary of Ideas

There is a lot of structure in this dataset. We need to think about transforms that we could use later to better expose the structure which in turn may improve modelling accuracy. So far it would be worth trying:

  • Feature selection and removing the most correlated attributes.

  • Normalizing the dataset to reduce the effect of differing scales.

  • Standardizing the dataset to reduce the effects of differing distributions.

  • Box-Cox transform to see if flattening out some of the distributions improves accuracy.

With lots of additional time I would also explore the possibility of binning (discretization) of the data. This can often improve accuracy for decision tree algorithms.


Evaluate Algorithms: Baseline

We have no idea what algorithms will do well on this problem. Gut feel suggests regression algorithms like GLM and GLMNET may do well. It is also possible that decision trees and even SVM may do well. I have no idea. Let's design our test harness. We will use 10-fold cross-validation (each fold will be about 360 instances for training and 40 for test) with 3 repeats


The dataset is not too small, and this is a good standard test harness configuration. We will evaluate algorithms using the RMSE and R2 metrics. RMSE will give a gross idea of how wrong all predictions are (0 is perfect) and R2 will give an idea of how well the model has fit the data (1 is perfect, 0 is worst).


# Prepare the test harness for evaluating algorithms. 
# Run algorithms using 10-fold cross validation 
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3) metric <- "RMSE"

Let's create a baseline of performance on this problem and spot-check a number of different algorithms. We will select a suite of different algorithms capable of working on this regression problem. The 6 algorithms selected include:

  • Linear Algorithms: Linear Regression (LR), Generalized Linear Regression (GLM) and Penalized Linear Regression (GLMNET)

  • Non-Linear Algorithms: Classification and Regression Trees (CART), Support Vector Machines (SVM) with a radial basis function and k-Nearest Neighbours (KNN)


We know the data has differing units of measure so we will standardize the data for this baseline comparison. This will those algorithms that prefer data in the same scale (e.g., instance-based methods and some regression algorithms) a chance to do well.


# Estimate accuracy of machine learning algorithms

# LM

set.seed(7)

fit.lm <- train(medv~., data=dataset, method="lm", metric=metric, preProc=c("center", "scale"), trControl=trainControl)


# GLM

set.seed(7)

fit.glm <- train(medv~., data=dataset, method="glm", metric=metric, preProc=c("center", "scale"), trControl=trainControl)


# GLMNET

set.seed(7)

fit.glmnet <- train(medv~., data=dataset, method="glmnet", metric=metric, preProc=c("center", "scale"), trControl=trainControl)


# SVM

set.seed(7)

fit.svm <- train(medv~., data=dataset, method="svmRadial", metric=metric,13 preProc=c("center", "scale"), trControl=trainControl)


# CART

set.seed(7) grid <- expand.grid(.cp=c(0, 0.05, 0.1))

fit.cart <- train(medv~., data=dataset, method="rpart", metric=metric, tuneGrid=grid, preProc=c("center", "scale"), trControl=trainControl)


# KNN

set.seed(7)

fit.knn <- train(medv~., data=dataset, method="knn", metric=metric, preProc=c("center", "scale"), trControl=trainControl)


The algorithms all use default tuning parameters, except CART which is fussy on this dataset and has 3 default parameters specified. Let's compare the algorithms. We will use a simple table of results to get a quick idea of what is going on. We will also use a dot plot to show the 95% confidence level for the estimated metrics.


# Collect resample statistics from models and summarize results. 
# Compare algorithms 
results <- resamples(list(LM=fit.lm, GLM=fit.glm, GLMNET=fit.glmnet,  SVM=fit.svm, CART=fit.cart, KNN=fit.knn)) 
summary(results) 
dotplot(results)


Output Result:

Dotplot Compare Machine Learning Algorithms on the Boston House Price Dataset with Box-Cox Power Transform