Regression Predictive Modelling Problem In R | Investigate The Boston House Price Dataset

realcode4you
Jun 12, 2021
12 min read

Problem Definition

For this project we will investigate the Boston House Price dataset. It is already included in the library mlbench, so you just need to install the package, as follows:

install.library("mlbench") 
# load the package 
library(mlbench) 
# list the contents of the package 
library(help = "mlbench") 
# Boston Housing Data 
data(BostonHousing)

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the UCI Machine Learning Repository)

1. CRIM: per capita crime rate by town

2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

3. INDUS: proportion of non-retail business acres per town2

4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

5. NOX: nitric oxides concentration (parts per 10 million)

6. RM: average number of rooms per dwelling

7. AGE: proportion of owner-occupied units built prior to 1940

8. DIS: weighted distances to five Boston employment centers

9. RAD: index of accessibility to radial highways

10. TAX: full-value property-tax rate per $10,000

11. PTRATIO: pupil-teacher ratio by town

12. B: 1000(𝐵𝑘 − 0.63) 2 , where 𝐵𝑘 is the proportion of blacks by town

13. LSTAT: % lower status of the population

14. MEDV: Median value of owner-occupied homes in $1000s We can see that the input attributes have a mixture of units

Load the Dataset

The dataset is available in the mlbench package. Let's start by offloading the required packages and loading the dataset.

# load packages 
library(mlbench) 
library(caret) 
library(corrplot) 
# attach the BostonHousing dataset 
data(BostonHousing)

Validation Dataset

It is a good idea to use a validation hold out set. This is a sample of the data that we hold back from our analysis and modelling. We use it right at the end of our project to confirm the accuracy of our final model. It is a smoke test that we can use to see if we messed up and to give us confidence on our estimates of accuracy on unseen data.

# Split out validation dataset 
# create a list of 80% of the rows in the original dataset we can use for  training 
set.seed(7)
validationIndex <- createDataPartition(BostonHousing$medv, p=0.80,  list=FALSE) 
# select 20% of the data for validation 
validation <- BostonHousing[-validationIndex,] 
# use the remaining 80% of data to training and testing the models 
dataset <- BostonHousing[validationIndex,]

Analyse Data

The objective of this step in the process is to better understand the problem.

Descriptive Statistics

Let's start off by confirming the dimensions of the dataset, e.g., the number of rows and columns.

# dimensions of dataset 
dim(dataset)

We have 407 instances to work with and can confirm the data has 14 attributes including the class attribute medv.

407 14

Let's also look at the data types of each attribute.

# list types for each attribute 
sapply(dataset, class)

We can see that one of the attributes (chas) is a factor while all of the others are numeric.

crim zn indus chas nox rm age dis rad tax

ptratio b

"numeric" "numeric" "numeric" "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"

lstat medv

"numeric" "numeric"

Let's now take a peak at the first 20 rows of the data.

# take a peek at the first 5 rows of the data 
head(dataset, n=20)

We can confirm that the scales for the attributes are all over the place because of the differing units. We may benefit from some transforms later on.

crim zn indus chas nox rm age dis rad tax ptratio b lstat medv

2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2 6 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7 7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9 8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1 9 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5 13 0.09378 12.5 7.87 0 0.524 5.889 39.0 5.4509 5 311 15.2 390.50 15.71 21.7 14 0.62976 0.0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21.0 396.90 8.26 20.4 15 0.63796 0.0 8.14 0 0.538 6.096 84.5 4.4619 4 307 21.0 380.02 10.26 18.2 16 0.62739 0.0 8.14 0 0.538 5.834 56.5 4.4986 4 307 21.0 395.62 8.47 19.9 17 1.05393 0.0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21.0 386.85 6.58 23.1 18 0.78420 0.0 8.14 0 0.538 5.990 81.7 4.2579 4 307 21.0 386.75 14.67 17.5 19 0.80271 0.0 8.14 0 0.538 5.456 36.6 3.7965 4 307 21.0 288.99 11.69 20.2 20 0.72580 0.0 8.14 0 0.538 5.727 69.5 3.7965 4 307 21.0 390.95 11.28 18.2 23 1.23247 0.0 8.14 0 0.538 6.142 91.7 3.9769 4 307 21.0 396.90 18.72 15.2 25 0.75026 0.0 8.14 0 0.538 5.924 94.1 4.3996 4 307 21.0 394.33 16.30 15.6 26 0.84054 0.0 8.14 0 0.538 5.599 85.7 4.4546 4 307 21.0 303.42 16.51 13.9 27 0.67191 0.0 8.14 0 0.538 5.813 90.3 4.6820 4 307 21.0 376.88 14.81 16.6

Let's summarize the distribution of each attribute.

# summarize attribute distributions 
summary(dataset)

We can note that chas is a pretty unbalanced factor. We could transform this attribute to numeric to make calculating descriptive statistics and plots easier.

crim zn indus chas nox rm age

Min. : 0.00906 Min. : 0.00 Min. : 0.46 0:376 Min. :0.3850 Min. :3.863 Min. : 2.90

1st Qu.: 0.08556 1st Qu.: 0.00 1st Qu.: 5.19 1: 31 1st Qu.:0.4530 1st Qu.:5.873 1st Qu.: 45.05

Median : 0.28955 Median : 0.00 Median : 9.90 Median :0.5380 Median :6.185 Median : 77.70

Mean : 3.58281 Mean :10.57 Mean :11.36 Mean :0.5577 Mean :6.279 Mean : 68.83

3rd Qu.: 3.50464 3rd Qu.: 0.00 3rd Qu.:18.10 3rd Qu.:0.6310 3rd Qu.:6.611 3rd Qu.: 94.55

Max. :88.97620 Max. :95.00 Max. :27.74 Max. :0.8710 Max. :8.780 Max. :100.00

dis rad tax ptratio b lstat medv

Min. : 1.130 Min. : 1.000 Min. :188.0 Min. :12.60 Min. : 0.32 Min. : 1.730 Min. : 5.00

1st Qu.: 2.031 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:374.50 1st Qu.: 6.895 1st Qu.:17.05

Median : 3.216 Median : 5.000 Median :330.0 Median :19.00 Median :391.13 Median :11.500 Median :21.20

Mean : 3.731 Mean : 9.464 Mean :405.6 Mean :18.49 Mean :357.88 Mean :12.827 Mean :22.61

3rd Qu.: 5.100 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.27 3rd Qu.:17.175 3rd Qu.:25.00

Max. :10.710 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.970 Max. :50.00

Let's go ahead and convert chas to a numeric attribute.

dataset[,4] <- as.numeric(as.character(dataset[,4]))

Now, let's now take a look at the correlation between all of the numeric attributes. cor(dataset[,1:13])

This is interesting. We can see that many of the attributes have a strong correlation (e.g.> 0:70 or < 0:70). For example:

nox and indus with 0.77
dist and indus with 0.71
tax and indus with 0.72
age and nox with 0.72
dist and nox with 0.76

crim zn indus chas nox rm age dis rad tax ptratio b lstat

crim 1.00000000 -0.19790631 0.40597009 -0.05713065 0.4232413 -0.21513269 0.3543819 -0.3905097 0.64240501 0.60622608 0.2892983 -0.3021185 0.47537617

zn -0.19790631 1.00000000 -0.51895069 -0.04843477 -0.5058512 0.28942883 - 0.5707027 0.6561874 -0.29952976 -0.28791668 -0.3534121 0.1692749 -0.39712686

indus 0.40597009 -0.51895069 1.00000000 0.08003629 0.7665481 -0.37673408 0.6585831 -0.7230588 0.56774365 0.68070916 0.3292061 -0.3359795 0.59212718

chas -0.05713065 -0.04843477 0.08003629 1.00000000 0.1027366 0.08252441 0.1093812 -0.1114242 -0.00901245 -0.02779018 -0.1355438 0.0472442 -0.04569239

nox 0.42324132 -0.50585121 0.76654811 0.10273656 1.0000000 -0.29885055 0.7238371 -0.7708680 0.58516760 0.65217875 0.1416616 -0.3620791 0.58196447

rm -0.21513269 0.28942883 -0.37673408 0.08252441 -0.2988506 1.00000000 - 0.2325359 0.1952159 -0.19149122 -0.26794733 -0.3200037 0.1553992 -0.62038075

age 0.35438190 -0.57070265 0.65858310 0.10938121 0.7238371 -0.23253586 1.0000000 -0.7503321 0.45235421 0.50164657 0.2564318 -0.2512574 0.59321281

This is collinearity and we may see better results with regression algorithms if the correlated attributes are removed.

Unimodal Data Visualizations

Let's look at visualizations of individual attributes. It is often useful to look at your data using multiple different visualizations in order to spark ideas. Let's look at histograms of each attribute to get a sense of the data distributions.

# histograms each attribute 
par(mfrow=c(2,7)) 
for(i in 1:13) { 
hist(dataset[,i], main=names(dataset)[i]) 
}

We can see that some attributes may have an exponential distribution, such as crim, zn, age and b. We can see that others may have a bimodal distribution such as rad and tax

Result:

Let's look at the same distributions using density plots that smooth them out a bit.

# density plot for each attribute 
par(mfrow=c(2,7)) 
for(i in 1:13) { 
plot(density(dataset[,i]), main=names(dataset)[i]) 
}

Output Result:

This perhaps adds more evidence to our suspicion about possible exponential and bimodal distributions. It also looks like nox, rm and lsat may be skewed Gaussian distributions, which might be helpful later with transforms.

Let's look at the data with box and whisker plots of each attribute.

# boxplots for each attribute 
par(mfrow=c(2,7)) 
for(i in 1:13) { 
boxplot(dataset[,i], main=names(dataset)[i]) 
}

This helps point out the skew in many distributions so much so that data looks like outliers (e.g., beyond the whisker of the plots).

Output Result:

Multi-modal Data Visualisations

Let's look at some visualisations of the interactions between variables. The best place to start is a scatterplot matrix.


# scatterplot matrix 
pairs(dataset[,1:13])

We can see that some of the higher correlated attributes do show good structure in their relationship. Not linear, but nice predictable curved relationships

Output Result:

Scatterplot Matrix of Boston House Dataset Input Attributes.

# correlation plot 
correlations <- cor(dataset[,1:13]) 
corrplot(correlations, method="circle")

The larger darker blue dots confirm the positively correlated attributes we listed early (not the diagonal). We can also see some larger darker red dots that suggest some negatively correlated attributes. For example, tax and rad. These too may be candidates for removal to better improve accuracy of models later on

Output Result:

Summary of Ideas

There is a lot of structure in this dataset. We need to think about transforms that we could use later to better expose the structure which in turn may improve modelling accuracy. So far it would be worth trying:

Feature selection and removing the most correlated attributes.
Normalizing the dataset to reduce the effect of differing scales.
Standardizing the dataset to reduce the effects of differing distributions.
Box-Cox transform to see if flattening out some of the distributions improves accuracy.

With lots of additional time I would also explore the possibility of binning (discretization) of the data. This can often improve accuracy for decision tree algorithms.

Evaluate Algorithms: Baseline

We have no idea what algorithms will do well on this problem. Gut feel suggests regression algorithms like GLM and GLMNET may do well. It is also possible that decision trees and even SVM may do well. I have no idea. Let's design our test harness. We will use 10-fold cross-validation (each fold will be about 360 instances for training and 40 for test) with 3 repeats

The dataset is not too small, and this is a good standard test harness configuration. We will evaluate algorithms using the RMSE and R2 metrics. RMSE will give a gross idea of how wrong all predictions are (0 is perfect) and R2 will give an idea of how well the model has fit the data (1 is perfect, 0 is worst).

# Prepare the test harness for evaluating algorithms. 
# Run algorithms using 10-fold cross validation 
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3) metric <- "RMSE"

Let's create a baseline of performance on this problem and spot-check a number of different algorithms. We will select a suite of different algorithms capable of working on this regression problem. The 6 algorithms selected include:

Linear Algorithms: Linear Regression (LR), Generalized Linear Regression (GLM) and Penalized Linear Regression (GLMNET)
Non-Linear Algorithms: Classification and Regression Trees (CART), Support Vector Machines (SVM) with a radial basis function and k-Nearest Neighbours (KNN)

We know the data has differing units of measure so we will standardize the data for this baseline comparison. This will those algorithms that prefer data in the same scale (e.g., instance-based methods and some regression algorithms) a chance to do well.

# Estimate accuracy of machine learning algorithms

# LM

set.seed(7)

fit.lm <- train(medv~., data=dataset, method="lm", metric=metric, preProc=c("center", "scale"), trControl=trainControl)

# GLM

set.seed(7)

fit.glm <- train(medv~., data=dataset, method="glm", metric=metric, preProc=c("center", "scale"), trControl=trainControl)

# GLMNET

set.seed(7)

fit.glmnet <- train(medv~., data=dataset, method="glmnet", metric=metric, preProc=c("center", "scale"), trControl=trainControl)

# SVM

set.seed(7)

fit.svm <- train(medv~., data=dataset, method="svmRadial", metric=metric,13 preProc=c("center", "scale"), trControl=trainControl)

# CART

set.seed(7) grid <- expand.grid(.cp=c(0, 0.05, 0.1))

fit.cart <- train(medv~., data=dataset, method="rpart", metric=metric, tuneGrid=grid, preProc=c("center", "scale"), trControl=trainControl)

# KNN

set.seed(7)

fit.knn <- train(medv~., data=dataset, method="knn", metric=metric, preProc=c("center", "scale"), trControl=trainControl)

The algorithms all use default tuning parameters, except CART which is fussy on this dataset and has 3 default parameters specified. Let's compare the algorithms. We will use a simple table of results to get a quick idea of what is going on. We will also use a dot plot to show the 95% confidence level for the estimated metrics.

# Collect resample statistics from models and summarize results. 
# Compare algorithms 
results <- resamples(list(LM=fit.lm, GLM=fit.glm, GLMNET=fit.glmnet,  SVM=fit.svm, CART=fit.cart, KNN=fit.knn)) 
summary(results) 
dotplot(results)

Output Result:

Dotplot Compare Machine Learning Algorithms on the Boston House Price Dataset with Box-Cox Power Transform

We can see that this indeed decrease the RMSE and increased the 𝑅 2 on all except the CART algorithms. The RMSE of SVM dropped to an average of 3.761.

# Output of estimated accuracy of models on transformed dataset.

Models: LM, GLM, GLMNET, SVM, CART, KNN20

Number of resamples: 30

RMSE

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

LM 3.404 3.811 4.399 4.621 5.167 7.781 0

GLM 3.404 3.811 4.399 4.621 5.167 7.781 0

GLMNET 3.312 3.802 4.429 4.611 5.123 7.976 0

SVM 2.336 2.937 3.543 3.761 4.216 8.207 0

CART 2.797 3.434 4.272 4.541 5.437 9.248 0

KNN 2.474 3.608 4.308 4.563 5.080 8.922 0

Rsquared

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

LM 0.5439 0.7177 0.7832 0.7627 0.8257 0.8861 0

GLM 0.5439 0.7177 0.7832 0.7627 0.8257 0.8861 0

GLMNET 0.5198 0.7172 0.7808 0.7634 0.8297 0.8909 0

SVM 0.5082 0.8249 0.8760 0.8452 0.8998 0.9450 0

CART 0.3614 0.6733 0.8197 0.7680 0.8613 0.9026 0

KNN 0.4065 0.7562 0.8073 0.7790 0.8594 0.9043 0

Improve Results with Tuning

We can improve the accuracy of the well performing algorithms by tuning their parameters. In this section we will look at tuning the parameters of SVM with a Radial Basis Function (RBF). with more time it might be worth exploring tuning of the parameters for CART and KNN. It might also be worth exploring other kernels for SVM besides the RBF. Let's look at the default parameters already adopted.

# Display estimated accuracy of a model. 
print(fit.svm)

The C parameter is the cost constraint used by SVM. Learn more in the help for the ksvm function? ksvm. We can see from previous results that a C value of 1.0 is a good starting point.

# Output of estimated accuracy of a model.

Support Vector Machines with Radial Basis Function Kernel

407 samples

13 predictor

Pre-processing: centered (13), scaled (13), Box-Cox transformation (11)

Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 366, 367, 366, 366, 367, 367, ...

Resampling results across tuning parameters

C RMSE Rsquared RMSE SD Rsquared SD

0.25 4.555338 0.7906921 1.533391 0.11596196

0.50 4.111564 0.8204520 1.467153 0.10573527

1.00 3.761245 0.8451964 1.323218 0.09487941

Tuning parameter 'sigma' was held constant at a value of 0.07491936

RMSE was used to select the optimal model using the smallest value.

The final values used for the model were sigma = 0.07491936 and C = 1.

Let's design a grid search around a C value of 1. We might see a small trend of decreasing RMSE with increasing C, so let’s try all integer C values between 1 and 10. Another parameter that caret lets us tune is the sigma parameter. This is a smoothing parameter. Good sigma values are often start around 0.1, so we will try numbers before and after.

# Tune the parameters of a model. 
# tune SVM sigma and C parametres 
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3) 
metric <- "RMSE" 
set.seed(7) 
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15), .C=seq(1, 10, by=1)) 
fit.svm <- train(medv~., data=dataset, method="svmRadial", metric=metric,  tuneGrid=grid, preProc=c("BoxCox"), trControl=trainControl) 
print(fit.svm) 
plot(fit.svm)

Output Result:

Algorithm Tuning Results for SVM on the Boston House Price Dataset.

We can see that the sigma values flatten out with larger C cost constraints. It looks like we might do well with a sigma of 0.05 and a C of 10. This gives us a respectable RMSE of 2.977085.

# Output of tuning the parameters of a model.

Support Vector Machines with Radial Basis Function Kernel

407 samples

13 predictor

Pre-processing: Box-Cox transformation (11)

Resampling: Cross-Validated (10 fold, repeated 3 times)

19.7. Ensemble Methods 174

Summary of sample sizes: 366, 367, 366, 366, 367, 367, ...

Resampling results across tuning parameters:

sigma C RMSE Rsquared RMSE SD Rsquared SD

0.025 1 3.889703 0.8335201 1.4904294 0.11313650

0.025 2 3.685009 0.8470869 1.4126374 0.10919207

0.025 3 3.562851 0.8553298 1.3664097 0.10658536

0.025 4 3.453041 0.8628558 1.3167032 0.10282201

0.025 5 3.372501 0.8686287 1.2700128 0.09837303

------

RMSE was used to select the optimal model using the smallest value.

The final values used for the model were sigma = 0.05 and C = 10.

If we wanted to take this further, we could try even more fine tuning with more grid searches. We could also explore trying to tune other parameters of the underlying ksvm() function. Finally and as already mentioned, we could perform some grid searches on the other non-linear regression methods.

Ensemble Methods

We can try some ensemble methods on the problem and see if we can get a further decrease in our RMSE. In this section we will look at some boosting and bagging techniques for decision trees. Additional approaches you could look into would be blending the predictions of multiple well performing models together, called stacking. Let's take a look at the following ensemble methods:

Random Forest, bagging (RF).
Gradient Boosting Machines boosting (GBM).
Cubist, boosting (CUBIST).

# Estimate accuracy of ensemble methods. 
# try ensembles 
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3) 
metric <- "RMSE" 

# Random Forest 
set.seed(seed) 
fit.rf <- train(medv~., data=dataset, method="rf", metric=metric,  preProc=c("BoxCox"), trControl=trainControl) 

# Stochastic Gradient Boosting set.seed(seed) 
fit.gbm <- train(medv~., data=dataset, method="gbm", metric=metric,  preProc=c("BoxCox"), trControl=trainControl, verbose=FALSE) 

# Cubist set.seed(seed) 
fit.cubist <- train(medv~., data=dataset, method="cubist", metric=metric, preProc=c("BoxCox"), trControl=trainControl) 

# Compare algorithms 
ensembleResults <- resamples(list(RF=fit.rf, GBM=fit.gbm,  CUBIST=fit.cubist)) 
summary(ensembleResults) 
dotplot(ensembleResults)

Output:

Ensemble Methods on the Boston House Price Dataset.

Get R Programming Project help, R Programming Homework Help with an affordable price. Send your requirement details at realcode4you@gmail.com and get instant help with an affordable prices.

RealCode4You

Regression Predictive Modelling Problem In R | Investigate The Boston House Price Dataset

Recent Posts

Comments