Problem Definition
For this project we will investigate the Boston House Price dataset. It is already included in the library mlbench, so you just need to install the package, as follows:
install.library("mlbench")
# load the package
library(mlbench)
# list the contents of the package
library(help = "mlbench")
# Boston Housing Data
data(BostonHousing)
Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the UCI Machine Learning Repository)
1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town2
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centers
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(𝐵𝑘 − 0.63) 2 , where 𝐵𝑘 is the proportion of blacks by town
13. LSTAT: % lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000s We can see that the input attributes have a mixture of units
Load the Dataset
The dataset is available in the mlbench package. Let's start by offloading the required packages and loading the dataset.
# load packages
library(mlbench)
library(caret)
library(corrplot)
# attach the BostonHousing dataset
data(BostonHousing)
Validation Dataset
It is a good idea to use a validation hold out set. This is a sample of the data that we hold back from our analysis and modelling. We use it right at the end of our project to confirm the accuracy of our final model. It is a smoke test that we can use to see if we messed up and to give us confidence on our estimates of accuracy on unseen data.
# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(7)
validationIndex <- createDataPartition(BostonHousing$medv, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- BostonHousing[-validationIndex,]
# use the remaining 80% of data to training and testing the models
dataset <- BostonHousing[validationIndex,]
Analyse Data
The objective of this step in the process is to better understand the problem.
Descriptive Statistics
Let's start off by confirming the dimensions of the dataset, e.g., the number of rows and columns.
# dimensions of dataset
dim(dataset)
We have 407 instances to work with and can confirm the data has 14 attributes including the class attribute medv.
407 14
Let's also look at the data types of each attribute.
# list types for each attribute
sapply(dataset, class)
We can see that one of the attributes (chas) is a factor while all of the others are numeric.
crim zn indus chas nox rm age dis rad tax
ptratio b
"numeric" "numeric" "numeric" "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
lstat medv
"numeric" "numeric"
Let's now take a peak at the first 20 rows of the data.
# take a peek at the first 5 rows of the data
head(dataset, n=20)
We can confirm that the scales for the attributes are all over the place because of the differing units. We may benefit from some transforms later on.
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2 6 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7 7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9 8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1 9 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5 13 0.09378 12.5 7.87 0 0.524 5.889 39.0 5.4509 5 311 15.2 390.50 15.71 21.7 14 0.62976 0.0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21.0 396.90 8.26 20.4 15 0.63796 0.0 8.14 0 0.538 6.096 84.5 4.4619 4 307 21.0 380.02 10.26 18.2 16 0.62739 0.0 8.14 0 0.538 5.834 56.5 4.4986 4 307 21.0 395.62 8.47 19.9 17 1.05393 0.0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21.0 386.85 6.58 23.1 18 0.78420 0.0 8.14 0 0.538 5.990 81.7 4.2579 4 307 21.0 386.75 14.67 17.5 19 0.80271 0.0 8.14 0 0.538 5.456 36.6 3.7965 4 307 21.0 288.99 11.69 20.2 20 0.72580 0.0 8.14 0 0.538 5.727 69.5 3.7965 4 307 21.0 390.95 11.28 18.2 23 1.23247 0.0 8.14 0 0.538 6.142 91.7 3.9769 4 307 21.0 396.90 18.72 15.2 25 0.75026 0.0 8.14 0 0.538 5.924 94.1 4.3996 4 307 21.0 394.33 16.30 15.6 26 0.84054 0.0 8.14 0 0.538 5.599 85.7 4.4546 4 307 21.0 303.42 16.51 13.9 27 0.67191 0.0 8.14 0 0.538 5.813 90.3 4.6820 4 307 21.0 376.88 14.81 16.6
Let's summarize the distribution of each attribute.
# summarize attribute distributions
summary(dataset)
We can note that chas is a pretty unbalanced factor. We could transform this attribute to numeric to make calculating descriptive statistics and plots easier.
crim zn indus chas nox rm age
Min. : 0.00906 Min. : 0.00 Min. : 0.46 0:376 Min. :0.3850 Min. :3.863 Min. : 2.90
1st Qu.: 0.08556 1st Qu.: 0.00 1st Qu.: 5.19 1: 31 1st Qu.:0.4530 1st Qu.:5.873 1st Qu.: 45.05
Median : 0.28955 Median : 0.00 Median : 9.90 Median :0.5380 Median :6.185 Median : 77.70
Mean : 3.58281 Mean :10.57 Mean :11.36 Mean :0.5577 Mean :6.279 Mean : 68.83
3rd Qu.: 3.50464 3rd Qu.: 0.00 3rd Qu.:18.10 3rd Qu.:0.6310 3rd Qu.:6.611 3rd Qu.: 94.55
Max. :88.97620 Max. :95.00 Max. :27.74 Max. :0.8710 Max. :8.780 Max. :100.00
dis rad tax ptratio b lstat medv
Min. : 1.130 Min. : 1.000 Min. :188.0 Min. :12.60 Min. : 0.32 Min. : 1.730 Min. : 5.00
1st Qu.: 2.031 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:374.50 1st Qu.: 6.895 1st Qu.:17.05
Median : 3.216 Median : 5.000 Median :330.0 Median :19.00 Median :391.13 Median :11.500 Median :21.20
Mean : 3.731 Mean : 9.464 Mean :405.6 Mean :18.49 Mean :357.88 Mean :12.827 Mean :22.61
3rd Qu.: 5.100 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.27 3rd Qu.:17.175 3rd Qu.:25.00
Max. :10.710 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.970 Max. :50.00
Let's go ahead and convert chas to a numeric attribute.
dataset[,4] <- as.numeric(as.character(dataset[,4]))
Now, let's now take a look at the correlation between all of the numeric attributes. cor(dataset[,1:13])
This is interesting. We can see that many of the attributes have a strong correlation (e.g.> 0:70 or < 0:70). For example:
nox and indus with 0.77
dist and indus with 0.71
tax and indus with 0.72
age and nox with 0.72
dist and nox with 0.76
crim zn indus chas nox rm age dis rad tax ptratio b lstat
crim 1.00000000 -0.19790631 0.40597009 -0.05713065 0.4232413 -0.21513269 0.3543819 -0.3905097 0.64240501 0.60622608 0.2892983 -0.3021185 0.47537617
zn -0.19790631 1.00000000 -0.51895069 -0.04843477 -0.5058512 0.28942883 - 0.5707027 0.6561874 -0.29952976 -0.28791668 -0.3534121 0.1692749 -0.39712686
indus 0.40597009 -0.51895069 1.00000000 0.08003629 0.7665481 -0.37673408 0.6585831 -0.7230588 0.56774365 0.68070916 0.3292061 -0.3359795 0.59212718
chas -0.05713065 -0.04843477 0.08003629 1.00000000 0.1027366 0.08252441 0.1093812 -0.1114242 -0.00901245 -0.02779018 -0.1355438 0.0472442 -0.04569239
nox 0.42324132 -0.50585121 0.76654811 0.10273656 1.0000000 -0.29885055 0.7238371 -0.7708680 0.58516760 0.65217875 0.1416616 -0.3620791 0.58196447
rm -0.21513269 0.28942883 -0.37673408 0.08252441 -0.2988506 1.00000000 - 0.2325359 0.1952159 -0.19149122 -0.26794733 -0.3200037 0.1553992 -0.62038075
age 0.35438190 -0.57070265 0.65858310 0.10938121 0.7238371 -0.23253586 1.0000000 -0.7503321 0.45235421 0.50164657 0.2564318 -0.2512574 0.59321281
This is collinearity and we may see better results with regression algorithms if the correlated attributes are removed.
Unimodal Data Visualizations
Let's look at visualizations of individual attributes. It is often useful to look at your data using multiple different visualizations in order to spark ideas. Let's look at histograms of each attribute to get a sense of the data distributions.
# histograms each attribute
par(mfrow=c(2,7))
for(i in 1:13) {
hist(dataset[,i], main=names(dataset)[i])
}
We can see that some attributes may have an exponential distribution, such as crim, zn, age and b. We can see that others may have a bimodal distribution such as rad and tax
Result:
Let's look at the same distributions using density plots that smooth them out a bit.
# density plot for each attribute
par(mfrow=c(2,7))
for(i in 1:13) {
plot(density(dataset[,i]), main=names(dataset)[i])
}
Output Result:
This perhaps adds more evidence to our suspicion about possible exponential and bimodal distributions. It also looks like nox, rm and lsat may be skewed Gaussian distributions, which might be helpful later with transforms.
Let's look at the data with box and whisker plots of each attribute.
# boxplots for each attribute
par(mfrow=c(2,7))
for(i in 1:13) {
boxplot(dataset[,i], main=names(dataset)[i])
}
This helps point out the skew in many distributions so much so that data looks like outliers (e.g., beyond the whisker of the plots).
Output Result:
Multi-modal Data Visualisations
Let's look at some visualisations of the interactions between variables. The best place to start is a scatterplot matrix.
# scatterplot matrix
pairs(dataset[,1:13])
We can see that some of the higher correlated attributes do show good structure in their relationship. Not linear, but nice predictable curved relationships
Output Result:
Scatterplot Matrix of Boston House Dataset Input Attributes.
# correlation plot
correlations <- cor(dataset[,1:13])
corrplot(correlations, method="circle")
The larger darker blue dots confirm the positively correlated attributes we listed early (not the diagonal). We can also see some larger darker red dots that suggest some negatively correlated attributes. For example, tax and rad. These too may be candidates for removal to better improve accuracy of models later on
Output Result:
Summary of Ideas
There is a lot of structure in this dataset. We need to think about transforms that we could use later to better expose the structure which in turn may improve modelling accuracy. So far it would be worth trying:
Feature selection and removing the most correlated attributes.
Normalizing the dataset to reduce the effect of differing scales.
Standardizing the dataset to reduce the effects of differing distributions.
Box-Cox transform to see if flattening out some of the distributions improves accuracy.
With lots of additional time I would also explore the possibility of binning (discretization) of the data. This can often improve accuracy for decision tree algorithms.
Evaluate Algorithms: Baseline
We have no idea what algorithms will do well on this problem. Gut feel suggests regression algorithms like GLM and GLMNET may do well. It is also possible that decision trees and even SVM may do well. I have no idea. Let's design our test harness. We will use 10-fold cross-validation (each fold will be about 360 instances for training and 40 for test) with 3 repeats
The dataset is not too small, and this is a good standard test harness configuration. We will evaluate algorithms using the RMSE and R2 metrics. RMSE will give a gross idea of how wrong all predictions are (0 is perfect) and R2 will give an idea of how well the model has fit the data (1 is perfect, 0 is worst).
# Prepare the test harness for evaluating algorithms.
# Run algorithms using 10-fold cross validation
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3) metric <- "RMSE"
Let's create a baseline of performance on this problem and spot-check a number of different algorithms. We will select a suite of different algorithms capable of working on this regression problem. The 6 algorithms selected include:
Linear Algorithms: Linear Regression (LR), Generalized Linear Regression (GLM) and Penalized Linear Regression (GLMNET)
Non-Linear Algorithms: Classification and Regression Trees (CART), Support Vector Machines (SVM) with a radial basis function and k-Nearest Neighbours (KNN)
We know the data has differing units of measure so we will standardize the data for this baseline comparison. This will those algorithms that prefer data in the same scale (e.g., instance-based methods and some regression algorithms) a chance to do well.
# Estimate accuracy of machine learning algorithms
# LM
set.seed(7)
fit.lm <- train(medv~., data=dataset, method="lm", metric=metric, preProc=c("center", "scale"), trControl=trainControl)
# GLM
set.seed(7)
fit.glm <- train(medv~., data=dataset, method="glm", metric=metric, preProc=c("center", "scale"), trControl=trainControl)
# GLMNET
set.seed(7)
fit.glmnet <- train(medv~., data=dataset, method="glmnet", metric=metric, preProc=c("center", "scale"), trControl=trainControl)
# SVM
set.seed(7)
fit.svm <- train(medv~., data=dataset, method="svmRadial", metric=metric,13 preProc=c("center", "scale"), trControl=trainControl)
# CART
set.seed(7) grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=dataset, method="rpart", metric=metric, tuneGrid=grid, preProc=c("center", "scale"), trControl=trainControl)
# KNN
set.seed(7)
fit.knn <- train(medv~., data=dataset, method="knn", metric=metric, preProc=c("center", "scale"), trControl=trainControl)
The algorithms all use default tuning parameters, except CART which is fussy on this dataset and has 3 default parameters specified. Let's compare the algorithms. We will use a simple table of results to get a quick idea of what is going on. We will also use a dot plot to show the 95% confidence level for the estimated metrics.
# Collect resample statistics from models and summarize results.
# Compare algorithms
results <- resamples(list(LM=fit.lm, GLM=fit.glm, GLMNET=fit.glmnet, SVM=fit.svm, CART=fit.cart, KNN=fit.knn))
summary(results)
dotplot(results)
Output Result:
Dotplot Compare Machine Learning Algorithms on the Boston House Price Dataset with Box-Cox Power Transform
We can see that this indeed decrease the RMSE and increased the 𝑅 2 on all except the CART algorithms. The RMSE of SVM dropped to an average of 3.761.
# Output of estimated accuracy of models on transformed dataset.
Models: LM, GLM, GLMNET, SVM, CART, KNN20
Number of resamples: 30
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
LM 3.404 3.811 4.399 4.621 5.167 7.781 0
GLM 3.404 3.811 4.399 4.621 5.167 7.781 0
GLMNET 3.312 3.802 4.429 4.611 5.123 7.976 0
SVM 2.336 2.937 3.543 3.761 4.216 8.207 0
CART 2.797 3.434 4.272 4.541 5.437 9.248 0
KNN 2.474 3.608 4.308 4.563 5.080 8.922 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
LM 0.5439 0.7177 0.7832 0.7627 0.8257 0.8861 0
GLM 0.5439 0.7177 0.7832 0.7627 0.8257 0.8861 0
GLMNET 0.5198 0.7172 0.7808 0.7634 0.8297 0.8909 0
SVM 0.5082 0.8249 0.8760 0.8452 0.8998 0.9450 0
CART 0.3614 0.6733 0.8197 0.7680 0.8613 0.9026 0
KNN 0.4065 0.7562 0.8073 0.7790 0.8594 0.9043 0
Improve Results with Tuning
We can improve the accuracy of the well performing algorithms by tuning their parameters. In this section we will look at tuning the parameters of SVM with a Radial Basis Function (RBF). with more time it might be worth exploring tuning of the parameters for CART and KNN. It might also be worth exploring other kernels for SVM besides the RBF. Let's look at the default parameters already adopted.
# Display estimated accuracy of a model.
print(fit.svm)
The C parameter is the cost constraint used by SVM. Learn more in the help for the ksvm function? ksvm. We can see from previous results that a C value of 1.0 is a good starting point.
# Output of estimated accuracy of a model.
Support Vector Machines with Radial Basis Function Kernel
407 samples
13 predictor
Pre-processing: centered (13), scaled (13), Box-Cox transformation (11)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 366, 367, 366, 366, 367, 367, ...
Resampling results across tuning parameters
C RMSE Rsquared RMSE SD Rsquared SD
0.25 4.555338 0.7906921 1.533391 0.11596196
0.50 4.111564 0.8204520 1.467153 0.10573527
1.00 3.761245 0.8451964 1.323218 0.09487941
Tuning parameter 'sigma' was held constant at a value of 0.07491936
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.07491936 and C = 1.
Let's design a grid search around a C value of 1. We might see a small trend of decreasing RMSE with increasing C, so let’s try all integer C values between 1 and 10. Another parameter that caret lets us tune is the sigma parameter. This is a smoothing parameter. Good sigma values are often start around 0.1, so we will try numbers before and after.
# Tune the parameters of a model.
# tune SVM sigma and C parametres
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
set.seed(7)
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15), .C=seq(1, 10, by=1))
fit.svm <- train(medv~., data=dataset, method="svmRadial", metric=metric, tuneGrid=grid, preProc=c("BoxCox"), trControl=trainControl)
print(fit.svm)
plot(fit.svm)
Output Result:
Algorithm Tuning Results for SVM on the Boston House Price Dataset.
We can see that the sigma values flatten out with larger C cost constraints. It looks like we might do well with a sigma of 0.05 and a C of 10. This gives us a respectable RMSE of 2.977085.
# Output of tuning the parameters of a model.
Support Vector Machines with Radial Basis Function Kernel
407 samples
13 predictor
Pre-processing: Box-Cox transformation (11)
Resampling: Cross-Validated (10 fold, repeated 3 times)
19.7. Ensemble Methods 174
Summary of sample sizes: 366, 367, 366, 366, 367, 367, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared RMSE SD Rsquared SD
0.025 1 3.889703 0.8335201 1.4904294 0.11313650
0.025 2 3.685009 0.8470869 1.4126374 0.10919207
0.025 3 3.562851 0.8553298 1.3664097 0.10658536
0.025 4 3.453041 0.8628558 1.3167032 0.10282201
0.025 5 3.372501 0.8686287 1.2700128 0.09837303
------
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.05 and C = 10.
If we wanted to take this further, we could try even more fine tuning with more grid searches. We could also explore trying to tune other parameters of the underlying ksvm() function. Finally and as already mentioned, we could perform some grid searches on the other non-linear regression methods.
Ensemble Methods
We can try some ensemble methods on the problem and see if we can get a further decrease in our RMSE. In this section we will look at some boosting and bagging techniques for decision trees. Additional approaches you could look into would be blending the predictions of multiple well performing models together, called stacking. Let's take a look at the following ensemble methods:
Random Forest, bagging (RF).
Gradient Boosting Machines boosting (GBM).
Cubist, boosting (CUBIST).
# Estimate accuracy of ensemble methods.
# try ensembles
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# Random Forest
set.seed(seed)
fit.rf <- train(medv~., data=dataset, method="rf", metric=metric, preProc=c("BoxCox"), trControl=trainControl)
# Stochastic Gradient Boosting set.seed(seed)
fit.gbm <- train(medv~., data=dataset, method="gbm", metric=metric, preProc=c("BoxCox"), trControl=trainControl, verbose=FALSE)
# Cubist set.seed(seed)
fit.cubist <- train(medv~., data=dataset, method="cubist", metric=metric, preProc=c("BoxCox"), trControl=trainControl)
# Compare algorithms
ensembleResults <- resamples(list(RF=fit.rf, GBM=fit.gbm, CUBIST=fit.cubist))
summary(ensembleResults)
dotplot(ensembleResults)
Output:
Ensemble Methods on the Boston House Price Dataset.
Get R Programming Project help, R Programming Homework Help with an affordable price. Send your requirement details at realcode4you@gmail.com and get instant help with an affordable prices.
コメント