Boston Housing Price Prediction Report Using Sequential Model

Introduction

The problem that we are going to solve here is given a set of features that describe a house in Boston, our machine learning model must predict the house price. To train our machine learning model with boston housing data, we will be using scikit-learn’s boston dataset. In this report, we aim to predict Boston housing prices using a deep learning model implemented with TensorFlow. We'll go through the following steps:

Data Loading
Data Exploration
Data Preprocessing
Model Building
Model Training
Model Evaluation

# Importing required modules >>

# For mathematical computation >>
import numpy as np
# For dealing with Dataframe >>
import pandas as pd
# For Data Visualization >>
import matplotlib.pyplot as plt
import seaborn as sns
# Data processing >>
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# TensorFlow ( Modeling ) >>
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

#1. Data Loading

We begin by importing the Boston Housing dataset from a CSV file using the Pandas library. This dataset contains various features related to housing in Boston, with the goal of predicting housing prices.

# Import the dataset >>
df = pd.read_csv('./BostonHousing.csv')

#2. Data Exploration

Let's take a glimpse into the dataset to understand its structure and features, Display the first few rows of the dataset

df.head()

#Dataset Information

CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per 10,000usd
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population

# Additional Information >>
print("\nAdditional Information:")

print("[$] Total Rows :",df.shape[0])
print("[$] Total Columns :",df.shape[1])

[$] Total Rows : 506
[$] Total Columns : 14

#Dataset Columns & Datatypes

# Check for missing values and data types >>
print("\nInformation about the dataset:")
df.info()

Output:

Information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  b        506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB

 Identifying the unique number of values in the dataset >>
df.nunique()

crim       504
zn          26
indus       76
chas         2
nox         81
rm         446
age        356
dis        412
rad          9
tax         66
ptratio     46
b          357
lstat      455
medv       229
dtype: int64

Lets create a bar plot showing the number of unique values in each column using Seaborn

# Visualizing the number of unique values in each column >>

plt.figure(figsize=(10, 5))
sns.barplot(x=df.columns, y=df.nunique())
plt.title('Number of Unique Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Number of Unique Values')
plt.xticks(rotation=45, ha='right')
plt.show()

#Check Null Values

df.isnull().sum()

crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
b          0
lstat      0
medv       0
dtype: int64

#Describe the Dataset

# [$] Get stastical description of Dataset >>
df.describe()

#Heatmap Let's create a heatmap visualization of the correlation between features using Seaborn.

df.corr()

# the heatmap of correlation between features >>
plt.figure(figsize=(10,7))
sns.heatmap(df.corr(), cbar=True, square= True, fmt='.1f', annot=True)

#3. Train,Test Split Data

Features (X): The variable X represents the independent variables or features of our dataset, excluding the target variable medv.
Target Variable (y): The variable y represents the dependent variable or the target variable we want to predict, which is medv in this case.

# Importing library >>
from sklearn.model_selection import train_test_split

# Seaprate X ( Feature ) & Y ( Output )
X = df.drop(['medv'],axis=1)
y = df['medv']

Train_test_split

The train_test_split function is used to randomly split the dataset into training and testing sets. The test_size parameter specifies the proportion of the dataset to include in the test split (in this case, 20% for testing). The random_state parameter ensures reproducibility by fixing the random seed for the split. This separation allows us to train the model on one subset (training set) and evaluate its performance on another independent subset (testing set). It helps us assess how well the model generalizes to new, unseen data, and it is a crucial step in the machine learning workflow.

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=101)

#Shape Of Data

print("X Train Shape",X_train.shape)
print("X Test Shape",X_test.shape)
print("Y Train Shape",y_train.shape)
print("Y Test Shape",y_test.shape)

output:

X Train Shape (404, 13) X Test Shape (102, 13) Y Train Shape (404,) Y Test Shape (102,)

#4. Data Preprocessing

Standardizing the Features

Standardization is an essential preprocessing step to ensure that all features have a similar scale. This is crucial for machine learning models, particularly those that rely on distance-based calculations or optimization algorithms.

# Standardize the features using StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

StandardScaler:

The StandardScaler is a preprocessing step that transforms the features by removing the mean and scaling to unit variance. It ensures that the distribution of each feature has a mean value of 0 and a standard deviation of 1. Fit and Transform (Training Set):
The fit_transform method is applied to the training set (X_train), which calculates the mean and standard deviation of each feature and then transforms the data accordingly. Transform (Testing Set):
The same scaler is then used to transform the testing set (X_test). It's important to use the same mean and standard deviation values calculated from the training set to maintain consistency.

#5. Model Training

Building the Sequential Model

The core of our predictive model is constructed using a Sequential model from TensorFlow's Keras API. This model consists of densely connected layers and dropout layers to prevent overfitting.

# Build the Sequential model
model = Sequential()

# Add Dense layers with dropout
model.add(Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)))
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='linear'))

Sequential Model:

The Sequential model is a linear stack of layers, where you can simply add one layer at a time. It's a common choice for building neural networks layer by layer. Dense Layers:
Dense layers are fully connected layers, where each neuron in one layer connects to every neuron in the next layer. The first Dense layer has 64 neurons with the rectified linear unit (ReLU) activation function and is specified with an input shape matching the number of features in our dataset. Dropout Layers:
Dropout layers are added to prevent overfitting. They randomly drop a fraction of input units during training, which helps in generalizing the model to new, unseen data. Output Layer:
The final Dense layer has 1 neuron with a linear activation function. This is suitable for regression problems, where we aim to predict a continuous output (in this case, the median value of owner-occupied homes).

#Summary Of Model

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 64)                896       
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 3009 (11.75 KB)
Trainable params: 3009 (11.75 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

#Compiling Modules

After constructing the neural network architecture, the next step is to compile the model. Compiling involves configuring the learning process, specifying the optimizer and loss function, among other parameters.

# Compile the model >>
model.compile(optimizer='adam', loss='mean_squared_error')

Optimizer:

The optimizer is a crucial component that adjusts the weights of the neural network during training. Here, we use the 'adam' optimizer, which is an adaptive learning rate optimization algorithm known for its efficiency and effectiveness. Loss Function:
For regression tasks, where the goal is to predict a continuous value, the mean squared error (MSE) is a commonly used loss function. The MSE measures the average squared difference between the predicted values and the true values.

#Training Module

With the model compiled, we can now proceed to train it on the prepared training dataset. Training involves iteratively adjusting the weights of the neural network based on the provided input features and target values.

# Train the model >>
history = model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.2, verbose=2)

Epoch 1/50
11/11 - 2s - loss: 593.7202 - val_loss: 503.8979 - 2s/epoch - 150ms/step
Epoch 2/50
11/11 - 0s - loss: 569.7773 - val_loss: 484.8472 - 95ms/epoch - 9ms/step
Epoch 3/50
11/11 - 0s - loss: 551.3031 - val_loss: 466.6956 - 161ms/epoch - 15ms/step
Epoch 4/50
11/11 - 0s - loss: 529.8636 - val_loss: 446.2093 - 112ms/epoch - 10ms/step
Epoch 5/50
11/11 - 0s - loss: 512.2217 - val_loss: 421.4626 - 121ms/epoch - 11ms/step
Epoch 6/50
11/11 - 0s - loss: 474.4426 - val_loss: 390.3333 - 171ms/epoch - 16ms/step
Epoch 7/50
11/11 - 0s - loss: 435.3330 - val_loss: 350.8893 - 145ms/epoch - 13ms/step
Epoch 8/50
11/11 - 0s - loss: 395.2251 - val_loss: 303.3921 - 126ms/epoch - 11ms/step
Epoch 9/50
11/11 - 0s - loss: 338.3850 - val_loss: 249.2575 - 180ms/epoch - 16ms/step
Epoch 10/50
11/11 - 0s - loss: 274.9074 - val_loss: 192.7155 - 178ms/epoch - 16ms/step
Epoch 11/50
11/11 - 0s - loss: 235.8356 - val_loss: 141.8684 - 140ms/epoch - 13ms/step
Epoch 12/50
11/11 - 0s - loss: 191.6234 - val_loss: 100.9594 - 119ms/epoch - 11ms/step
Epoch 13/50
11/11 - 0s - loss: 149.4445 - val_loss: 71.9506 - 111ms/epoch - 10ms/step
Epoch 14/50
11/11 - 0s - loss: 120.7500 - val_loss: 57.8792 - 144ms/epoch - 13ms/step
Epoch 15/50
11/11 - 0s - loss: 115.0784 - val_loss: 48.9722 - 183ms/epoch - 17ms/step
Epoch 16/50
11/11 - 0s - loss: 110.4878 - val_loss: 45.5624 - 189ms/epoch - 17ms/step
Epoch 17/50
11/11 - 0s - loss: 101.6381 - val_loss: 41.9764 - 183ms/epoch - 17ms/step
Epoch 18/50
11/11 - 0s - loss: 96.8864 - val_loss: 39.8332 - 250ms/epoch - 23ms/step
Epoch 19/50
11/11 - 0s - loss: 91.9704 - val_loss: 38.9673 - 186ms/epoch - 17ms/step
Epoch 20/50
11/11 - 0s - loss: 92.7428 - val_loss: 38.1128 - 193ms/epoch - 18ms/step
Epoch 21/50
11/11 - 0s - loss: 92.5336 - val_loss: 37.6970 - 164ms/epoch - 15ms/step
Epoch 22/50
11/11 - 0s - loss: 80.9204 - val_loss: 36.4323 - 140ms/epoch - 13ms/step
Epoch 23/50
11/11 - 0s - loss: 77.0351 - val_loss: 35.4173 - 128ms/epoch - 12ms/step
Epoch 24/50
11/11 - 0s - loss: 80.5209 - val_loss: 32.7770 - 175ms/epoch - 16ms/step
Epoch 25/50
11/11 - 0s - loss: 74.1207 - val_loss: 32.6497 - 179ms/epoch - 16ms/step
Epoch 26/50
11/11 - 0s - loss: 78.2702 - val_loss: 33.9887 - 143ms/epoch - 13ms/step
Epoch 27/50
11/11 - 0s - loss: 78.0483 - val_loss: 36.4695 - 93ms/epoch - 8ms/step
Epoch 28/50
11/11 - 0s - loss: 66.7863 - val_loss: 35.1577 - 84ms/epoch - 8ms/step
Epoch 29/50
11/11 - 0s - loss: 68.1533 - val_loss: 33.2069 - 155ms/epoch - 14ms/step
Epoch 30/50
11/11 - 0s - loss: 77.8594 - val_loss: 32.6738 - 134ms/epoch - 12ms/step
Epoch 31/50
11/11 - 0s - loss: 73.1195 - val_loss: 32.1478 - 129ms/epoch - 12ms/step
Epoch 32/50
11/11 - 0s - loss: 75.1272 - val_loss: 31.6717 - 136ms/epoch - 12ms/step
Epoch 33/50
11/11 - 0s - loss: 73.7986 - val_loss: 31.4472 - 110ms/epoch - 10ms/step
Epoch 34/50
11/11 - 0s - loss: 74.1163 - val_loss: 31.4819 - 154ms/epoch - 14ms/step
Epoch 35/50
11/11 - 0s - loss: 76.6583 - val_loss: 32.7706 - 112ms/epoch - 10ms/step
Epoch 36/50
11/11 - 0s - loss: 77.2155 - val_loss: 32.3915 - 204ms/epoch - 19ms/step
Epoch 37/50
11/11 - 0s - loss: 70.1360 - val_loss: 32.2399 - 146ms/epoch - 13ms/step
Epoch 38/50
11/11 - 0s - loss: 77.0289 - val_loss: 30.9727 - 65ms/epoch - 6ms/step
Epoch 39/50
11/11 - 0s - loss: 65.1568 - val_loss: 30.0533 - 46ms/epoch - 4ms/step
Epoch 40/50
11/11 - 0s - loss: 69.7074 - val_loss: 30.1188 - 65ms/epoch - 6ms/step
Epoch 41/50
11/11 - 0s - loss: 66.0095 - val_loss: 29.5351 - 93ms/epoch - 8ms/step
Epoch 42/50
11/11 - 0s - loss: 66.9036 - val_loss: 28.2009 - 49ms/epoch - 4ms/step
Epoch 43/50
11/11 - 0s - loss: 63.0430 - val_loss: 28.8434 - 50ms/epoch - 5ms/step
Epoch 44/50
11/11 - 0s - loss: 69.8645 - val_loss: 28.8378 - 47ms/epoch - 4ms/step
Epoch 45/50
11/11 - 0s - loss: 75.1537 - val_loss: 28.2755 - 67ms/epoch - 6ms/step
Epoch 46/50
11/11 - 0s - loss: 55.5705 - val_loss: 26.8995 - 86ms/epoch - 8ms/step
Epoch 47/50
11/11 - 0s - loss: 66.0437 - val_loss: 26.4634 - 84ms/epoch - 8ms/step
Epoch 48/50
11/11 - 0s - loss: 71.3588 - val_loss: 25.7471 - 65ms/epoch - 6ms/step
Epoch 49/50
11/11 - 0s - loss: 63.4523 - val_loss: 26.4709 - 48ms/epoch - 4ms/step
Epoch 50/50
11/11 - 0s - loss: 62.2586 - val_loss: 26.8478 - 47ms/epoch - 4ms/step

Training Process:

The fit method is used to train the model. It takes the scaled training features (X_train_scaled) and corresponding target values (y_train). Epochs:
The epochs parameter specifies the number of times the entire training dataset is passed through the neural network. One epoch represents one complete cycle through the dataset. Batch Size:
The batch_size parameter determines the number of samples used in each iteration before updating the model's weights. It helps in optimizing the training process and managing memory usage. Validation Split:
The validation_split parameter (here set to 0.2) reserves a portion of the training data for validation during training. This allows monitoring the model's performance on unseen data and helps prevent overfitting.
Verbose: The verbose parameter controls the amount of information printed during training. Setting it to 2 provides a more detailed progress report for each epoch.

#6. Model Evaluation

# Evaluate the model on the test set >>
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error on Test Set: {mse}')
print(f'R2 Score :{r2}')

4/4 [==============================] - 0s 4ms/step
Mean Squared Error on Test Set: 27.693457764356012
R2 Score :0.7531583707990598

Prediction on Test Set:

The predict method is used to obtain model predictions on the scaled test set (X_test_scaled). Metrics Calculation:

The mean squared error (MSE) and R2 score are common regression metrics used to assess the model's accuracy. Mean Squared Error (MSE): It measures the average squared difference between the predicted values and the true values. Lower MSE values indicate better model performance.
R2 Score: Also known as the coefficient of determination, it represents the proportion of variance in the target variable that can be explained by the model. R2 score values range from 0 to 1, where higher values indicate better performance. Printed Results:

The results are printed to the console, providing insights into the model's accuracy on the test set.

# Save the model if needed >>
model.save('boston_housing_model.h5')

/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py:3103: UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')`.
  saving_api.save_model(