This machine learning algorithms is used to reduce the dimension of features set. Features set mean your data has more columns then we can reduce it dimension for many reason which became the machine leaning algorithms fast which take minimum time to execute, there are many other reasons which makes it easy which are given below:
Space required to store the data is reduced as the number of dimensions comes down
Less dimensions lead to less computation/training time
Some algorithms do not perform well when we have a large dimensions. So reducing these dimensions needs to happen for the algorithm to be useful
It takes care of multicollinearity by removing redundant features.
It helps in visualizing data. As discussed earlier, it is very difficult to visualize data in higher dimensions so reducing our space to 2D or 3D.
# importing required libraries
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# read the train and test dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
# view the top 3 rows of the dataset
print(train_data.head(3))
# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)
# seperate the independent and target variable on training data
# target variable - Income
x_train = train_data.drop(columns=['income'],axis=1)
y_train = train_data['Income']
# seperate the independent and target variable on testing data
x_test = test_data.drop(columns=['Income'],axis=1)
y_test = test_data['Income']
print('\nTraining model with {} dimensions.'.format(x_train.shape[1]))
# create object of model
model = LinearRegression()
# fit the model with the training data
model.fit(x_train,y_train)
# predict the target on the train dataset
predict_train = model.predict(x_train)
# Accuray Score on train dataset
rmse_train = mean_squared_error(y_train,predict_train)**(0.5)
print('\nRMSE on train dataset : ', rmse_train)
# predict the target on the test dataset
predict_test = model.predict(x_test)
# Accuracy Score on test dataset
rmse_test = mean_squared_error(y_test,predict_test)**(0.5)
print('\nRMSE on test dataset : ', rmse_test)
model_pca = PCA(n_components=12)
new_train = model_pca.fit_transform(x_train)
new_test = model_pca.fit_transform(x_test)
print('\nTraining model with {} dimensions.'.format(new_train.shape[1]))
# create object of model
model_new = LinearRegression()
# fit the model with the training data
model_new.fit(new_train,y_train)
# predict the target on the new train dataset
predict_train_pca = model_new.predict(new_train)
# Accuray Score on train dataset
rmse_train_pca = mean_squared_error(train_y,predict_train_pca)**(0.5)
print('\nRMSE on new train dataset : ', rmse_train_pca)
# predict the target on the new test dataset
predict_test_pca = model_new.predict(new_test)
# Accuracy Score on test dataset
rmse_test_pca = mean_squared_error(test_y,predict_test_pca)**(0.5)
print('\nRMSE on new test dataset : ', rmse_test_pca)