In this blog we will learn how to analyze amazon review and using train and test data:
The multilayer perceptron(MLP) has a large wide of classification and regression applications in many fields: pattern recognition, voice and classification problems. But the architecture choice has a great impact on the convergence of these networks. In the present paper we introduce a new approach to optimize the AMAZON REVIEW DATA, for solving the obtained model we use the genetic algorithm and we train the amazon review.
# Introduction
Here we will analyze positive and nagative review of amazon dataset and test the accuracy of train and test data.
# Part I - Data preparation
# Like importing, reading, cleaning and split, etc.
Data Source
Importing Libraries:
import gzip
import itertools
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
%matplotlib inline
Reviews into Pandas DataFrame
Here we will first parse the data sets parse_gz() method using which is given in zip formate and then we will convert it into the dataframe by using convert_to_DataFrame() methods
Code for unzip file:
It used to unzip file and then convert it into the data-frame.
def parse_gz(file_path):
g = gzip.open(file_path, 'rb')
for l in g:
yield eval(l)
def convert_to_DataFrame(file_path):
i = 0
df = {}
for d in parse_gz(file_path):
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient='index')
We are going to classify Amazon product reviews to understand the positive or negative review. Amazon has different rating(1-stars, 2-stars, etc), which is given in overall column. We will use that to compare our prediction.
Split data:
x_train, x_test, y_train, y_test = train_test_split(sports_data.reviewText,sports_data.review_in_float, random_state=0)
How to use countvectorizer()
It used to change the data into the string to integer
cv = CountVectorizer()
X_traincv = cv.fit_transform(x_train)
X_testcv = cv.transform(x_test)
After this we are fit it into the model
Here we fit it into the MLP Classifier
## import mlp classifier libraries
from sklearn.preprocessing import StandardScaler
# Training the model
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report,confusion_matrix
mlp = MLPClassifier()
# predict the target on the train dataset
pred_train = mlp.predict(X_traincv)
# Accuray Score on train dataset
accur_train = accuracy_score(train_y,pred_train)
print('accuracy_score on train dataset : ', accur_train)
#confusion matrix to find to mark predicted value
cnf = confusion_matrix(test_y,predictions)
#result with score and accuracy
