top of page

Analyze Positive And Negative Review Of Amazon Dataset Using Python Machine Learning

The multilayer perceptron(MLP) has a large wide of classification and regression applications in many fields: pattern recognition, voice and classification problems. But the architecture choice has a great impact on the convergence of these networks. In the present paper we introduce a new approach to optimize the AMAZON REVIEW DATA, for solving the obtained model we use the genetic algorithm and we train the amazon review.


Introduction

Here we will analyze positive and nagative review of amazon dataset and test the accuracy of train and test data.



Data preparation



#importing Libraries
import gzip
import itertools
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

%matplotlib inline

Reviews into Pandas DataFrame Here we will first parse the data sets parse_gz() method using which is given in zip formate and then we will convert it

into the dataframe by using convert_to_DataFrame() methods

def parse_gz(file_path):
    g = gzip.open(file_path, 'rb')
    for l in g:
        yield eval(l)

def convert_to_DataFrame(file_path):
    i = 0
    df = {}
    for d in parse_gz(file_path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

Loading Data

We are going to classify Amazon product reviews to understand the positive or negative review. Amazon has different rating(1-stars, 2-stars, etc), which is given in overall column. We will use that to compare our prediction.



#passing file path or name
sports_data = convert_to_DataFrame('reviews_Sports_and_Outdoors_5.json.gz')
#checking size of dataset in words
print('Dataset review size: {:,} rows'.format(len(sports_data)))
#Selecting Datasets three records
sports_data[:3]

Output:


sports_data.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 296337 entries, 0 to 296336
Data columns (total 9 columns):
reviewerID        296337 non-null object
asin              296337 non-null object
reviewerName      294935 non-null object
helpful           296337 non-null object
reviewText        296337 non-null object
overall           296337 non-null float64
summary           296337 non-null object
unixReviewTime    296337 non-null int64
reviewTime        296337 non-null object
dtypes: float64(1), int64(1), object(7)
memory usage: 14.7+ MB

sports_data.describe()

Output:













#displaying shape of data
sports_data.shape

Output:

(296337, 9)


Reformat 𝑑𝑎𝑡𝑒𝑡𝑖𝑚𝑒⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ from raw form.

sports_data["reviewTime"] = pd.to_datetime(sports_data["reviewTime"])

choosing selected field

sports_data = sports_data[['asin', 'summary', 'reviewText', 'overall', 'reviewerID', 'reviewerName', 'helpful', 'reviewTime',
      'unixReviewTime']]

view top three records

sports_data.head(3)

Output:


view 3 bottom records of dataframe

sports_data.tail(3)

Output:


Number of Reviews by Unique Products


products = sports_data['overall'].groupby(sports_data['asin']).count()
print("Number of Unique Products in the Sports Category = {}".format(products.count()))

Output:

Number of Unique Products in the Sports Category = 18357


Modeling

sports_data[:3]

Output:


Insert review_in_float column for Sentiment modeling

Here we add nagative review for 1-3 overall ranking and positive review for 4-5 overall ranking

     Negative reviews:      1-3 Stars  = 0
     Positive reviews:      4-5 Stars  = 1

review_text = sports_data["reviewText"]

Train/Test Split over a "overall" column which used for view analysis

Build a sentiment classifier to identify whether the review has positive or negative sentiment. MLP Classifier model

will use the words reviewText( column) and ratings (overall) from the training data to develop a model to

predict target (overall).


x_train, x_test, y_train, y_test = train_test_split(sports_data.reviewText, sports_data.overall, random_state=0)
print("x_train shape: {}".format(x_train.shape), end='\n')
print("y_train shape: {}".format(y_train.shape), end='\n\n')
print("x_test shape: {}".format(x_test.shape), end='\n')
print("y_test shape: {}".format(y_test.shape), end='\n\n')

Output:

x_train shape: (222252,)
y_train shape: (222252,)

x_test shape: (74085,)
y_test shape: (74085,)



Use for reading text data as per integer value because model reads only integer value

Here we select 10 fields because memory issue if your system memory is large then you can use full train and test data for fit into the model

data = x_train[:500]
data1 = x_test[:500]
test_y = y_test[:500]

train_y = y_train[:500]


Here we have Using Countvectorizer() because it chaged data from string format to integer array format so that it can be fit into the model


cv = CountVectorizer()
X_traincv = cv.fit_transform(data)  
X_testcv = cv.transform(data1)
feature_names1 = cv.get_feature_names()
print("Number of features: {}".format(len(feature_names1)))
Number of features: 5479

ML Deep neural networks'


Now we will fit train and test data into the MLP Classifer to predict the score


from sklearn.preprocessing import StandardScaler
# Training the model
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report,confusion_matrix
mlp = MLPClassifier()
X_traincv
mlp.fit(X_traincv,train_y)

Output:

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)
# predict the target on the train dataset
pred_train = mlp.predict(X_traincv)
pred_train

Output:


















# Accuray Score on train dataset
accur_train = accuracy_score(train_y,pred_train)
print('accuracy_score on train dataset : ', accur_train)
# Predictions and Evaluation 
predictions = mlp.predict(X_testcv)
predictions
#confusion matrix to find to mark predicted value
cnf = confusion_matrix(test_y,predictions)
cnf

Result with score and accuracy

#result with score and accuracy
print(classification_report(test_y,predictions))

Output:


bottom of page