top of page

Data Analysis Assignment Help In India | Analysis of Epinions Data | Realcode4you

We are provided with the Epinions data collected through a general consumer review web site Epinions.com. Each user on the site can rate and post a review for products (items) and other users would rate the review to show the helpfulness of the review. In this data, for each user, we have user ID, gender, ratings (1-5) and reviews towards the products they bought. For each rating, we have the product name and its category, the rating score, the time point when the rating is created, and the helpfulness (1-5) of this rating.


Originally, there are several separate files to record the user-item rating info, user demographic info and item descriptive info, we have conducted preliminary processing to map and merge the data into one csv file named 'Epinions_test_data.csv'. The header of the csv file is shown below.

Description of Fields

  • userId - the user's id

  • gender - the user's gender

  • item - the product's name

  • category - the category of the item

  • rating - the user's rating towards the item

  • review - the user's review towards the item

  • helpfulness - the helpfulness of this rating

  • timestamp - the timestamp when the rating is created


Note that, a same user (i.e., a same userId) corresponds to a same gender; a same item corresponds to a same category; a same user (i.e., a same userId) may rate multiple items; a same item may receive ratings and reviews from multiple users and each review is associted with a single helpfulness value calculated based on the feedbacks from other users.


Read Data

#Importing Pandas, Numpy & Matplotlibrary
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#Reading and display the data and its length before removing the missing data
ds = 'Epinions_test_data.csv'
df = pd.read_csv(ds)
display(df)
print(df.shape)

output:


#Displaying the information of the Epinion Data
df.info()

output:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 10000 entries, 0 to 9999

Data columns (total 8 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 userId 10000 non-null int64

1 gender 9973 non-null object

2 item 10000 non-null object

3 rating 9969 non-null float64

4 review 10000 non-null object

5 helpfulness 9974 non-null float64

6 timestamp 10000 non-null int64

7 category 10000 non-null object

dtypes: float64(2), int64(2), object(4)

memory usage: 625.1+ KB



Q1. Remove missing data

Please remove the following records in the csv file:

  • gender/rating/helpfulness is missing

  • review is 'none'

Display the DataFrame, counting number of Null values in each column, and print the length of the data before and after removing the missing data.

#Counting the number of missing data in each column and printing the same
df2 = df.isnull().sum()
print (df2)

output:

userId 0

gender 27

item 0

rating 31

review 0

helpfulness 26

timestamp 0

category 0

dtype: int64


#Dropping the null values of each column and printing the data after removing the null values
df_clean = df.dropna()

#Removing the columns which has review as 'none' and printing it 
df_clean1 = df_clean.drop(df_clean.index[df_clean["review"] == "none"])
df_clean1

output:


#Printing the length of the new data after removing the specified data
df_clean1.shape

output:

(9890, 8)


Q2. Descriptive statistics

With the cleaned data in Q1, please provide the data summarization as below:

  • Q2.1 total number of unique users, unique reviews, unique items, and unique categories

  • Q2.2 descriptive statistics, e.g., the total number, mean, std, min and max regarding all rating records

  • Q2.3 descriptive statistics, e.g., mean, std, max, and min of the number of items rated by different genders

  • Q2.4 descriptive statistics, e.g., mean, std, max, min of the number of ratings that received by each items


# Total no of Unique "UserId"
df_clean1["userId"].unique().size

output:

4326


# Total no of Unique "review"
df_clean1["review"].unique().size

output:

9727


# Total no of Unique "item"
df_clean1["item"].unique().size

output:

89


# Total no of Unique "category"
df_clean1["category"].unique().size

output:

9


#Q2.2 - Showing descriptive statistics of all the rating records
df_clean1["rating"].describe()

output:

count 9890.000000

mean 3.701517

std 1.387026

min 1.000000

25% 3.000000

50% 4.000000

75% 5.000000

max 5.000000

Name: rating, dtype: float64



#Q2.3 - Counting the number of items rated by different genders
df_clean1.groupby("gender").count()["item"]

output:

gender

F 5025

M 4865

Name: item, dtype: int64


#Displaying the number of items rated by different genders in a tabular form
df_gender_group = df_clean1.groupby("gender").count()["item"].reset_index(name="count")
df_gender_group

output:

gender count

0 F 5025

1 M 4865



#Displaying the descriptive statistics for the number of items rated by different genders
df_gender_group["count"].describe()

output:

count 2.000000

mean 4945.000000

std 113.137085

min 4865.000000

25% 4905.000000

50% 4945.000000

75% 4985.000000

max 5025.000000

Name: count, dtype: float64


#Q2.4 - Counting the ratings recieved by each individual item
df_clean1.groupby("item").count()["rating"]

output:

item

AOL (America Online) 446

All Advantage 306

Alta Vista (by 1stUp) 84

Amazon.com 110

American Airlines 92

...

Wendy's 88

What Lies Beneath 85

X-Men 121

X.com 93

eBay 295

Name: rating, Length: 89, dtype: int64


#Displaying the ratings recieved by each item in a tabular form
df_item_group = df_clean1.groupby("item").count()["rating"].reset_index(name = "count")
df_item_group

output:

item count

0 AOL (America Online) 446

1 All Advantage 306

2 Alta Vista (by 1stUp) 84

3 Amazon.com 110

4 American Airlines 92

... ... ...

84 Wendy's 88

85 What Lies Beneath 85

86 X-Men 121

87 X.com 93

88 eBay 295


89 rows × 2 columns



#Displaying the descriptive statistics of the ratings recieved by each individual item
df_item_group["count"].describe()

output:

count 89.000000

mean 111.123596

std 55.955709

min 60.000000

25% 82.000000

50% 92.000000

75% 121.000000

max 446.000000

Name: count, dtype: float64



Q3. Plotting and Analysis

Please try to explore the correlation between gender/helpfulness/category and ratings; for instance, do female/male users tend to provide higher ratings than male/female users? Hint: you may use the boxplot function to plot figures for comparison (Challenge) You may need to select the most suitable graphic forms for ease of presentation. Most importantly, for each figure or subfigure, please summarise what each plot shows (i.e. observations and explanations). Finally, you may need to provide an overall summary of the data.


#Printing the data which has been cleaned in Q1
df_clean1

output:


# Creating a box plot of ratings and  genders correlation
df_clean1.boxplot("rating", by = "gender")

output:


#Creating a box plot of helpfulness- rating correlation
df_clean1.boxplot("rating", by = "helpfulness")

output:


# Creating a box plot of ratings and  category correlation
df_clean1.boxplot("rating", by = "category")
plt.xticks(rotation = 45)

output:


Q4. Detect and remove outliers

We may define outlier users and items with three rules:

  • users whose helpfulness is no more than 2

  • users who rate less than 7 items

  • items that receives less than 11 ratings

Please remove the corresponding records in the csv file that involves outlier users and items. After that, print the length of the data.

# Droppiing the indexes which has helpfulness<=2 and displaying the altered result
df_helpfulness_clean = df_clean1.drop(df_clean1.index[df_clean1["helpfulness"]<=2])
df_helpfulness_clean

output:


#Group users by their ids and counting the number of their ratings
df_user = df_helpfulness_clean.groupby("userId").count()["item"].reset_index(name = "count")
df_user

output:

userId count

0 3 3

1 4 3

2 7 1

3 10 1

4 12 1

... ... ...

3662 15720 1

3663 15723 1

3664 15726 5

3665 15727 1

3666 15730 1


3667 rows × 2 column



#Finding the number of users who have rated less than 7 items
list_user_less7 = df_user[df_user["count"]<7]["userId"].tolist()
list_user_less7

output:

[3,
 4,
 7,
 10,
 12,
 16,
 17,
 18,
 22,
 25,
 ...
 ...


#Dropping the no of users who have rated less than 7 items
df_user_clean = df_helpfulness_clean.drop(df_helpfulness_clean.index[df_helpfulness_clean["userId"].isin(list_user_less7)])
df_user_clean

output:


#Counting the number of items by their ratings and displaying it in a tabular form
df_item = df_user_clean.groupby("item").count()["rating"].reset_index(name = "count")
df_item

output:

item count

0 AOL (America Online) 26

1 All Advantage 29

2 Alta Vista (by 1stUp) 3

3 Amazon.com 10

4 American Airlines 9

... ... ...

84 Wendy's 14

85 What Lies Beneath 12

86 X-Men 31

87 X.com 7

88 eBay 35


89 rows × 2 columns



#Displaying the items which have recieved less than 7 ratings
list_item_less11 = df_item[df_item["count"]<11]["item"].tolist()
list_item_less11

output:

['Alta Vista (by 1stUp)',

'Amazon.com',

'American Airlines',

'Any Given Sunday',

"Applebee's",

'Bank of America',

'BlueLight.com',

'Crouching Tiger, Hidden Dragon',

'Deep Blue Sea',

'Diablo 2 - Bestseller Series for Windows',

'Earthlink',

'Golden Eye 007 for Nintendo 64',

'Google',

'Huggies Ultra Trim Diapers',

'Luvs Ultra Leakguards Stretch',

'MTV',

'NetFlix',

'Northwest Airlines',

'Playtex Diaper Genie System',

'Priceline Flight Reservations',

'Quicken Loans',

'Road Runner (includes MediaOne)',

'Roller Coaster Tycoon For Windows',

'Sega Dreamcast Grey Console',

'Sony PlayStation 2 Slimline Console',

'Spedia',

'Teletubbies',

'The Legend Of Zelda Ocarina Of Time for Nintendo 64',

'The Original Furby',

'Treeloot.com',

'United Airlines',



#Dropping the items which have recieved less than 7 ratings
df_item_clean = df_user_clean.drop(df_user_clean.index[df_user_clean["item"].isin(list_item_less11)])
df_item_clean

output:


#Displaying the final size of the data after performing all the operations
df_item_clean.shape

output:

(1080, 8)

#exporting the data
df_item_clean.to_csv("clean_data.csv", index = False)

bottom of page