We are provided with the Epinions data collected through a general consumer review web site Epinions.com. Each user on the site can rate and post a review for products (items) and other users would rate the review to show the helpfulness of the review. In this data, for each user, we have user ID, gender, ratings (1-5) and reviews towards the products they bought. For each rating, we have the product name and its category, the rating score, the time point when the rating is created, and the helpfulness (1-5) of this rating.
Originally, there are several separate files to record the user-item rating info, user demographic info and item descriptive info, we have conducted preliminary processing to map and merge the data into one csv file named 'Epinions_test_data.csv'. The header of the csv file is shown below.
Description of Fields
userId - the user's id
gender - the user's gender
item - the product's name
category - the category of the item
rating - the user's rating towards the item
review - the user's review towards the item
helpfulness - the helpfulness of this rating
timestamp - the timestamp when the rating is created
Note that, a same user (i.e., a same userId) corresponds to a same gender; a same item corresponds to a same category; a same user (i.e., a same userId) may rate multiple items; a same item may receive ratings and reviews from multiple users and each review is associted with a single helpfulness value calculated based on the feedbacks from other users.
Read Data
#Importing Pandas, Numpy & Matplotlibrary
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#Reading and display the data and its length before removing the missing data
ds = 'Epinions_test_data.csv'
df = pd.read_csv(ds)
display(df)
print(df.shape)
output:
#Displaying the information of the Epinion Data
df.info()
output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userId 10000 non-null int64
1 gender 9973 non-null object
2 item 10000 non-null object
3 rating 9969 non-null float64
4 review 10000 non-null object
5 helpfulness 9974 non-null float64
6 timestamp 10000 non-null int64
7 category 10000 non-null object
dtypes: float64(2), int64(2), object(4)
memory usage: 625.1+ KB
Q1. Remove missing data
Please remove the following records in the csv file:
gender/rating/helpfulness is missing
review is 'none'
Display the DataFrame, counting number of Null values in each column, and print the length of the data before and after removing the missing data.
#Counting the number of missing data in each column and printing the same
df2 = df.isnull().sum()
print (df2)
output:
userId 0
gender 27
item 0
rating 31
review 0
helpfulness 26
timestamp 0
category 0
dtype: int64
#Dropping the null values of each column and printing the data after removing the null values
df_clean = df.dropna()
#Removing the columns which has review as 'none' and printing it
df_clean1 = df_clean.drop(df_clean.index[df_clean["review"] == "none"])
df_clean1
output:
#Printing the length of the new data after removing the specified data
df_clean1.shape
output:
(9890, 8)
Q2. Descriptive statistics
With the cleaned data in Q1, please provide the data summarization as below:
Q2.1 total number of unique users, unique reviews, unique items, and unique categories
Q2.2 descriptive statistics, e.g., the total number, mean, std, min and max regarding all rating records
Q2.3 descriptive statistics, e.g., mean, std, max, and min of the number of items rated by different genders
Q2.4 descriptive statistics, e.g., mean, std, max, min of the number of ratings that received by each items
# Total no of Unique "UserId"
df_clean1["userId"].unique().size
output:
4326
# Total no of Unique "review"
df_clean1["review"].unique().size
output:
9727
# Total no of Unique "item"
df_clean1["item"].unique().size
output:
89
# Total no of Unique "category"
df_clean1["category"].unique().size
output:
9
#Q2.2 - Showing descriptive statistics of all the rating records
df_clean1["rating"].describe()
output:
count 9890.000000
mean 3.701517
std 1.387026
min 1.000000
25% 3.000000
50% 4.000000
75% 5.000000
max 5.000000
Name: rating, dtype: float64
#Q2.3 - Counting the number of items rated by different genders
df_clean1.groupby("gender").count()["item"]
output:
gender
F 5025
M 4865
Name: item, dtype: int64
#Displaying the number of items rated by different genders in a tabular form
df_gender_group = df_clean1.groupby("gender").count()["item"].reset_index(name="count")
df_gender_group
output:
gender count
0 F 5025
1 M 4865
#Displaying the descriptive statistics for the number of items rated by different genders
df_gender_group["count"].describe()
output:
count 2.000000
mean 4945.000000
std 113.137085
min 4865.000000
25% 4905.000000
50% 4945.000000
75% 4985.000000
max 5025.000000
Name: count, dtype: float64
#Q2.4 - Counting the ratings recieved by each individual item
df_clean1.groupby("item").count()["rating"]
output:
item
AOL (America Online) 446
All Advantage 306
Alta Vista (by 1stUp) 84
Amazon.com 110
American Airlines 92
...
Wendy's 88
What Lies Beneath 85
X-Men 121
X.com 93
eBay 295
Name: rating, Length: 89, dtype: int64
#Displaying the ratings recieved by each item in a tabular form
df_item_group = df_clean1.groupby("item").count()["rating"].reset_index(name = "count")
df_item_group
output:
item count
0 AOL (America Online) 446
1 All Advantage 306
2 Alta Vista (by 1stUp) 84
3 Amazon.com 110
4 American Airlines 92
... ... ...
84 Wendy's 88
85 What Lies Beneath 85
86 X-Men 121
87 X.com 93
88 eBay 295
89 rows × 2 columns
#Displaying the descriptive statistics of the ratings recieved by each individual item
df_item_group["count"].describe()
output:
count 89.000000
mean 111.123596
std 55.955709
min 60.000000
25% 82.000000
50% 92.000000
75% 121.000000
max 446.000000
Name: count, dtype: float64
Q3. Plotting and Analysis
Please try to explore the correlation between gender/helpfulness/category and ratings; for instance, do female/male users tend to provide higher ratings than male/female users? Hint: you may use the boxplot function to plot figures for comparison (Challenge) You may need to select the most suitable graphic forms for ease of presentation. Most importantly, for each figure or subfigure, please summarise what each plot shows (i.e. observations and explanations). Finally, you may need to provide an overall summary of the data.
#Printing the data which has been cleaned in Q1
df_clean1
output:
# Creating a box plot of ratings and genders correlation
df_clean1.boxplot("rating", by = "gender")
output:
#Creating a box plot of helpfulness- rating correlation
df_clean1.boxplot("rating", by = "helpfulness")
output:
# Creating a box plot of ratings and category correlation
df_clean1.boxplot("rating", by = "category")
plt.xticks(rotation = 45)
output:
Q4. Detect and remove outliers
We may define outlier users and items with three rules:
users whose helpfulness is no more than 2
users who rate less than 7 items
items that receives less than 11 ratings
Please remove the corresponding records in the csv file that involves outlier users and items. After that, print the length of the data.
# Droppiing the indexes which has helpfulness<=2 and displaying the altered result
df_helpfulness_clean = df_clean1.drop(df_clean1.index[df_clean1["helpfulness"]<=2])
df_helpfulness_clean
output:
#Group users by their ids and counting the number of their ratings
df_user = df_helpfulness_clean.groupby("userId").count()["item"].reset_index(name = "count")
df_user
output:
userId count
0 3 3
1 4 3
2 7 1
3 10 1
4 12 1
... ... ...
3662 15720 1
3663 15723 1
3664 15726 5
3665 15727 1
3666 15730 1
3667 rows × 2 column
#Finding the number of users who have rated less than 7 items
list_user_less7 = df_user[df_user["count"]<7]["userId"].tolist()
list_user_less7
output:
[3,
4,
7,
10,
12,
16,
17,
18,
22,
25,
...
...
#Dropping the no of users who have rated less than 7 items
df_user_clean = df_helpfulness_clean.drop(df_helpfulness_clean.index[df_helpfulness_clean["userId"].isin(list_user_less7)])
df_user_clean
output:
#Counting the number of items by their ratings and displaying it in a tabular form
df_item = df_user_clean.groupby("item").count()["rating"].reset_index(name = "count")
df_item
output:
item count
0 AOL (America Online) 26
1 All Advantage 29
2 Alta Vista (by 1stUp) 3
3 Amazon.com 10
4 American Airlines 9
... ... ...
84 Wendy's 14
85 What Lies Beneath 12
86 X-Men 31
87 X.com 7
88 eBay 35
89 rows × 2 columns
#Displaying the items which have recieved less than 7 ratings
list_item_less11 = df_item[df_item["count"]<11]["item"].tolist()
list_item_less11
output:
['Alta Vista (by 1stUp)',
'Amazon.com',
'American Airlines',
'Any Given Sunday',
"Applebee's",
'Bank of America',
'BlueLight.com',
'Crouching Tiger, Hidden Dragon',
'Deep Blue Sea',
'Diablo 2 - Bestseller Series for Windows',
'Earthlink',
'Golden Eye 007 for Nintendo 64',
'Google',
'Huggies Ultra Trim Diapers',
'Luvs Ultra Leakguards Stretch',
'MTV',
'NetFlix',
'Northwest Airlines',
'Playtex Diaper Genie System',
'Priceline Flight Reservations',
'Quicken Loans',
'Road Runner (includes MediaOne)',
'Roller Coaster Tycoon For Windows',
'Sega Dreamcast Grey Console',
'Sony PlayStation 2 Slimline Console',
'Spedia',
'Teletubbies',
'The Legend Of Zelda Ocarina Of Time for Nintendo 64',
'The Original Furby',
'Treeloot.com',
'United Airlines',
#Dropping the items which have recieved less than 7 ratings
df_item_clean = df_user_clean.drop(df_user_clean.index[df_user_clean["item"].isin(list_item_less11)])
df_item_clean
output:
#Displaying the final size of the data after performing all the operations
df_item_clean.shape
output:
(1080, 8)
#exporting the data
df_item_clean.to_csv("clean_data.csv", index = False)
Comments