Requirement Details
An organization wanted to check association among employee experience, skills, traits etc. to better manage human resources.
As a data scientist, you are required to recognize patterns from the available data and evaluate efficacy of methods to obtain patterns. Your activities should include - performing various activities pertaining to the data such as, preparing the dataset for analysis; investigating the relationships in the data set with visualization; identify frequent patterns; formulate association rules and evaluate quality of rules.
Demonstrate KDD process with following activities:
Problem statement
·Perform exploratory data analysis
Preprocess the data
Propose parameters such as support, confidence etc.
Discover frequent patterns
Iterate previous steps by varying parameters
Formulate association rules
Compare association rules
Briefly explain importance of discovered rules
Following are some points for you to take note of, while doing the assignment:
The data in some of the rows in the data set may be noisy
Some of the attributes have large number of values – you can consider merging them into 2 or 3 values to simplify the solution
State all your assumptions clearly
Provide clear explanations to explain your stand
Solution:
Import Libraries
#import libraires
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import mean_absolute_error, make_scorer
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns
import itertools
import numpy as np
import matplotlib.pyplot as plt
Read Data
#Read Dataset
df_train = pd.read_csv("Employee_skills_traits.csv")
df_train
Output:
Replace column attribute space
#Replace Column attribute space with '-'
df_train.columns = df_train.columns.str.replace(' ','_')
Now Check columns
#Show all dataset Columns
df_train.columns
Output:
Index(['ID', 'Employment_period_', 'Time_in_current_department_', 'Gender_',
'Team_leader_', 'Age_', 'Member_of_professional_organizations_',
'.Net_', 'SQL_Server_', 'HTML_CSS_Java_Script_', 'PHP_mySQL_',
'Fast_working', 'Awards', 'Communicative_'],
dtype='object')
Check Data After Rename Attribute
df_train
Here we see that all column name are renames and space is removed(Here we do this steps because space raise issue to read columns)
Checking Dataset Null Values
#Checking Null Value
#Visualize for check null value
check_null_value = df_train.isnull()
sns.heatmap(check_null_value,yticklabels=False,cbar=False,cmap='viridis')
Output:
As per above heat map we can say that it has no null value.
Checking Shape of dataset
#– the shape of the dataset
df_train.shape
Output:
(998, 14)
Dataset has 998 rows and 14 columns
Checking Data Type
#– info of the dataset
df_train.info()
Output:
Summary of the dataset
#– summary of the dataset
df_train.describe()
Output:
Finding Features and target variable and Split dataset
#Devided dataset into with target attribute "Awards"
x=df_train.drop(['Awards'],axis=1)
target=df_train.Awards
#split dataset with 25 percent test sample
X_train, X_test, y_train, y_test = train_test_split(x, target, test_size=0.25, random_state=0)
Use K means Clustering Algorithm And Fit into Model
#Use Data mining Clustering and Find the Mean Square Error
kmeans = KMeans(n_clusters=2)
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)
Output:
Validation MAE: 0.548
Accuracy
#Find Accuracy of Model
score = metrics.accuracy_score(y_test,y_pred)
print('Accuracy: %0.2f' % score)
Output:
Accuracy: 0.45
Fine Tune the Model To increase the accuracy
#Use Data mining Clustering and Find the Mean Square Error
#Fine Tune The Model to minimize the MSE and increase Accuracy
kmeans = KMeans(n_clusters=2, init='k-means++', n_init=5, max_iter=100, tol=0.0001)
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)
Output:
Validation MAE: 0.452
Accuracy
#Find Accuracy of Model
#Find Accuracy After Fine tune Parameter
score = metrics.accuracy_score(y_test,y_pred)
print('Accuracy: %0.2f' % score)
Here we have see increase the model Accuracy when we have tune the model using extra K means parameters
Plot the Confusion Matrix
#Evaluation of Model - Confusion Matrix Plot
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['0','1'],
title='Confusion matrix, without normalization')
Output:
Find Recall And Precision
#Find Recall and Precision
cnf_matrix = confusion_matrix(y_test, y_pred)
recall = np.diag(cnf_matrix) / np.sum(cnf_matrix, axis = 1)
precision = np.diag(cnf_matrix) / np.sum(cnf_matrix, axis = 0)
Confusion Matrix:
cnf_matrix
Output:
array([[63, 61],
[52, 74]], dtype=int64)
Precision:
#Precision Score
precision
Output:
array([0.55, 0.55])
Recall Score:
#Recall Score
recall
Output:
array([0.51, 0.59])
Exploratory Data Analysis
#EDA(Exploratory Data Analysis)
#ploting Graph of co-rrelation
fig = plt.figure(figsize=(10,5))
sns.heatmap(df_train.corr())
Output:
Box Plot
#Box Plot between ID & Employment_Period
#Here Both attribute(ID & Employment_period_) is associated to each other
sns.boxplot(x=df_train['Employment_period_'],y=df_train['ID'] )
Output:
# visualize frequency distribution of `Gender` variable
f, ax = plt.subplots(figsize=(9, 7))
ax = sns.countplot(x="Gender_", data=df_train, palette="Set1")
ax.set_title("Frequency distribution of Gender variable")
ax.set_xticklabels(df_train.Gender_.value_counts().index, rotation=30)
plt.show()
Output:
Line Chart
#Bar Plot(Here we choose some records of daset due to show visible plots)
df_train_plot = df_train.iloc[:18]
df_train_plot.plot(x= "ID", y= "Employment_period_", kind="bar")
Output:
If you need any other help related to machine learning then send your requirement details at realcode4you@gmail.com and get instant help.
Comments