top of page
Search

# Case Study Assignment Help In Data Mining

### Requirement Details

An organization wanted to check association among employee experience, skills, traits etc. to better manage human resources.

As a data scientist, you are required to recognize patterns from the available data and evaluate efficacy of methods to obtain patterns. Your activities should include - performing various activities pertaining to the data such as, preparing the dataset for analysis; investigating the relationships in the data set with visualization; identify frequent patterns; formulate association rules and evaluate quality of rules.

Demonstrate KDD process with following activities:

• Problem statement

• ·Perform exploratory data analysis

• Preprocess the data

• Propose parameters such as support, confidence etc.

• Discover frequent patterns

• Iterate previous steps by varying parameters

• Formulate association rules

• Compare association rules

• Briefly explain importance of discovered rules

Following are some points for you to take note of, while doing the assignment:

• The data in some of the rows in the data set may be noisy

• Some of the attributes have large number of values – you can consider merging them into 2 or 3 values to simplify the solution

• State all your assumptions clearly

• Provide clear explanations to explain your stand

### Solution:

Import Libraries

```#import libraires
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import mean_absolute_error, make_scorer
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns
import itertools
import numpy as np
import matplotlib.pyplot as plt```

```#Read Dataset
df_train```

Output: Replace column attribute space

```#Replace Column attribute space with '-'
df_train.columns = df_train.columns.str.replace(' ','_')```

Now Check columns

```#Show all dataset Columns
df_train.columns```

Output:

```Index(['ID', 'Employment_period_', 'Time_in_current_department_', 'Gender_',
'.Net_', 'SQL_Server_', 'HTML_CSS_Java_Script_', 'PHP_mySQL_',
'Fast_working', 'Awards', 'Communicative_'],
dtype='object')```

Check Data After Rename Attribute

`df_train` Here we see that all column name are renames and space is removed(Here we do this steps because space raise issue to read columns)

Checking Dataset Null Values

```#Checking Null Value
#Visualize for check null value
check_null_value = df_train.isnull()
sns.heatmap(check_null_value,yticklabels=False,cbar=False,cmap='viridis')```

Output: As per above heat map we can say that it has no null value.

Checking Shape of dataset

```#– the shape of the dataset
df_train.shape```

Output:

`(998, 14)`

Dataset has 998 rows and 14 columns

Checking Data Type

```#– info of the dataset
df_train.info()```

Output: Summary of the dataset

```#– summary of the dataset
df_train.describe()```

Output: Finding Features and target variable and Split dataset

```#Devided dataset into with target attribute "Awards"
x=df_train.drop(['Awards'],axis=1)
target=df_train.Awards```
```#split dataset with 25 percent test sample
X_train, X_test, y_train, y_test = train_test_split(x, target, test_size=0.25, random_state=0)```

Use K means Clustering Algorithm And Fit into Model

```#Use Data mining Clustering and Find the Mean Square Error
kmeans = KMeans(n_clusters=2)
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)```

Output:

`Validation MAE: 0.548`

Accuracy

```#Find Accuracy of Model
score = metrics.accuracy_score(y_test,y_pred)
print('Accuracy: %0.2f' % score)```

Output:

`Accuracy: 0.45`

Fine Tune the Model To increase the accuracy

```#Use Data mining Clustering and Find the Mean Square Error
#Fine Tune The Model to minimize the MSE and increase Accuracy
kmeans = KMeans(n_clusters=2, init='k-means++', n_init=5, max_iter=100, tol=0.0001)
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print('Validation MAE: %0.3f' % test_mae)```

Output:

`Validation MAE: 0.452`

Accuracy

```#Find Accuracy of Model
#Find Accuracy After Fine tune Parameter
score = metrics.accuracy_score(y_test,y_pred)
print('Accuracy: %0.2f' % score)```

Here we have see increase the model Accuracy when we have tune the model using extra K means parameters

### Plot the Confusion Matrix

```#Evaluation of Model - Confusion Matrix Plot

def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')

print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape), range(cm.shape)):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['0','1'],
title='Confusion matrix, without normalization')```

Output: Find Recall And Precision

```#Find Recall and Precision
cnf_matrix = confusion_matrix(y_test, y_pred)
recall = np.diag(cnf_matrix) / np.sum(cnf_matrix, axis = 1)
precision = np.diag(cnf_matrix) / np.sum(cnf_matrix, axis = 0)```

Confusion Matrix:

`cnf_matrix`

Output:

```array([[63, 61],
[52, 74]], dtype=int64)```

Precision:

```#Precision Score
precision```

Output:

`array([0.55, 0.55])`

Recall Score:

```#Recall Score
recall

```

Output:

`array([0.51, 0.59])`

Exploratory Data Analysis

```#EDA(Exploratory Data Analysis)
#ploting Graph of co-rrelation
fig = plt.figure(figsize=(10,5))
sns.heatmap(df_train.corr())```

Output: Box Plot

```#Box Plot between ID & Employment_Period
#Here Both attribute(ID & Employment_period_) is associated to each other
sns.boxplot(x=df_train['Employment_period_'],y=df_train['ID'] )```

Output: ```# visualize frequency distribution of `Gender` variable
f, ax = plt.subplots(figsize=(9, 7))
ax = sns.countplot(x="Gender_", data=df_train, palette="Set1")
ax.set_title("Frequency distribution of Gender variable")
ax.set_xticklabels(df_train.Gender_.value_counts().index, rotation=30)
plt.show()```

Output: Line Chart

```#Bar Plot(Here we choose some records of daset due to show visible plots)
df_train_plot = df_train.iloc[:18]
df_train_plot.plot(x= "ID", y= "Employment_period_", kind="bar")

```

Output: If you need any other help related to machine learning then send your requirement details at realcode4you@gmail.com and get instant help.