Machine Learning Sample | Practice Set


In this post we have add some important machine learning related questions which help to improve your and data science concept:


Question 1:

DESCRIPTION

Problem Statement: Build a tic-tac-toe game classification algorithm using the concept of supervised machine learning.


Requirements:

  • Python 3.6

  • Scikit-Learn

  • Pandas and Numpy

Dataset Used: tic-tac-toe.txt


Attribute Description: Name|Type|Description

top_left_square | string | Value includes x,o or b for blank

top_middle_square | string | Value includes x,o or b for blank

top_right_square | string | Value includes x,o or b for blank

middle_left_square | string | Value includes x,o or b for blank

middle_middle_square | string | Value includes x,o or b for blank

middle_right_square | string | Value includes x,o or b for blank

bottom_left_square | string | Value includes x,o or b for blank

bottom_middle_square | string | Value includes x,o or b for blank

bottom_right_square | string | Value includes x,o or b for blank

class | string | Predictor class: Values can be positive (X won) or negative (X lost or tied)


Dataset Description:

This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where "x" is assumed to have played first. The target concept is "win for x" (i.e., true when "x" has one of 8 possible ways to create a "three-in-a-row").

Training dataset:

This dataset will be used to test the developer's solution. It will be available at


/data/train/tic-tac-toe.data.txt

Tasks to be performed:

1. Data Preprocessing:

Use random_state = 3 while splitting the dataset into train and test set.


Label Val | Decoded Val (features) | 0 | b | 1 | o | 2 | x Label Val | Decoded Val (class) 0 | negative 1 | positive


Hint: Use the concept of label encoding i.e. map the parameters manually


2. Create a Random Forest Model (random_state = 0) using k- Cross-Validation Technique.


3. Apply Ada Boost algorithm to improve the accuracy score (random_state = 0).


Hint: For the above scenario, you can choose the best value of k (from 2 to 10) for Cross-Validation and use n_esitmator = 100, n_splits=20 (You need to understand which parameter to use and when).


Print the accuracy score before and after implementing Ada Boost Algorithm.


Output Format:

  • Perform the above operations and write your output to a file named output.csv, which should be present at the location /code/output/output.csv

  • output.csv should contain the answer to each question on consecutive rows.

NOTE: If accuracy before implementing ada boost is 0.713 and after implementing is 0.811 then create a list result = [0.713, 0.811] and convert it to a CSV file(The process of which is mentioned in the stub).

import pandas as pd
import numpy as np 
import seaborn as sns
train=pd.read_csv('/data/training/tic-tac-toe.data.txt')
#********Write your code here***************
#*******************************************
#*******************************************
result=[0.713, 0.811]
result=pd.DataFrame(result)
#writing output to output.csv
result.to_csv('/code/output/output.csv', header=False, index=False)

Question 2:

DESCRIPTION

Dataset Used: PredictionsFor4April2019.csv

Problem Statement: ABC Company has made a model to predict the daily number of units sold of different products.











You have to help this company to get the metrics at the Country level.

Write python code for computing the following metrics using mean_squared_error function:

  1. RMSE for Country DE

  2. RMSE for Country AT

  3. RMSE for Country PL

  • Calculate up to 2 decimal places

  • Perform the above operations and write your output to a file named output.csv, which should be present at the location /code/output/output.csv

  • output.csv should contain the answer to each question on consecutive rows.

NOTE: If the answer for 1st, 2nd and 3rd questions are 0.7,0.6 and 0.8 respectively, then create a list result = [0.7, 0.6, 0.8] and convert it to a CSV file(The process of which is mentioned in the stub).

import pandas as pd
import numpy as np
forecast=pd.read_csv('/data/training/PredictionsFor4April2019.csv')
#********Write your code here***************
#******************************************


QUESTION 3

DESCRIPTION

Dataset Used: PredictionsFor4April2019.csv

Problem Statement: ABC Company has made a model to predict the daily number of units sold of different products.










You have to help this company to get the metrics at the Country level.

Write python code for computing the following metrics :


1. Percentage of Identical Predictions for Country DE

2. Percentage of Identical Predictions for Country AT

3. Percentage of Identical Predictions for Country PL


Output Format:

  • Calculate up to 2 decimal places (example for DE it is 60.28)

  • Perform the above operations and write your output to a file named output.csv, which should be present at the location /code/output/output.csv

  • output.csv should contain the answer to each question on consecutive rows.

NOTE: If the answer for 1st, 2nd and 3rd questions are 0.7,0.6 and 0.8 respectively, then create a list result = [0.7, 0.6, 0.8] and convert it to a CSV file(The process of which is mentioned in the stub).

import pandas as pd
import numpy as np
forecast=pd.read_csv('/data/training/PredictionsFor4April2019.csv')
#********Write your code here***************
#*******************************************
#*******************************************
result=[0.7, 0.8,0.97]
result=pd.DataFrame(result)
#writing output to output.csv
result.to_csv('/code/output/output.csv', header=False, index=False)


QUESTION 4:

DESCRIPTION

Problem Statement: The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to diagnostically predict whether a patient is diabetic or not, based on diagnostic measurements included in the dataset. Create a classification model using AdaBoost Algorithm and XGBoost Algorithm. Use grid search to find the optimal value for the hyperparameters: learning_rate and n_estimators.

Dataset:

  • diabetes_train.csv

  • diabetes_test.csv

Dataset Parameters:

  • Pregnancies: Number of times pregnancies(0-14)

  • Glucose: Glucose Level (0-198)

  • BloodPressure: Diastolic blood pressure (0-122)

  • SkinThickness: Triceps skin fold thickness (0-52)

  • Insulin: 2-Hour serum insulin (0-543)

  • BMI: Body Mass Index (0-57.3)

  • DiabetesPedigreeFunction: Diabetes Pedigree Function(0.078-2.288)

  • Age: Age of Patient (21-81)

  • Outcome: Patient is diabetic or not (0 or 1)


Tasks to be Performed: 1. What are the optimal values for learning_rate and n_estimators?

Example: If 1 and 100 are optimal values then the output should be:

Output: 1 , 100

Hint: Take hyperparameters as:

  • learning_rate: 0.1 to 1 step 0.1

  • n_estimators: 50 to 300 step 50

2. Calculate the below precision values for both models( ADA Boost and XGBoost) and find the larger value between them(up to 2 decimal places):

  • Accuracy

  • Sensitivity

  • Specificity


Example: If the precision values of AdaBoost are:

  • Accuracy: 80.0

  • Sensitivity: 40.12

  • Specificity: 30.34

And the precision values of XGBoost are:

  • Accuracy: 90.0

  • Sensitivity: 50.56

  • Specificity: 20.78

Then the output should be:

Output: 90.0, 50.56, 30.34


Hint: Use the confusion matrix to calculate the above values.


Final Output Sample:

1, 100, 90.0, 50.56, 30.34

NOTE: Here, The multiple answers are separated by a comma.


Input Format:

  • The first file ‘diabetes_train.csv’ contains data as mentioned in the problem to train the models. The file is in *.csv format and is present at the location /data/training/diabetes_train.csv.

  • The second file ‘diabetes_test.csv’ contains data as mentioned in the problem to test the models. The file is in *.csv format and is present at the location /data/test/diabetes_test.csv.


Output Format:

  • Perform the above operations and write answers to all queries asked in the questions to a file named output.csv.

  • Each answer should be separated by a comma.

  • Your file output.csv should be present at the location

/code/output/output.csv.
# Import libraries here
# import numpy as np
# from sklearn import linear_model
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt


QUESTION 5:

DESCRIPTION

Problem Statement:


You are provided with a data set named “Retail.csv”, you have to perform market basket analysis on the dataset. Apply the Apriori algorithm and association rules using appropriate parameters.

Dataset: Retail.csv

Dataset Parameters:

  • POS Txn : Transaction ID

  • Dept: Department

  • ID: Item ID

  • Sales U: Units sold


Remove the ‘0999: UNSCANNED ITEMS’ from the ‘Dept’ column and print number of times ‘0973:CANDY’ sold.

Example: If the number of times ‘0973:CANDY’ was sold 100 times then the output should be:

Output: 100

Hint: We need to find the number of times ‘0973:CANDY’ was sold not total units sold.


1. For the Frequent Itemsets, keep the minimum support as 0.02 and find maximum support. (up to 5 decimal places)

Example: If maximum support is 0.54321 then the output should be:

Output: 0.54321

Hint: Get rules using the “lift” Metric having minimum_threshold as 2


1. Filter rules having lift>=3 and confidence >=0.1 and calculate the total number of rules and filtered rules.

Example: If the total number of rules is 40 and the number of filtered rules is 20 then output should be:

Output: 40, 20

Final Output Sample:

100, 0.54321, 40, 20


NOTE: Here, The multiple answers are separated by a comma.


Input Format:

  • The first file ‘Retail.csv’ contains data as mentioned in the problem. The file is in *.csv format and is present at the location /data/training/Retail.csv.

Output Format:

  • Perform the above operations and write answers to all queries asked in the questions to a file named output.csv.

  • Each answer should be separated by a comma.