Machine Learning Sample | Practice Set

In this post we have add some important machine learning related questions which help to improve your and data science concept:
Question 1:
DESCRIPTION
Problem Statement: Build a tic-tac-toe game classification algorithm using the concept of supervised machine learning.
Requirements:
Python 3.6
Scikit-Learn
Pandas and Numpy
Dataset Used: tic-tac-toe.txt
Attribute Description: Name|Type|Description
top_left_square | string | Value includes x,o or b for blank
top_middle_square | string | Value includes x,o or b for blank
top_right_square | string | Value includes x,o or b for blank
middle_left_square | string | Value includes x,o or b for blank
middle_middle_square | string | Value includes x,o or b for blank
middle_right_square | string | Value includes x,o or b for blank
bottom_left_square | string | Value includes x,o or b for blank
bottom_middle_square | string | Value includes x,o or b for blank
bottom_right_square | string | Value includes x,o or b for blank
class | string | Predictor class: Values can be positive (X won) or negative (X lost or tied)
Dataset Description:
This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where "x" is assumed to have played first. The target concept is "win for x" (i.e., true when "x" has one of 8 possible ways to create a "three-in-a-row").
Training dataset:
This dataset will be used to test the developer's solution. It will be available at
/data/train/tic-tac-toe.data.txt
Tasks to be performed:
1. Data Preprocessing:
Use random_state = 3 while splitting the dataset into train and test set.
Label Val | Decoded Val (features) | 0 | b | 1 | o | 2 | x Label Val | Decoded Val (class) 0 | negative 1 | positive
Hint: Use the concept of label encoding i.e. map the parameters manually
2. Create a Random Forest Model (random_state = 0) using k- Cross-Validation Technique.
3. Apply Ada Boost algorithm to improve the accuracy score (random_state = 0).
Hint: For the above scenario, you can choose the best value of k (from 2 to 10) for Cross-Validation and use n_esitmator = 100, n_splits=20 (You need to understand which parameter to use and when).
Print the accuracy score before and after implementing Ada Boost Algorithm.
Output Format:
Perform the above operations and write your output to a file named output.csv, which should be present at the location /code/output/output.csv
output.csv should contain the answer to each question on consecutive rows.
NOTE: If accuracy before implementing ada boost is 0.713 and after implementing is 0.811 then create a list result = [0.713, 0.811] and convert it to a CSV file(The process of which is mentioned in the stub).
import pandas as pd
import numpy as np
import seaborn as sns
train=pd.read_csv('/data/training/tic-tac-toe.data.txt')
#********Write your code here***************
#*******************************************
#*******************************************
result=[0.713, 0.811]
result=pd.DataFrame(result)
#writing output to output.csv
result.to_csv('/code/output/output.csv', header=False, index=False)
Question 2:
DESCRIPTION
Dataset Used: PredictionsFor4April2019.csv
Problem Statement: ABC Company has made a model to predict the daily number of units sold of different products.

You have to help this company to get the metrics at the Country level.
Write python code for computing the following metrics using mean_squared_error function:
RMSE for Country DE
RMSE for Country AT
RMSE for Country PL
Calculate up to 2 decimal places
Perform the above operations and write your output to a file named output.csv, which should be present at the location /code/output/output.csv
output.csv should contain the answer to each question on consecutive rows.
NOTE: If the answer for 1st, 2nd and 3rd questions are 0.7,0.6 and 0.8 respectively, then create a list result = [0.7, 0.6, 0.8] and convert it to a CSV file(The process of which is mentioned in the stub).
import pandas as pd
import numpy as np
forecast=pd.read_csv('/data/training/PredictionsFor4April2019.csv')
#********Write your code here***************
#******************************************
QUESTION 3
DESCRIPTION
Dataset Used: PredictionsFor4April2019.csv
Problem Statement: ABC Company has made a model to predict the daily number of units sold of different products.

You have to help this company to get the metrics at the Country level.
Write python code for computing the following metrics :
1. Percentage of Identical Predictions for Country DE
2. Percentage of Identical Predictions for Country AT
3. Percentage of Identical Predictions for Country PL
Output Format:
Calculate up to 2 decimal places (example for DE it is 60.28)
Perform the above operations and write your output to a file named output.csv, which should be present at the location /code/output/output.csv
output.csv should contain the answer to each question on consecutive rows.
NOTE: If the answer for 1st, 2nd and 3rd questions are 0.7,0.6 and 0.8 respectively, then create a list result = [0.7, 0.6, 0.8] and convert it to a CSV file(The process of which is mentioned in the stub).
import pandas as pd
import numpy as np
forecast=pd.read_csv('/data/training/PredictionsFor4April2019.csv')
#********Write your code here***************
#*******************************************
#*******************************************
result=[0.7, 0.8,0.97]
result=pd.DataFrame(result)
#writing output to output.csv
result.to_csv('/code/output/output.csv', header=False, index=False)
QUESTION 4:
DESCRIPTION
Problem Statement: The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to diagnostically predict whether a patient is diabetic or not, based on diagnostic measurements included in the dataset. Create a classification model using AdaBoost Algorithm and XGBoost Algorithm. Use grid search to find the optimal value for the hyperparameters: learning_rate and n_estimators.
Dataset:
diabetes_train.csv
diabetes_test.csv
Dataset Parameters:
Pregnancies: Number of times pregnancies(0-14)
Glucose: Glucose Level (0-198)
BloodPressure: Diastolic blood pressure (0-122)
SkinThickness: Triceps skin fold thickness (0-52)
Insulin: 2-Hour serum insulin (0-543)
BMI: Body Mass Index (0-57.3)
DiabetesPedigreeFunction: Diabetes Pedigree Function(0.078-2.288)
Age: Age of Patient (21-81)
Outcome: Patient is diabetic or not (0 or 1)
Tasks to be Performed: 1. What are the optimal values for learning_rate and n_estimators?
Example: If 1 and 100 are optimal values then the output should be:
Output: 1 , 100
Hint: Take hyperparameters as:
learning_rate: 0.1 to 1 step 0.1
n_estimators: 50 to 300 step 50
2. Calculate the below precision values for both models( ADA Boost and XGBoost) and find the larger value between them(up to 2 decimal places):
Accuracy
Sensitivity
Specificity
Example: If the precision values of AdaBoost are:
Accuracy: 80.0
Sensitivity: 40.12
Specificity: 30.34
And the precision values of XGBoost are:
Accuracy: 90.0
Sensitivity: 50.56
Specificity: 20.78
Then the output should be:
Output: 90.0, 50.56, 30.34
Hint: Use the confusion matrix to calculate the above values.
Final Output Sample:
1, 100, 90.0, 50.56, 30.34
NOTE: Here, The multiple answers are separated by a comma.
Input Format:
The first file ‘diabetes_train.csv’ contains data as mentioned in the problem to train the models. The file is in *.csv format and is present at the location /data/training/diabetes_train.csv.
The second file ‘diabetes_test.csv’ contains data as mentioned in the problem to test the models. The file is in *.csv format and is present at the location /data/test/diabetes_test.csv.
Output Format:
Perform the above operations and write answers to all queries asked in the questions to a file named output.csv.
Each answer should be separated by a comma.
Your file output.csv should be present at the location
/code/output/output.csv.
# Import libraries here
# import numpy as np
# from sklearn import linear_model
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
QUESTION 5:
DESCRIPTION
Problem Statement:
You are provided with a data set named “Retail.csv”, you have to perform market basket analysis on the dataset. Apply the Apriori algorithm and association rules using appropriate parameters.
Dataset: Retail.csv
Dataset Parameters:
POS Txn : Transaction ID
Dept: Department
ID: Item ID
Sales U: Units sold
Remove the ‘0999: UNSCANNED ITEMS’ from the ‘Dept’ column and print number of times ‘0973:CANDY’ sold.
Example: If the number of times ‘0973:CANDY’ was sold 100 times then the output should be:
Output: 100
Hint: We need to find the number of times ‘0973:CANDY’ was sold not total units sold.
1. For the Frequent Itemsets, keep the minimum support as 0.02 and find maximum support. (up to 5 decimal places)
Example: If maximum support is 0.54321 then the output should be:
Output: 0.54321
Hint: Get rules using the “lift” Metric having minimum_threshold as 2
1. Filter rules having lift>=3 and confidence >=0.1 and calculate the total number of rules and filtered rules.
Example: If the total number of rules is 40 and the number of filtered rules is 20 then output should be:
Output: 40, 20
Final Output Sample:
100, 0.54321, 40, 20
NOTE: Here, The multiple answers are separated by a comma.
Input Format:
The first file ‘Retail.csv’ contains data as mentioned in the problem. The file is in *.csv format and is present at the location /data/training/Retail.csv.
Output Format:
Perform the above operations and write answers to all queries asked in the questions to a file named output.csv.
Each answer should be separated by a comma.