top of page

Machine Learning Practice Example Set: 1



Part 1:

Write a P\thon program to calculate the density estimator of a histogram. Use the field [ in the

NormalSample.csv file.

a) Let a be the largest integer less than the minimum value of the field [, and b be the smallest integer greater than the maximum value of the field x. What are the values of a and b?

b) Use h = 0.25, minimum = a and maximum = b. List the coordinates of the density estimator. Paste the histogram drawn using Python or your favorite graphing tools.

c) Use h = 0.5, minimum = a and maximum = b. List the coordinates of the density estimator. Paste the histogram drawn using P\thon or \our favorite graphing tools.

d) Use h = 1, minimum = a and maximum = b. List the coordinates of the density estimator. Paste the histogram drawn using P\thon or \our favorite graphing tools.

e) Use h = 2, minimum = a and maximum = b. List the coordinates of the density estimator. Paste the histogram drawn using P\thon or \our favorite graphing tools.

f) Among the four histograms, which one, in \our honest opinions, can best provide \our insights into the shape and the spread of the distribution of the field x? Please state \our arguments.


Part 2:

Use in the NormalSample.csv to generate box-plots for answering the following questions.

a) (5 points) What is the file-number summary of x for each category of the group? What are the values of the 1.5 IQR whiskers for each category of the group?


b) (5 points) Draw a graph where it contains the boxplot of [, the boxplot of [ for each category of Group (i.e., three boxplots within the same graph frame). Use the 1.5 IQR whiskers, identify the outliers of [, if an\, for the entire data and for each category of the group.


Hint: Consider using the CONCAT function in the PANDA module to append observations.


Part 3:

The data, FRAUD.csv, contains results of fraud investigations of 5,960 cases. The binary variable FRAUD indicates the result of a fraud investigation: 1 = Fraudulent, 0 = Otherwise.


The other interval variables contain information about the cases.

  1. TOTAL_SPEND: Total amount of claims in dollars

  2. DOCTOR_VISITS: Number of visits to a doctor

  3. NUM_CLAIMS: Number of claims made recently

  4. MEMBER_DURATION: Membership duration in number of months

  5. OPTOM_PRESC: Number of optical e[aminations

  6. NUM_MEMBERS: Number of members covered


You are asked to use the Nearest Neighbors algorithm to predict the likelihood of fraud.


a) What percent of investigations are found to be fraudulent? Please give your answer up to 4 decimal places.


b) Use the BOXPLOT function to produce horizontal box-plots. For each internal variable, one box-plot for the fraudulent observations, and another box-plot for the non-fraudulent observations. These two box-plots must appear in the same graph for each interval variable.


c) Orthonormalize interval variables and use the resulting variables for the nearest neighbor analysis. Use only the dimensions whose corresponding eigenvalues are greater than one.

i. How many dimensions are used?

ii. Please provide the transformation matrix? You must provide proof that the resulting variables are actually orthonormal.


d) Use the Nearest Neighbors module to execute the Nearest Neighbors algorithm using

exactly file neighbors and the resulting variables you have chosen in c). The K Neighbors Classifier module has a score function.

i. Run the score function, provide the function return value

ii. Explain the meaning of the score function return value.



e) For the observation which has these input variable values: TOTAL_SPEND = 7500, DOCTOR_VISITS = 15, NUM_CLAIMS = 3, MEMBER_DURATION = 127, OPTOM_PRESC = 2, and NUM_MEMBERS = 2, find its file neighbors. Please list their input variable values and the target values. Reminder: transform the input observation using the results in c) before finding the neighbors.


f) Follow-up with e), what is the predicted probability of fraudulent (i.e., FRAUD = 1)? If \our predicted probability is greater than or equal to \our answer in a), then the observation will be classified as fraudulent. Otherwise, non-fraudulent. Based on this criterion, will this observation be misclassified?



If you need solution of machine learning problem like above or need solution of above question then you can contact us at realcode4you@gmail.com so we can provide solution as per you need.


bottom of page