K-means Clustering on the Car Dataset to segment the cars into various categories | Sample Paper

Part A:

  1. Domain: Automobile

  2. Context: The data concerns city-cycle fuel consumption in miles per gallon to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

  3. Data Description:


4. Project Objective: To understand K-means Clustering by applying on the Car Dataset to segment the cars into various categories.


Steps and Tasks

1. Data Understanding and Exploration

  • Read ‘Car name.csv’ as Dataframe and assign a variable to it.

  • Read ‘Car-Attribute.csv” as a Dataframe and assign a variable to it.

  • Merge both the Dataframes together to form a single Dataframe.

  • Print 5-point summary of the numerical features and share insights.


2. Data Preparation and Analysis

  • Check and print feature wise percentage of missing values present in the data and impute with the best suitable approach.

  • for duplicate values in the data and impute with best suitable approach.

  • Plot a Pairplot for all the features.

  • Visualize a scatter plot for “wt” and “disp”. Datapoints should be distinguishable by “cyl”.

  • Share insights for Q2.d.

  • Visualize a scatterplot for 'wt' and 'mpg'. Datapoints should be distinguishable by 'cyl'.

  • Share insights for Q2.f.

  • Check for unexpected values in all the features and datapoints with such values.


3. Clustering

  • Apply K-Means clustering for 2 to 10 clusters.

  • Plot a visual and find elbow point.

  • On the above visual, highlight which are the possible Elbow points.

  • Train a K-means clustering model once again on the optimal number of clusters.

  • Add a new feature in the DataFrame which will have labels based upon cluster value.

  • Plot a visual and color the datapoints based upon clusters.

  • Pass a new DataPoint and predict which cluster it belongs to.



Part B

  • Domain: Automobile

  • Context: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

  • Data Description: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

  • All the features are numeric i.e., geometric features extracted from the silhouette.

  • Project Objective: Apply dimensionality reduction technique - PCA and train a model and compare relative results.


Steps and Tasks

1. Data Understanding and Cleaning

  • Read ‘vehicle.csv' and save as DataFrame.

  • Check percentage of missing values and impute with correct approach.

  • Visualize a Pie-chart and print percentage of values for variable 'class'.

  • Check for duplicate rows in the data and impute with correct approach.

2. Data Preparation

  • Split data into X and Y. [Train and Test optional]

  • Standardize the Data.

3. Model Building

  • Train a base Classification model using SVM.

  • Print Classification metrics for train data.

  • Apply PCA on the data with 10 components.

  • Visualize Cumulative Variance Explained with Number of Components.

  • Draw a horizontal line on the above plot to highlight the threshold of 90%.

  • Apply PCA on the data. This time Select Minimum Components with 90% or above variance explained.

  • Train SVM model on components selected from above step.

  • Print Classification metrics for train data of above model and share insights.

4. Performance Improvement

  • another SVM on the components out of PCA. Tune the parameters to improve performance.

  • Train another SVM on the components out of PCA. Tune the parameters to improve performance.

  • Share best Parameters observed from above step.

  • Print Classification metrics for train data of above model and share relative improvement in performance in all the models along with insights.


5. Data Understanding and Cleaning

  • Explain pre-requisite/assumptions of PCA.

  • Explain advantages and limitations of PCA.




To get solution of above problem comment in below comment section or send your query at:


realcode4you@gmail.com

Here you get code without any plagiarism issue with an affordable price.

62 views1 comment