Exploratory Factor Analysis And Clustering In R Programming

realcode4you
Jun 3, 2021
5 min read

Exploratory Factor Analysis (EFA) or roughly known as factor analysis in Ris a statistical technique that is used to identify the latent relational structure among a set of variables and narrow down to a smaller number of variables. This essentially means that the variance of a large number of variables can be described by a few summary variables, i.e., factors. Here is an overview of exploratory factor analysis in R.

Cluster analysis is part of the unsupervised learning. A cluster is a group of data that share similar features. We can say, clustering analysis is more about discovery than a prediction. The machine searches for similarity in the data. For instance, you can use cluster analysis for the following application:

Customer segmentation: Looks for similarity between groups of customers
Stock Market clustering: Group stock based on performances
Reduce dimensionality of a dataset by grouping observations with similar values

Let Go through the below EFA and Clustering Analysis Example:

Instructions

You have to submit two files: R-code to run your program and your report in PDF or MS Word format.

To obtain the maximum available marks you should aim to:

1. Code all requested components (30%).

2. Use a clear style of code presentation (10%). Code clarity is an important part of your submission. Thus you should choose meaningful variable names and adopt the use of comments --- you don't need to comment every single line, as this will affect readability --- however you should aim to comment at least each section of code.

3. Have the code run successfully (5%).

4. Output the information in a presentable manner and present your written analysis of the output. (55%).

Plagiarism is a specific form of academic misconduct. Although the University encourages discussing work with others and the Social Forum will support this, ultimately this submission is to represent your individual work. If plagiarism is found, all parties will be penalised. You should retain copies of all assignment computer files used during development. These files must remain unchanged after submission, for the purpose of checking if required.

For the purpose of this exam, a “paragraph” is considered to consist of approximately 6 --- 8 lines. You are welcome to exceed this amount.

Dataset

The data for this question are the responses to the sensometric qualities of chocolate that can be purchased in supermarkets. Two groups were asked to rate the qualities of the chocolates. The first group contained a panel of sensometric experts. The second group contained a panel of volunteers chosen to represent ‘regular shoppers’ who underwent a three-hour sensometric training session before rating the qualities of the chocolate.

The responses were recorded over a scale from 0 to 10 with 0 indicating the absence of the sensometric quality and 10 indicating fully present. There are 14 sensometric variables (Chocolate Aroma through to Granular Texture in the data file) and variable Role indicating if responses were provided by experts or amateurs.

Your task

It is of interest to determine if experts perceive supermarket chocolate differently to non-experts (the amateurs). You have to run an analysis and prepare a report using two types of analysis: EFA and clustering (weighted equally in terms of your grade). Specifically, your analysis should include:

Initial data discussion: Write an explanation of the data and any data manipulation performed prior to analysis should you do so.

Then for each method separately Research methods introduction: Write a short explanation (approximately 1 paragraph) of the analysis to be performed.

Exploratory Factor Analysis: conduct two separate exploratory factor analyses: the first for the expert responses, the other for the amateur responses. You may present the analyses side-by-side or in sequence; whatever you believe is best. For each Exploratory Factor Analysis, you only need to include the following:

If appropriate, Cronbach Alpha output and a short discussion (2 --- 3 lines) of whether the data is trustworthy and why.
Correlation output of your choosing (graphical and/or numerical) with an accompanying discussion (3 --- 4 lines). If numerical, round the correlations to 2 digits;
A single paragraph explaining the outcome of the determinant test, Bartlett’s test of sphericity and the KMO statistic for both data sets. Do not include R output.
Your decision regarding the number of factors to estimate (scree plot may be shown, do not show the R console output).
The FINAL factor solution. You do not need to discuss results of any of the other solutions, however you should justify your final factor solution, including loadings, and name the factors in each analysis. You should also include up to two sentences indicating whether the test of residuals was passed and whether the factors are correlated.
All factors should be named and an explanation as to how you come up with these names should be included.
Based on the factor analysis results and your chosen factor names, discuss the factors that have emerged from the study. What types of differences or similarities (if any) exist between the expert and amateur sensometric ratings?
Conclusions: write 2 paragraphs of conclusions based on your analysis.

Clustering Analysis: For this question, you are asked to conduct clustering analysis using both hierarchical and partitional clustering techniques for the entire data set combining experts and amateurs. Variable Role should be not used for clustering --- you use only 14 sensometric variables. Specifically, your analysis should include:

Hierarchical clustering: conduct hierarchical clustering on the data, choosing an appropriate AGNES-based method based on either single, complete, average-linkage or Ward’s method. Ensure you justify your choice in your write-up and include the resulting dendrogram, as well as a discussion of the outcomes of hierarchical clustering on your data.
Partitional clustering: conduct a partitional clustering of your data using K-means. Ensure you explain and include any relevant R output (including graphics) supporting your choice of k, the number of clusters.
Validation: as a form of cluster validation, consider the following:
- If there are obvious outliers or distances that should be removed, identify these in your write-up and re-run your chosen Partitional Clustering algorithm, adjusting k if necessary. Include justification of your choice of the new value for k.
- If there are no obvious outliers/distances that should be removed, then explain this conclusion with justification. In this case re-run your chosen Partitional Clustering algorithm for a different value of k to that used above. Include justification of your choice for the new value for k.
Select one best solution (from any method), analyse values of 14 sensometric variables for each cluster, describe observed patterns, name your clusters. Compare clustering membership to variable Role (use function table to get a cross-tabulation table). Are there any patterns?
Conclusions: write 2 paragraphs of conclusions based on your analysis including a statement regarding which clustering solution is the better one and why.