Download file data.zip from the CMM510 assessment area of CampusMoodle.
Unzip data.zip. It contains the following files:
- sampleWater1.csv: contains the first dataset about water base analysis of samples. In the remaining of this document this dataset will be referred as water1.
- sampleWater2.csv: contains a second dataset about water base analysis of samples. In the remaining of this document this dataset will be referred as water2.
- testWater.csv: contains a third dataset about water base analysis of samples, which you will use for testing. In the remaining of this document this dataset will be referred as testWater.
Each of the above files contains a dataset with details of analyses of the water base using the following features (attributes):
siteIDScheme – either eiometMonitoringSiteCode or euMonitoringSiteCode.
WBCategory – the water base category (GW, LW, RW, TW).
determinandC – the determinand code (CAS_14798-03-9:328, CAS_7723- 14-0 :632, EEA_3131-01-9 :700, EEA_3132-01-2 :825).
analysed – fraction of the sample analysed (dissolved, SPM, total).
media – type of media monitored (sediment, water).
NSamples– the number of samples.
minValue – minimum sample value used.
meanValue – mean sample value used.
maxValue – maximum sample value used.
sd: standard deviation for the sample values.
method: CEN/ISO code of the analytical method used.
TASK 1 Dataset Exploration and Classification
a. Load and inspect the 3 datafiles above. Explore the datasets, highlighting anything of interest. Use R code to explore and analyse these datasets. Highlight key observations about the data. [Word limit: 200 excluding code and/or plots]
b. Run two tree classifiers, and an instance-based classifier on water1 and a new dataset water containing both water1 and water2. The class is WBCategory. Note that you may have to pre-process the data files before you can use them for classification. Critically compare the performance of the 3 algorithms on the two datasets. In your explanations include the performance metric(s) and the evaluation method, the parameters used, the size of the datasets. Give details of any data pre-processing. [Word limit: 200 excluding code and/or results]
c. If one algorithm's performance was better than another one, discuss any reasons under which the lower performer algorithm may be preferred and state the confidence level (if any) at which the difference in performance is not statistically significant [Word limit: 100, excluding code and/or plots]
d. Test the 3 models you trained using the water dataset on the testWater dataset and compare their performance. [Word limit: 100, excluding code and/or plots]
TASK 2 Clustering and Additional Insights
a. Cluster the water dataset, undertaking any pre-processing required using ONE clustering algorithm discussed in the class. Discuss the ideal number of clusters and comment on what the clusters represent. Justify your choice of clustering algorithm and discuss whether the resulting clusters correlate with any attribute. [Word limit: 150, excluding code and/or plots]
b. Undertake one further data mining activity of your choice using one or more of the datasets available with this coursework to demonstrate your understanding of data mining. The data mining activity you choose must have been covered in this module (CMM510). [Word limit 200, excluding code and/or plots]
If you need solution of above problem or need any other R programming related help then you can directly send your request at email@example.com and get instant help with an affordable prices.