Table of Contents
1. Introduction
2. Methodology
2.1 Data Preparation
2.2 Sparse Vector Representation
2.3 Initial Centroid Initialization
2.4 K-Means Algorithm Execution
2.5 Evaluation
3. Results and Discussion
3.1 Optimal Number of Clusters
3.2 Cluster Analysis with Different Distance Measures
3.3 Limitations of MapReduce and Hadoop's MapReduce Engine
3.3.1 Scalability and Efficiency
3.3.2 Communication Overhead
3.3.3 Storage Requirements
3.3.4 Complexity of Development
4. Conclusion
5. References
Large data sets can be processed and produced using the MapReduce programming model and related implementation. A mapping procedure that filters and sorts the data and a reduction method that performs a summary operation define it. Using straightforward programming paradigms like MapReduce, the Hadoop software library provides a framework that enables the distributed processing of massive data volumes across computer clusters.
Data mining and machine learning fundamentally use cluster analysis to combine together data items that share characteristics. Using Apache Mahout, a scalable machine learning toolkit built on top of Apache Hadoop, we investigate cluster analysis in this study. With several distance measures, such as Euclidean, Manhattan, and Cosine, we implemented the K-Means clustering algorithm and assessed the outcomes. Additionally, we examined the clusters obtained using various distance measures and established the K-Means algorithm's ideal cluster size (K).
The National Climate Data Centre (NCDC) provided a weather dataset for the experiment that included hourly data records for four months in 2007. The goal was to calculate a number of descriptive statistics, such as the difference between the daily maximum and minimum wind speed, the daily minimum relative humidity, the daily mean and variance of dew point temperature, and the correlation matrix showing the monthly correlation between relative humidity, wind speed, and dry bulb temperature.
Q1)- For the MapReduce tasks, a pseudocode implementation and Python code utilising the Pandas library were provided.
Mapper
A row from the dataset will be taken by the mapper function, which will then produce a key-value pair with the date as the key and a tuple with the values of the variables wind speed, relative humidity, dew point temperature, and dry bulb temperature as the value.
function MAPPER(row):
parse the row into fields
date = row's date
wind_speed = row's wind speed
relative_humidity = row's relative humidity
dew_point_temp = row's dew point temperature
dry_bulb_temp = row's dry bulb temperature
emit (date, (wind_speed, relative_humidity, dew_point_temp, dry_bulb_temp))
Reducer
The reducer function will use the date-grouped key-value pairs from the mapping to determine the necessary statistics.
function REDUCER(date, values):
wind_speeds, relative_humidities, dew_point_temps, dry_bulb_temps = unzip values into separate lists
diff_wind_speed = maximum of wind_speeds - minimum of wind_speeds
min_relative_humidity = minimum of relative_humidities
mean_dew_point_temp = mean of dew_point_temps
variance_dew_point_temp = variance of dew_point_temps
correlation_matrix = calculate correlation among relative_humidities, wind_speeds, and dry_bulb_temps
emit (date, (diff_wind_speed, min_relative_humidity, mean_dew_point_temp, variance_dew_point_temp, correlation_matrix))
Instructions to Run the Code
Load your data into a pandas DataFrame. Make sure your DataFrame has the columns 'Date', 'Wind Speed (kt)', '% Relative Humidity', 'Dew Point Temp', 'Dry Bulb Temp'. The 'Date' column should be the index of the DataFrame.
Replace data in the Python code with the variable name of your DataFrame.
Run the Python code. It will print the calculated statistics.
We followed a step-by-step approach to conduct the cluster analysis using Apache Mahout:
The dataset we utilised was made up of text documents that the'seqdirectory' function converted into sequence files. These sequence files were the starting point for additional processing.
Using the'seq2sparse' command, we produced sparse vectors to represent the textual data as numerical values. The TF-IDF approach was used to normalise and weight the vectors, giving words with greater discrimination higher weights.
Using the 'kmeanspp' command, we set the centroids for the K-Means algorithm. The K-Means++ initialization method, which is used by this command, aids in choosing a variety of initial centroids.
The K-Means algorithm was then executed using the 'kmeans' command. Input directory, maximum iterations, distance measure (Euclidean, Manhattan, or Cosine), and the desired number of clusters (K) were all supplied.
By taking into account evaluation measures including cluster purity, silhouette score, and within-cluster sum of squares (WCSS), we assessed the calibre of the final clustering solution. In order to estimate the ideal number of clusters, we also used the elbow approach and plotted the average distance to the centroid as a function of K.
The MapReduce algorithm successfully computed the required statistics, underlining the efficacy of the methodology for large-scale data processing tasks.
As a methodology, MapReduce has a number of benefits. Developers may concentrate on writing the map and delegate less duties thanks to its abstraction of the underlying hardware and system architecture, which makes coding easier. Processing huge datasets on common hardware is excellent for MapReduce because of its inherent scalability and fault-tolerance.
It is not without its restrictions, though. MapReduce functions well when a task can be divided into distinct components; nevertheless, it is less effective when a task necessitates global sharing states, such as graph processing and iterative algorithms. Furthermore, the MapReduce computing architecture has a large latency and was not intended for real-time processing, where the results are required right away.
Although Hadoop's MapReduce engine offers a stable and scalable environment for carrying out MapReduce tasks, it also presents several difficulties. Due to its disk-based architecture, Hadoop is less suited for workloads requiring repeated iterations and high-speed data processing because it writes intermediate data to disc, incurring significant cost. A Hadoop cluster must also be set up and maintained, which takes considerable time and knowledge.
The following outcomes were attained following the cluster analysis we carried out with Apache Mahout:
We found the ideal number of clusters (K) by applying the elbow approach to the average distance to the centroid plot. At a particular value of K, we saw that the average distance's rate of decline slowed down noticeably. This value denotes the threshold at which further raising K does not significantly improve cluster quality.
The Euclidean, Manhattan, and Cosine distance metrics were used in the K-Means clustering technique. The clustering solutions obtained with each distance measure varied, according to the results. The type of data and the intended cluster characteristics will determine which distance measure is used.
For instance, the Manhattan distance metric is effective for data with categorical properties whereas the Euclidean distance measure is appropriate for continuous numerical data. For text-based data, the Cosine distance measure is frequently employed because it takes the angle between the vectors rather than their magnitude into account.
Although Apache Mahout uses Hadoop's MapReduce computing engine for distributed processing, it's important to take into account the drawbacks of this method:
MapReduce is intended for the distributed processing of huge datasets. However, network connectivity and disc I/O overhead can have an impact on overall performance, particularly for iterative algorithms like K-Means. The scalability and effectiveness of the cluster analysis process may be constrained by this overhead.
Between the map and reduce phases of a MapReduce framework, the data must be shuffled and sorted, which adds to the communication overhead. Particularly when working with high-dimensional data or intricate distance calculations, this overhead might constitute a bottleneck.
MapReduce algorithms sometimes include numerous iterations, necessitating the storage and exchange of interim results between steps. When working with huge datasets, this storage demand might be high and result in increased disc space utilisation.
It takes a thorough understanding of distributed systems and parallel processing to implement algorithms utilising MapReduce. It frequently entails addressing low-level details and writing bespoke code, which can make the development process complicated and error-prone.
In this study, we investigated cluster analysis utilising the K-Means clustering technique and Apache Mahout. We put the algorithm into practise using various distance metrics, chose the ideal number of clusters, and assessed the outcomes. In the context of cluster analysis, we also talked about the drawbacks of the MapReduce computing engine and methodology.
Despite these drawbacks, MapReduce is nevertheless a potent tool for handling massive datasets, especially when combined with frameworks like Hadoop. However, the particular requirements and limitations of the task at hand should guide the technique selection. Tasks requiring low latency and iterative computations may be better suited for other platforms like Spark, which offer in-memory processing.
While MapReduce offers a framework for scaling cluster analysis, it has inherent limits in terms of scalability, communication costs, storage needs, and complexity of development. When creating and implementing cluster analytic solutions utilising Apache Mahout or any other MapReduce-based method, these restrictions must be taken into account.
Practitioners can select the best tools and approaches for their unique cluster analysis assignments by being aware of the advantages and disadvantages of the MapReduce paradigm.
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
White, T. (2012). Hadoop: The definitive guide. "O'Reilly Media, Inc.".
Anil, R., Capan, G., Drost-Fromm, I., Dunning, T., Friedman, E., Grant, T., ... & Yilmazel, Ö. (2020). Apache mahout: Machine learning on distributed dataflow systems. The Journal of Machine Learning Research, 21(1), 4999-5004.
Solanki, R., Ravilla, S. H., & Bein, D. (2019, January). Study of distributed framework hadoop and overview of machine learning using apache mahout. In 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC) (pp. 0252-0257). IEEE.
Vats, D., & vinash Sharma, A. (2022, January). A Collaborative Filtering Recommender System using Apache Mahout, Ontology and Dimensionality Reduction Technique. In 2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) (pp. 1-12). IEEE.
AL WAILI, A. R. (2023). High Performance Scalable Big Data and Machine Learning using Apache Mahout. Journal of Misan Researches, 13(26-1).
Comments