Parallel Implementation Of K-Means Algorithm Using Hadoop

Clustering is regarded as one of the momentous task in data mining which deals with primarily grouping of similar data. To cluster large data is a point of concern. In recent years, data clustering has been studied extensively and a lot of methods and theories have been achieved. Hadoop is a software framework which deals with distributed processing of vast amount of data across groups of distributed computers using Map-Reduce programming model. The Map-Reduce computing model have two phases; a map phase and a reduce phase. The map phase calculates the distances between each point and each cluster and allots each point to its nearest cluster. All the points which belong to the same cluster are sent to a single reduce phase. The reduce phase calculates the new cluster centers for the next Map-Reduce job. Map-Reduce allows a kind of parallelization to solve a problem that involves large datasets using computing clusters and is also a striking implication for data clustering involving large datasets. This paper focuses on studying the parallel implementation of KMeans clustering algorithm using Map-Reduce computing model of Hadoop on different datasets. Keywordsâ€” Data Mining, Data Clustering, Parallel Computing, Map-Reduce, K-Means algorithm, Hadoop, HDFS, Machine Learning.