Chapter 4 Cluster Analysis
Section 2 K-Means
Page 2 K-Means Algorithm

Objectives

The objectives of this section are:
to explain the K-Means clustering algorithm
to introduce its limitations
to examine the various major parts

Outcomes

By the time you have completed this section you will be able to:
briefly explain the K-Means clustering algorithm
list the major steps in the algorithm

Definition

K-means is one of the oldest and most commonly used clustering algorithms. It is a prototype based clustering technique defining the prototype in terms of a centroid which is considered to be the mean of a group of points and is applicable to objects in a continuous n-dimensional space.

Description

The K-means algorithm involves randomly selecting K initial centroids where K is a user defined number of desired clusters. Each point is then assigned to a closest centroid and the collection of points close to a centroid form a cluster. The centroid gets updated according to the points in the cluster and this process continues until the points stop changing their clusters.  A high-level representation of the K-means algorithm is shown in Figure 1.

K-Means Algorithm

 

Initialization step: Choosing Initial Centroid

Randomly: different runs of the algorithm can produce poor results when the initial centriods are choosing randomly. It is important to realize that the choice of the initial centroid has a huge effect on the final result, for instance the figure below shows the resulting clusters based on two different randomly selected centroids. As one can see, the bottom left image is a more accurate representation.
Solutions to this problem