Chapter 4 Cluster Analysis
Section 2 K-Means
Page 4 K-Means Issues

Objectives

The objectives of this section are:
to explain the K-Means clustering algorithm
to introduce its limitations
to examine the various major parts

Outcomes

By the time you have completed this section you will be able to:
briefly explain the K-Means clustering algorithm
list the major steps in the algorithm

The K-Means basic algorithm creates a couple of additional issues that must be considered and in some situations resolved in order to provide a realistic output.

Handling Empty Clusters

This occurs when no points are assigned to a centriod during the assignment step, the re-calculation step does not get rid of this cluster, and it also does not re-calculate the centriod value because no points are being used and so essentially we will have an output with k-1 cluster.
The only solution is to choose a replacement centriod, this can be done by

Choosing the point that contributes most to SSE
Choose a point from the cluster with the highest SSE

If there are several empty clusters, the above can be repeated several times.           

Outliers

Outliers are points that are far away from the centriod and they can unduly influence centriod values and in turn this can skew cluster grouping and increase the amount of time needed to find an optimal solution. Outliers need to be discovered and eliminated beforehand so as to avoid these complications. It is important to note that not all outliers need to be removed in some situations like for data compression every point must be clustered. There a number of approaches used to avoid and remove outliers but that is beyond the scope of this course.

Reduce the SSE with Postprocessing

The objective is to reduce the SSE and one obvious way to do this is to increase the number of clusters. Two strategies that decrease the total SSE by increasing the number of clusters are to split the cluster or to introduce a new cluster centriod.

Another alternative is to decrease the number of clusters, while trying to minimize the increase in total SSE. This can be done by merging two clusters or by dispersing a cluster. In some situations we would prefer to keep constant the number of clusters and at the same time reduce the SSE, below are two strategies that decrease.