Chapter 4 Cluster Analysis
Section 2 K-Means
Page 2 K-Means Introduction

Objectives

The objectives of this section are:
to explain the K-Means clustering algorithm
to introduce its limitations
to examine the various major parts

Outcomes

By the time you have completed this section you will be able to:
briefly explain the K-Means clustering algorithm
list the major steps in the algorithm

 

Definition

K-means is one of the oldest and most commonly used clustering algorithms. It is a prototype based clustering technique defining the prototype in terms of a centroid which is considered to be the mean of a group of points and is applicable to objects in a continuous n-dimensional space.

Description

The K-means algorithm involves randomly selecting K initial centroids where K is a user defined number of desired clusters. Each point is then assigned to a closest centroid and the collection of points close to a centroid form a cluster. The centroid gets updated according to the points in the cluster and this process continues until the points stop changing their clusters.  A high-level representation of the K-means algorithm is shown in Figure 1.

K-Means Algorithm

Initialization step: Initial centroids are can be selected randomly.

Assignment step: this step raises the question, how do you define “closest centriod”. In order determine which centroid is closest to a particular data point we have to use a proximity measure. The Euclidean distance is commonly used.

Re-estimation step: a simplistic view point is for the centroid to be calculated as the mean of the points that have been assigned to its grouping.