next up previous contents
Next: Other clustering: Up: Clustering Previous: Similarity Clustering:

Algorithms:

The most famous algorithm from this group is probably the K-means algorithm, where K is a predefined number of clusters [18]. The algorithm starts out with randomly associating data-entities to one of the K clusters. Then the algorithm loops through the following steps until it converges:
1.
Calculate the center data-point (mean) for each cluster.
2.
For each data-record compare the distance-measure to each of the K cluster-centers. Associate the record to the closest cluster.
Note that the K-means algorithm does not work for nominal data as `mean' is not defined on nominal or even ordinal data. A variation of the K-means algorithm is the ``K-modes'' algorithm. Instead of calculating in step 1 the mean for each cluster, the mode of the cluster-distribution is used (the mode is defined as the most occurring value). The problem is that for a reasonable mode many data entities distributed over few values are needed.

For clustering on both nominal and continuous data the ``K-prototypes'' algorithm is proposed which uses modes for nominal and ordinal variables and means for continuous variables. Note that for defining the prototype (center) we only need to consider variables that are used by our given distance-measure. If the distance-measure is defined purely on nominal data then the K-modes algorithm is sufficient, the K-mean algorithm is sufficient if only continuous variables are evaluated for the distance.

Other methods for similarity clustering include hierachical clustering, link clustering, nearest-beighbor clustering, fuzzy clustering, and Kohonen's self-organizing maps.


next up previous contents
Next: Other clustering: Up: Clustering Previous: Similarity Clustering:
Thomas Prang
1998-06-07