Similarity Clustering:

Next: Algorithms: Up: Clustering Previous: Clustering

Similarity Clustering:

Most clustering algorithms partition the data based on how similar individual records are; the more similar, the more likely that they belong to the same cluster. Their main purpose is to identify clusters which maximize the inter-cluster distance and minimize the intra-cluster distance, so that we obtain clearly distinct groups of similar entities. This grouping introduces a ``natural'' unsupervised classification schema based on similarities according to the given distance measure.

Creating unsupervised classification schemas is also an important part of the human recognition process. Humans always group new things and assign these groups with natural language labels; the group of trees, houses, cars, clouds, etc. These labels are abstractions which identify specific sets of entities that are similar in some aspects while other aspects are unimportant. The shape and function are important characteristics of a house or car, while the color is pretty much unimportant. In the case of trees color has a higher importance. This also concludes that different distance measures are needed for different ``classification schemas''.

The information which is created as a new ``natural'' classification schema is important knowledge which can be added to our relation as a new dimension. The new dimension contains knowledge based on similarities in the chosen distance measure. The ``right'' choice of the distance measure is very important and has the implicit assumption that the induced similarities are meaningful for classifications.

If we also deal with some externally provided classifications of our data, then the overlap between the new ``natural'' classification and the given classification is of interest. Projections into these dimensions and comparison of their respective distributions can be used to investigate common properties. Iteratively trying to create clusters that are close to the given classification schema is also known as ``supervised clustering''.

As clusters are identified as distinct groups, the different structural properties among the clusters can be investigated in general. This means that all different supervised and unsupervised methods can be applied separately on each cluster. For the supervised classification problem this may help in more accurate models and predictions. For unsupervised structure finding significant differences between identified patterns can be investigated. In both cases clustering can give insights in the structural relationships among ``natural'' classifications.

Next: Algorithms: Up: Clustering Previous: Clustering

Thomas Prang
1998-06-07