Next: Introduction Up: Unsupervised Data Mining in Previous: Contents

Problem

In my project I want to deal with unsupervised approaches for finding structures, patterns, or relationships in large databases. Many methods have been developed for structure finding in continuous and ordered data, but nominal aspects of this problem are often ignored.

An important characteristic of databases is how entries (rows) within a table are related to each other. In a customer table the entries may be unrelated; in a table of daily temperature measurements the individual entries may be time related. Database tables usually contain a variable(s) for uniquely identifying and distinguishing each entry (functional relationship). This variable(s) which often doesn't have any other purpose is referred to as the ``support'' [26]. For example there might be a customer id for uniquely identifying customers and a time stamp for identifying the temperature measurements. We can now express the kind of relationship among data entries by the support type. In the case of time or space coordinates the support is ordered. The consequence for structure finding is that we can look for behavior patterns over this support (time, space, etc.); for example the influence a previous date entry has on a current one, what influence has yesterdays temperature (and perhaps some other climate variables) on today's temperature ? In the case of customer IDs, record numbers, and Social Security Numbers, the support is nominal. Here the structure finding needs to concentrate on general patterns within data entries. In this paper I focus on nominal support.

Similar to nominal support, I also want to concentrate on nominal data, or data fields whose ``real world'' counterparts are neither continuous nor ordered themselves. Examples for this are 'Area code' (which could be sorted as numbers but it would not make sense), 'Name of friend', 'product bought', 'which kind of payment used'. Other examples include normalized string variables like normalized addresses. Note that we can still have a distance measurement (not necessarily a metric in the mathematical sense) on these kinds of fields so that clustering is possible. For example, there is no intrinsic ordering of persons as friends, but we can have a distance measurement of friendship; products have no intrinsic ordering (though they could be ordered in price as persons could be ordered by their IQ, which would be an implicit relabeling of the fields) but they can have a similarity measure.

The distinction between ``supervised'', and ``unsupervised'' for Data Mining methods comes from the classification problem. If methods use a training data set with correct classifications for learning specific predictive patterns they are called supervised. Many neural networks as well as logistic regression, Fisher-analysis, etc. work this way. If we just use the data itself to find internal structure the method is called unsupervised. Summarizing, one could state that ``supervised'' denotes structure finding directed to one classification variable, while ``unsupervised'' means general structure finding.

In general three data types are distinguished: nominal, ordinal, and scalar. Nominal data values have no ordered relationship to each other, ordinal values can be ordered but the ordering has no associated distance. We can't say that ``one value is double as good as another one''. Finally scalar values, also refered as continuous values, have a quantitative relationship associated with the ordering.

Although I will concentrate my discussion of basic techniques, methods and implementations on the more general unsupervised, nominal domain, for completeness, I will present some supervised methods and methods on continuous data. Relating them to the discussed background will help to ``see'' the complete picture and better understand the unsupervised, nominal problem.

Next: Introduction Up: Unsupervised Data Mining in Previous: Contents

Thomas Prang
1998-06-07