In this Section I will present several methods which are used for data-mining. Some of them are traditional methods like Neural Networks (NN's)for classification, Clustering for similarity-hierarchies, Regression and other statistical methods for modeling. Others are based on GSPS (General Systems Problem Solver) as an overall problem-solving framework for inductive modeling of databases [26]. In particular, reconstructability analysis for finding simpler overall models of the database, mask analysis for investigating behavior of a system on an ordered support (time, etc.) and DEEP for determining high local structure are GSPS methods. Methods derived from different fields include decision trees and rule inference.
Although I am concentrating in this paper on nominal data and unsupervised methods, I will first present some of the classical approaches which are supervised and use continuous, ordered data; because they are historical methods and because it is interesting to see their connection and ideas for data-mining. But moreover, the unsupervised, nominal structure-finding problem is more understandable if it is contrasted with methods using a different approach.
Note that ``Genetic Algorithms'' (GA's) are also often mentioned in this context of structure-finding. But GA's are not directly Data-mining methods though they are often used to search and optimize huge search-spaces (i.e. space of models). In combination with other methods this can be a very useful approach. See [12,35] for more information.
One other process applicable to all domains is ``Goodness of fit''. Using a simple Chi-Square () statistic, a K-S test, etc., the quality of a model compared to test-data can be evaluated; the expected values of the model e.g. eijk for the input variables i,j and k, are compared with the actually occurring values, e.g. oijk [34, pp. 294-300,216-217].
In the following sections I will first introduce, discuss and relate the problem context of the different approaches: supervised and unsupervised with scalar, ordinal or nominal data. Then after a short overview of each method, I will show how they relate to the four basic techniques I introduced in Section 2.3.
Note that all these methods use slightly different notation, though I tried to standardize them, and traditionally they use different labels for similar concepts. In NN terminology the investigated database is referred to as part training set and part cross-validation set. In other situations I might write about a relation or the counts table. All these labels denote the same data set, although sometimes from different viewpoints. In the case of a training-set we are neither interested in counts nor in induced probabilities of our relation (section 2.1); the data with the correct classification matters as NN usually ignore the significance (count) of specific value combinations. In other methods I refer to counts and probabilities and the reader will understand the appropriate meaning in that context. Single occurring value-combinations in the relation will be referred to as entities, rows, data-records, etc.
The introduced methods can be categorized as follows:
Type/ Data | Scalar | Ordinal | Nominal |
---|---|---|---|
supervised | Fisher, linear classifiers | Logit-model | |
Logistic Regression | |||
Neural Networks | |||
Decision Trees | |||
unsupervised | Clustering | Mask-Analysis | |
Anova | |||
Reconstructability Analysis | |||
DEEP | |||
Log-linear models | |||
Rule Induction | |||
Goodness of fit |