Next: Supervised, Scalar methods Up: Unsupervised Data Mining in Previous: New dimensions - Meta-dimensions

Methods

In this Section I will present several methods which are used for data-mining. Some of them are traditional methods like Neural Networks (NN's)for classification, Clustering for similarity-hierarchies, Regression and other statistical methods for modeling. Others are based on GSPS (General Systems Problem Solver) as an overall problem-solving framework for inductive modeling of databases [26]. In particular, reconstructability analysis for finding simpler overall models of the database, mask analysis for investigating behavior of a system on an ordered support (time, etc.) and DEEP for determining high local structure are GSPS methods. Methods derived from different fields include decision trees and rule inference.

Although I am concentrating in this paper on nominal data and unsupervised methods, I will first present some of the classical approaches which are supervised and use continuous, ordered data; because they are historical methods and because it is interesting to see their connection and ideas for data-mining. But moreover, the unsupervised, nominal structure-finding problem is more understandable if it is contrasted with methods using a different approach.

Note that ``Genetic Algorithms'' (GA's) are also often mentioned in this context of structure-finding. But GA's are not directly Data-mining methods though they are often used to search and optimize huge search-spaces (i.e. space of models). In combination with other methods this can be a very useful approach. See [12,35] for more information.

One other process applicable to all domains is ``Goodness of fit''. Using a simple Chi-Square ( $\chi^2$ ) statistic, a K-S test, etc., the quality of a model compared to test-data can be evaluated; the expected values of the model e.g. e_ijk for the input variables i,j and k, are compared with the actually occurring values, e.g. o_ijk [34, pp. 294-300,216-217].

In the following sections I will first introduce, discuss and relate the problem context of the different approaches: supervised and unsupervised with scalar, ordinal or nominal data. Then after a short overview of each method, I will show how they relate to the four basic techniques I introduced in Section 2.3.

Note that all these methods use slightly different notation, though I tried to standardize them, and traditionally they use different labels for similar concepts. In NN terminology the investigated database is referred to as part training set and part cross-validation set. In other situations I might write about a relation or the counts table. All these labels denote the same data set, although sometimes from different viewpoints. In the case of a training-set we are neither interested in counts nor in induced probabilities of our relation (section 2.1); the data with the correct classification matters as NN usually ignore the significance (count) of specific value combinations. In other methods I refer to counts and probabilities and the reader will understand the appropriate meaning in that context. Single occurring value-combinations in the relation will be referred to as entities, rows, data-records, etc.

The introduced methods can be categorized as follows:

Type/ Data Scalar Ordinal Nominal

supervised Fisher, linear classifiers Logit-model

Logistic Regression

Neural Networks

Decision Trees

unsupervised Clustering Mask-Analysis

Anova

Reconstructability Analysis

DEEP

Log-linear models

Rule Induction

Goodness of fit

Type/ Data	Scalar	Ordinal	Nominal
supervised	Fisher, linear classifiers		Logit-model
	Logistic Regression
	Neural Networks
	Decision Trees
unsupervised	Clustering	Mask-Analysis
	Anova
	Reconstructability Analysis
	DEEP
	Log-linear models
	Rule Induction
	Goodness of fit

Next: Supervised, Scalar methods Up: Unsupervised Data Mining in Previous: New dimensions - Meta-dimensions

Thomas Prang
1998-06-07