Introduction

Next: Epistemological background Up: Problem Previous: Problem

Introduction

``Database Mining can be defined as the process of mining for implicit, previously unknown, and potentially useful information from very large databases by efficient knowledge discovery techniques.'' (workshop program 1995 ACM Computer Science Conference) [31].

``Implicit'' expresses that the information is inductive, discovered from data, instead of deductive, derived from laws.

``Previously unknown'' has two meanings and therefore can be somewhat misleading. The first one is the obvious, that we are only interested in knowledge that is not already known. This is true for all science. But the other important meaning is that ``data mining is distinguished by the fact that it is aimed at the discovery of information, without a previously formulated hypothesis'' [6, pg. 12]. In this sense it is different from most sciences where we usually first state hypothese and then test them. Data mining is aimed at deriving hypothese from the data in first place. This is then also the main distinction between data mining and data warehousing (data management). Data warehousing allows queries for ``validating'' a hypothesis within the data, while data mining searches for general pattern to ``explain'' something. Because we don't start out with hypothese and we are mostly only dealing with archival data (section 1.2) data-mining results need to be interpreted with care (section 4). Depending on the method and measure used they may not reflect any correlation among variables, let alone causality relationships. Therefore data-mining results should be treated as what they are - hypothesis.

``potentially useful'' is obvious, we only care for information which in some sense is useful.

``very large databases'' (VLDB) expresses the problem of dealing with enormous amounts of data. ``Computerization of daily life has caused data about individual behavior to be collected and stored by banks, credit card companies, reservation systems, and electronic points of sale'' [6, pg. 9]. Also satellite photographs, climate measurements, video safety recordings, etc. should be noted in the increasing data. ``It has been estimated that the amount of information in the world doubles every 20 months. The size and number of databases probably increase even faster.'' [42, pg. 1]. This should also show an important, sometimes misunderstood, point. Data mining methods are not developed to replace humans (because of being better or something), instead they are developed to guide and support humans in the process of discovering information. The amounts of data are so enormous that it is totally impossible for humans to examine them. If we don't use computerized support in data mining, most of the collected data would stay unobserved.

``by efficient knowledge discovery techniques'' emphasizes again the point that we need fast algorithms for dealing with the data in a useful way. Especially the curse of dimensionality should be noted: The complexity of methods tend to increase exponentially with the number of dimensions as does the amount of data needed to meaningful ``cover'' that dimensionality. This is the main impetus for the ``variable selection'', ``field selection'', and ``variable reduction'' methods.

There are several other definitions of data-mining, but they usually differ only slightly depending on what the author wants to emphasize, or what kind of readership he addresses. For example Cabena et. al. addresses business-managers: ``Data Mining is the process of extracting previously unknown, valid, and actionable information from very large databases and then using the information to make crucial business decisions.'' [6, pp. 12-13]. A good introduction to data mining can be found in Adriaans and Zantinge [1, pp 1-10].

In this paper I will present some fundamental structure-finding ideas (section 2) and overview a representative selection of data-mining methods (section 3). As my title expresses, the emphasis will be in unsupervised methods (meaning not directed to any particular classification variable), and on nominal data. Therefore, structure and knowledge connected with ordered or even quantitative relationships of variables, e.g. Newton's law, is disregarded in this discussion and postponed for a later paper. Some methods for continuous data are also introduced to contrast with the nominal approach.

Next: Epistemological background Up: Problem Previous: Problem

Thomas Prang
1998-06-07