Epistemological background

Next: Structure Up: Problem Previous: Introduction

Epistemological background

In this section I will discuss the similarities and significant differences between databases and source systems in the system science view. It will be demonstrated that these differences are not only specific to the source system perspective but to the data-mining problem in general, focusing on the question of what is the important data for our investigation, how can it be cleaned, preprocessed, transformed.

As we will see, databases are basically collections of data entries over specified sets of variables with specified domains (state sets, value sets). This view fits pretty closely with the idea of source and data systems [26]. Definition of variables, their connection to the ``real-world'' and assigned sets of ``allowed'' values constitute a source system. It's important to note that this definition already contains a lot of constraints and information. A source system already specifies what aspects of the ``real world'' are important, where ``important'' must always be seen in the problem context of our system. Also specified are how to map these aspects into our problem space. This is normally a homomorphic mapping - the values are simplified and less in number, but the ``relevant'' structure is preserved.

Relational Databases are seen as a collection of relations [39]. Joining all these relations we can obtain one overall relation which combines the relational information. The definition of the overall relation, called the relational schema, specifies a set of variables, corresponding domains, and respective ``real-world-connections''. The constraints and knowledge already given by the variables and corresponding sets of values are the same as in a source system. The relational schema of the total join is therefore an equivalent (isomorphic) description to a source system of the whole database.

Measurements add data to the source system. We obtain a data system. In the database view a relation or table is obtained from the relational schema. The data represents structure and relationships between our variables, and, therefore, information about specific ``real world'' connections.

Actually the set of (unjoined) relations in a database contains already deeper constraints and structure among the variables as some variables are disconnected in different relations. The unjoined relations with their additional structure actually correspond to a GSPS structure system. More details on structure systems will be given and related in section 3.3.2. In relational databases this structure within the relational schema is often induced by so called ``normal forms'' for the manageability of the databese. The following discussion will therefore concentrate on the overall relation as one data system, its relational schema isomorphic to a source system.

Even though source and data systems are theoretically connected with databases, there is still a big and practical difference between them. In the system science view, the building of a source-system is preceded by several premethodological considerations [26, chapter 1]: The Purpose of Investigation expresses our idealized intention, what do we want to achieve with the system, what is the reason for our modeling. The Constraints of Investigation restrict our idealized intention and reapply our purpose to the real-world with all its constraints. For example, parts of our intention may not be realizable - some information may be unavailable, technologies may not be advanced enough, etc. This leads us to our final Object of Investigation from which we abstract variables and corresponding state-sets, also support and support sets for differentiating our data. Building a source system in the system science approach means carefully selecting variables and domains for the kind of things we want to investigate. For the domains we restrict our attention to as few values as possible as this simplifies the modeling process.

In data mining the situation looks entirely different. Neither variables nor state sets or supports are selected by ourselves, rather they are given to us with the direction to find some specific structures, patterns, and dependencies. Often these databases are large, incomplete and contain ``noisy'', uncertain, redundant, useless, NULL or missing values. Questions are how to deal with the overabundance of information, how to find that information, and even if there is adequate information for interesting discoveries.

The situation can be compared with the three levels of data one can use for scientific discovery (see also [42, pp. 33]). The first level is the experimental level. One can actively select appropriate input variables and an environment to create the needed data. This normally leaves us with an abundance of data as many parameters can be varied. Also new data can be created, measurements can be improved and tests repeated (to validate results).

Observational Data constitutes the second level. In this situation parameters of an investigated object cannot be changed and therefore data cannot be created just as needed. But for an observer it is still possible to choose what kinds of data he wants to select for his investigation. He can measure any observable properties of any available object and refine his methods of measurement.

The lowest level consists of historical or archival data. This is data which is already recorded and we have no way to change, improve or add other dimensions. In most situations this is the information we need to deal with in databases, whereas in the system science approach we often start with experimental or observational data, though constraints can require also the use of archival data.

There are two approaches to deal with these differences between databases and source-/data-system.

The first solution is the obvious one. For our investigation we define the given database as the universe of discourse. This means that constraints restrict the available data to that in the database. From the database we then select our variables and corresponding domains (which could be different from variables and domains in the database) according to our purpose.

The second solution is similar in that it also selects variables, but we don't choose static domains for these variables. When exploring databases we often want to be able to see the details in the data as well as simplified and abstracted values for finding general patterns. Therefore we use the detailed ``precision'' of the database for our source-system but we also incorporate the concept of ``simplification'' (coarsening and refinement) as hierarchies. This means that we introduce hierarchies of values for each variable in the source-system. This makes it possible not only to search for structure at each level of the hierarchy but also to use different levels for different variables. For example for income we might use only three different states {high, medium, low}, for address a clustering into small towns and city-districts, and for number of children the whole precision of the database. Later in this process we might want to see how the medium wages spread out over five more refined wage ranges. Hierarchies of values are often used in OLAP (On-Line Analytical Processing) [2,3,1] and in the next chapter I will discuss how they can be used for structure finding. I also want to mention that a ``3-4-5 rule'' for building hierarchies is known in the database community. It suggests dividing each node of the hierarchy into 3,4 or 5 subnodes [18].

One might ask why I concentrated so much on the relationship of databases with source systems in this chapter. First of all, source systems are very similar to relational databases and therefore an ``interesting'' viewpoint. Second the mentioned differences between databases and source systems express the importance and necessity to prepare data for data-mining. In this sense ``building a source system'' is just a labeling example (I think a very appropriate one) for a step which in the literature is also called ``Data Cleaning'', ``Data Preprocessing'', ``Data Transformation'' and ``Variable Selection'' [1, pp. 37-47]. This process includes knowing the purpose for investigating the database; dealing with useless, noisy, false, uncertain and redundant data; transforming it and selecting useful dimensions and state sets (or hierarchies of state sets). Basically it summarizes all the preprocessing which needs to be done before the ``real'' investigation. For more details on source or data systems see [26, chapter 1,2].

Next: Structure Up: Problem Previous: Introduction

Thomas Prang
1998-06-07