What is structure

Next: Information and Uncertainty Measures Up: Structure Previous: Structure

What is structure

In this section I want to investigate what we mean by structure in data. ``Dimension'' , ``Variable'', ``Aspect'' are used simultaneously to describe a certain dimension of our data. For each dimension, data-entries (also called records, entities, etc.) can take any values out of a specified domain. From the actual occurrence of these values in our data we can derive a count-table, indicating how often a specific value or set of values occurs. This count-table is a basis to derive probability-distributions. In the following examples and the rest of this paper I will mostly refer to a probability-distribution as relative frequencies, although other measures may be used also, see [26, pp. 103-105].

As an formal example assume a table R, here also called relation, of cars:

$\begin{displaymath}R \subseteq dom(S) \times dom(X_1) \times dom(X_2) \end{displaymath}$

where $X_1, dom(X_1)=\{\lq\lq Ford'',\lq\lq Dodge'',\lq\lq Chevy''\}$ denotes the variable car type, $X_2, dom(X_2)=\{\lq\lq black'',\lq\lq white'',\lq\lq blue''\}$ the car's color, and $S, dom(S)=\{1,2,3,4,5,6,7\}$ the uniquely identifying (nominal) support. The support is just a car-number, and is ignored for the count table. Note that the distinction between support and variables describes the following functional relationship:

$\begin{displaymath}\func{d}{dom(S)}{\times_{i=1}^n dom(X_i)} \end{displaymath}$

Let R be defined by the tuples in the following table:

s (No.) x₁ (car-type) x₂ (color)

1 Ford black

2 Dodge red

3 Ford white

4 Ford black

5 Chevy blue

6 Dodge red

7 Dodge red

The relation R can be also described by its characteristic function $\chi$ over support S and variables X₁,X₂:

$\begin{displaymath}\func{\chi}{dom(S) \times dom(X_1) \times dom(X_2)}{\{0,1\}} \end{displaymath}$

$\begin{displaymath}\chi(s,x_1,x_2) := \left\{ \begin{array}{r@{,\quad }l} 1 & (... ..._2) \in R \\ 0 & (s,x_1,x_2) \not\in R \end{array}\right. \end{displaymath}$

(1)

A count function c over the variable domain is introduced by aggregating about the uniquely identifying support:

$\begin{displaymath}\func{c}{dom(X_1) \times dom(X_2)}{\{0,1,2,\ldots\}} \end{displaymath}$

$\begin{displaymath}c(x_1,x_2) := \sum_{s \in dom(S)}\chi(s,x_1,x_2) \end{displaymath}$

(2)

The relative frequencies f which will be seen as induced probabilities are obtained by dividing by the the total number of data-entries |R|:

$\begin{displaymath}\func{f}{dom(X_1) \times dom(X_2)}{[0,1]} \end{displaymath}$

$\begin{displaymath}f(x_1,x_2) := \frac{c(x_1,x_2)}{\vert R\vert} \end{displaymath}$

(3)

Note that not all possible tuples of type and color $(x_1,x_2) \in dom(X_1) \times dom(X_2)$ need to occur in the relation R. Therefore it may be c(x₁,x₂)= f(x₁,x₂)=0 for some tuples (x₁,x₂). Similar to the table representation of R these tuples are ignored in the count table:

x₁ (type) x₂ (color) c(x₁,x₂) (count) f(x₁,x₂) (probability)

Ford black 2 2/7

Dodge red 3 3/7

Ford white 1 1/7

Chevy blue 1 1/7

total 7 1

Can we have structure in a single dimension? Definitively not if the values in that dimension are randomly distributed. That random distribution would tell us something about this dimension (i.e. that there is no structure) but would leave us with an unstructured mess of values. Thus we associate some structure with a variable if its value-distribution allows some predictability; that is if the value-distribution is different from a random distribution. As an example, imagine a distribution where 50% of the cars are red and 50% are black. If red and black are the only values for car colors then this dimension is randomly distributed and doesn't give us any information for prediction. If 90% of the cars are red and only 10% are black the situation is entirely different. We can find the ``structure'' that red cars are much more likely to appear than black ones. ``Structure'' seems to be connected with the distribution of the variable-values. You can compare this with fitting a distribution (Normal, Exponential, Gamma,..) in the continuous case.

Going to two or more dimensions, the relationships between variables are involved. We think of high structure if specific values of one variable mostly appear together with a specific value of another variable. In statistical terms, we say the variables are ``correlated''; however, statistical correlation does not work for nominal variables (neither mean, variance, nor covariance are defined on a nominal probability space). Looking to the joint-distribution of the variables we see that this ``appearing together'' again just means a ``structured'' distribution of values instead of a random distribution. The probability for some value-tuples is pretty high (for those values which mostly appear together) while other probabilities remain small.

This becomes even more clear if we reduce the distribution again to a one- dimensional case by looking at the conditional distribution f(Y | X=x). We fix one value in dimension X and look how the values of Y distribute in this case (see Section 2.3.2 ). If the resulting distribution is random, then our chosen value in X seems not related to the dimension Y. But if the value x mostly occurs with one value in Y the conditional distribution will be highly structured and the predictive uncertainty low.

When we look for some kind of pattern we often have some entities which have something in common (on which we conditionalize). We want to figure out what else they have in common (what structure there might be in the conditional distribution). Consider this example from programming. In some cases a program returns an error (in the space of program-runs this is the first thing the have in common). The programmer then wants to know what else these runs have in common so he can find out what could have triggered the error. If all these runs show a specific and distinct pattern in the input-values then the problem might be connected with these inputs.

In this sense the structure within a dataset can be measured by the randomness or uncertainty of its value-distribution (or conditional distribution, etc.)

Next: Information and Uncertainty Measures Up: Structure Previous: Structure

Thomas Prang
1998-06-07

s (No.)	x₁ (car-type)	x₂ (color)
1	Ford	black
2	Dodge	red
3	Ford	white
4	Ford	black
5	Chevy	blue
6	Dodge	red
7	Dodge	red

x₁ (type)	x₂ (color)	c(x₁,x₂) (count)	f(x₁,x₂) (probability)
Ford	black	2	2/7
Dodge	red	3	3/7
Ford	white	1	1/7
Chevy	blue	1	1/7
total		7	1