next up previous contents
Next: Subset and Superset Up: Finding structure Previous: Finding structure

   
Projection and Extension

To project a dataset means to aggregate it according to a subset of dimensions (formally defined in section 3.3.2). This means that the data is projected only through a given set of dimensions and viewed independently of all other dimensions. Projection is used to find direct relationships in a smaller subset of dimensions. This relationship may not be obvious from the whole dataset because of other `noisy' dimensions. A simple example is a personal database of mood, quality of food, amount of work, and weather conditions with 61 data-records. This data is given in form of the following counts table:



Mood Food Work Weather Count Probability
bad bad few rainy 9 9/61
good good much sunny 10 10/61
good good few sunny 12 12/61
medium good medium cloudy 13 13/61
good medium medium sunny 9 9/61
medium bad few cloudy 2 2/61
bad medium much rainy 6 6/61
total       61 1


A 2-dimensional projection (Mood and Weather) of this data shows a direct relation between these two variables: `rainy' comes always with `bad' mood, `cloudy' with `medium' and `sunny' with `good' mood.



Mood Weather Count Probability
bad rainy 15 15/61
good sunny 31 31/61
medium cloudy 15 15/61
total   61 1


This direct relation between mood and weather is easiliy indicated by uncertainty measures: The relative Entropy Hrelative(Mood, Weather)= 0.4706 is already low as only 3 states from the 9 possible mood, weather combination occur in the projection. The direct connection between the values becomes obvious by calculating the conditional Entropy: H(Mood | Weather) = H(Weather | Mood) = 0; if we know one of the variables the uncertainty about the other one is 0 in our given table. This shows how Entropy measures can be used on projections to indicate inner structures.

So far we only dealt with counts and probability distributions as measures for our table (relation). They are important as we use them to measure the amount of structure in our given relationship (table). But we can also have other measures which can be used for further investigation, for example summary statistics like percentiles, averages, minimum, maximum, and standard-deviation. These aggregational statistics are defined over some dimensions which due to a projection are not shown directly. Every dimension can be folded into a measure. It is just another visualization and viewpoint of the same aspect. Instead of showing all possible combinations for the variable ``food'' we could aggregate this information into a measure ``Percentile of good Food''. This process of interchanging measures with dimensions is described in more detail in [2]. With this new viewpoint relationships between measures or between dimensions and measures can be investigated.

From our point of view ``folding'' a dimension into a measure is a mixture of projection and then creating a new dimension with the aggregational statistics. First we get rid of the ``food''- dimension by projection and aggregation over its values, then we add the aggregational measure ``Percentile of good Food'' to this table. Note that these aggregational measures usually combine several of the original rows (due to the projection) into one value. Therefore a ``folded'' dimension consists usually of less rows.

As described we can always see our data as a table of dimensions with counts attached, but for understanding different viewpoints of data the distinction between dimensions and measures makes sense. More on new dimensions in Section 2.3.4.

In statistics projections are known as ``Marginalization''. By projecting our relation (table) to fewer dimensions we marginalize the probability-distribution.

In OLAP-Terminology this is called ``pivoting'' and is equivalent to the `` SELECT x1,x2,.. ,COUNT(*) .. GROUP BY x1,x2,..'' statement in SQL.


next up previous contents
Next: Subset and Superset Up: Finding structure Previous: Finding structure
Thomas Prang
1998-06-07