next up previous contents
Next: Coarsening and Refining Up: Finding structure Previous: Projection and Extension

   
Subset and Superset

Subsetting of the dataset is another important operation in which the available data is restricted according to specified conditions. This means that specified values are eliminated from given variables. This allows focusing on ``interesting'' values. Especially with nominal variables there might be no general linkage between variables but some between specified values.

In a health database of patients we could create a projection of just the variables illness and food to find a linkage between them. But the distribution of this projection might be random. In general there may be no relationship between illness and the kind of food the patient ate prior his illness. After focusing on one specific illness by subsetting the data the result can look entirely different. Perhaps all patients with stomach pain ate `Hamburger' before being hospitalized.

The opposite of subsetting is supersetting. It allows going back and seeing the relation between the unconditioned values. There might be some general connection, but the chosen values reveal only random behavior. In this case the structure lies in some other values, and we want to go back to see the whole picture.

Subsetting corresponds to ``Conditionalizing'' in statistics. We restrict the values of some variables and look at the ``conditional probabilities''. In subsetting variables to different sets of values we get several conditional probability distributions which can then be compared. Differences in distributions can be testes by several statistical tests ($\chi^2$-test, H-test, U-test) [34].

In OLAP terminology this technique is called ``slicing and dicing'' and corresponds to the ``WHERE'' - clause in SQL, in GSPS terminology this is known as ``simplification''.


next up previous contents
Next: Coarsening and Refining Up: Finding structure Previous: Projection and Extension
Thomas Prang
1998-06-07