next up previous contents
Next: Log-linear models Up: Unsupervised, Nominal methods Previous: Discussion:

DEEP

Whereas the aim in Reconstructability Analysis is an overall model of the structure among variables, DEEP (Data Exploration through Extension and Projection [24] is a user-guided approach for investigating high local structure.

DEEP is a new method within the GSPS framework to explore more ``local'' structures without reconstructing the whole relation. The user specifies an initial set of ``interesting'' variables V1. The overall relation is projected through this variable set, $\pi_{V_1}$, and it can be observed how the data distributes over the corresponding domain dom(V1). Now the user specifies a second set of variables V2 on which our first projection $\pi_{V_1}$ can be spread out as conditional distributions, we extend the first projection to include the second set of variables. This means for every value $\vec{v_1} \in dom(V_1)$ we create a distribution over the variable set V2, the count attached with each $\vec{v_1}$ is partitioned by the values of V2. We can visualize this process in a 2-dimensional table. The count-table of $\pi_{V_1}$ is one dimension, the partitioning of these counts by V2manifests the second dimension. As an example imagine the following fictitious example with $V_1=\{income\}$ and $V_2=\{car\}$:



income count Ford Dodge Toyota GMC H K G
high 10 7 3 0 0 0.8813 0.30 0.88
medium 35 10 10 10 5 1.9502 0.39 0.97
low 20 0 2 5 13 1.2362 0.37 0.78
total 65 17 15 15 18 1.9954 0.33 0.998


The entropy H for all these conditional distributions ( $H( V_2 \vert V_1 = \vec{v_1})$ ) is calculated and compared. The conditional distributions with low entropies contain the most structure.

K and G are two other important measures for this method. K is defined as a relative Hartley (equation 6, page [*]) based on the partitioning of the counts by values of the second variable set V2. In our example the set of 10 data entities with ``high income'' partition into two groups, one group driving ``Ford'', the other ``Dodge'', therefore $K= \log_2(2)/\log_2(10)=0.30$. The 35 ``medium wages'' data entries are partitioned into 4 groups, $K= \log_2(4)/\log_2(35)=0.39$, while the 20 ``low wages'' fall into 3 partitions, $K= \log_2(3)/\log_2(20)=.37$.

G is defined as a relative entropy (equation 10, page [*]) over the partition, which means the entropy relative not to the whole domain (of V2), but to the size of the partition. The entropy of the ``high wages'' is connected with two clusters, therefore $G= H/\log_2(2)=0.8813$, the entropy of ``medium wages'' origins from 4 clusters, therefore $G=H/\log_2(4)=0.97$.

According to these three measures the user subsets the data into specific values $v_1^{(1)}, \ldots, v_1^{(k_1)}$ called condition $C_1=(V_1 \in \{v_1^{(1)}, \ldots, v_1^{(k_1)}\} )$. We end up with a data set conditionalized on our first variable-set. Now we use the projection $\pi_{V_2}$ and calculate entropies for the conditional distributions on a third variable-set V3. We continue subsetting the data with a second condition $C_2=(V_2 \in \{v_2^{(1)}, \ldots, v_2^{(k_2)}\} )$. This approach is iterated and we end up with a small subset of data which is highly correlated over the chosen variable-sets. The user guided process can be summarized in an ordered set (vector) of (variable-set, condition)-tuples:

\begin{displaymath}( (V_1,C_1), \; (V_2,C_2), \;\ldots, \; (V_n,C_n) ) \end{displaymath}

As this process is absolutely user-guided it is very flexible in finding required structures. It is also possible to predefine some of the subsetting and conditionalizing criteria and run DEEP in an automated fashion.

DEEP is based on an iterative procedure of projecting, conditionalizing, subsetting and again extending into other dimensions.


next up previous contents
Next: Log-linear models Up: Unsupervised, Nominal methods Previous: Discussion:
Thomas Prang
1998-06-07