next up previous contents
Next: Formal Definition: Up: Reconstructability Analysis (RA) Previous: Reconstructability Analysis (RA)

Informal Introduction:

In the original overall relation every possible connection between variable- subsets and their effects on the counts is accounted for; every ``correlated'' behavior between variables can be represented as all variables are ``connected'' with each other in this relation. This may be either unclear or self-evident at this point, it just means that every possible value combination over the overall cartesian product space can be assigned an individual count and induced probability measure.

On the other hand, if all the variables would be independent from each other we could simplify the overall probability distribution dramatically by just representing each dimension separately with its own distribution; in other words, the whole system could be represented by the projections through the individual dimensions xi. All subsets of variables which are not ``correlated'' do not need to appear together in the same relation. So by ``decomposing'' the overall relation into a set of subrelations we can obtain a simplified model which allows to see the connections between variables more directly. The less the variables are connected the simpler our model. We also might wish to ignore slight connections among variables in order to reduce the complexity of the system and to emphasis the the strong relations.

Imagine an example where 70% of the cars are red, and 30% blue. Independently to the car color, 20% of the cars are standard , the other 80% automatic. We could represent this information in an overall relation with probabilities attached:



Color Shift Probability
red standard 14%
red automatic 56%
blue standard 6%
blue automatic 24%
total   100%


From this table the independence between the variables is not obvious. But if we use a model of two subrelations {Color} and {Shifting} then the independence becomes clear as this ``simpler'' representation is able to support the same knowledge. We have the same accuracy as in the above table and the same information is represented. By the way I already made use of this simpler representation in my verbal description. The following are the projections of our ``perfect'' model:



Color Probability
red 70%
blue 30%
total 100%
Shift Probability
standard 20%
automatic 80%
total 100%


In reality we work with much larger sets of variables, but the same principle holds. Only the relationships can be much more complex, like there could be some connection among variables x1,x2,x3, a connection among variables x2,x3,x4 and also a connection among variables x1,x2,x4 but still no combined effect of all the variables x1,x2,x3,x4. In this case we would exchange the later relation, denoted $\{x_1,x_2,x_3,x_4\}$, by the three less complex relations $\{x_1,x_2,x_3\}$, $\{x_2,x_3,x_4\}$ and $\{x_1,x_2,x_4\}$.

Normal forms in relational databases follow a similar idea of decomposing the overall data into simpler (reconstructable) subrelations. Known independencies among variables are used to simplify the data representation and to get rid of redundancies. Manageability of updating procedures and the database is the main reason for doing this in Data Warehousing. The difference is that in creating normal forms we create the relational form according to our knowledge of independencies. We recognize that a sale of some clothes has something to do with the salesperson, but we assume that home-addresses of salespersons are not individually connected with each sale, only with other salesperson specific data.

In reconstructability analysis we don't know the independencies in advance. Given our data we try to infer simpler models which are an optimal tradeoff between ignoring slight connections in the data and a higher complexity of the model. In the further discussion of this method I will ignore existing normal or relational forms of the database and concentrate on decomposing one overall relation. This could be just one of the existing relational tables or an ``artificial'' overall relation constructed by joining the existing relations.

We start out with the whole relation in which all variables are connected by the data entries. We then want to simplify this whole relation by a set of subrelations which connect only a smaller number of variables and therefore reduces the complexity. The definition of the set of subrelations which carries our assumptions about the independencies among the variables is called a ``model'' of the overall relation, or in this particular case the ``reconstruction hypothesis''.

In Reconstruction Analysis many different hypotheses, following some specific search paths, are undertaken to find optimal models. To compare the correctness of hypothesis assumptions, we project the data using the subrelations of this model and then rebuilt a relation over the whole domain via unbiased (maximum entropy) reconstruction. Unbiased reconstruction means we obtain again an overall relation by exactly using the information from our projected model without adding external knowledge.

Taking a distance measure of the reconstructed (projected) distribution to the original distribution we are able to tell the quality of our reconstructability hypothesis. The closer the distance, the more accurate the model and the less information is lost by using that model. Choosing a model is a tradeoff of complexity and accuracy of the model. The original relation is most accurate but so complex that it doesn't give us insights about the internal structure and the connectivity of the variables. The more we ignore weak correlationships, the simpler and more understandable the model gets, but we also lose accuracy. Using a separate subrelation for each variable we obtain the simplest model but in most cases also a very inaccurate and therefore useless one. Because of the two complementary measures, simplicity and accuracy, the model solution set of reconstructability analysis consists of all hypotheses which are not inferior to any other model in a combined (simplicity, accuracy) partial ordering. Models in the solution set are less accurate with increasing simplicity but ``optimal'' for their respective simplicity level.

To summarize, Reconstructability Analysis is ``the problem of breaking down a given overall system into (simpler) subsystems that preserve enough information about the overall system. The principle motivation behind the reconstruction problem is the reduction of complexity in the system involved.'' [28, pg. 278].


next up previous contents
Next: Formal Definition: Up: Reconstructability Analysis (RA) Previous: Reconstructability Analysis (RA)
Thomas Prang
1998-06-07