next up previous contents
Next: Parts and Wholes Up: Dangers in Data-mining Previous: Dangers in Data-mining

Causality

In the discussion of rule inference (section 3.3.5), we already noted one problem with data mining results. Rules, denoted in the form $X \rightarrow Y$, often give humans the feeling of some causal relationship, X ``implies'' Y. But instead everything is just based on some data observations. By using only the support and confidence measure we saw that some ``good'' results can be triggered by totally independent variables (Milk $\rightarrow$ Bread with s= .81 and c= .90). This rule expresses that Milk and Bread have in common that they appear in most transactions (independently), which can be still quite useful information after the rule's hypothesis is carefully evaluated.

Even if we measure correlation between variables then this relationship does not necessarily describe any causality. I want to state an even stronger hypothesis: from observational and historical data it is generally impossible to infer any causal relationships. The only scientific approach for concluding causality is the experimental one. This means first stating a hypothesis, then carrying out experiments with known input variables and observing the responding variables relative to the actively changed inputs.

To support my hypothesis visualize the following example: when we observe in some kind of database that all vegetarians live longer and are less sick we could think that changing ourselves to vegetarians would improve also our health and life expectancy. But that may not necessarily be the case. The structure that we found in the database, the rule that all the vegetarians live longer and healthier, doesn't imply any causality, it's just a correlation found in the observations. It may be that all these people are vegetarians because of a totally different attitude towards life. Perhaps they take more care in general about their health, their food, etc. Which then is the reason for longer and healthier life? Perhaps being vegetarian is only one probable effect of this attitude and by no means a cause. On the other hand, if we believe in a causality conclusion from data observations, we might as well conclude that longer life and better health causes (with some probability) being a vegetarian.

We still try to infer causalities from recorded data but it should be noted that all this is done in the context of huge background knowledge and already known laws about the world. For example most people would not conclude that better health causes vegetarian eating habits just because this conflicts with their a priori knowledge (or belief). In other cases we are able to explain some relationships (even if previously unknown) by some external knowledge. This then might justify a causal conclusion.

The whole issue of causality has a much wider context in the philosophy of science, in particular the problem of induction.


next up previous contents
Next: Parts and Wholes Up: Dangers in Data-mining Previous: Dangers in Data-mining
Thomas Prang
1998-06-07