next up previous contents
Next: Training Data Up: Dangers in Data-mining Previous: Causality

Parts and Wholes

We also need to be careful in making judgments from marginalized and aggregated data. Imagine a ``fair'' university which accepts the same rate pfield of female applicants and male applicants for each field, for example the best 10% of female applicants and the best 10% of male applicants for education. Let's also say that the total number of female applicants equals the number of male applicants. Assume the following distribution of applicants (with simplified numbers):


Field rate f. applicants f. accepted m. applicants m. accepted
Education 10 1000 100 100 10
Social Sciences 20 500 100 300 60
Engineering 30 200 60 1300 390
total   1700 260 1700 460

If we look at the aggregated total distribution the selection of students seems biased. Many more male students (460) are accepted than female (260). The reason is that female students applied more for fields with lower acceptance rate (higher competition). But this information is not shown anymore in the total (projected) distribution, parts (marginals, projections) do not in general determine wholes (overall distributions). A similar example with patients in a hospital can be found in Glymour et. al. [11, pg. 20].


next up previous contents
Next: Training Data Up: Dangers in Data-mining Previous: Causality
Thomas Prang
1998-06-07