next up previous contents
Next: Algorithm and Decision Tree Up: Decision Trees Previous: Decision tree building:

Rule-Selection:

How are rules selected for each node? The most famous approach used in ID3 [45] is called ``information gain'', and is based on Shannon's entropy. ``Variants of the information gain (...) have become the de facto standard metrics for TDIDT split selection'' [32, pg. 263]. Let C be the classification variable, $X=\{X_1,\ldots,X_r\}$ a set of variables with nominal or discretized values. The variable set X is to be evaluated as a node condition in separating the classification classes $c \in C$. The information gain GC(X) is then defined as:

GC(X) := H(C) - H(C|X) (33)

From section 2.2 we know that the information gain is 0 if variables C and X are independent ( H(C) = H(C|X)). The more they are ``correlated'' the lower is the predictive uncertainty H(C|X) and the higher the information gain. Several variation of this measure are known [32]. One of them is called ``gain ratio'' and tries to compensate the known bias of gain towards variable sets with larger domains (more values):

\begin{displaymath}GR_C(X) := \frac{H(C) - H(C\vert X)}{H(X)} = \frac{G_C(X)}{H(X)}
\end{displaymath} (34)

Other measures include the Chi-square statistics, Fisher's exact test, category utility measure, etc. [32].

The important point in selecting rules for decision nodes is that we want to maximize class homogeneity within partitions and maximize classndistance between partitions.


next up previous contents
Next: Algorithm and Decision Tree Up: Decision Trees Previous: Decision tree building:
Thomas Prang
1998-06-07