Measures:

Next: ITRULE Algorithm: Up: Rule inference Previous: Formal Definition:

Measures:

To search for `interesting' rules we need a preference measure to rank the rules and an algorithm which uses the preference measure to find the `best' rules. In general, the conditional probability, also called ``transition probability'' or ``confidence'' is a belief parameter associated with every rule:

$\begin{displaymath}c=c(\mathcal{X},\mathcal{Y}) := f( Y=y \vert \vec{X}=\vec{x} ) \end{displaymath}$

(26)

It expresses the percentage for which the rule-implication is actually true. Another measure, mostly used for association rules (see below), is ``support''. It expresses the significance of a rule by measuring the probability that the true implication of the rule occurs in the data:

$\begin{displaymath}s=s(\mathcal{X},\mathcal{Y}) := f(Y=y, \vec{X}=\vec{x}) \end{displaymath}$

(27)

An interesting information theory based measure for general rule induction was introduced in 1988 by Goodman and Smyth [13]. The ``J-measure'' is a mixture of the probability of $\mathcal{X}$ and a special case of Shannon's cross-entropy. As a refresher, cross-entropy or directed divergence, is defined as (section 3.3.2, [28, pg. 279], [38, pg. 12]):

$\begin{displaymath}H(f,f^h) = \sum_{s \in dom(V)} f(s) \cdot \log_2\left(\frac{f(s)}{f^h(s)}\right) \end{displaymath}$

In rule-inference we are interested in the distribution of the the ``implication'' variable Y, and especially in its two events y and complement $\bar{y}$ . We want to measure the difference between the a priori distribution f( Y), i.e. f(Y=y) and $f(Y \ne y)$ , and the a posteriori distribution $f(Y \vert \vec{X})$ , i.e. $f(Y=y\vert\vec{X}=\vec{x})$ and $f(Y \ne y\vert\vec{X}=\vec{x})$ . The ``j-measure'' (small j) is defined as ``the average mutual information between the events (y and $\bar{y}$ ) with the expectation taken with respect to the a posteriori probability distribution of (Y).'' [41, pg. 304]. Denote $f(\bar{y})=1-f(y)$ and $f(\bar{y}\vert \vec{x}):= 1-f(y\vert \vec{x})$ :

$\displaystyle j(Y \vert \vec{X}=\vec{x})$	:=	$\displaystyle f(y\vert \vec{x}) \cdot \log_2\left(\frac{f(y\vert \vec{x})}{f(y)}\right)$
		$\displaystyle + f(\bar{y}\vert \vec{x}) \cdot \log_2\left(\frac{f(\bar{y}\vert \vec{x})}{f(\bar{y})}\right)$
	=	$\displaystyle f(y\vert \vec{x}) \cdot \log_2\left(\frac{f(y\vert \vec{x})}{f(y)}\right)$
		$\displaystyle + (1\: -\: f(y\vert \vec{x})) \cdot \log_2\left(\frac{(1\: - \: f(y\vert \vec{x}))} {(1\: - \: f(y))}\right)$	(28)

This measure is maximized when the ``transition''-probability $f(Y=y\vert\vec{X}=\vec{x})$ equals 1 (or 0), and minimized (=0) when the transition-probability equals the a priori probability f(Y=y). ``In this sense the j-measure is a well-defined measure of how dissimilar our a priori and a posteriori beliefs are about (Y) -- useful rules imply a high degree of dissimilarity." [41, pg. 305].

Summarizing, the j-measure includes two important features. The first is the ``goodness of fit'' between the rule hypothesis and the data, expressed by having maximal values for transition probabilities $f(Y=y\vert\vec{X}=\vec{x})$ close to 1 (or 0 for a negative rule). Second is the amount of ``dissimilarity'' compared with the unconditionalized distribution. A rule with similar confidence as the overall conclusion probability, $f(Y=y\vert \vec{X}=\vec{x}) \approx f(Y=y)$ , wouldn't make much sense, even if that probability is close to 100%. As an example imagine 90% of all customers buy milk, then a rule ``buying bread $\longrightarrow$ buying milk with c=91%'' wouldn't be very useful. The implication of buying milk is not given by buying bread, it is just a general pattern.

A third feature is ``simplicity'' which is combined with the j-measure to form the J-measure. Simplicity is a measure for the complexity of a rules precondition. The more likely the truth of the precondition, the simpler and more useful the rule. But the likelihood of the precondition is just the probability $f( \vec{X}=\vec{x})$ . Therefore the average information content of a rule can be defined as:

$\begin{displaymath}J( Y; \vec{X}=\vec{x}) := f(\vec{x}) \cdot j(Y ; \vec{X}=\vec{x}) \end{displaymath}$

(29)

Next: ITRULE Algorithm: Up: Rule inference Previous: Formal Definition:

Thomas Prang
1998-06-07