Association Rules:

Next: Discussion: Up: Rule inference Previous: ITRULE Algorithm:

Association Rules:

The similar problem of finding ``association rules'' was introduced by Agrawal, et.al., 1993 [4] in the context of market-basket research (section 3.3). In this field the buying behavior of customers is the main interest; we want to investigate what subsets of items are usually bought together in one ``transaction''. Therefore association rules usually operate on sets of items, but by seeing one (variable, value)-pair as one item and one database-record as a transaction we can transfer our problem to this methodology. Note that by doing so some restrictions are imposed on the transactions: each transaction has the same number of items, that is exactly one for each variable; items with different values for the same variable are excluded from being in the same transaction.

An association rule is defined as an expression:

$\begin{displaymath}\mathcal{X} \rightarrow \mathcal{Y} \mbox{ with confidence $c$\space and support $s$ }, \end{displaymath}$

(30)

where

$\begin{displaymath}\mathcal{I}:= \{ i_1,i_2,\ldots,i_m \} \end{displaymath}$

is the set of all items and

$\begin{displaymath}\mathcal{X}:= \{ i_{x_1},i_{x_2},\ldots,i_{x_r}\} \subseteq \mathcal{I}, \quad 1 \le r \le (m-1) \end{displaymath}$

$\begin{displaymath}\mathcal{Y}:= \{ i_y\} \subset \mathcal{I} \end{displaymath}$

are subsets of items.

The interpretation of an association rule is that ``transactions in the database which contain the items in $\mathcal{X}$ also tend to contain the items in $\mathcal{Y}$ '' [47]. The data $\mathcal{D}$ to be investigated consists of a set of Transactions T, which are themselves subsets of items, i.e. all the items a customer bought at the same time:

$\begin{displaymath}\mathcal{D}:= \{ T_1,T_2,\ldots,T_n \}, \qquad T_i \subseteq \mathcal{I} \end{displaymath}$

As already described there are two basic measures for evaluating the strength of an association rule. The ``confidence'' c) measures the percentage of transactions in which the items $\mathcal{Y}$ were bought whenever the items in $\mathcal{X}$ were bought (frepresents the induced frequency distribution from the data $\mathcal{D}$ ):

$\begin{displaymath}c=c(\mathcal{X}, \mathcal{Y}):= f( \mathcal{Y} \subset T \vert \mathcal{X} \subseteq T) \end{displaymath}$

(31)

The ``support'' (s) expresses the significance of the rule as the percentage of all transactions which contain both the items in $\mathcal{X}$ and $\mathcal{Y}$ :

$\begin{displaymath}s=s(\mathcal{X}, \mathcal{Y})=s(\mathcal{X} \cup \mathcal{Y}):= f( (\mathcal{X} \cup \mathcal{Y}) \subseteq T ) \end{displaymath}$

(32)

The problem of mining association rules is to find all rules that satisfy a user-specified minimum support and confidence. It is decomposed into 3 parts [47]:

1.: Find all sets of items $I \subseteq \mathcal{I}$ (``frequent itemsets'') which are subset of more transaction T_i than the user-defined minimum. In other words find all subsets of $I \subseteq \mathcal{I}$ which satisfy a specified minimum support. This can be done in a simple cumulative algorithm because of the nestedness of the subset property: If an itemset I fullfills the requirement, so do all subsets of I. By starting with single items and sequentially joining itemsets while checking for minimum support we are able to efficiently determine the frequent itemsets. In this procedure we also automatically obtain the support for all frequent itemsets including their subsets (which are also frequent itemsets). This is used in the next step.
2.: Use the frequent itemsets to generate desired rules. The general idea is that if $\{i_1,i_2,i_3,i_4\}$ is a frequent itemset, then we can determine the confidence for a rule $\{i_1,i_2\} \rightarrow \{i_3,i_4\}$ by computing $c(\{i_1,i_2\},\{i_3,i_4\})=s(\{i_1,i_2,i_3,i_4\}) / s(\{i_3,i_4\})$ .
3.: Prune all uninteresting rules from this set. The criteria for ``interesting'' must be defined according to the Purpose of Investigation.

Using this concept several fast algorithms have been developed. For more details and references see one of the many papers on association rules [4,47,36] etc.

As mentioned earlier, hierarchies of values, or in this case hierarchies defined on the itemsets, are important for finding general pattern. This is especially true for this case as we preselect rules according to their support. The name for using hierarchies (taxonomies) in market-basket-research is ``Generalized Association Rules''; appropriate algorithms have been developed [47].

The confidence and support measures are fairly elementary and not necessarily sufficient in describing interesting rules. Other measures based on statistics, i.e. chi-square test, have been proposed [36].

Next: Discussion: Up: Rule inference Previous: ITRULE Algorithm:

Thomas Prang
1998-06-07