Today I will describe one of some interesting algorithms I’ve learnt lately in Data Mining class, which aims at finding frequent items bought together. For instance beer and diapers (yes, you guessed it – dads doing shopping).

Market Basket Analysis helps to identify items that are likely to be purchased together. Association rule mining finds correlations between items in a set of transactions. Marketers may then use these association rules to place correlated products next to each other on store shelves or online so that customers buy more items. Finding frequent sets in mining association rules is a computationally intensive problem.

In general, the task can be described as follows for a set of transactions:

Given: Set of transactions Find: IF-THEN rules that predict the occurrence of an item based on other items in the transaction.

**Transaction ID Items:**

1 Bread, Milk

2 Bread, Milk, Diaper, Beer, Eggs

3 Milk,Diaper, Beer, Coke

4 Bread, Milk, Diaper,Beer

5 Bread, Milk, Diaper, Coke

We will therefore look for association rules (implication means co-occurrence, not causality!), for instance:

{Diaper} →{Beer},

{Milk, Bread} →{Eggs, Coke},

{Beer, Bread} →{Milk}

An important thing to keep in mind is that we keep only these association rules with support ≥s and confidence ≥c. Focus of this task is therefore on common events, not rare events (like for instance in fraud detection). There are many algorithms that are designed to address this problem, e.g. naive solution, apriori, PCY, and others.

Naive solution gets exponentially complex in number of items and eventually becomes too slow, so it not an optional one.

For each association rule X -> Y,

Keep if meets support & confidence threshold

If we have m items, we will have (3^m – 2^(m+1) + 1) rules.

A better algorithm is Apriori, which applies an iterative approach of joining and pruning.

An even better version of it is PCY algorithm, which applies hash-based filtering. It takes advantage of the idle memory in pass1. During pass 1, maintain a hash table. Keep a count for each bucket into which pairs of items are hashed. If the count of a bucket is >= support s, it is called a frequent bucket. For a bucket with total count less than s, none of its pairs can be frequent, therefore can be eliminated as candidates. For Pass 2, only count pairs that hash to frequent buckets. An example of a hash function can be for instance remainder from a division of multiplied items by 11 (e.g.: for a par of items {1,3}, it would be hashed to bucket 3 because (1*3)%11 = 3).

A bit more about the background and implementation in the original article:

Jong Soo Park, Ming-Syan Chen, Philip S. Yu: *An Effective Hash Based Algorithm for Mining Association Rules*. SIGMOD Conference 1995: 175-186

I found it particularly interesting to learn a bit more about this use of a hashing function.