Hash Hash Baby: On Market Basket Analysis

Today I will describe one of some interesting algorithms I’ve learnt lately in Data Mining class, which aims at finding frequent items bought together. For instance beer and diapers (yes, you guessed it – dads doing shopping).

Market Basket Analysis helps to identify items that are likely to be purchased together. Association rule mining finds correlations between items in a set of transactions. Marketers may then use these association rules to place correlated products next to each other on store shelves or online so that customers buy more items. Finding frequent sets in mining association rules is a computationally intensive problem.

In general, the task can be described as follows for a set of transactions:

Given: Set of transactions
Find: IF-THEN rules that predict the occurrence of an item based on other 
items in the transaction.

Transaction ID Items:
1 Bread, Milk
2 Bread, Milk, Diaper, Beer, Eggs
3 Milk,Diaper, Beer, Coke
4 Bread, Milk, Diaper,Beer
5 Bread, Milk, Diaper, Coke

We will therefore look for association rules (implication means co-occurrence, not causality!), for instance:
{Diaper} →{Beer},
{Milk, Bread} →{Eggs, Coke},
{Beer, Bread} →{Milk}

An important thing to keep in mind is that we keep only these association rules with support ≥s and confidence ≥c. Focus of this task is therefore on common events, not rare events (like for instance in fraud detection). There are many algorithms that are designed to address this problem, e.g. naive solution, apriori, PCY, and others.

Naive solution gets exponentially complex in number of items and eventually becomes too slow, so it not an optional one.
For each association rule X -> Y,
Keep if meets support & confidence threshold

If we have m items, we will have (3^m – 2^(m+1) + 1) rules.

A better algorithm is Apriori, which applies an iterative approach of joining and pruning.

An even better version of it is PCY algorithm, which applies hash-based filtering. It takes advantage of the idle memory in pass1. During pass 1, maintain a hash table. Keep a count for each bucket into which pairs of items are hashed. If the count of a bucket is >= support s, it is called a frequent bucket. For a bucket with total count less than s, none of its pairs can be frequent, therefore can be eliminated as candidates. For Pass 2, only count pairs that hash to frequent buckets. An example of a hash function can be for instance remainder from a division of multiplied items by 11 (e.g.: for a par of items {1,3}, it would be hashed to bucket 3 because (1*3)%11 = 3).

A bit more about the background and implementation in the original article:
Jong Soo Park, Ming-Syan Chen, Philip S. Yu: An Effective Hash Based Algorithm for Mining Association Rules. SIGMOD Conference 1995: 175-186

I found it particularly interesting to learn a bit more about this use of a hashing function.

 

 

 

 

 

Advertisements

Hadoop MapReduce: Victory beyond WordCount!

After the Hortonworks Docker did not want to cooperate with the VirtualBox on my Dell computer (with 8MB memory), I was left with two options: ask for setting up the remote server solution, or try other distributions. I did both, with a success. The advantage of Hortonworks is that it provides Spark 2 and Python 2.7, which are needed for my other assignment (predictions from streamed reddit data, which I will describe in another entry). As an intermediary solution for some simple MapReduce programs, I worked with Cloudera quickstart 5.4.2.

As a beginner Java programmer, it required quite a steep learning curve for me to figure out the three classes of the program: mapper, reducer and so-called driver, which includes the main method. On top of that, Hadoop has its own data types, e.g. Text instead of String.

For the assignment, I was given two tasks:

Fist task‘s goal was to compute minimum, maximum, and average temperatures for each month and city. The tab-separated data looked like follows, where the first value was the day, second month, third city, and fourth the temperature record:

1 1 Brussels 1
2 1 Brussels 2
3 1 Brussels 3
4 1 Brussels 1
5 1 Brussels 0
6 1 Brussels -1
7 1 Brussels -2
8 1 Brussels -3
9 1 Brussels -3
10 1 Brussels -5

A piece of the output of my program looked as follows (yes, I could have rounded the average to two decimal places):

tempsmapreduce

Second task required to find common friends for any two friends from the file. The tab-separated data looked as follows, where each line included the names of friends of the first person in that line (e.g.: Jan’s friends are Huy, Jordan, Robert, Jie, Nicolas, Kexin, and Luc):

Jan Huy Jordan Robert Jie Nicolas Kexin Luc
Jia Danai Lilong David Ivan Michael William Huy Gilbert Luc
Upendo Moise Lilong Axel Mingyan Daria William Huy Jie Bharat
Jordan Jan Carlos Moise Jie Robert Joris Krina Kexin John
Lilong Jia Ysaline Upendo Reinout Yilin Seppe Joris Michael Kexin
Pelle Jie John Qiaoyubing Axel Halyna Mingyan Kexin Carlos
Qiaoyubing Seppe Pelle Ivan Axel Gilbert John
Seiji Bjorn Huy Yilin Ivan Axel Nicolas Gilbert
Halyna Pelle John Robert Bjorn Carlos

The general idea behind this task is described on the blog of Steve Krenzel: http://stevekrenzel.com/finding-friends-with-mapreduce

A piece of the output of my program presents as follows:

friendsmapreduce

The empty brackets mean that the two persons are friends, but they do not have any other mutual friends.

Once the deadline for the assignments passes on the 28th May, I will upload the code on GitHub.

Defeated by a yellow elephant plushie…

This weekend I tried hard to play with Hadoop and prepare the environment for one of the assignments for the Advanced Analytics class. VirtualBox setup check. SandBox setup check. Cannot connect to Ambari.  I’ve read half of the internet and tried all the recommended tricks but still nothing works. The answer must be in the other half of the internet. Last resort is setting up the environment on the university server. But nobody likes last resorts… Back to petting the Python for the next two weeks of “Easter holidays” at home, starting today.

Programming is like sport – you need to practice a lot

Beautiful March weekend, sun, clear sky, more than 15 degrees outside. What a perfect time for a hard training. Kata weekend in Lede: two days, 8 hours of intensive training. Satisfaction guaranteed. Observations made that are very relevant in learning how to program: so called gurus are humans too and hence are not perfect. You need to depart on your own way to discovery of the skill and its mastery. You need to listen, try and fail and try again and succeed. You are stronger in a group by sharing knowledge that is complimentary, and by helping each other. If you cannot reciprocate immediately, remember to send the lift back to those who follow you later. JOW_medaille_Uitreiking_2016-2017-300x146

Chocolate is best for creativity when it rains outside

Today a bunch of random thoughts gave birth to a next challenge in my PhD life: let’s write a paper implementing a Multi-Agent-System. Found a relevant course at the Uni which requires Java (doesn’t matter it’s already half-way the semester), now the challenge lies in convincing my promoter. Anyone tried this approach? Please do leave a comment!

Solving problems while drinking Java

It came somehow unexpectedly that I need to challenge myself to build a tokenizer for data for one of my doctoral research projects. Yeah! I will need to make the program to read the acquisition announcements from a text file (hundreds of them) and parse structured information (e.g.: announcement date, target, SIC codes, address, etc.) into a csv file, which can be read into Excel. Finally I will get to use my beginner Java skills I learnt last semester to solve a real problem! Otherwise I would have to work manually by copy-pasting this info to Excel (no fun for so many files and so many fields). The adventure continues… I will post an update once it’s ready and working.