Monthly Archives: February 2015

Pattern discovery in data mining - Week 2

I've updated the github repository to include some code related to week 2 of the coursera course Pattern Discovery in Data Mining. However, this is only a few lines of R code in order to calculate chi², lift and cosine measure for a contingency table of items. All of this week's answers could have been answered using pen and paper.

I am not a big fan of this quiz. Answering most of the questions amounted to revisiting the lecture slides and looking up the definition of the relevant section. This is made harder by the fact that some of the definitions are confusing and the notation is not really well explained in all cases.

In any case, I did alright on the test, but I definitely hope there is more actual ata mining involved in the coming weeks.

Pattern Discovery in Data Mining - Week 1

Today's the final day of the first week of the first course in the new coursera Data Mining specialization - "Pattern Discovery in Data Mining".

This introduction covered a lot of ground, from a general introduction to transactional databases to frequent patterns and how to identify them (the a priori algorithm and FP trees were discussed). The lectures total only a bit over one hour, but this was definitely one of the more difficult first weeks of the coursera courses I have followed.

This is not only because the material is in itself quite dense, but also because the lecturer Jiawei Han, has such a strong accent that the course is sometimes hard to follow. I had to google some definitions to make sure I understood what's going on.

For example, when explaining the difference between closed patterns and max patterns, the slides (and Mr. Han) state: "Do not care the real support of the sub patterns of a max-pattern", which is not extremely helpful when one is struggling with the concepts anyway.

In general though the course is off to a great start - the selection of material is interesting and the quiz was just hard enough to be challenging but not so difficult as to be frustrating.

I will go through some of the code I used this week below (all code for this specialization can be found at

def frequentItems(items, tdb, n, s):
    itemsets = set(itertools.combinations(items, n))

    itemTransactions = []
    for i in itemsets:
        for k,v in tdb1.items():
            if set(v).intersection(set(i)) == set(i):

    ret = []
    for k,v in sorted(Counter(itemTransactions).items()):
        if v >= s * len(tdb):
            ret.append([k, v])

After storing all transactions in a dictionary and creating a list of individual items, I defined a function which outputs all frequent itemsets of a given length n with minimum support s. This code first creates all possible itemsets from the list of unique items in the database. In the second step, each itemset is compared to every transaction in the database and recorded if a match is found. Finally, the function outputs all matches and the number of times a transaction matching the itemset was found.

Coursera Data Mining Specialisation

I've decided to upgrade my programming skills a bit and get deeper into data mining. In particular, I want to become more adept at handling transactional databases and text processing, two areas which come up frequently at my current job.

That's why the coursera specialization ( came at exactly the right time. I'll be updating this blog with code snippets as I follow along. All code can be found at my github page: