Chapter 2 Association Analysis
Section 2 Frequent Itemset Generation
Page 4 Algorithm Deconstructed

Objectives

The objectives of this section are:
to explain the problem of frequent itemset generation
to introduce the main strategies for solving this problem
to explain the Apriori Principle
to explain the Apriori Algorithm and it's various parts

Outcomes

By the time you have completed this section you will be able to:
list the main strategies for solving the frequent itemset generation problem
describe the Apriori Principle
explain the major steps of the Apriori Algorithm
list the requirements for efficient Candidate Generation
distinguish between the methods for efficient Candidate Generation

Apriori Algorithm Deconstructed

There are three main parts of the algorithm that need to be explained a little bit further. The Candidate Generation, Candidate Pruning and Support Counting stages which take place after we generate frequent itemsets of length k are very crucial to grasping the complexity of the Apriori Algorithm.

Candidate Generation: this part of the algorithm does as the name suggests and generates new candidates of length (k+1) from frequent itemsets of length k. For the sake of efficiency there are some requirements that need to be met in order for us to have effective candidate generation.

  1. Avoid generating too many unnecessary candidates
  2. Ensure that all frequent itemsets are included
  3. Avoid repetitious generation of candidates

The table below outlines three candidate generation procedures that are widely used.

CandidateGeneration

 

Candidate Pruning: goes hand in hand with Candidate Generation and is responsible for eliminating some of the candidate k-itemsets that are infrequent.

Support Counting: this step is responsible for determining the frequency of occurrence for every candidate itemset that remains after the candidate pruning stage. The brute force approach would involve comparing each candidate itemset against each transaction but if you had 100 itemsets and 10,000 transactions it starts to become time consuming. An alternative approach is to list the itemsets contained in each transaction and use them to update the support counts of their respective candidate itemsets. Another approach involves using a hash tree to count the support for the itemsets.