The objectives of this section are:
to introduce you to the concept of association analysis
to explain the basic problem that association rules present
to excite you to delve deeper into the world of data mining
By the time you have completed this section you will be able to:
describe what association analysis is
define the key terms described in the problem definition
calculate the support and confidence for various itemsets
Say Beers and Diapers five times while looking at the animation to the left.
Now close your eyes do you see the connection between beer bottles and lil’ toddlers in diapers? If you do then you have an extra sensory gift that I need to acquire. Beers and Diapers. What is the connection? How does one arrive at this rule? The rest of this section focuses on association rule mining and introduces you to the key terms because in order for you to see the unbelievable and make these kind of connections you have to talk the talk.
Each time a customer checks out items at a supermarket, a list is comprised of everything they buy and is subsequently stored on a central system. This collection of data is commonly referred to as market basket data. The two tables below are different representations of the Cougar Supermarket dataset. The table to the left is a transaction data approach and the one to the right is a binary 0/1 representation of the same data.
Words to know
Itemset: a collection of one or more items
k-itemset:an itemset that contains k items
Support count (s): is the frequency of occurrence of an itemset
Support: is the ratio (or fraction) of the number of transactions that contain an itemset
Confidence: is the probability that itemset B will exist given itemset A exists in the transaction.
Association Rule: relationship discovered between two itemsets.
Frequent Itemset: an itemset whose support is greater than or equal to a support threshold value
Strong Association Rules: rules whose confidence is greater than or equal to a confidence threshold value
The strength of an association rule is characterized by its support and confidence. These two work hand in hand to ensure that only significant rules are reported. Support helps eliminate uninteresting rules that could have occurred as a result of error or just by chance and Confidence as its name suggests measures the reliability of the inference made by a rule, for instance for the association rule X →Y, the higher the confidence, the more likely it is for Y to be present in transactions that contain X.
For the market basket data given above, the table below shows the support and confidence for the following association rules.
This might not seem like complex analysis that requires a model but imagine if we had a dataset with over 200 items and over 5000 transactions. The time that it would take to find the support of the antecedent and the support of all the possible rules starts to grow exponentially so in order to cut back on computation the following thresholds are introduced.
minsup: this is the minimal support used as a threshold
minconf: this is the minimal confidence used as a threshold
Frequent Itemset: an itemset whose support is greater than or equal to a minsup threshold
Strong Association Rules: rules whose confidence is greater than or equal to a minconf threshold
Association Rule Mining uses these thresholds to reduce the time complexity of the computations and find strong association rules in the data set. Association Rule Mining can be viewed as a two-step process:
The next section focuses on efficient techniques for generating frequent itemsets