Author: Peter Editor: Peter

Hello, I'm Peter~

Today, I'd like to share a classic machine learning algorithm: association rule analysis, which is full from theory to code to practice.

The main contents of this paper are as follows:

The article is too long. It is recommended to collect it

## Classic case

Association analysis is a method to find interesting relationships from large-scale data sets. An example often used in association analysis: shopping basket analysis.

By checking which goods are often purchased by customers together, you can help stores understand users' purchase behavior.

Classic beer and diaper cases:

When analyzing sales orders, the sales manager of a supermarket found that beer and diapers, two seemingly unrelated goods, often appear in the same order.

Later, the follow-up survey found that the wife of young couples usually arranged for their husband to go to the supermarket to buy diapers on Friday night, and the husband always couldn't help buying himself a few cans of beer when buying diapers.

That's why beer and diapers, two seemingly unrelated items, often appear in the same shopping basket.

In order to solve the problem of beer and diapers at the same time, the algorithm of association rule analysis is introduced.

Association rules are still used in many application fields, including network usage mining, intrusion detection, continuous production and bioinformatics.

## Related terms

In the process of using association rules (analysis), several terms are often encountered:

### Transaction Library

The above commodity shopping data is a transaction database, recording each data.

### affair

Each record in the transaction library is called a transaction. A transaction is a purchase.

### k-itemset

In the above example, each item is called an "item". A collection of items is called an itemset. For example, {diaper}, {diaper, beer}, {diaper, lettuce}, {diaper, beer, lettuce} are itemsets, that is, the combination of different commodities.

An itemset with K items is called k-itemset, 1-itemset, 2-itemset, K-itemset

### Association rules

association rules: implies that there may be A strong relationship between items, in the form of $A - > B $.

Among them, A is called the front part and B is called the back part, which means that if the user buys A commodity, he will also buy B commodity.

Here, AB can be a single commodity or an item set

For example: {A,B} - > {c} means that if the user purchases AB products, he will also purchase C products.

### Frequent itemsets

frequent item sets: a collection of items that often appear together. For example, {diapers, wine} in the above example is a good example.

Support (below ðŸ‘‡) Itemsets that are greater than or equal to the artificially specified threshold (this threshold is called the minimum support) are called frequent itemsets.

Assuming that the minimum support is 50%, if the support of a certain itemset {beer, diaper} is 70% (assuming), we call {beer, diaper} a frequent itemset.

## Evaluation criteria

There are three evaluation indicators in association rules:

- Support
- Confidence
- Lift

### Degree of support

Support: the proportion of records in the dataset containing this item set. For example, there are 5 records in total, including 3 {diapers and wine}. Therefore, the support of {diapers, wine} is 3 / 5.

The formula can be expressed as:

$$support(XY) = \frac {num(XY)}{num(samples)} $$

Support is for itemsets. In actual requirements, you can specify a minimum support to retain the item set that meets the minimum support and play the role of data filtering.

### Credibility / confidence

Confidence: it is defined for an association rule such as {diaper} - > {wine}. We can define credibility as:

Degree of support{diaper, Wine} / Degree of support{diaper}

In the above example: support {diaper, wine} = 3 / 5; Support {diaper} = 4 / 5. Then: the credibility is (3 / 5) / (4 / 5) = 3 / 4 = 75%

This means that 75% of all orders containing diapers contain wine

If the formula of conditional probability is used, it can be expressed as:

$$Confidence(Xâ€”>Y) = P(Y|X) = \frac {P(XY)}{P(X)}$$

That is: under the condition of X occurrence, the probability of Y occurrence = the probability of XY simultaneous occurrence / the probability of X occurrence

In the above example, suppose there is an association rule {diaper, beer} - > {lettuce} which can be expressed as:

$$p (lettuce | (diaper, beer)) = \ frac {P (lettuce, beer, diaper)} {P (diaper, beer)}$$

### Lifting degree

Lift represents the ratio of the probability of containing y under the condition of containing x to the overall probability of Y. That is, the ratio of the confidence of X to y to the overall occurrence probability of Y:

$$Lift(Xâ€”>Y)=\frac {P(Y|X)}{P(Y)} = \frac {Confidence(Xâ€”>Y)}{P(Y)}$$

The promotion degree reflects the correlation between X and Y in association rules

- The higher the lift > 1, the higher the positive correlation
- The degree of improvement < 1 and the lower it is, the higher the negative correlation is
- Lift = 1 indicates no correlation

## Strong association rules

An important concept: strong association rules. In practical application, it is usually:

- First, find the frequent itemsets that meet the minimum support
- Then, the association rules satisfying the minimum confidence are found in the frequent itemsets

The association rules found in this way are called strong association rules.

## case

Understand the three indicators through a simple example. Suppose there are 40 students in a class, 20 men and 20 women, and their interests in basketball and table tennis are counted:

Gender | Basketball | Table Tennis |
---|---|---|

Male (20 persons in total) | 18 | 20 |

Female (20 persons in total) | 0 | 12 |

Suppose: x = {like basketball}, Y = {like table tennis}, and find three indicators of association rule {X - > Y} (like both basketball and table tennis)

1. For boys

Support{X,Y}=18/20 # 18 people like it at the same time; 20 is the total number of boys Confidence{X,Y}=P(Y|X)=P(XY)/P(X)=18/18=1 # The first 18 indicates the number of people who like basketball at the same time, and the second indicates the number of people who like basketball at the same time Lift{X,Y}=Confidence{X,Y} / P(Y)=1/1=1

At this time, the lifting degree is 1, indicating that XY is independent of each other and does not affect each other. In other words, among boys, there is no connection between basketball and table tennis.

Although they have high support and credibility, they are not a strong association rule.

2. For girls

Support{X,Y}=0/20=0 # No girl likes basketball Confidence{X,Y}=P(Y|X)=P(XY)/P(X) = 0 Lift{X,Y}=0

There is no connection at all among girls

3. For the whole

Support{X,Y}=18/40=9/20 Confidence{X,Y}=P(Y|X)=P(XY)/P(X)=(18/40) / (18/40)=1 Lift{X,Y}=Confidence{X,Y}/P(Y) = 1 / (32/40)=5/4

On the whole, the support and credibility are very high, and the improvement degree is greater than 1, indicating that XY is a strong association rule.

## Apriori algorithm

The ultimate goal of association analysis is to find strong association rules. Apriori algorithm is one of the famous association rule mining algorithms. Main steps of the algorithm:

- Set minimum support and minimum confidence
- Find out all frequent itemsets according to the minimum support
- Finding strong association rules according to the minimum confidence

### Commodity mix

Suppose there are four commodities: Commodity 0, commodity 1, commodity 2 and commodity 3. You can quickly sort out the combination of goods purchased together. For example, item 0:

- Purchased separately
- With a commodity: 01, 02, 03
- Together with 2 commodities: 012, 013, 023
- Together with 3 commodities: 0123

Note: the first set is $\ phi $. Represents an empty set or a set that does not contain items

### Support calculation

Support of a set: refers to the proportion of transactions that contain the set. For example, the calculation of support of {0,3} set:

- Traverse each record to check whether it contains {0,3}. If the record contains 1
- After traversing all the data, the support obtained by using is: the total number of records containing the set / the total number of transaction records

In the above example, there are only 4 kinds of goods. We can see from the [item set combination diagram]: it needs to be traversed 15 times. When there are N commodities, the number of traversals is $2^N-1 $times.

The amount of data will increase sharply, and the calculation time will increase greatly. In order to solve this problem, Apriori algorithm came.

The algorithm assumes that if an itemset is frequent, all subsets containing it are also frequent.

If the itemset {1,3} is frequent, then {1} or {3} is also frequent. That is, if an itemset is an infrequent itemset, all its supersets are infrequent itemsets.

In the following figure, the infrequent itemset represented by gray. If {2,3} is an infrequent itemset, its supersets {023}, {123}, {0123} are all infrequent itemsets.

### On supersets and subsets

If set A contains all the elements in set B, and set A may have elements that do not exist in set B, then set A is A superset of set B, on the contrary, set B is A superset of set A.

If there must be elements in set a that do not exist in set B, then a is the true superset of B, on the contrary, B is the true subset of A.

## Algorithm flow

- Give a piece of data or simulate a dataSet dataset
- Create C1 from the original dataset (an itemset with only one element)
- Scan the data through the scan function to find the frequent itemset L1 that meets the minimum support
- Combine each 1-itemset in L1 in pairs and repeat step 3 to find the frequent itemset L2
- Repeat steps 3 and 4 until the k-itemset cycle is completed

- C1 to Ck represent 1-itemset and k-itemset
- L1 to Lk represent frequent itemsets containing k data sets
- Scan method: scan the entire data set. Filter the data to remove the data that does not meet the minimum support

## Generate candidate item set

1. Analog data

def loadData(): """ Data required for simulation """ dataSet = [[1,3,4],[2,3,5],[1,2,3,5],[2,5]] return dataSet

2. Generate an Itemset Based on a single element

You must use frozenset as the key of the dictionary

def createC1(data): """ Function: generate the first candidate data set C1 Parameters: raw dataset dataset return: frozenset Form candidate set C1 """ C1 = [] for transaction in data: # Traverse each element of data, such as: [1,3,4] as an element print(transaction) for item in transaction: # Traverse each element in the data element, for example: 1,3,4,2,3,5 print(item) if not [item] in C1: # Adding a single element as a list to C1 [item] is to build a collection for each item C1.append([item]) print(C1) C1.sort() # [[1], [2], [3], [4], [5]] return list(map(frozenset, C1)) # Perform the function of the frozenset function on each element in C1

In the process of generating C1, if an element is already in it, it will not be added to the large list.

3. Generate candidate item set

def scanD(D, Ck, minSupport): """ Parameters: D: incoming data dataset Ck: Candidate set list Minimum support """ ssCnt = {} for tid in D: # Traverse raw data for can in Ck: # Traverse each data in the k itemset if can.issubset(tid): # Judge whether can is a subset of tid and return Boolean data if can not in ssCnt.keys(): # If it is not in the key of ssCnt, it is assigned as 1 ssCnt[can] = 1 else: # Otherwise, add 1 ssCnt[can] += 1 numItems = float(len(D)) # Length of data set retList = [] # Frequent itemsets supportData = {} # Support dictionary storing candidate item set Ck: key candidate value support for key in ssCnt: # Traverse the key support = ssCnt[key] / numItems # Get the support corresponding to the key supportData[key] = support # Support for storing candidate item sets if support >= minSupport: # If the judgment is greater than the minimum support, it will be added to the list retList.append(key) return retList, supportData # Frequent itemset candidate set

The confidence of {4} itemset is less than the threshold of 0.5 and is directly discarded.

## Generate frequent itemsets

Pseudo code of algorithm:

When the number of items in the collection is greater than 0: Build a k A list of candidate itemsets consisting of items Check the data to make sure that each itemset is frequent Keep frequent itemsets and build k+1 List of candidate itemsets consisting of items

def aprioriGen(Lk,k): """ Function: generate frequent itemsets Parameters: Lk: contain k Frequent itemsets of elements k: Number of elements in itemset return: Ck: Candidate itemset list """ Ck = [] # Frequent itemsets, such as {0, 1} {0, 2} and {1, 2} length = len(Lk) # The number of itemsets in the above example is 3 for i in range(length): for j in range(i+1, length): # If the first k-2 itemsets are the same, merge the two sets to obtain the set of elements of k L1 = list(Lk[i])[:k-2] L2 = list(Lk[j])[:k-2] L1.sort() # Sort operation L2.sort() if L1==L2: Ck.append(Lk[i] | Lk[j]) # merge return Ck # Candidate itemset list

Explanation of k-2 in the above code:

- If we use {0}, {1}, {2}, we can quickly build {0, 1} {0, 2} and {1, 2}
- When {0, 1}, {0, 2} and {1, 2} are used to build {0, 1, 2}, if we merge the set in pairs, we will get {0, 1, 2}, {0, 1, 2}, {0, 1, 2}, that is, the data is reread three times.
- When you scan the itemset list of three elements to get non duplicate results, you need to ensure that the list is traversed at least. If you compare the first element of the set {0, 1} {0, 2} and {1, 2} and merge only those with the same first element, you can get the set {0, 1, 2}

Let's look at the example of binomial set generating trinomial set. If there are multiple elements of binomial set: {0,1}, {0,2}, {0,3}, {1,2}, {1,3}, {2,3}.

The final synthesis is: {0,1,2}, {0,1,3}, {0,2,3}, {1,2,3}. Where {1,2,3} is first generated by {1,2}, {1,3}, and the repeating element is still in the first place instead of {1,3}, {2,3}

By analogy, if 3 itemsets generate 4 itemsets, the first 2 bits are the same; Generating n itemsets is the same as the previous 0 to k-2 elements

aprioriGen(L1, 2) # Call function [frozenset({1, 3}), frozenset({1, 2}), frozenset({1, 5}), frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5})]

## Main function

def apriori(D, minSupport=0.5): """ Function: main function Function: D: Original data minSupport: Degree of support Return value: L: Frequent itemsets supportData: Support data """ C1 = createC1(D) # C1 candidate set L1, support = scanD(D, C1, minSupport) # Scan the original data to find the data that meets minSupport L = [L1] k = 2 while (len(L[k-2]) > 0): Ck = aprioriGen(L[k-2], k) Lk, supK = scanD(D, Ck, minSupport) supportData.update(supK) L.append(Lk) k += 1 return L, supportData

Run the next main function

L, supportData = apriori(dataSet, minSupport=0.5)

Conclusion: when minsupport=0.5, we can see the data satisfying the conditions in the 1-item set

Effect of changing support:

apriori(dataSet, minSupport=0.7) ([[frozenset({3}), frozenset({2}), frozenset({5})], [frozenset({2, 5})], []], {frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75, frozenset({1, 3}): 0.5, frozenset({2, 3}): 0.5, frozenset({3, 5}): 0.5, frozenset({2, 5}): 0.75, frozenset({1, 2}): 0.25, frozenset({1, 5}): 0.25, frozenset({2, 3, 5}): 0.5})

## technological process

- Firstly, the 1-term set and the corresponding confidence are obtained from the original data through the scanning function; Select a frequent itemset that meets a confidence of 0.5 or more, that is, support>=2 (large arrow on the right corner)
- Then the 1-itemset is combined in pairs to obtain the 2-itemset. Scan the original data again through the scanning function, check the confidence of each element in the 2-item set, and find out the frequent item set with confidence greater than or equal to 0.5 (large arrow turning on the left)
- Combine the data in the 2-item set in pairs to obtain each element in the 3-item set. Scan the original data again to check the confidence of each element in the 3-item set. Finally, it is found that only {235} is satisfied

## Reference books

- Machine learning practice
- Association Analysis, Apriori algorithm and FP growth algorithm