Market Basket Analysis using Association Rule-Mining
Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.
Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.
In the business intelligence world, “market basket analysis” helps retailers better understand — and ultimately serve — their users by predicting their purchasing behaviors. In this blog post, we’ll explain how market basket analysis works and what it takes to deploy a market analysis project.
In the retail and restaurant businesses, market basket analysis (MBA) is a set of statistical affinity calculations that help managers better understand — and ultimately serve — their customers by highlighting purchasing patterns. In simplest terms, MBA shows what combinations of products most frequently occur together in orders. These relationships can be used to increase profitability through cross-selling, recommendations, promotions, or even the placement of items on a menu or in a store.
The approach is based on the theory that customers who buy a certain item (or group of items) are more likely to buy another specific item (or group of items). For example: while at a quick-serve restaurant (QSR), if someone buys a sandwich and cookies, they are more likely to buy a drink than someone who did not buy a sandwich. This correlation becomes more valuable if it is shown to be stronger than that between the sandwich and drink without the cookies.
An example of Association Rules
- Assume there are 100 customers
- 10 of them bought milk, 8 bought butter and 6 bought both of them.
- bought milk => bought butter
- support = P(Milk & Butter) = 6/100 = 0.06
- confidence = support/P(Butter) = 0.06/0.08 = 0.75
- lift = confidence/P(Milk) = 0.75/0.10 = 7.5
let’s get to the code.
Download the famous grocery dataset, available here : https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/groceries.csv.
Load the packages
import sys
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings(‘ignore’)
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
grocery = pd.read_csv(‘groceries.csv’, names = [‘Products’], sep = ‘,’)
grocery.head(5)
grocery.tail(5)
grocery.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9835 entries, ('citrus fruit', 'semi-finished bread', 'margarine') to ('chicken', 'tropical fruit', 'other vegetables')
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Products 4734 non-null object
dtypes: object(1)
memory usage: 153.1+ KB
grocery.shape
(9835, 1)
te = TransactionEncoder()
te_data = te.fit(grocery_df).transform(grocery_df)
gdf = pd.DataFrame(te_data, columns = te.columns_)
gdf = gdf.replace(False,0)
gdf
BISCUITBOURNVITABREADCOCKCOFFEECORNFLAKESJAMMAGGIMILKSUGERTEA01.00.01.00.00.00.00.00.01.00.00.011.00.01.00.00.01.00.00.01.00.00.020.01.01.00.00.00.00.00.00.00.01.030.00.01.00.00.00.01.01.01.00.00.041.00.00.00.00.00.00.01.00.00.01.050.01.01.00.00.00.00.00.00.00.01.060.00.00.00.00.01.00.01.00.00.01.071.00.01.00.00.00.00.01.00.00.01.080.00.01.00.00.00.01.01.00.00.01.090.00.01.00.00.00.00.00.01.00.00.0101.00.00.01.01.01.00.00.00.00.00.0111.00.00.01.01.01.00.00.00.00.00.0120.01.00.00.01.00.00.00.00.01.00.0130.00.01.01.01.00.00.00.00.00.00.0141.00.01.00.00.00.00.00.00.01.00.0150.00.00.00.01.01.00.00.00.01.00.0160.01.01.00.00.00.00.00.00.01.00.0170.00.01.00.01.00.00.00.00.01.00.0180.00.01.00.01.00.00.00.00.01.00.0190.00.00.00.01.01.00.00.01.00.01.0
gdf = gdf.replace(True,1)
gdf
BISCUITBOURNVITABREADCOCKCOFFEECORNFLAKESJAMMAGGIMILKSUGERTEA01.00.01.00.00.00.00.00.01.00.00.011.00.01.00.00.01.00.00.01.00.00.020.01.01.00.00.00.00.00.00.00.01.030.00.01.00.00.00.01.01.01.00.00.041.00.00.00.00.00.00.01.00.00.01.050.01.01.00.00.00.00.00.00.00.01.060.00.00.00.00.01.00.01.00.00.01.071.00.01.00.00.00.00.01.00.00.01.080.00.01.00.00.00.01.01.00.00.01.090.00.01.00.00.00.00.00.01.00.00.0101.00.00.01.01.01.00.00.00.00.00.0111.00.00.01.01.01.00.00.00.00.00.0120.01.00.00.01.00.00.00.00.01.00.0130.00.01.01.01.00.00.00.00.00.00.0141.00.01.00.00.00.00.00.00.01.00.0150.00.00.00.01.01.00.00.00.01.00.0160.01.01.00.00.00.00.00.00.01.00.0170.00.01.00.01.00.00.00.00.01.00.0180.00.01.00.01.00.00.00.00.01.00.0190.00.00.00.01.01.00.00.01.00.01.0
gdf.sum().to_frame(‘Frequency’).sort_values(‘Frequency’,ascending=False).plot(kind=’bar’,
figsize=(12,8),
title=”Frequent Items”)
plt.show()
gdf1 = apriori(gdf, min_support = 0.2, use_colnames = True)
gdf1
supportitemsets00.35(BISCUIT)10.20(BOURNVITA)20.65(BREAD)30.40(COFFEE)40.30(CORNFLAKES)50.25(MAGGI)60.25(MILK)70.30(SUGER)80.35(TEA)90.20(BISCUIT, BREAD)100.20(MILK, BREAD)110.20(BREAD, SUGER)120.20(BREAD, TEA)130.20(COFFEE, CORNFLAKES)140.20(COFFEE, SUGER)150.20(MAGGI, TEA)
gdf1.sort_values(by = “support” , ascending = False)
supportitemsets20.65(BREAD)30.40(COFFEE)00.35(BISCUIT)80.35(TEA)40.30(CORNFLAKES)70.30(SUGER)50.25(MAGGI)60.25(MILK)10.20(BOURNVITA)90.20(BISCUIT, BREAD)100.20(MILK, BREAD)110.20(BREAD, SUGER)120.20(BREAD, TEA)130.20(COFFEE, CORNFLAKES)140.20(COFFEE, SUGER)150.20(MAGGI, TEA)
gdf_rules = association_rules(gdf1, metric = ‘confidence’, min_threshold = 0.6)
gdf_rules
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction0(MILK)(BREAD)0.250.650.20.8000001.2307690.03751.751(SUGER)(BREAD)0.300.650.20.6666671.0256410.00501.052(CORNFLAKES)(COFFEE)0.300.400.20.6666671.6666670.08001.803(SUGER)(COFFEE)0.300.400.20.6666671.6666670.08001.804(MAGGI)(TEA)0.250.350.20.8000002.2857140.11253.25
gdf_rules.sort_values(by = “lift”, ascending = False)
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction4(MAGGI)(TEA)0.250.350.20.8000002.2857140.11253.252(CORNFLAKES)(COFFEE)0.300.400.20.6666671.6666670.08001.803(SUGER)(COFFEE)0.300.400.20.6666671.6666670.08001.800(MILK)(BREAD)0.250.650.20.8000001.2307690.03751.751(SUGER)(BREAD)0.300.650.20.6666671.0256410.00501.05
- I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Exprience. Thank you www.suvenconsultants.com"