shweta mishramishra
3 min readMar 28, 2021

Market Basket Analysis using Association Rule-Mining

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

In the business intelligence world, “market basket analysis” helps retailers better understand — and ultimately serve — their users by predicting their purchasing behaviors. In this blog post, we’ll explain how market basket analysis works and what it takes to deploy a market analysis project.

In the retail and restaurant businesses, market basket analysis (MBA) is a set of statistical affinity calculations that help managers better understand — and ultimately serve — their customers by highlighting purchasing patterns. In simplest terms, MBA shows what combinations of products most frequently occur together in orders. These relationships can be used to increase profitability through cross-selling, recommendations, promotions, or even the placement of items on a menu or in a store.

The approach is based on the theory that customers who buy a certain item (or group of items) are more likely to buy another specific item (or group of items). For example: while at a quick-serve restaurant (QSR), if someone buys a sandwich and cookies, they are more likely to buy a drink than someone who did not buy a sandwich. This correlation becomes more valuable if it is shown to be stronger than that between the sandwich and drink without the cookies.

An example of Association Rules

  • Assume there are 100 customers
  • 10 of them bought milk, 8 bought butter and 6 bought both of them.
  • bought milk => bought butter
  • support = P(Milk & Butter) = 6/100 = 0.06
  • confidence = support/P(Butter) = 0.06/0.08 = 0.75
  • lift = confidence/P(Milk) = 0.75/0.10 = 7.5

let’s get to the code.

Download the famous grocery dataset, available here : https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/groceries.csv.

Load the packages

import sys
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori import sys

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings(‘ignore’)

from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

grocery = pd.read_csv(‘groceries.csv’, names = [‘Products’], sep = ‘,’)

grocery.head(5)

grocery.tail(5)

grocery.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9835 entries, ('citrus fruit', 'semi-finished bread', 'margarine') to ('chicken', 'tropical fruit', 'other vegetables')
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Products 4734 non-null object
dtypes: object(1)
memory usage: 153.1+ KB

grocery.shape

(9835, 1)

te = TransactionEncoder()
te_data = te.fit(grocery_df).transform(grocery_df)
gdf = pd.DataFrame(te_data, columns = te.columns_)
gdf = gdf.replace(False,0)
gdf

BISCUITBOURNVITABREADCOCKCOFFEECORNFLAKESJAMMAGGIMILKSUGERTEA01.00.01.00.00.00.00.00.01.00.00.011.00.01.00.00.01.00.00.01.00.00.020.01.01.00.00.00.00.00.00.00.01.030.00.01.00.00.00.01.01.01.00.00.041.00.00.00.00.00.00.01.00.00.01.050.01.01.00.00.00.00.00.00.00.01.060.00.00.00.00.01.00.01.00.00.01.071.00.01.00.00.00.00.01.00.00.01.080.00.01.00.00.00.01.01.00.00.01.090.00.01.00.00.00.00.00.01.00.00.0101.00.00.01.01.01.00.00.00.00.00.0111.00.00.01.01.01.00.00.00.00.00.0120.01.00.00.01.00.00.00.00.01.00.0130.00.01.01.01.00.00.00.00.00.00.0141.00.01.00.00.00.00.00.00.01.00.0150.00.00.00.01.01.00.00.00.01.00.0160.01.01.00.00.00.00.00.00.01.00.0170.00.01.00.01.00.00.00.00.01.00.0180.00.01.00.01.00.00.00.00.01.00.0190.00.00.00.01.01.00.00.01.00.01.0

gdf = gdf.replace(True,1)
gdf

BISCUITBOURNVITABREADCOCKCOFFEECORNFLAKESJAMMAGGIMILKSUGERTEA01.00.01.00.00.00.00.00.01.00.00.011.00.01.00.00.01.00.00.01.00.00.020.01.01.00.00.00.00.00.00.00.01.030.00.01.00.00.00.01.01.01.00.00.041.00.00.00.00.00.00.01.00.00.01.050.01.01.00.00.00.00.00.00.00.01.060.00.00.00.00.01.00.01.00.00.01.071.00.01.00.00.00.00.01.00.00.01.080.00.01.00.00.00.01.01.00.00.01.090.00.01.00.00.00.00.00.01.00.00.0101.00.00.01.01.01.00.00.00.00.00.0111.00.00.01.01.01.00.00.00.00.00.0120.01.00.00.01.00.00.00.00.01.00.0130.00.01.01.01.00.00.00.00.00.00.0141.00.01.00.00.00.00.00.00.01.00.0150.00.00.00.01.01.00.00.00.01.00.0160.01.01.00.00.00.00.00.00.01.00.0170.00.01.00.01.00.00.00.00.01.00.0180.00.01.00.01.00.00.00.00.01.00.0190.00.00.00.01.01.00.00.01.00.01.0

gdf.sum().to_frame(‘Frequency’).sort_values(‘Frequency’,ascending=False).plot(kind=’bar’,
figsize=(12,8),
title=”Frequent Items”)
plt.show()

gdf1 = apriori(gdf, min_support = 0.2, use_colnames = True)
gdf1

supportitemsets00.35(BISCUIT)10.20(BOURNVITA)20.65(BREAD)30.40(COFFEE)40.30(CORNFLAKES)50.25(MAGGI)60.25(MILK)70.30(SUGER)80.35(TEA)90.20(BISCUIT, BREAD)100.20(MILK, BREAD)110.20(BREAD, SUGER)120.20(BREAD, TEA)130.20(COFFEE, CORNFLAKES)140.20(COFFEE, SUGER)150.20(MAGGI, TEA)

gdf1.sort_values(by = “support” , ascending = False)

supportitemsets20.65(BREAD)30.40(COFFEE)00.35(BISCUIT)80.35(TEA)40.30(CORNFLAKES)70.30(SUGER)50.25(MAGGI)60.25(MILK)10.20(BOURNVITA)90.20(BISCUIT, BREAD)100.20(MILK, BREAD)110.20(BREAD, SUGER)120.20(BREAD, TEA)130.20(COFFEE, CORNFLAKES)140.20(COFFEE, SUGER)150.20(MAGGI, TEA)

gdf_rules = association_rules(gdf1, metric = ‘confidence’, min_threshold = 0.6)
gdf_rules

antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction0(MILK)(BREAD)0.250.650.20.8000001.2307690.03751.751(SUGER)(BREAD)0.300.650.20.6666671.0256410.00501.052(CORNFLAKES)(COFFEE)0.300.400.20.6666671.6666670.08001.803(SUGER)(COFFEE)0.300.400.20.6666671.6666670.08001.804(MAGGI)(TEA)0.250.350.20.8000002.2857140.11253.25

gdf_rules.sort_values(by = “lift”, ascending = False)

antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction4(MAGGI)(TEA)0.250.350.20.8000002.2857140.11253.252(CORNFLAKES)(COFFEE)0.300.400.20.6666671.6666670.08001.803(SUGER)(COFFEE)0.300.400.20.6666671.6666670.08001.800(MILK)(BREAD)0.250.650.20.8000001.2307690.03751.751(SUGER)(BREAD)0.300.650.20.6666671.0256410.00501.05

  • I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Exprience. Thank you www.suvenconsultants.com"
shweta mishramishra
shweta mishramishra

Written by shweta mishramishra

0 Followers

Master’s in Data science

No responses yet