Segment Your Customers Now!

Let’s take a look at RFM Segmentation via K-Means Clustering, which allows us to study users’ purchasing behavior

9 min readMay 21, 2021

Introduction

Understanding customer behavior is a key element shaping the business strategy of retail and e-commerce firms today. The importance of ‘knowing thy customer’ cannot be over-emphasized as this will lead to retention, loyalty, and building stronger customer relationships. Customers can be understood based on different business metrics such as how often they spend, how much they spend, their favorite products, etc. which would, in turn, help the marketing, sales, and product teams to provide adequate support for customers and improve sales.

Why Segmentation?

Customers' needs and profiles differ and shouldn’t be treated alike to avoid them seeking other options. Therefore, it is better to segment them into groups according to their similarities to better understand the traits of the groups and engage them with appropriate campaigns.

There are various segmentation methods and one of the most popular and efficient methods which would be applied in this project is the RFM segmentation.

RFM Segmentation

RFM segmentation is based on the Pareto Principle: 80% of the results come from 20% of the causes. Similarly, 20% of customers contribute to 80% of your total revenue. Simply put, people who spent once are more likely to spend again and people who make big-ticket purchases are more likely to repeat them.

RFM segmentation is an effective marketing technique to identify groups of customers for specific treatments. Availability of the data of existing customers such as browsing history, demographics, purchase history- can be used to identify unique groups of customers that can be treated with offers relevant to each. Using RFM segmentation, you get to know:

Who your best customers are?
Which customers are on the verge of churning?
Who has the potential to be converted into more profitable customers?
Which customers must be retained?
Which group of customers are most likely to respond to your current campaign?
Who are lost customers that you don’t need to pay much attention to?

and many more…

Thus, RFM segmentation will help a business strengthen its customer relationship, marketing strategies and increase customer loyalty.

What does RFM stand for?

Recency (R) — how much time has elapsed since a customer’s last transaction/activity with the brand. It is proposed that the more recent the activity of a customer to a brand, the more likely the customer will be responsive to communications from the brand.
Frequency (F) — how often a customer has completed a transaction or purchased from the brand within a time period. Customers with frequent activities are more engaged and could be more loyal than customers who rarely engage.
Monetary (M) — how much a customer has spent with the brand within a time period. Those that spend more are to be treated differently from those that spend little.

This project aims to successfully identify customer segments based on their overall purchase behavior using RFM segmentation via K-means clustering (the engine of RFM segmentation).

Let’s get right into it!

The Data

The data used for this project is an e-commerce dataset for a UK-based retailer available at Kaggle through this link.

Data Preprocessing

Importing and Data Cleaning

The dataset contains eight columns and from the Country column, the UK constitutes over 90% of the data, therefore I’ll be working with just records from the UK.

It turns out that the data contains NULL rows on the Customer ID and Description column. The segmentation is based on the customers, therefore I’ll be dropping rows with null customer id alone.

Negative unit price and quantity were noticed and these rows were also dropped to avoid ambiguity in our segmentation.

Feature Engineering

In this step, we transform the data to the appropriate format and generate important columns for the clustering algorithm. The following is done in this case:

· Convert InvoiceDate column to the convenient DateTime format

· Create Revenue(Monetary) column by multiplying Quantity and Unit price

With this cleaned dataset, we are now ready to construct a new data frame for our RFM segmentation. The data frame’s key is CustomerID, with three columns as features: recency, frequency, and monetary.

Recency(R)

A new data frame is first constructed to hold all the RFM scores for each customer by grouping per CustomerID.

Another data frame is created based on the invoice date by CustomerID and the last invoice date is used as the observation point to calculate the recency. Finally, the recency in days is generated by subtracting the Last Invoice Day (our observation point) from the Last Purchase Date for that customer:

This data frame is then merged with the RFM score data frame on CustomerID to create the recency column.

Frequency(F)

To generate the frequency, a simple count of the invoices grouped per CustomerID would suffice since the frequency metric reflects the number of orders per customer:

Monetary(M)

For Revenue, the revenue column is grouped by CustomerID and summed up.

The final RFM data frame with its statistics is as shown:

Creating a scatter plot of the pairs to visualize the relationship between the R, F, and M variables is as shown:

There’s clearly some distinction between high-value segments and low-value ones, but the difficulty in inspecting such scatter plots visually is finding the ideal boundaries and also identifying the middle segment.

Plotting the histogram distribution of the variables:

The distributions of the RFM values, as shown above, are right-skewed which could cause bias to our model.

K-MEANS CLUSTERING

K-Means clustering is a type of unsupervised learning algorithm that identifies specific clusters even without labels based on the distance between the points. From the histogram above, there is a large variation between the range of values. K-means clustering uses distance as the similarity factor, and implementing the model without normalization will cause bias. Thus, in K-means clustering, scaling and normalizing the data is a critical step for preprocessing.

Elbow method

The Elbow method is a common way to optimally choose the number of clusters, K. We specify a range of values of K, train them in the model and plot the sum of squared errors (SSE) also known as Inertia for the various K values and choose the K value at which the SSE decline slopes changes significantly between before and after the value.

For this project, the K-means algorithm was fit nine times with nine different K values (2,3,4….10), and the Inertia/SSE and silhouette score was obtained for each K value.

EVALUATION METRICS

To evaluate the performance of the K-Means clustering algorithm, the following metrics are used:

Inertia value: calculates the sum of distances of all the points within a cluster from the centroid of that cluster. That is, it gives the sum of intracluster distances. The rule of thumb is the smaller the number, the better the fit.
Silhouette Score: is used to determine the degree of separation between clusters and ranges between -1 and 1. Values closer to 1 indicate better cluster separation, while values near 0 indicate overlapping clusters. Negative values indicate that the sample has been assigned to the wrong cluster

Centroid is the arithemetic mean of all the data points that belong to a particular cluster.

Let us visualize the Elbow Method!

Looking at the plot above, the SSE decline slope changes significantly between K values of 3 and 5. To adequately choose the optimal K, let’s analyze the silhouette score.

Silhouette scores for K values 2–10

A silhouette score with values close to 1 is considered as having better cluster separation. In our case, the K value of 5 gives the silhouette score closest to 1, however, the difference between this value and the next (ie K=6) is significantly large. This could mean that the model overfitted at K=5. Therefore, I will choose the optimal K as 4 for this model. Now, let’s build the model.

Fit K-means using the optimal K

After knowing the optimal K is 4, we fit the algorithm once again using this K value.

Visualizing the silhouette plot,

Results Analysis

The generated clusters are then analyzed based on their R-F-M centroid values and the number of users in each cluster. The centroid value (which is the mean) is considered to be representative of each of the clusters and can be used in interpreting the overall behavior of users that belong to a specific cluster based on R-F-M.

Result of the analysis segmented our customers into 4 clusters:

Cluster 1: Classified as our best customers. Users in this category spend the most in the store, have a relatively high frequency and have recently visited the store. They are potential targets for launch of new products. Actions such as reward points, free membership card with benefits, etc should be considered in order to retain these set of customers.

2. Cluster 2: Classified as regular/loyal customers. Users in this category frequent the store and with good purchase. Their loyalty shouldn’t be taken for granted else there's risk of churning. Customer relationship management should be emphasized to enhance shopping experience and hence strengthen the loyalty of these customers. Also incentives such as membership cards, loyalty cards can be introduced to this segment.

3. Cluster 3: Classified as customers at risk of leaving. Users in this category are not frequent, haven’t visited in a while and contribute little to the revenue of the store. Unfortunately our customers dominate this segment which means actions must be taken to move them from this segment to becoming loyal customers. The reasons for leaving must be figured out and customized marketing plan to purchase again must be rolled out as this segment could be the turning point of the business.

4. Cluster 0: Classified as lost customers. Users in this category have not visited the store in over 8 months, have low frequency and their purchase power is very poor. Bringing these customers back would be hard but not impossible. First off, the business should carry out a survey on why this segment churned and make corrections where necessary. They could further reach out to this segment introducing their core business and promoting their high-valued, in-demand products.

Conclusion

By using the RFM analysis via K-Means clustering, we were able to segment our customers in the e-commerce store into four. In naming these clusters, I have placed more weight on the monetary value because the aim of every business is to make money and the more money a segment brings, the more preferential treatment that segment gets in order to retain them. Nevertheless, every segment irrespective of its monetary value should be treated uniquely and promotions to convert a low-level to a high-level segment should be strategically formed.

Furthermore, analysis should be based on the business goals therefore one must have great domain knowledge of the business in question to adequately understand and meet the needs of each customer segment.

Thank you for your time and attention! I hope this was an informative and interesting project. For more detail about this data, the code, and more visualization, you can check my GitHub account and should you have any questions or any kind of feedback, feel free to drop them in the comment section or contact me on LinkedIn.