Analysis of Taobao user behavior based on AARRR and RMF models

1. Analysis purpose

The concept of customer commercial value is based on the interests of customers to merchants. As the source of enterprise income, customers are the foundation of enterprise survival. From a strategic point of view, customers not only determine the income of the company, but also are the valuable wealth for the company to use its brand image to improve the market value of the company and establish a competitive advantage. For enterprises, customers are a source of profit with commercial value.
This article analyzes the customer data flow during Taobao Double Eleven, using the AARRR model to solve three problems: how does user activity change in different time dimensions; how is user retention (repurchase rate and funnel loss) and user value situation.

2. Data source

field description

This data set has a total of about 12 million pieces of data. The data is the user behavior data of Taobao APP from November 18, 2014 to December 18, 2014. There are a total of 6 columns of fields. The column fields are as follows:

Field Namefield meaning
user_idUser identity (desensitization)
item_idCommodity ID (desensitization)
behavior_typeUser behavior type (including four behaviors of clicking, collecting, adding to shopping cart, and paying, represented by numbers 1, 2, 3, and 4 respectively)
user_geohashgeographic location
item_categoryCategory ID (category to which the product belongs)
timetime of user action

3. Data processing

1. Import library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.sans-serif']=['SimHei'] # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus']=False # Used to display negative signs normally
import seaborn as sns
import warnings

2. Data cleaning and preprocessing

data = pd.read_csv('tianchi_mobile_recommend_train_user.csv')
# remove duplicates
#View missing values
# Missing values ​​are geographic information, which have no effect on user behavior analysis and are not processed
data.drop('user_geohash',axis=1, inplace=True)
#restore index
data = data.reset_index(drop=True)
# Split the time column into a date column and an hour column
data['time'] = pd.to_datetime(data['time'])
data['date'] = data['time']
data['date'] = pd.to_datetime(data['date'])
data['hour'] = data['time'].dt.hour
#Convert item_id and item_category to str
data.item_id = data.item_id.astype(str)
data.item_category = data.item_category.astype(str)

4. User Behavior Analysis (AARRR Model)


First calculate the daily visits, daily unique visitors and per capita visits, and package them into a new df.

#daily visits
pv_daily_s = data.groupby(by='date')['user_id'].count()

#Daily Unique Visitors
uv_daily_s = data.groupby(by='date')['user_id'].nunique()

#Per capita visits (daily visits/daily unique visitors)
uv_pv_s = pv_daily_s / uv_daily_s

df = pd.concat((pv_daily_s,uv_daily_s,uv_pv_s),axis=1)
df.columns = ['pv','uv','pv/uv']

Then analyze the daily visits, daily unique visits, and per capita visits

plt.figure(figsize=(13,13), dpi=320)
ax1 = plt.subplot(311)
ax1.plot(df.index,df.pv, color="b",linestyle = "-")
ax2.plot(df.index,df.uv, color="r",linestyle = "-")
ax3.plot(df.index,df.pv/df.uv, color="k",linestyle = "-")

It is obtained as shown in the following figure:

According to the above figure, it can be concluded that before December 5th, the activity fluctuated at a certain level. After December 5, the activity began to increase significantly, and reached its peak on the day of Double Twelve. It is preliminarily inferred that the possible reason is that after December 5, the Double Twelve warm-up activity started, and user activity increased.

Changes in activity on the day of Double Twelve

Select the data on the day of Double 12, analyze its active time period, the number of visits per hour on the day of Double 12, the number of visitors per hour on the day of Double 12, and the per capita visits per hour on the day of Double 12 to form a new df

#Take out the data on the day of Double 12
data_1212 = data.loc[data['date']=='2014-12-12']
#Check the hourly visits on Double 12
pv_hour_s = data_1212.groupby(by='hour')['user_id'].count()
#View the number of visitors per hour on the day of Double 12
uv_hour_s = data_1212.groupby(by='hour')['user_id'].nunique()
#View the average visits per hour on the day of Double 12
pv_uv_s = pv_hour_s / uv_hour_s
df = pd.concat((pv_hour_s,uv_hour_s,pv_uv_s),axis=1)
df.columns = ['pv_hour','uv_hour','pv/uv']

Visualize the hourly visits, visitors and per capita visits on Double Twelve, and get the following results:

According to the above figure, it can be concluded that Taobao users are more active after 0:00 and 18:00 on the day of Double Twelve, and it falls to the lowest point at 6:00. It is suggested that merchants can set up coupons or take other promotional measures after 18:00 to attract more people to consume and increase the purchase rate.

Changes in the activity of different user behaviors

pv_df = data.pivot_table(index='date',columns='behavior_type',aggfunc='size',fill_value=0)

Plot the aggregates of the four behaviors in a coordinate system to view

plt.plot(pv_df.index,pv_df[3],label='add purchase')
plt.plot(pv_df.index,pv_df[4],label='to pay')

By observing the above figure, it can be concluded that the four behaviors of clicking, collecting, adding to the shopping cart, and paying all reached their peak on the day of Double Twelve. Only observing payment and collection, it was found that the amount of payment on the day of Double Twelve was greater than the amount of collection. It is preliminarily speculated that the possible reason is that the amount of payment is greater than the amount of collection, which means that many users have purchased products beyond the target, which may be impulsive consumption due to the influence of promotions, or to make up orders.

retention rate

Funnel Conversions

#View the visits of different behaviors and encapsulate them in df
s = data['behavior_type'].value_counts()
s.index = ['click','collect','Add to cart','to pay']
df = pd.DataFrame([s.index,s]).T
df.columns = ['user behavior','Views']

#Calculate the conversion rate from click to favorite, from favorite to additional purchase, from additional purchase to payment
temp1 = df['Views'][1:].values
temp2 = df['Views'][0:-1].values
p = temp1 / temp2 * 100
p = list(p)
df['single link conversion rate(%)'] = p
#Calculate overall conversion rate
df['overall conversion rate(%)'] = df['Views'] / df.iloc[0,1] * 100
#Calculate the loss rate (%) of each link
df['Churn rate for each link(%)'] = 100 - df['single link conversion rate(%)']

Use the pyecharts library to draw a funnel chart:

#Funnel conversion chart of overall conversion rate
from pyecharts.charts import Funnel
from pyecharts import options as opts
funnel = Funnel().add(
                series_name = 'overall conversion rate(%)',
                data_pair = [ list(z) for z in zip(df['user behavior'],df['overall conversion rate(%)']) ],
                is_selected = True,
                label_opts = opts.LabelOpts(position = 'inside')
funnel.set_series_opts(tooltip_opts = opts.TooltipOpts(formatter = '{a}<br/>{b}:{c}%'))
funnel.set_global_opts(title_opts = opts.TitleOpts(title = 'overall conversion rate(%)') )

Observing the above picture, it can be concluded that the loss of users mainly occurs in the click-collection link. After analyzing the loss of users in each link, it is preliminarily inferred that the possible reasons are as follows.
User churn from clicks to favorites. ①Users are attracted by the advertisements placed, and after entering, they find that they are seriously inconsistent with expectations, resulting in loss. The suggestion for this is to optimize the advertisements. ②Customers searched or recommended to the list page but failed to find a suitable product, resulting in churn. The suggestion for this is to update the search engine and related algorithms, and push relevant content as accurately as possible. ③If the evaluation of the product is too low, resulting in loss, the suggestion for this is that the merchant should conduct further investigations, analyze the reasons for the low evaluation of the product by users, make further improvements, enhance the user's shopping experience, and finally promote word-of-mouth marketing.
The loss of users from favorites to additional purchases. Such users have strong purchasing needs. It can accurately push promotional information to users to stimulate users to complete purchases.
User loss from additional purchase to payment. The loss of users in this link is often caused by the complexity of the purchase link. Therefore, merchants need to optimize the shopping process and support as many payment methods as possible, such as bank cards, WeChat payment, Alipay payment, Huabei, etc. Considering that the current shopping process to Taobao is difficult to simplify, merchants need to further investigate to understand the reasons why users give up paying, so as to facilitate adjustments.

Repurchase analysis

#Histogram of user purchases
buy_df = data.loc[data['behavior_type'] == 4]
user_buy_s = buy_df.groupby(by='user_id')['behavior_type'].count()

#Calculation of repurchase rate = number of users with purchase times greater than 1/total number of users with purchase behavior
reBuy_rate = user_buy_s[user_buy_s > 1].count() / user_buy_s.count() * 100

The calculated repurchase rate is 91.44722034661264. It can be concluded that the repurchase rate of users in the month from November 18, 2014 to December 18, 2014 was as high as 91.45%.

5. User value analysis (RMF model)

#Take out the data of purchased users separately
buy_df = data.loc[data['behavior_type'] == 4]

#Calculate R: R represents the interval of the customer's latest transaction time
#/np.timedelta64(1,'D') to output days
now_date = buy_df['date'].max()
R = buy_df.groupby(by='user_id')['date'].apply(lambda x:now_date - x.max()) / np.timedelta64(1,'D')
#Calculation F: consumption frequency per user
F = buy_df.groupby(by='user_id')['date'].count()

Since the transaction amount is missing, only the last transaction date and transaction frequency are considered.

rfm = pd.DataFrame(data=[R,F],index=['R','F']).T

#Divide each dimension into two degrees
recent_avg = rfm['R'].mean()
freq_avg = rfm['F'].mean()
#R should be as small as possible, then return 1 if R is less than the mean value, otherwise return 0
def rec_value(x):
    if x < recent_avg:
        return '1'
        return '0'
#F should be as big as possible, then return 1 if F is greater than the mean value, otherwise return 0   
def freq_value(x):
    if x > freq_avg:
        return '1'
        return '0'
rfm['R_value'] = rfm['R'].apply(rec_value)
rfm['F_value'] = rfm['F'].apply(freq_value)
#splicing R and F together
#Determine the user level based on the splicing of R and F
def rfm_value(x):
    if x == '10':#The purchase interval is short, but the purchase frequency is low
        return 'important development customers'
    elif x == '01':#The purchase interval is long, but the purchase frequency is high
        return 'important to keep customers'
    elif x == '00':
        return 'Important to retain customers'
        return 'important value customers'
rfm['user_type'] = rfm['rfm'].apply(rfm_value)
#Calculate the proportion of various users
user_type_count_s = rfm['user_type'].value_counts()
user_type_count_s / user_type_count_s.sum() * 100

In this regard, it can be concluded that important retained customers account for the largest proportion, and the consumption time interval of this type of users is relatively long, and the consumption frequency is low. It is necessary to take the initiative to contact the customer, investigate and find out where the problem is, and wake up the customer through SMS, email, APP push, etc., so as to reduce the loss as much as possible. Important development customers have low consumption frequency, and appropriate discounts or bundled sales can be given to increase the user's purchase frequency and increase the retention rate as much as possible. Important value customers are key users, but there are relatively few users. VIP services can be provided to such customers in a targeted manner; it is important to keep the customer's consumption time interval relatively long, but the consumption frequency is high. Such users may purchase many things at once. For such customers, it is necessary to take the initiative to contact them, pay attention to their shopping habits, do precise marketing, and meet the needs of such users in a timely manner.


The above is all the content of this article. This article uses pandas, numpy, matplotlib and pyecharts to implement the AARRR model and RMF model, and analyzes the user behavior during Taobao Double Twelve to draw general conclusions, which have certain commercial significance.

Tags: Python matplotlib pandas echarts

Posted by whoisKeel on Wed, 25 Jan 2023 11:31:39 +1030