1. Analysis purpose
The concept of customer commercial value is based on the interests of customers to merchants. As the source of enterprise income, customers are the foundation of enterprise survival. From a strategic point of view, customers not only determine the income of the company, but also are the valuable wealth for the company to use its brand image to improve the market value of the company and establish a competitive advantage. For enterprises, customers are a source of profit with commercial value.
This article analyzes the customer data flow during Taobao Double Eleven, using the AARRR model to solve three problems: how does user activity change in different time dimensions; how is user retention (repurchase rate and funnel loss) and user value situation.
2. Data source
field description
This data set has a total of about 12 million pieces of data. The data is the user behavior data of Taobao APP from November 18, 2014 to December 18, 2014. There are a total of 6 columns of fields. The column fields are as follows:
Field Name | field meaning |
---|---|
user_id | User identity (desensitization) |
item_id | Commodity ID (desensitization) |
behavior_type | User behavior type (including four behaviors of clicking, collecting, adding to shopping cart, and paying, represented by numbers 1, 2, 3, and 4 respectively) |
user_geohash | geographic location |
item_category | Category ID (category to which the product belongs) |
time | time of user action |
3. Data processing
1. Import library
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline plt.rcParams['font.sans-serif']=['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus']=False # Used to display negative signs normally import seaborn as sns import warnings warnings.filterwarnings('ignore')
2. Data cleaning and preprocessing
data = pd.read_csv('tianchi_mobile_recommend_train_user.csv') # remove duplicates data.drop_duplicates(inplace=True) #View missing values data.isnull().sum() # Missing values are geographic information, which have no effect on user behavior analysis and are not processed data.drop('user_geohash',axis=1, inplace=True) #restore index data = data.reset_index(drop=True) # Split the time column into a date column and an hour column data['time'] = pd.to_datetime(data['time']) data['date'] = data['time'].dt.date data['date'] = pd.to_datetime(data['date']) data['hour'] = data['time'].dt.hour data.head() #Convert item_id and item_category to str data.item_id = data.item_id.astype(str) data.item_category = data.item_category.astype(str)
4. User Behavior Analysis (AARRR Model)
Activity
First calculate the daily visits, daily unique visitors and per capita visits, and package them into a new df.
#daily visits pv_daily_s = data.groupby(by='date')['user_id'].count() pv_daily_s.head()
#Daily Unique Visitors uv_daily_s = data.groupby(by='date')['user_id'].nunique() uv_daily_s.head()
#Per capita visits (daily visits/daily unique visitors) uv_pv_s = pv_daily_s / uv_daily_s uv_pv_s.head()
df = pd.concat((pv_daily_s,uv_daily_s,uv_pv_s),axis=1) df.columns = ['pv','uv','pv/uv'] df.head()
Then analyze the daily visits, daily unique visits, and per capita visits
plt.figure(figsize=(13,13), dpi=320) plt.figure(1) ax1 = plt.subplot(311) ax1.plot(df.index,df.pv, color="b",linestyle = "-") ax1.set_title('pv') ax2=plt.subplot(312) ax2.plot(df.index,df.uv, color="r",linestyle = "-") ax2.set_title('uv') ax3=plt.subplot(313) ax3.plot(df.index,df.pv/df.uv, color="k",linestyle = "-") ax3.set_title('pv/uv')
It is obtained as shown in the following figure:
According to the above figure, it can be concluded that before December 5th, the activity fluctuated at a certain level. After December 5, the activity began to increase significantly, and reached its peak on the day of Double Twelve. It is preliminarily inferred that the possible reason is that after December 5, the Double Twelve warm-up activity started, and user activity increased.
Changes in activity on the day of Double Twelve
Select the data on the day of Double 12, analyze its active time period, the number of visits per hour on the day of Double 12, the number of visitors per hour on the day of Double 12, and the per capita visits per hour on the day of Double 12 to form a new df
#Take out the data on the day of Double 12 data_1212 = data.loc[data['date']=='2014-12-12'] #Check the hourly visits on Double 12 pv_hour_s = data_1212.groupby(by='hour')['user_id'].count() #View the number of visitors per hour on the day of Double 12 uv_hour_s = data_1212.groupby(by='hour')['user_id'].nunique() #View the average visits per hour on the day of Double 12 pv_uv_s = pv_hour_s / uv_hour_s df = pd.concat((pv_hour_s,uv_hour_s,pv_uv_s),axis=1) df.columns = ['pv_hour','uv_hour','pv/uv'] df.head()
Visualize the hourly visits, visitors and per capita visits on Double Twelve, and get the following results:
According to the above figure, it can be concluded that Taobao users are more active after 0:00 and 18:00 on the day of Double Twelve, and it falls to the lowest point at 6:00. It is suggested that merchants can set up coupons or take other promotional measures after 18:00 to attract more people to consume and increase the purchase rate.
Changes in the activity of different user behaviors
pv_df = data.pivot_table(index='date',columns='behavior_type',aggfunc='size',fill_value=0) pv_df.head()
Plot the aggregates of the four behaviors in a coordinate system to view
plt.figure(figsize=(12,8)) plt.xticks(rotation=30) plt.plot(pv_df.index,pv_df[1],label='click') plt.plot(pv_df.index,pv_df[2],label="collect") plt.plot(pv_df.index,pv_df[3],label='add purchase') plt.plot(pv_df.index,pv_df[4],label='to pay') plt.legend()
By observing the above figure, it can be concluded that the four behaviors of clicking, collecting, adding to the shopping cart, and paying all reached their peak on the day of Double Twelve. Only observing payment and collection, it was found that the amount of payment on the day of Double Twelve was greater than the amount of collection. It is preliminarily speculated that the possible reason is that the amount of payment is greater than the amount of collection, which means that many users have purchased products beyond the target, which may be impulsive consumption due to the influence of promotions, or to make up orders.
retention rate
Funnel Conversions
#View the visits of different behaviors and encapsulate them in df s = data['behavior_type'].value_counts() s.index = ['click','collect','Add to cart','to pay'] df = pd.DataFrame([s.index,s]).T df.columns = ['user behavior','Views'] df
#Calculate the conversion rate from click to favorite, from favorite to additional purchase, from additional purchase to payment temp1 = df['Views'][1:].values temp2 = df['Views'][0:-1].values p = temp1 / temp2 * 100 p = list(p) p.insert(0,100) df['single link conversion rate(%)'] = p df #Calculate overall conversion rate df['overall conversion rate(%)'] = df['Views'] / df.iloc[0,1] * 100 df #Calculate the loss rate (%) of each link df['Churn rate for each link(%)'] = 100 - df['single link conversion rate(%)'] df
Use the pyecharts library to draw a funnel chart:
#Funnel conversion chart of overall conversion rate from pyecharts.charts import Funnel from pyecharts import options as opts funnel = Funnel().add( series_name = 'overall conversion rate(%)', data_pair = [ list(z) for z in zip(df['user behavior'],df['overall conversion rate(%)']) ], is_selected = True, label_opts = opts.LabelOpts(position = 'inside') ) funnel.set_series_opts(tooltip_opts = opts.TooltipOpts(formatter = '{a}<br/>{b}:{c}%')) funnel.set_global_opts(title_opts = opts.TitleOpts(title = 'overall conversion rate(%)') ) funnel.render_notebook()
Observing the above picture, it can be concluded that the loss of users mainly occurs in the click-collection link. After analyzing the loss of users in each link, it is preliminarily inferred that the possible reasons are as follows.
User churn from clicks to favorites. ①Users are attracted by the advertisements placed, and after entering, they find that they are seriously inconsistent with expectations, resulting in loss. The suggestion for this is to optimize the advertisements. ②Customers searched or recommended to the list page but failed to find a suitable product, resulting in churn. The suggestion for this is to update the search engine and related algorithms, and push relevant content as accurately as possible. ③If the evaluation of the product is too low, resulting in loss, the suggestion for this is that the merchant should conduct further investigations, analyze the reasons for the low evaluation of the product by users, make further improvements, enhance the user's shopping experience, and finally promote word-of-mouth marketing.
The loss of users from favorites to additional purchases. Such users have strong purchasing needs. It can accurately push promotional information to users to stimulate users to complete purchases.
User loss from additional purchase to payment. The loss of users in this link is often caused by the complexity of the purchase link. Therefore, merchants need to optimize the shopping process and support as many payment methods as possible, such as bank cards, WeChat payment, Alipay payment, Huabei, etc. Considering that the current shopping process to Taobao is difficult to simplify, merchants need to further investigate to understand the reasons why users give up paying, so as to facilitate adjustments.
Repurchase analysis
#Histogram of user purchases buy_df = data.loc[data['behavior_type'] == 4] user_buy_s = buy_df.groupby(by='user_id')['behavior_type'].count() plt.hist(user_buy_s,bins=50)
#Calculation of repurchase rate = number of users with purchase times greater than 1/total number of users with purchase behavior reBuy_rate = user_buy_s[user_buy_s > 1].count() / user_buy_s.count() * 100 reBuy_rate
The calculated repurchase rate is 91.44722034661264. It can be concluded that the repurchase rate of users in the month from November 18, 2014 to December 18, 2014 was as high as 91.45%.
5. User value analysis (RMF model)
#Take out the data of purchased users separately buy_df = data.loc[data['behavior_type'] == 4] buy_df.head()
#Calculate R: R represents the interval of the customer's latest transaction time #/np.timedelta64(1,'D') to output days now_date = buy_df['date'].max() R = buy_df.groupby(by='user_id')['date'].apply(lambda x:now_date - x.max()) / np.timedelta64(1,'D') #Calculation F: consumption frequency per user F = buy_df.groupby(by='user_id')['date'].count()
Since the transaction amount is missing, only the last transaction date and transaction frequency are considered.
rfm = pd.DataFrame(data=[R,F],index=['R','F']).T rfm.head()
#Divide each dimension into two degrees recent_avg = rfm['R'].mean() freq_avg = rfm['F'].mean() #R should be as small as possible, then return 1 if R is less than the mean value, otherwise return 0 def rec_value(x): if x < recent_avg: return '1' else: return '0' #F should be as big as possible, then return 1 if F is greater than the mean value, otherwise return 0 def freq_value(x): if x > freq_avg: return '1' else: return '0' rfm['R_value'] = rfm['R'].apply(rec_value) rfm['F_value'] = rfm['F'].apply(freq_value) #splicing R and F together rfm['rfm']=rfm['R_value'].str.cat(rfm['F_value']) #Determine the user level based on the splicing of R and F def rfm_value(x): if x == '10':#The purchase interval is short, but the purchase frequency is low return 'important development customers' elif x == '01':#The purchase interval is long, but the purchase frequency is high return 'important to keep customers' elif x == '00': return 'Important to retain customers' else: return 'important value customers' rfm['user_type'] = rfm['rfm'].apply(rfm_value) rfm.head()
#Calculate the proportion of various users user_type_count_s = rfm['user_type'].value_counts() user_type_count_s / user_type_count_s.sum() * 100
In this regard, it can be concluded that important retained customers account for the largest proportion, and the consumption time interval of this type of users is relatively long, and the consumption frequency is low. It is necessary to take the initiative to contact the customer, investigate and find out where the problem is, and wake up the customer through SMS, email, APP push, etc., so as to reduce the loss as much as possible. Important development customers have low consumption frequency, and appropriate discounts or bundled sales can be given to increase the user's purchase frequency and increase the retention rate as much as possible. Important value customers are key users, but there are relatively few users. VIP services can be provided to such customers in a targeted manner; it is important to keep the customer's consumption time interval relatively long, but the consumption frequency is high. Such users may purchase many things at once. For such customers, it is necessary to take the initiative to contact them, pay attention to their shopping habits, do precise marketing, and meet the needs of such users in a timely manner.
Summarize
The above is all the content of this article. This article uses pandas, numpy, matplotlib and pyecharts to implement the AARRR model and RMF model, and analyzes the user behavior during Taobao Double Twelve to draw general conclusions, which have certain commercial significance.