Original link: http://tecdat.cn/?p=26219
Bank dataset
Our dataset description
This data relates to the direct marketing activities of banking institutions, which are based on telephone. Typically, you need to contact multiple contacts of the same customer to access whether the product (bank time deposit) will ("yes") or not ("no") subscribe.
y - does the customer subscribe to a time deposit? (binary: 'yes',' no ')
Subscription or non subscription model is the best choice for our customers. We will use the following algorithm:
- linear regression
- Random forest regression
- KNN nearest neighbor
- Decision tree
- Gauss naive Bayes
- Support vector machine
The decision to select the best model will be based on:
- accuracy
- Oversampling
Data preparation
In this section, we load data. Our data has 45211 variables.
Input variable:
Bank customer data
1 - age (number)
2 - job: job type (classification: 'administration', 'blue collar', 'Entrepreneur', 'maid', 'management', 'retirement', 'self employment', 'service', 'student', 'technician', 'unemployed', 'unknown')
3 - marriage: marital status (classification: 'divorce', 'married', 'single', 'unknown'; note: 'divorce' refers to divorce or widowhood).
4 - Education (classification: 'basic 4 years',' Basic 6 years', 'basic 9 years',' high school ',' illiterate ',' professional course ',' university degree ',' unknown ')
5 - default: is there a default credit? (classification: 'no', 'yes',' unknown ')
6 - Housing: is there a housing loan? (classification: 'no', 'yes',' unknown ')
7 - Loan: is there a personal loan?
8 - contact: contact communication type (classification: 'mobile phone', 'phone').
9 - month: the year and month of the last contact (classification: 'January', 'February', 'March', 'November', 'December')
10 - day\_of\_week: the week of the last contact (classification: 'mon', 'tue', 'wed', 'thu', 'fri')
11 - duration: the duration of the last contact, in seconds (digits).
12 - activity: the number of contacts made for this customer during this activity (number, including the last contact).
13 - pdays: the number of days since the last contact with the customer in the last activity (999 indicates no previous contact with the customer).
14 - previous: the number of contacts made for this customer before this event (number).
15 - result: the result of the last marketing campaign (classification: "failed", "nonexistent", "successful").
Social and economic background attributes
16 - emp.var.rate: employment change rate -- quarterly indicator (numerical value).
17 - cons.price.idx: consumer price index - monthly indicator (value).
18 - cons.conf.idx: consumer confidence index - monthly indicator (number).
19 - euribor3m: Bank 3-month interest rate - Daily Index (value)
20 - nr.employed: number of employees - quarterly indicators (figures)
Output variable (desired target):
- y - does the customer subscribe for time deposits? (binary: 'yes',' no ')
data.head(5)
Our next step is to look at the form of the variable and whether there is a problem with missing values.
df1 = data.dtypes df1
df2 = data.isnull().sum() df2
Our next step is to calculate the values of all variables.
data\['y'\].value_counts()
data\['job'\].value_counts()
data\['marital'\].value_counts()
data\['education'\].value_counts()
data\['housing'\].value_counts()
data\['loan'\].value_counts()
data\['contact'\].value_counts()
data\['month'\].value_counts()
data\['poutcome'\].value_counts()
descriptive statistics
Numerical summary
data.head(5)
Change the value of dependent variable y. Replace no - 0 and yes - 1.
data\['y'\] = data\['y'\].map({'no': 0, 'yes': 1})
data.columns
For each of our variables, we draw a box graph to see if there are any visible outliers.
plt.figure(figsize=\[10,25\]) ax = plt.subplot(611) sns.boxplot(data\['age'\],orient="v")
We can see many visible outliers, especially in the case of balance, campaign and pdays. In pdays, we can see that many variables are outside the quantile range. This variable is a special case, which is decoded as - 1, which is why our graph looks like this. In the case of a boxplot before the variable, which represents the number of connections performed before this activity, we can also notice many values beyond the quantile range.
histogram
Our next step is to look at the distribution and histogram of continuous variables
We can see that no variable has a normal distribution.
plt.figure(figsize=\[10,20\]) plt.subplot(611) g = sns.distplot(data\["age"\], color="r")
Our next step is to look at the relationship between the dependent variable y and each variable or continuous variable.
g = sns.FacetGrid(data, col='y',size=4) g.map
The most interesting observation we can get from these variables is that most people who say no are between the ages of 20 and 40. On the 20th day at the end of the month, most people also rejected the proposal.
Classification summary
We make subsets of data that contain only categorical variables to make it easier to draw box graphs
data_categorical = data\[\['job', 'marital', 'education', 'default', 'housing', 'loan','month', 'y'\]\]
We also looked at categorical variables to see if there were any interesting features
As can be seen from the bar chart above, the most interesting results come from the variables: marital status, education and work.
From the chart representing marital status, most people are married.
As we can see on the chart representing Education - the largest is the number of people with secondary education.
In job's case, we can see that most people have blue collar and management jobs.
We also want to see the relationship between our classification variables and y variables on the mosaic.
plt.rcParams\['font.size'\] = 16.0
As we can see, most people rejected the proposal. In terms of status, married people say "no" most.
In the case of variable default, most people without default credit also rejected the proposal.
Most people with housing loans also rejected the proposal.
Most people without loans rejected the offer.
data mining
data.head(5)
We want to go deeper into our variables and see if we can do more with them.
Our next step is to use WOE analysis.
finv, IV = datars(data,data.y) IV
Based on the WOE analysis variables that are useful to us: pdays, previous, job, housing, balance, month, duration, poutcome and contact.
In the next step, we decide to delete useless columns based on the WOE result and the previous result of the variable.
One of the columns we deleted was poutcome. Although its WOE was very high, we decided to delete it because we saw many unknown observations from prevois analysis.
In the case of variable duration, we can also see that WOE is quite large, and we can even say that the result is a little suspicious. We decided to abandon it based on the results of WOE, because our model should explain whether it is recommended to call someone based on past data.
In the case of variable contact, we give it up because for us, the contact form is useless in our model.
We also deleted the variable day because it is of no use to us, because this variable represents days, and the WOE of this variable is very small. The last variable we deleted is the variable pdays. Although the result of this variable WOE is very good, it is not a useful variable for us.
The remaining columns in our analysis:
Feature selection and Engineering
To execute our algorithm, we first need to change the string to a binary variable.
data = pd.get_dummies(data=data, columns = \['job', 'marital', 'education' , 'month'\], \ prefix = \['job', 'marital', 'education' , 'month'\])
We changed the name of the column.
data.head(5)
After creating the dummy variable, we performed Pearson correlation.
age = pearsonr(data\['age'\], data\['y'\])
sns.heatmap(corr
We selected the number column to check the correlation. As we can see, there is no correlation.
Let's look at the relationship between dependent variables and continuous variables.
pylab.show()
Cross validation
After all the preparatory work, we can finally split the data set into training set and test set.
Implementation of algorithm
logistic regression
K=5 kf = KFold(n_splits=K, shuffle=True) logreg = LogisticRegression()
\[\[7872 93\] \[ 992 86\]\]
\[\[7919 81\] \[ 956 86\]\]
\[\[7952 60\] \[ 971 59\]\]
\[\[7871 82\] \[1024 65\]\]
\[\[7923 69\] \[ 975 75\]\]
Decision tree
dt2 = tree.DecisionTreeClassifier(random\_state=1, max\_depth=2)
\[\[7988 0\] \[1055 0\]\]
\[\[7986 0\] \[1056 0\]\]
\[\[7920 30\] \[1061 31\]\]
\[\[8021 0\] \[1021 0\]\]
\[\[7938 39\] \[1039 26\]\]
Random forest
random_forest = RandomForestClassifier
\[\[7812 183\] \[ 891 157\]\]
\[\[7825 183\] \[ 870 164\]\]
\[\[7774 184\] \[ 915 169\]\]
\[\[7770 177\] \[ 912 183\]\]
\[\[7818 196\] \[ 866 162\]\]
KNN nearest neighbor
classifier = KNeighborsClassifier(n_neighbors =13,metric = 'minkowski' , p=2) print("Mean accuracy: ",accuracyknn/K) print("The best AUC: ", bestaucknn)
\[\[7952 30\] \[1046 15\]\]
\[\[7987 30\] \[1010 15\]\]
\[\[7989 23\] \[1017 13\]\]
\[\[7920 22\] \[1083 17\]\]
\[\[7948 21\] \[1052 21\]\]
Gauss naive Bayes
kf = KFold(n_splits=K, shuffle=True) gaussian = GaussianNB()
\[\[7340 690\] \[ 682 331\]\]
\[\[7321 633\] \[ 699 389\]\]
\[\[7291 672\] \[ 693 386\]\]
\[\[7300 659\] \[ 714 369\]\]
\[\[7327 689\] \[ 682 344\]\]
`````` models = pd.DataFrame({ 'Model': \['KNN', 'Logistic Regression', 'Naive Bayes', 'Decision Tree','Random Forest'\], 'Score': \[ accuracyknn/K, accuracylogreg/K, accuracygnb/K, accuracydt/K, accuracyrf/K\], 'BestAUC': \[bestaucknn,bestauclogreg,bestaucgnb, bestaucdt,bestaucrf\]})
We see that the best model based on AUC value is naive Bayes. We should not care too much about the lowest R2 score because the data is very unbalanced (it is easy to predict y=0). In the confusion matrix, we see that it predicts beautiful values, real values and negative values. To our surprise, the AUC of the decision tree is about 50%.
Undersampling
We try to undersampling the variable y=0
gTrain, gValid = train\_test\_split
logistic regression
predsTrain = logreg.predict(gTrainUrandom)
predsTrain = logreg.predict(gTrain20Urandom)
predsTrain = logreg.predict(gTrrandom)
Decision tree
`````` print("Train AUC:", metrics.roc\_auc\_score(ygTrds))
Random forest
print("Train AUC:", metrics.roc\_auc\_score(ygTr, predsTrain), "Valid AUC:", metrics.roc\_auc\_score(ygVd, preds))
KNN nearest neighbor
print("Train AUC:", metrics.roc\_auc\_score(ygTrm, predsTrain), "Valid AUC:", metrics.roc\_auc\_score(ygVal10, preds))
Gauss naive Bayes
print("Train AUC:", metrics.roc\_auc\_score(ygTraom, predsTrain), "Valid AUC:", metrics.roc\_auc\_score(ygid, preds))
Oversampling
We tried to oversample the variable y=1
feates = datolist() print(feures) feaes.remove('y')
print(gTrainOSM.shape)
(31945, 39) `````` smt = SMOT
(32345, 39) `````` smt = SMOT
(32595, 39) `````` ygTrain10OSM=gTrain10OSM\['y'\] gTrain10OSM=gTrain10OSM.drop(columns=\['y'\])
logistic regression
print("Train AUC:", metrics.roc\_auc\_score(ygTrin10SM, predsTrain), "Valid AUC:", metrics.roc\_auc\_score(ygValid, preds))
Decision tree
dt2.fit(,ygTranOS) predsTrain = dtpreict(TrainOSM) preds = dt2.predict(gValid)
Random forest
random_forest.fit(rainOSM, ygTranOS) predsTrain = random_forest.prect(gTraiOSM) p
KNN nearest neighbor
classifier.fit(granOSM, yTanOSM) predsTrain = classifier.predict(gTaiSM) preds = classifier.predict(Vaid)
Gauss naive Bayes
gaussian.fit(gTriOM, ygrainM) predsTrain = gaussian.predcti)
conclusion
We see that the subsampling and oversampling variables y are not very helpful to AUC.
Most popular insights
1.Why employees leave from decision tree model
2.R language tree based method: decision tree, random forest
3.Using scikit learn and pandas decision trees in python
4.Machine learning: running random forest data analysis reports in SAS
5.R language uses random forest and text mining to improve airline customer satisfaction
6.Machine learning boosts fast fashion and accurate sales time series
9.Using python machine learning classification to predict bank customer churn in python