Python bank machine learning: regression, random forest, KNN nearest neighbor, decision tree, Gaussian naive Bayes, support vector machine, svm analysis of marketing activity data

Original link: 

Bank dataset

Our dataset description

This data relates to the direct marketing activities of banking institutions, which are based on telephone. Typically, you need to contact multiple contacts of the same customer to access whether the product (bank time deposit) will ("yes") or not ("no") subscribe.
y - does the customer subscribe to a time deposit? (binary: 'yes',' no ')

Subscription or non subscription model is the best choice for our customers. We will use the following algorithm:

  • linear regression
  • Random forest regression
  • KNN nearest neighbor
  • Decision tree
  • Gauss naive Bayes
  • Support vector machine

The decision to select the best model will be based on:

  • accuracy
  • Oversampling

Data preparation

In this section, we load data. Our data has 45211 variables.

Input variable:
Bank customer data
1 - age (number)
2 - job: job type (classification: 'administration', 'blue collar', 'Entrepreneur', 'maid', 'management', 'retirement', 'self employment', 'service', 'student', 'technician', 'unemployed', 'unknown')
3 - marriage: marital status (classification: 'divorce', 'married', 'single', 'unknown'; note: 'divorce' refers to divorce or widowhood).
4 - Education (classification: 'basic 4 years',' Basic 6 years', 'basic 9 years',' high school ',' illiterate ',' professional course ',' university degree ',' unknown ')
5 - default: is there a default credit? (classification: 'no', 'yes',' unknown ')
6 - Housing: is there a housing loan? (classification: 'no', 'yes',' unknown ')
7 - Loan: is there a personal loan?
8 - contact: contact communication type (classification: 'mobile phone', 'phone').
9 - month: the year and month of the last contact (classification: 'January', 'February', 'March', 'November', 'December')
10 - day\_of\_week: the week of the last contact (classification: 'mon', 'tue', 'wed', 'thu', 'fri')
11 - duration: the duration of the last contact, in seconds (digits).
12 - activity: the number of contacts made for this customer during this activity (number, including the last contact).
13 - pdays: the number of days since the last contact with the customer in the last activity (999 indicates no previous contact with the customer).
14 - previous: the number of contacts made for this customer before this event (number).
15 - result: the result of the last marketing campaign (classification: "failed", "nonexistent", "successful").
Social and economic background attributes
16 - emp.var.rate: employment change rate -- quarterly indicator (numerical value).
17 - cons.price.idx: consumer price index - monthly indicator (value).
18 - cons.conf.idx: consumer confidence index - monthly indicator (number).
19 - euribor3m: Bank 3-month interest rate - Daily Index (value)
20 - nr.employed: number of employees - quarterly indicators (figures)

Output variable (desired target):

  • y - does the customer subscribe for time deposits? (binary: 'yes',' no ')

Our next step is to look at the form of the variable and whether there is a problem with missing values.

df1 = data.dtypes

df2 = data.isnull().sum() 

Our next step is to calculate the values of all variables.










descriptive statistics

Numerical summary


Change the value of dependent variable y. Replace no - 0 and yes - 1.

data\['y'\] = data\['y'\].map({'no': 0, 'yes': 1})

For each of our variables, we draw a box graph to see if there are any visible outliers.

ax = plt.subplot(611)

We can see many visible outliers, especially in the case of balance, campaign and pdays. In pdays, we can see that many variables are outside the quantile range. This variable is a special case, which is decoded as - 1, which is why our graph looks like this. In the case of a boxplot before the variable, which represents the number of connections performed before this activity, we can also notice many values beyond the quantile range.


Our next step is to look at the distribution and histogram of continuous variables
We can see that no variable has a normal distribution.

g = sns.distplot(data\["age"\], color="r")

Our next step is to look at the relationship between the dependent variable y and each variable or continuous variable.

g = sns.FacetGrid(data, col='y',size=4)

The most interesting observation we can get from these variables is that most people who say no are between the ages of 20 and 40. On the 20th day at the end of the month, most people also rejected the proposal.

Classification summary

We make subsets of data that contain only categorical variables to make it easier to draw box graphs

data_categorical = data\[\['job',
 'default', 'housing',
 'loan','month', 'y'\]\]


We also looked at categorical variables to see if there were any interesting features
As can be seen from the bar chart above, the most interesting results come from the variables: marital status, education and work.
From the chart representing marital status, most people are married.
As we can see on the chart representing Education - the largest is the number of people with secondary education.
In job's case, we can see that most people have blue collar and management jobs.

We also want to see the relationship between our classification variables and y variables on the mosaic.

plt.rcParams\['font.size'\] = 16.0

As we can see, most people rejected the proposal. In terms of status, married people say "no" most.

In the case of variable default, most people without default credit also rejected the proposal.

Most people with housing loans also rejected the proposal.

Most people without loans rejected the offer.

data mining


We want to go deeper into our variables and see if we can do more with them.

Our next step is to use WOE analysis.

finv, IV = datars(data,data.y)

Based on the WOE analysis variables that are useful to us: pdays, previous, job, housing, balance, month, duration, poutcome and contact.
In the next step, we decide to delete useless columns based on the WOE result and the previous result of the variable.
One of the columns we deleted was poutcome. Although its WOE was very high, we decided to delete it because we saw many unknown observations from prevois analysis.
In the case of variable duration, we can also see that WOE is quite large, and we can even say that the result is a little suspicious. We decided to abandon it based on the results of WOE, because our model should explain whether it is recommended to call someone based on past data.
In the case of variable contact, we give it up because for us, the contact form is useless in our model.
We also deleted the variable day because it is of no use to us, because this variable represents days, and the WOE of this variable is very small. The last variable we deleted is the variable pdays. Although the result of this variable WOE is very good, it is not a useful variable for us.

The remaining columns in our analysis:


Feature selection and Engineering

To execute our algorithm, we first need to change the string to a binary variable.

data = pd.get_dummies(data=data, columns = \['job', 'marital', 'education' , 'month'\], \
                                   prefix = \['job', 'marital', 'education' , 'month'\])

We changed the name of the column.


After creating the dummy variable, we performed Pearson correlation.

age = pearsonr(data\['age'\], data\['y'\])


We selected the number column to check the correlation. As we can see, there is no correlation.

Let's look at the relationship between dependent variables and continuous variables.

Cross validation

After all the preparatory work, we can finally split the data set into training set and test set.

Implementation of algorithm

logistic regression

kf = KFold(n_splits=K, shuffle=True)

logreg = LogisticRegression()
 \[\[7872   93\]
 \[ 992   86\]\]

 \[\[7919   81\]
 \[ 956   86\]\]

 \[\[7952   60\]
 \[ 971   59\]\]

 \[\[7871   82\]
 \[1024   65\]\]

 \[\[7923   69\]
 \[ 975   75\]\]

Decision tree

dt2 = tree.DecisionTreeClassifier(random\_state=1, max\_depth=2)
 \[\[7988    0\]
 \[1055    0\]\]

 \[\[7986    0\]
 \[1056    0\]\]

 \[\[7920   30\]
 \[1061   31\]\]

 \[\[8021    0\]
 \[1021    0\]\]

 \[\[7938   39\]
 \[1039   26\]\]

Random forest

random_forest = RandomForestClassifier
 \[\[7812  183\]
 \[ 891  157\]\]

 \[\[7825  183\]
 \[ 870  164\]\]

 \[\[7774  184\]
 \[ 915  169\]\]

 \[\[7770  177\]
 \[ 912  183\]\]

 \[\[7818  196\]
 \[ 866  162\]\]

KNN nearest neighbor

classifier = KNeighborsClassifier(n_neighbors =13,metric = 'minkowski' , p=2)

print("Mean accuracy: ",accuracyknn/K)
print("The best AUC: ", bestaucknn)
 \[\[7952   30\]
 \[1046   15\]\]

 \[\[7987   30\]
 \[1010   15\]\]

 \[\[7989   23\]
 \[1017   13\]\]

 \[\[7920   22\]
 \[1083   17\]\]

 \[\[7948   21\]
 \[1052   21\]\]

Gauss naive Bayes

kf = KFold(n_splits=K, shuffle=True)

gaussian = GaussianNB()
 \[\[7340  690\]
 \[ 682  331\]\]

 \[\[7321  633\]
 \[ 699  389\]\]

 \[\[7291  672\]
 \[ 693  386\]\]

 \[\[7300  659\]
 \[ 714  369\]\]

 \[\[7327  689\]
 \[ 682  344\]\]


models = pd.DataFrame({
    'Model': \['KNN', 'Logistic Regression', 
              'Naive Bayes', 'Decision Tree','Random Forest'\],
    'Score': \[ accuracyknn/K, accuracylogreg/K, 
              accuracygnb/K, accuracydt/K, accuracyrf/K\],
    'BestAUC': \[bestaucknn,bestauclogreg,bestaucgnb,

We see that the best model based on AUC value is naive Bayes. We should not care too much about the lowest R2 score because the data is very unbalanced (it is easy to predict y=0). In the confusion matrix, we see that it predicts beautiful values, real values and negative values. To our surprise, the AUC of the decision tree is about 50%.


We try to undersampling the variable y=0

gTrain, gValid = train\_test\_split

logistic regression

predsTrain = logreg.predict(gTrainUrandom)

predsTrain = logreg.predict(gTrain20Urandom)

predsTrain = logreg.predict(gTrrandom)

Decision tree


print("Train AUC:", metrics.roc\_auc\_score(ygTrds))

Random forest

print("Train AUC:", metrics.roc\_auc\_score(ygTr, predsTrain),
      "Valid AUC:", metrics.roc\_auc\_score(ygVd, preds))

KNN nearest neighbor

print("Train AUC:", metrics.roc\_auc\_score(ygTrm, predsTrain),
      "Valid AUC:", metrics.roc\_auc\_score(ygVal10, preds))

Gauss naive Bayes

print("Train AUC:", metrics.roc\_auc\_score(ygTraom, predsTrain),
      "Valid AUC:", metrics.roc\_auc\_score(ygid, preds))


We tried to oversample the variable y=1

feates = datolist()

(31945, 39)
smt = SMOT
(32345, 39)
smt = SMOT
(32595, 39)

logistic regression

print("Train AUC:", metrics.roc\_auc\_score(ygTrin10SM, predsTrain),
      "Valid AUC:", metrics.roc\_auc\_score(ygValid, preds))

Decision tree,ygTranOS)
predsTrain = dtpreict(TrainOSM)
preds = dt2.predict(gValid)

Random forest, ygTranOS)
predsTrain = random_forest.prect(gTraiOSM)

KNN nearest neighbor, yTanOSM)
predsTrain = classifier.predict(gTaiSM)
preds = classifier.predict(Vaid)

Gauss naive Bayes, ygrainM)
predsTrain = gaussian.predcti)


We see that the subsampling and oversampling variables y are not very helpful to AUC.


Most popular insights

1.Why employees leave from decision tree model

2.R language tree based method: decision tree, random forest

3.Using scikit learn and pandas decision trees in python

4.Machine learning: running random forest data analysis reports in SAS

5.R language uses random forest and text mining to improve airline customer satisfaction

6.Machine learning boosts fast fashion and accurate sales time series

7.Recognition of changing stock market conditions by machine learning -- Application of hidden Markov model

8.python machine learning: implementation of recommendation system (collaborative filtering by matrix decomposition)

9.Using python machine learning classification to predict bank customer churn in python

Tags: Deep Learning Algorithm Data Mining AI Windows

Posted by poujman on Thu, 14 Apr 2022 16:15:02 +0930