Machine Learning Algorithms: Classification Prediction Based on Logistic Regression

1.1 Introduction to logistic regression

Logistic regression (LR) is actually a classification model. For logistic regression, the two most prominent points are the simplicity of the model and the strong interpretability of the model.

Advantages and disadvantages of logistic regression models:

  • Advantages: simple to implement, easy to understand and implement; low computational cost, fast speed, and low storage resources;
  • Disadvantages: easy to underfit, classification accuracy may not be high

1.2 Application of logistic regression

Logistic regression models are widely used in various fields, including machine learning, most medical fields, and the social sciences. For example, the trauma and injury severity score (TRISS), originally developed by Boyd et al., is widely used to predict the mortality of injured patients. Logistic regression is used to predict the risk of developing a specific disease (e.g. diabetes, coronary heart disease) Logistic regression models are also used to predict the probability of failure of a system or product in a given process. Also used in marketing applications, such as predicting a customer's propensity to buy a product or abort an order, etc. In economics it can be used to predict the likelihood that a person will choose to enter the labor market, while business applications can be used to predict the likelihood that a homeowner will default on a mortgage. Conditional random fields are an extension of logistic regression to sequential data for natural language processing.

The logistic regression model is now also the basic component of many classification algorithms, such as credit card transaction anti fraud and CTR (click through rate) estimation based on GBDT algorithm +LR logistic regression in classification tasks. Its advantage is that the output value naturally falls between 0 and 1, and has probabilistic means The model is clear and has a corresponding theoretical basis of probability. The parameters it fits represent the influence of each feature on the result. Also a great tool for understanding data. But at the same time, because it is essentially a linear classifier, it cannot deal with more complex data situations. Many times we also use the logistic regression model as the baseline (basic level) for some task attempts.

2 Learning Objectives

Understand the theory of logistic regression
Master the use of logistic regression's sklearn function calls and apply it to predictions on the iris dataset

3 Code flow

Part1 Demo practice
Step1: Library function import
Step2: Model training
Step3: View model parameters
Step4: Data and Model Visualization
Step5: Model prediction
Part2 Logistic regression classification practice based on iris data set
Step1: Library function import
Step2: Data read/load
Step3: Simple view of data information
Step4: Visual description
Step5: Use the logistic regression model to train and predict on the binary classification
Step5: Use the logistic regression model to train and predict on three classifications (multi-classification)

4 Algorithm combat

4.1 Demo practice

Step1: Library function import
## Basic function library
import numpy as np 
## import art gallery
import matplotlib.pyplot as plt
import seaborn as sns
## Import logistic regression model functions
from sklearn.linear_model import LogisticRegression
Step2: Model training
##Demo demonstrates LogisticRegression classification

## Construct the dataset
x_fearures = np.array([[-1, -2], [-2, -1], [-3, -2], [1, 3], [2, 1], [3, 2]])
y_label = np.array([0, 0, 0, 1, 1, 1])
## Invoke the logistic regression model
lr_clf = LogisticRegression()
## Fit the constructed dataset with a logistic regression model
lr_clf = lr_clf.fit(x_fearures, y_label) #Its fitting equation is y=w0+w1*x1+w2*x2
Step3: View model parameters
## View the w of its corresponding model
print('the weight of Logistic Regression:',lr_clf.coef_)
​
## View the w0 of its corresponding model
print('the intercept(w0) of Logistic Regression:',lr_clf.intercept_)
Step4: Data and Model Visualization
## Visually constructed data sample points
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')
plt.show()

# Visualize decision boundaries
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')

nx, ny = 200, 100
x_min, x_max = plt.xlim()
y_min, y_max = plt.ylim()
x_grid, y_grid = np.meshgrid(np.linspace(x_min, x_max, nx),np.linspace(y_min, y_max, ny))

z_proba = lr_clf.predict_proba(np.c_[x_grid.ravel(), y_grid.ravel()])
z_proba = z_proba[:, 1].reshape(x_grid.shape)
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')

plt.show()

### Visually predict new samples

plt.figure()
## new point 1
x_fearures_new1 = np.array([[0, -1]])
plt.scatter(x_fearures_new1[:,0],x_fearures_new1[:,1], s=50, cmap='viridis')
plt.annotate(s='New point 1',xy=(0,-1),xytext=(-2,0),color='blue',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

## new point 2
x_fearures_new2 = np.array([[1, 2]])
plt.scatter(x_fearures_new2[:,0],x_fearures_new2[:,1], s=50, cmap='viridis')
plt.annotate(s='New point 2',xy=(1,2),xytext=(-1.5,2.5),color='red',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

## Training samples
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')

# Visualize decision boundaries
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')

plt.show()

Step5: Model prediction
## Use the trained model to make predictions by distributing the training set and test set
y_label_new1_predict = lr_clf.predict(x_fearures_new1)
y_label_new2_predict = lr_clf.predict(x_fearures_new2)
​
print('The New point 1 predict class:\n',y_label_new1_predict)
print('The New point 2 predict class:\n',y_label_new2_predict)
​
## Since the logistic regression model is a probability prediction model (p = p(y=1|x,\theta) introduced earlier), all we can use the predict_proba function to predict its probability
y_label_new1_predict_proba = lr_clf.predict_proba(x_fearures_new1)
y_label_new2_predict_proba = lr_clf.predict_proba(x_fearures_new2)
​
print('The New point 1 predict Probability of each class:\n',y_label_new1_predict_proba)
print('The New point 2 predict Probability of each class:\n',y_label_new2_predict_proba)

4.2 Logistic regression classification practice based on the iris dataset

At the very beginning of practice, we first need to import some basic function libraries including: numpy (the basic software package for scientific computing in Python), pandas (pandas is a fast, powerful, flexible and easy-to-use open source data analysis and processing tool ), matplotlib and seaborn plots.

Step1: Library function import
##  Basic function library
import numpy as np 
import pandas as pd

## Drawing function library
import matplotlib.pyplot as plt
import seaborn as sns

This time we choose the iris data (iris) to try to train the method. The data set contains a total of 5 variables, including 4 feature variables and 1 target categorical variable. There are 150 samples in total, and the target variable is the category of flowers, all of which belong to three subgenera under the genus Iris, namely iris setosa, iris versicolor and iris Virginia Four characteristics of the three irises included, sepal length (cm), sepal width (cm), petal length (cm), and petal width (cm), have been used to identify species in the past.

variable describe
sepal length Calyx length (cm)
sepal width Calyx width (cm)
petal length Petal length (cm)
petal width Petal width (cm)
target Three subgenus of iris, 'setosa'(0), 'versicolor'(1), 'virginica'(2)
Step2: Data read/load
## We use the iris data that comes with sklearn as data loading, and use Pandas to convert to DataFrame format
from sklearn.datasets import load_iris
data = load_iris() #get data features
iris_target = data.target #Get the label corresponding to the data
iris_features = pd.DataFrame(data=data.data, columns=data.feature_names) #Convert to DataFrame format using Pandas
Step3: Simple view of data information
## Use .info() to view the overall information of the data
iris_features.info()
## For simple data viewing, we can use .head() head.tail() tail
iris_features.head()
iris_features.tail()
## The corresponding category labels are, where 0, 1, and 2 represent three different flower categories: 'setosa', 'versicolor', and 'virginica' respectively.
iris_target
## Use the value_counts function to view the number of each category
pd.Series(iris_target).value_counts()
## Do some statistical description of the features
iris_features.describe()
sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Do some statistical description of the features
iris_features.describe() From the statistical description we can see the range of variation for different numerical features.

Step4: Visual description
## Merge labels and feature information
iris_all = iris_features.copy() ##Make a shallow copy to prevent modification of the original data
iris_all['target'] = iris_target
## Scatter visualization of feature and label combinations
sns.pairplot(data=iris_all,diag_kind='hist', hue= 'target')
plt.show()

for col in iris_features.columns:
    sns.boxplot(x='target', y=col, saturation=0.5,palette='pastel', data=iris_all)
    plt.title(col)
    plt.show()

# Select its first three features to draw a 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection='3d')

iris_all_class0 = iris_all[iris_all['target']==0].values
iris_all_class1 = iris_all[iris_all['target']==1].values
iris_all_class2 = iris_all[iris_all['target']==2].values
# 'setosa'(0), 'versicolor'(1), 'virginica'(2)
ax.scatter(iris_all_class0[:,0], iris_all_class0[:,1], iris_all_class0[:,2],label='setosa')
ax.scatter(iris_all_class1[:,0], iris_all_class1[:,1], iris_all_class1[:,2],label='versicolor')
ax.scatter(iris_all_class2[:,0], iris_all_class2[:,1], iris_all_class2[:,2],label='virginica')
plt.legend()

plt.show()

Step5: Use the logistic regression model to train and predict on the binary classification
# To properly evaluate model performance, the data is divided into training and test sets, and the model is trained on the training set and the model performance is verified on the test set.
from sklearn.model_selection import train_test_split

## Select samples whose classes are 0 and 1 (excluding samples with class 2)
iris_features_part = iris_features.iloc[:100]
iris_target_part = iris_target[:100]

## Test set size is 20%, 80%/20% points
x_train, x_test, y_train, y_test = train_test_split(iris_features_part, iris_target_part, test_size = 0.2, random_state = 2020)
## Import logistic regression model from sklearn
from sklearn.linear_model import LogisticRegression
## Define Logistic Regression Model 
clf = LogisticRegression(random_state=0, solver='lbfgs')
# Train a logistic regression model on the training set
clf.fit(x_train, y_train)
## View its corresponding w
print('the weight of Logistic Regression:',clf.coef_)

## View its corresponding w0
print('the intercept(w0) of Logistic Regression:',clf.intercept_)
## Use the trained model to make predictions by distributing the training set and test set
train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)
from sklearn import metrics

## Use accuracy (accuracy) [the ratio of the number of correctly predicted samples to the total number of predicted samples] to evaluate the model effect
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## View confusion matrix (statistical matrix of various situations of predicted and true values)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# Visualize results with heatmaps
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

Step6: Use the logistic regression model to train and predict on three classifications (multi-classification)
## Test set size is 20%, 80%/20% points
x_train, x_test, y_train, y_test = train_test_split(iris_features, iris_target, test_size = 0.2, random_state = 2020)
## Define Logistic Regression Model 
clf = LogisticRegression(random_state=0, solver='lbfgs')
# Train a logistic regression model on the training set
clf.fit(x_train, y_train)
## View its corresponding w
print('the weight of Logistic Regression:\n',clf.coef_)

## View its corresponding w0
print('the intercept(w0) of Logistic Regression:\n',clf.intercept_)

## Since this is a 3-classification, we have obtained the parameters of three logistic regression models here, and the three logistic regressions can be combined to achieve three-classification.
## Use the trained model to make predictions by distributing the training set and test set
train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)

## Since the logistic regression model is a probability prediction model (p = p(y=1|x,\theta) introduced earlier), all we can use the predict_proba function to predict its probability
train_predict_proba = clf.predict_proba(x_train)
test_predict_proba = clf.predict_proba(x_test)

print('The test predict Probability of each class:\n',test_predict_proba)
## The first column represents the probability of predicting class 0, the second column represents the probability of predicting class 1, and the third column represents the probability of predicting class 2.

## Use accuracy (accuracy) [the ratio of the number of correctly predicted samples to the total number of predicted samples] to evaluate the model effect
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## View confusion matrix
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# Visualize results with heatmaps
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

5 important knowledge points

Introduction to the principle of logistic regression:

Although Logistic regression has "regression" in its name, it is actually a classification method, mainly used for two-category problems (that is, there are only two outputs, representing two categories respectively), so the Logistic function (or Sigmoid function) is used. ), the function form is:
𝑙𝑜𝑔𝑖(𝑧)=11+𝑒−𝑧

Its corresponding function image can be represented as follows:

import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-5,5,0.01)
y = 1/(1+np.exp(-x))

plt.plot(x,y)
plt.xlabel('z')
plt.ylabel('y')
plt.grid()
plt.show()

From the above figure, we can find that the Logistic function is a monotonically increasing function, and takes a value of 0.5 when z=0, and the value range of the 𝑙𝑜𝑔𝑖(⋅) function is (0,1) .

And the basic equation of regression is 𝑧=𝑤0+∑𝑁𝑖𝑤𝑖𝑥𝑖 ,

Write the regression equation into it as:
𝑝=𝑝(𝑦=1|𝑥,𝜃)=ℎ𝜃(𝑥,𝜃)=11+𝑒−(𝑤0+∑𝑁𝑖𝑤𝑖𝑥𝑖)

So, 𝑝(𝑦=1|𝑥,𝜃)=ℎ𝜃(𝑥,𝜃) , 𝑝(𝑦=0|𝑥,𝜃)=1−ℎ𝜃(𝑥,𝜃)
In terms of the principle of logistic regression, logistic regression actual realizes a decision boundary: for the function_=11+_, when_=>0, _=>0.5, it is classified as 1; When<0, _<0.5, it is classified as 0, and the corresponding value can be included as the probability prediction value of category 1

For the training of the model: In essence, it is to use the data to solve the specific 𝑤 of the corresponding model. Thus, a feature logistic regression model for the current data is obtained.

For multi-classification, multi-classification can be achieved by combining multiple binary logistic regressions.

Tags: Python Big Data Machine Learning Data Mining Visualization

Posted by jrbush82 on Sat, 30 Jul 2022 02:25:10 +0930