### 1.1 Introduction to logistic regression

Logistic regression (LR) is actually a classification model. For logistic regression, the two most prominent points are the simplicity of the model and the strong interpretability of the model.

Advantages and disadvantages of logistic regression models:

- Advantages: simple to implement, easy to understand and implement; low computational cost, fast speed, and low storage resources;
- Disadvantages: easy to underfit, classification accuracy may not be high

### 1.2 Application of logistic regression

Logistic regression models are widely used in various fields, including machine learning, most medical fields, and the social sciences. For example, the trauma and injury severity score (TRISS), originally developed by Boyd et al., is widely used to predict the mortality of injured patients. Logistic regression is used to predict the risk of developing a specific disease (e.g. diabetes, coronary heart disease) Logistic regression models are also used to predict the probability of failure of a system or product in a given process. Also used in marketing applications, such as predicting a customer's propensity to buy a product or abort an order, etc. In economics it can be used to predict the likelihood that a person will choose to enter the labor market, while business applications can be used to predict the likelihood that a homeowner will default on a mortgage. Conditional random fields are an extension of logistic regression to sequential data for natural language processing.

The logistic regression model is now also the basic component of many classification algorithms, such as credit card transaction anti fraud and CTR (click through rate) estimation based on GBDT algorithm +LR logistic regression in classification tasks. Its advantage is that the output value naturally falls between 0 and 1, and has probabilistic means The model is clear and has a corresponding theoretical basis of probability. The parameters it fits represent the influence of each feature on the result. Also a great tool for understanding data. But at the same time, because it is essentially a linear classifier, it cannot deal with more complex data situations. Many times we also use the logistic regression model as the baseline (basic level) for some task attempts.

### 2 Learning Objectives

Understand the theory of logistic regression

Master the use of logistic regression's sklearn function calls and apply it to predictions on the iris dataset

### 3 Code flow

Part1 Demo practice

Step1: Library function import

Step2: Model training

Step3: View model parameters

Step4: Data and Model Visualization

Step5: Model prediction

Part2 Logistic regression classification practice based on iris data set

Step1: Library function import

Step2: Data read/load

Step3: Simple view of data information

Step4: Visual description

Step5: Use the logistic regression model to train and predict on the binary classification

Step5: Use the logistic regression model to train and predict on three classifications (multi-classification)

### 4 Algorithm combat

#### 4.1 Demo practice

##### Step1: Library function import

## Basic function library import numpy as np ## import art gallery import matplotlib.pyplot as plt import seaborn as sns ## Import logistic regression model functions from sklearn.linear_model import LogisticRegression

##### Step2: Model training

##Demo demonstrates LogisticRegression classification ## Construct the dataset x_fearures = np.array([[-1, -2], [-2, -1], [-3, -2], [1, 3], [2, 1], [3, 2]]) y_label = np.array([0, 0, 0, 1, 1, 1]) ## Invoke the logistic regression model lr_clf = LogisticRegression() ## Fit the constructed dataset with a logistic regression model lr_clf = lr_clf.fit(x_fearures, y_label) #Its fitting equation is y=w0+w1*x1+w2*x2

##### Step3: View model parameters

## View the w of its corresponding model print('the weight of Logistic Regression:',lr_clf.coef_) ## View the w0 of its corresponding model print('the intercept(w0) of Logistic Regression:',lr_clf.intercept_)

##### Step4: Data and Model Visualization

## Visually constructed data sample points plt.figure() plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis') plt.title('Dataset') plt.show()

# Visualize decision boundaries plt.figure() plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis') plt.title('Dataset') nx, ny = 200, 100 x_min, x_max = plt.xlim() y_min, y_max = plt.ylim() x_grid, y_grid = np.meshgrid(np.linspace(x_min, x_max, nx),np.linspace(y_min, y_max, ny)) z_proba = lr_clf.predict_proba(np.c_[x_grid.ravel(), y_grid.ravel()]) z_proba = z_proba[:, 1].reshape(x_grid.shape) plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue') plt.show()

### Visually predict new samples plt.figure() ## new point 1 x_fearures_new1 = np.array([[0, -1]]) plt.scatter(x_fearures_new1[:,0],x_fearures_new1[:,1], s=50, cmap='viridis') plt.annotate(s='New point 1',xy=(0,-1),xytext=(-2,0),color='blue',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red')) ## new point 2 x_fearures_new2 = np.array([[1, 2]]) plt.scatter(x_fearures_new2[:,0],x_fearures_new2[:,1], s=50, cmap='viridis') plt.annotate(s='New point 2',xy=(1,2),xytext=(-1.5,2.5),color='red',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red')) ## Training samples plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis') plt.title('Dataset') # Visualize decision boundaries plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue') plt.show()

##### Step5: Model prediction

## Use the trained model to make predictions by distributing the training set and test set y_label_new1_predict = lr_clf.predict(x_fearures_new1) y_label_new2_predict = lr_clf.predict(x_fearures_new2) print('The New point 1 predict class:\n',y_label_new1_predict) print('The New point 2 predict class:\n',y_label_new2_predict) ## Since the logistic regression model is a probability prediction model (p = p(y=1|x,\theta) introduced earlier), all we can use the predict_proba function to predict its probability y_label_new1_predict_proba = lr_clf.predict_proba(x_fearures_new1) y_label_new2_predict_proba = lr_clf.predict_proba(x_fearures_new2) print('The New point 1 predict Probability of each class:\n',y_label_new1_predict_proba) print('The New point 2 predict Probability of each class:\n',y_label_new2_predict_proba)

#### 4.2 Logistic regression classification practice based on the iris dataset

At the very beginning of practice, we first need to import some basic function libraries including: numpy (the basic software package for scientific computing in Python), pandas (pandas is a fast, powerful, flexible and easy-to-use open source data analysis and processing tool ), matplotlib and seaborn plots.

##### Step1: Library function import

## Basic function library import numpy as np import pandas as pd ## Drawing function library import matplotlib.pyplot as plt import seaborn as sns

This time we choose the iris data (iris) to try to train the method. The data set contains a total of 5 variables, including 4 feature variables and 1 target categorical variable. There are 150 samples in total, and the target variable is the category of flowers, all of which belong to three subgenera under the genus Iris, namely iris setosa, iris versicolor and iris Virginia Four characteristics of the three irises included, sepal length (cm), sepal width (cm), petal length (cm), and petal width (cm), have been used to identify species in the past.

variable | describe |
---|---|

sepal length | Calyx length (cm) |

sepal width | Calyx width (cm) |

petal length | Petal length (cm) |

petal width | Petal width (cm) |

target | Three subgenus of iris, 'setosa'(0), 'versicolor'(1), 'virginica'(2) |

##### Step2: Data read/load

## We use the iris data that comes with sklearn as data loading, and use Pandas to convert to DataFrame format from sklearn.datasets import load_iris data = load_iris() #get data features iris_target = data.target #Get the label corresponding to the data iris_features = pd.DataFrame(data=data.data, columns=data.feature_names) #Convert to DataFrame format using Pandas

##### Step3: Simple view of data information

## Use .info() to view the overall information of the data iris_features.info()

## For simple data viewing, we can use .head() head.tail() tail iris_features.head()

iris_features.tail()

## The corresponding category labels are, where 0, 1, and 2 represent three different flower categories: 'setosa', 'versicolor', and 'virginica' respectively. iris_target

## Use the value_counts function to view the number of each category pd.Series(iris_target).value_counts()

## Do some statistical description of the features iris_features.describe()

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.057333 3.758000 1.199333 std 0.828066 0.435866 1.765298 0.762238 min 4.300000 2.000000 1.000000 0.100000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000

Do some statistical description of the features

iris_features.describe() From the statistical description we can see the range of variation for different numerical features.

##### Step4: Visual description

## Merge labels and feature information iris_all = iris_features.copy() ##Make a shallow copy to prevent modification of the original data iris_all['target'] = iris_target

## Scatter visualization of feature and label combinations sns.pairplot(data=iris_all,diag_kind='hist', hue= 'target') plt.show()

for col in iris_features.columns: sns.boxplot(x='target', y=col, saturation=0.5,palette='pastel', data=iris_all) plt.title(col) plt.show()

# Select its first three features to draw a 3D scatter plot from mpl_toolkits.mplot3d import Axes3D fig = plt.figure(figsize=(10,8)) ax = fig.add_subplot(111, projection='3d') iris_all_class0 = iris_all[iris_all['target']==0].values iris_all_class1 = iris_all[iris_all['target']==1].values iris_all_class2 = iris_all[iris_all['target']==2].values # 'setosa'(0), 'versicolor'(1), 'virginica'(2) ax.scatter(iris_all_class0[:,0], iris_all_class0[:,1], iris_all_class0[:,2],label='setosa') ax.scatter(iris_all_class1[:,0], iris_all_class1[:,1], iris_all_class1[:,2],label='versicolor') ax.scatter(iris_all_class2[:,0], iris_all_class2[:,1], iris_all_class2[:,2],label='virginica') plt.legend() plt.show()

##### Step5: Use the logistic regression model to train and predict on the binary classification

# To properly evaluate model performance, the data is divided into training and test sets, and the model is trained on the training set and the model performance is verified on the test set. from sklearn.model_selection import train_test_split ## Select samples whose classes are 0 and 1 (excluding samples with class 2) iris_features_part = iris_features.iloc[:100] iris_target_part = iris_target[:100] ## Test set size is 20%, 80%/20% points x_train, x_test, y_train, y_test = train_test_split(iris_features_part, iris_target_part, test_size = 0.2, random_state = 2020)

## Import logistic regression model from sklearn from sklearn.linear_model import LogisticRegression

## Define Logistic Regression Model clf = LogisticRegression(random_state=0, solver='lbfgs')

# Train a logistic regression model on the training set clf.fit(x_train, y_train)

## View its corresponding w print('the weight of Logistic Regression:',clf.coef_) ## View its corresponding w0 print('the intercept(w0) of Logistic Regression:',clf.intercept_)

## Use the trained model to make predictions by distributing the training set and test set train_predict = clf.predict(x_train) test_predict = clf.predict(x_test)

from sklearn import metrics ## Use accuracy (accuracy) [the ratio of the number of correctly predicted samples to the total number of predicted samples] to evaluate the model effect print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict)) print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict)) ## View confusion matrix (statistical matrix of various situations of predicted and true values) confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test) print('The confusion matrix result:\n',confusion_matrix_result) # Visualize results with heatmaps plt.figure(figsize=(8, 6)) sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues') plt.xlabel('Predicted labels') plt.ylabel('True labels') plt.show()

##### Step6: Use the logistic regression model to train and predict on three classifications (multi-classification)

## Test set size is 20%, 80%/20% points x_train, x_test, y_train, y_test = train_test_split(iris_features, iris_target, test_size = 0.2, random_state = 2020)

## Define Logistic Regression Model clf = LogisticRegression(random_state=0, solver='lbfgs')

# Train a logistic regression model on the training set clf.fit(x_train, y_train)

## View its corresponding w print('the weight of Logistic Regression:\n',clf.coef_) ## View its corresponding w0 print('the intercept(w0) of Logistic Regression:\n',clf.intercept_) ## Since this is a 3-classification, we have obtained the parameters of three logistic regression models here, and the three logistic regressions can be combined to achieve three-classification.

## Use the trained model to make predictions by distributing the training set and test set train_predict = clf.predict(x_train) test_predict = clf.predict(x_test) ## Since the logistic regression model is a probability prediction model (p = p(y=1|x,\theta) introduced earlier), all we can use the predict_proba function to predict its probability train_predict_proba = clf.predict_proba(x_train) test_predict_proba = clf.predict_proba(x_test) print('The test predict Probability of each class:\n',test_predict_proba) ## The first column represents the probability of predicting class 0, the second column represents the probability of predicting class 1, and the third column represents the probability of predicting class 2. ## Use accuracy (accuracy) [the ratio of the number of correctly predicted samples to the total number of predicted samples] to evaluate the model effect print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict)) print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## View confusion matrix confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test) print('The confusion matrix result:\n',confusion_matrix_result) # Visualize results with heatmaps plt.figure(figsize=(8, 6)) sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues') plt.xlabel('Predicted labels') plt.ylabel('True labels') plt.show()

### 5 important knowledge points

Introduction to the principle of logistic regression:

Although Logistic regression has "regression" in its name, it is actually a classification method, mainly used for two-category problems (that is, there are only two outputs, representing two categories respectively), so the Logistic function (or Sigmoid function) is used. ), the function form is:

𝑙𝑜𝑔𝑖(𝑧)=11+𝑒−𝑧

Its corresponding function image can be represented as follows:

import numpy as np import matplotlib.pyplot as plt x = np.arange(-5,5,0.01) y = 1/(1+np.exp(-x)) plt.plot(x,y) plt.xlabel('z') plt.ylabel('y') plt.grid() plt.show()

From the above figure, we can find that the Logistic function is a monotonically increasing function, and takes a value of 0.5 when z=0, and the value range of the 𝑙𝑜𝑔𝑖(⋅) function is (0,1) .

And the basic equation of regression is 𝑧=𝑤0+∑𝑁𝑖𝑤𝑖𝑥𝑖 ,

Write the regression equation into it as:

𝑝=𝑝(𝑦=1|𝑥,𝜃)=ℎ𝜃(𝑥,𝜃)=11+𝑒−(𝑤0+∑𝑁𝑖𝑤𝑖𝑥𝑖)

So, 𝑝(𝑦=1|𝑥,𝜃)=ℎ𝜃(𝑥,𝜃) , 𝑝(𝑦=0|𝑥,𝜃)=1−ℎ𝜃(𝑥,𝜃)

In terms of the principle of logistic regression, logistic regression actual realizes a decision boundary: for the function_=11+_, when_=>0, _=>0.5, it is classified as 1; When<0, _<0.5, it is classified as 0, and the corresponding value can be included as the probability prediction value of category 1

For the training of the model: In essence, it is to use the data to solve the specific 𝑤 of the corresponding model. Thus, a feature logistic regression model for the current data is obtained.

For multi-classification, multi-classification can be achieved by combining multiple binary logistic regressions.