[DW team learning - hands-on data analysis] Chapter 3: Model building and evaluation - evaluation

Chapter III Model Building and Evaluation - Evaluation

According to the modeling of the previous model, we know how to use the sklearn library to complete the modeling, as well as the data set partitioning and other operations we know. How do we know whether a model is useful? So can we use the results of the model with confidence? Then today's assessment of learning will be very helpful.

Load the following libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display the minus sign normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set Output Picture Size

Task: Load data and split test set and training set

#Write Code
X = pd.read_csv("clear_data.csv")
X.head()
PassengerIdPclassAgeSibSpParchFareSex_femaleSex_maleEmbarked_CEmbarked_QEmbarked_S
00322.0107.250001001
11138.01071.283310100
22326.0007.925010001
33135.01053.100010001
44335.0008.050001001
y = pd.read_csv("train.csv")["Survived"]
y.head()
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64
from sklearn.model_selection import train_test_split
#Write Code
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 0)
X_train.info(),y_train
<class 'pandas.core.frame.DataFrame'>
Int64Index: 668 entries, 671 to 80
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  668 non-null    int64  
 1   Pclass       668 non-null    int64  
 2   Age          668 non-null    float64
 3   SibSp        668 non-null    int64  
 4   Parch        668 non-null    int64  
 5   Fare         668 non-null    float64
 6   Sex_female   668 non-null    int64  
 7   Sex_male     668 non-null    int64  
 8   Embarked_C   668 non-null    int64  
 9   Embarked_Q   668 non-null    int64  
 10  Embarked_S   668 non-null    int64  
dtypes: float64(2), int64(9)
memory usage: 82.6 KB





(None,
 671    0
 417    1
 634    0
 323    1
 379    0
       ..
 131    0
 490    0
 528    0
 48     0
 80     0
 Name: Survived, Length: 668, dtype: int64)
X_train.dtypes,y_train.dtypes
(PassengerId      int64
 Pclass           int64
 Age            float64
 SibSp            int64
 Parch            int64
 Fare           float64
 Sex_female       int64
 Sex_male         int64
 Embarked_C       int64
 Embarked_Q       int64
 Embarked_S       int64
 dtype: object,
 dtype('int64'))
lr = LogisticRegression()
lr.fit(X_train, y_train)
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_8652/2079915614.py in <module>
      1 lr = LogisticRegression()
----> 2 lr.fit(X_train, y_train)


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\linear_model\_logistic.py in fit(self, X, y, sample_weight)
   1599                       penalty=penalty, max_squared_sum=max_squared_sum,
   1600                       sample_weight=sample_weight)
-> 1601             for class_, warm_start_coef_ in zip(classes_, warm_start_coef))
   1602 
   1603         fold_coefs_, _, n_iter_ = zip(*fold_coefs_)


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
   1027             # remaining jobs.
   1028             self._iterating = False
-> 1029             if self.dispatch_one_batch(iterator):
   1030                 self._iterating = self._original_iterator is not None
   1031 


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    845                 return False
    846             else:
--> 847                 self._dispatch(tasks)
    848                 return True
    849 


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
    763         with self._lock:
    764             job_idx = len(self._jobs)
--> 765             job = self._backend.apply_async(batch, callback=cb)
    766             # A job can complete so quickly than its callback is
    767             # called before we get here, causing self._jobs to


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in __call__(self)
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    252             return [func(*args, **kwargs)
--> 253                     for func, args, kwargs in self.items]
    254 
    255     def __reduce__(self):


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    252             return [func(*args, **kwargs)
--> 253                     for func, args, kwargs in self.items]
    254 
    255     def __reduce__(self):


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\linear_model\_logistic.py in _logistic_regression_path(X, y, pos_class, Cs, fit_intercept, max_iter, tol, verbose, solver, coef, class_weight, dual, penalty, intercept_scaling, multi_class, random_state, check_input, max_squared_sum, sample_weight, l1_ratio)
    938             n_iter_i = _check_optimize_result(
    939                 solver, opt_res, max_iter,
--> 940                 extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
    941             w0, loss = opt_res.x, opt_res.fun
    942         elif solver == 'newton-cg':


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\utils\optimize.py in _check_optimize_result(solver, result, max_iter, extra_warning_msg)
    241                 "    https://scikit-learn.org/stable/modules/"
    242                 "preprocessing.html"
--> 243             ).format(solver, result.status, result.message.decode("latin1"))
    244             if extra_warning_msg is not None:
    245                 warning_msg += "\n" + extra_warning_msg


AttributeError: 'str' object has no attribute 'decode'
#Write Code
# Default parameter logistic regression model
rfc = RandomForestClassifier(n_estimators=100, max_depth = 5)
rfc.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
#Write Code
print("Training set score:{:.2f}".format(rfc.score(X_train, y_train)))
print("Testing set score:{:.2f}".format(rfc.score(X_test, y_test)))

Training set score:0.86
Testing set score:0.81

Model evaluation

  • Model evaluation is to know the generalization ability of the model.
  • Cross validation is a statistical method to evaluate generalization performance, which is more stable and comprehensive than the method of single division of training set and test set.
  • In cross validation, data is divided multiple times and multiple models need to be trained.
  • The most commonly used cross validation is k-fold cross validation, where k is a number specified by the user, usually 5 or 10.
  • The precision measures how many of the samples predicted as positive examples are real positive examples
  • recall measures how many positive samples are predicted to be positive
  • f-score is the harmonic average of accuracy and recall

[Thinking]: To further understand the above concepts, you can make a summary

Task 1: Cross validation

  • Use 10 fold cross validation to evaluate the previous logistic regression model
  • Calculate the average value of cross validation accuracy
#Tip: Cross validation
Image('Snipaste_2020-01-05_16-37-56.png')

Prompt 4

  • The module of cross validation in sklearn is sklearn.model_selection

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html?highlight=cross_val_score#sklearn.model_selection.cross_val_score

#Write Code
from sklearn.model_selection import cross_val_score
#Write Code
rfc = RandomForestClassifier(n_estimators=100, max_depth = 5)
scores = cross_val_score(rfc, X_train, y_train, cv=10) #Determine the cross validation split strategy. cv=None, use the default 50 fold cross validation
# k-fold cross validation score
scores
array([0.85074627, 0.79104478, 0.85074627, 0.7761194 , 0.80597015,
       0.85074627, 0.76119403, 0.82089552, 0.77272727, 0.77272727])
# Average Cross Validation Score
print("Average cross-validation score: {:.2f}".format(scores.mean()))

Thinking 4

  • What kind of impact will the more k discounts bring?
    A: The more k discounts, the more training set data during each training, and the less test set data.
    The fitting effect of the training set with sufficient data may be better, and the more training times, the higher the stability and fidelity of the evaluation results
    In addition, too many folds means that there are many training times, and the time of a single training will become longer under the condition of large data volume, that is, the complexity will become higher.

Task 2: confusion matrix

  • Calculating the confusion matrix of binary classification problem
  • Calculate accuracy rate, recall rate and f-score

[Thinking] What is the confusion matrix of the binary classification problem? Understand the concept and know what tasks it is mainly calculated into
A: I learned this in the previous [melon eating tutorial] https://blog.csdn.net/sinat_33209811/article/details/125755163

#Tip: confusion matrix
Image('Snipaste_2020-01-05_16-38-26.png')

#Prompt: Accuracy, Precision, Recall,f-score calculation method
Image('Snipaste_2020-01-05_16-39-27.png')

Tip 5

  • The method of confusion matrix is in the sklearn.metrics module of sklearn
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html?highlight=confusion_matrix#sklearn.metrics.confusion_matrix
  • The confusion matrix needs to input real tags and forecast tags
  • Classification can be used for accuracy rate, recall rate and f-score_ Report module
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification_report#sklearn.metrics.classification_report
#Write Code
from sklearn.metrics import confusion_matrix
#Write Code
rfc = RandomForestClassifier(n_estimators=100, max_depth = 5)
rfc.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
#Write Code
pred = rfc.predict(X_train)

#Confusion matrix
confusion_matrix(y_train, pred)

array([[391,  21],
       [ 64, 192]], dtype=int64)
from sklearn.metrics import classification_report
# Accuracy, recall and f1 score
print(classification_report(y_train, pred))
              precision    recall  f1-score   support

           0       0.86      0.95      0.90       412
           1       0.90      0.75      0.82       256

    accuracy                           0.87       668
   macro avg       0.88      0.85      0.86       668
weighted avg       0.88      0.87      0.87       668

[Thinking]

  • What should I pay attention to when implementing the confusion matrix
    A: Reference https://blog.csdn.net/zhangxiaohua18/article/details/122311808

Task 3: ROC curve

  • Draw ROC curve

[Thinking] What is ROC curve and what problems does OCR curve exist to solve?
A: I learned this in the previous [melon eating tutorial] https://blog.csdn.net/sinat_33209811/article/details/125755163

Tip 6

  • The module of ROC curve in skylearn is skylearn.metrics
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html?highlight=roc_curve#sklearn.metrics.roc_curve
  • The larger the area enclosed under the ROC curve, the better
#Write Code
from sklearn.metrics import roc_curve
#Write Code
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
# Found the threshold closest to 0
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_8652/1348101448.py in <module>
      1 #Write Code
----> 2 fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
      3 plt.plot(fpr, tpr, label="ROC Curve")
      4 plt.xlabel("FPR")
      5 plt.ylabel("TPR (recall)")


C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\linear_model\_base.py in decision_function(self, X)
    268         X = check_array(X, accept_sparse='csr')
    269 
--> 270         n_features = self.coef_.shape[1]
    271         if X.shape[1] != n_features:
    272             raise ValueError("X has %d features per sample; expecting %d"


AttributeError: 'list' object has no attribute 'shape'
#Write Code


#Write Code


Thinking 6

  • How to draw ROC curve for multi classification problem

[Thinking] What information can you get from this OCR curve? What can this information do?

Tags: Python Data Analysis

Posted by vocoder on Fri, 23 Sep 2022 02:21:35 +0930